# RAG Fusion

## RAG Fusion is a method of combining RAG with Reciprocal Rank Fusion and generated queries. In this demo, we will show how to implement RAG Fusion with Anyscale Endpoint, together with Pinecone and LlamaIndex
### Reference: https://github.com/Raudaschl/rag-fusion

In [None]:
!pip install pinecone-client, langchain
!pip install -U transformers

## First, we will use AE(Anysacle Endpoint) to generate multiple queries related to our original query "What is Ray cluster"

In [1]:
import openai
import random

ANYSCALE_API_KEY = "esecret_xxxxxx"
def generate_queries_llama(original_query, ft_suffix=None, split="\n"):
    if ft_suffix:
        openai.api_base = "https://api.endpoints.anyscale.com/v1"
        model = "meta-llama/Llama-2-7b-chat-hf"+ft_suffix
    else:
        openai.api_base = "https://console.endpoints.anyscale.com/m/v1"
        model = "meta-llama/Llama-2-70b-chat-hf"
    openai.api_key = ANYSCALE_API_KEY
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant that generates multiple search queries based on a single input query."},
            {"role": "user", "content": f"Generate multiple search queries related to: {original_query}"},
            {"role": "user", "content": "Be precise. Only output the result, go straight into the answer. Do NOT say something like 'sure, here are the result' at the beginning or 'do you need sth else'"},
            {"role": "user", "content": "OUTPUT (4 queries):"}
        ]
    )

    generated_queries = response.choices[0]["message"]["content"].strip().split(split)
    return generated_queries

## With original Llama2 70B model, it often outputs verbose texts and generats low quality queries even with prompt engineering. 

In [23]:
original_query = "What is Ray cluster in the context of computer science"
generated_queries = generate_queries_llama(original_query)
generated_queries

['Sure, here are four search queries related to "What is Ray cluster in the context of computer science":',
 '',
 '1. "Ray cluster computer science"',
 '2. "What is a Ray cluster"',
 '3. "Ray cluster analysis"',
 '4. "Ray cluster algorithm"',
 '',
 "I hope these queries help you find the information you're looking for! Let me know if you need any further assistance."]

## With finetune, even the Llama2 7B model generates good quality queries.  
### You can see more details at the "Finetune with FireAct" cookbook

In [3]:
generated_queries = generate_queries_llama(original_query,":FT_MODEL:FT_ID", "\\n")
generated_queries

['- Ray cluster origins',
 '- History of Ray cluster in computer science',
 "- Ray cluster's relevance in computer science",
 '- Current applications of Ray cluster in computer science']

## Now let's use Pinecone and LlamaIndex to run Rag Fusion with these 4 queries

In [None]:
# Pinecone initialization
# Connect to one pre-built index, see more details at App_RAG_Pinecone cookbook
import pinecone

pineconeApikey = "PINECONE_API_KEY"
environment = "PINECONE_ENVIRONMENT"
pinecone.init(api_key=pineconeApikey, environment=environment)
index_name = 'PINECONE_INDEX_NAME'
pinecone.list_indexes()
pinecone_index = pinecone.Index(index_name)

pinecone_index.describe_index_stats()

In [12]:
from llama_index.llms import Anyscale
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.embeddings import OpenAIEmbedding
from llama_index.vector_stores import PineconeVectorStore

#Create vector_store from Pinecone
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

#Create service_context with AE and OpenAI embedding models
service_context = ServiceContext.from_defaults(
    llm=Anyscale(model = "meta-llama/Llama-2-70b-chat-hf",
                 api_key=ANYSCALE_API_KEY),
    embed_model=OpenAIEmbedding(model="text-embedding-ada-002",
                                api_base="https://api.openai.com/v1",
                                api_key="OPENAI_API_KEY")
)

# Get the retriever from LlamaIndex
index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)
retriever = index.as_retriever()  

In [14]:
# Retrieve relevant nodes from the vector store
def pinecone_search(query, retriever):
    response = retriever.retrieve(query)
    return {resp.node.id_: resp.score for resp in response},{resp.node.id_: resp.text for resp in response}

all_results = {}
all_id_contents = {}
for query in generated_queries:
    search_results, id_contents = pinecone_search(query, retriever)
    all_results[query] = search_results
    for id_ in id_contents.keys():
        if id_ in all_id_contents.keys():
            continue
        all_id_contents[id_] = id_contents[id_]

In [15]:
# Reciprocal Rank Fusion algorithm to rerank to relevant nodes
def reciprocal_rank_fusion(search_results_dict, k=60):
    fused_scores = {}
    print("Initial individual search result ranks:")
    for query, doc_scores in search_results_dict.items():
        print(f"For query '{query}': {doc_scores}")
        
    for query, doc_scores in search_results_dict.items():
        for rank, (doc, score) in enumerate(sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)):
            if doc not in fused_scores:
                fused_scores[doc] = 0
            previous_score = fused_scores[doc]
            fused_scores[doc] += 1 / (rank + k)
            print(f"Updating score for {doc} from {previous_score} to {fused_scores[doc]} based on rank {rank} in query '{query}'")

    reranked_results = {doc: score for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)}
    print("Final reranked results:", reranked_results)
    return reranked_results

reranked_results = reciprocal_rank_fusion(all_results)

Initial individual search result ranks:
For query '- Ray cluster origins': {'80ab95ea-ff5b-46cd-8c01-64c1464581e5': 0.383875966, '38b9042c-64f1-44f9-81b6-d30c60d48280': 0.416363239}
For query '- History of Ray cluster in computer science': {'80ab95ea-ff5b-46cd-8c01-64c1464581e5': 0.355046272, '38b9042c-64f1-44f9-81b6-d30c60d48280': 0.391810656}
For query '- Ray cluster's relevance in computer science': {'80ab95ea-ff5b-46cd-8c01-64c1464581e5': 0.315196395, '819ea5ea-2d24-4fbb-a24e-f26710aa47a0': 0.369936943}
For query '- Current applications of Ray cluster in computer science': {'80ab95ea-ff5b-46cd-8c01-64c1464581e5': 0.319083452, '085ca8b4-cbdd-45f7-992a-b69ac3c76ff9': 0.374597192}
Updating score for 38b9042c-64f1-44f9-81b6-d30c60d48280 from 0 to 0.016666666666666666 based on rank 0 in query '- Ray cluster origins'
Updating score for 80ab95ea-ff5b-46cd-8c01-64c1464581e5 from 0 to 0.01639344262295082 based on rank 1 in query '- Ray cluster origins'
Updating score for 38b9042c-64f1-44f9-

## To generate the RAG output, we can use the top-K results from the re-ranked nodes, and fit them in the allowed context length

In [22]:
from transformers import LlamaTokenizerFast
tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")

def generate_output(reranked_results, all_id_contents, queries, tokenizer, top_k=-1, max_len=4098):
    tot_len = 0
    refer_docs = ""
    if top_k <=0 or top_k > len(reranked_results.keys()):
        reranked_keys = reranked_results.keys()
    else:
        reranked_keys = list(reranked_results.keys())[:top_k]
    #print(reranked_keys)
    for id_ in reranked_keys:
        refer_doc = all_id_contents[id_]
        tot_len += len(tokenizer.encode(refer_doc))
        if tot_len >= max_len:
            refer_docs +=refer_doc[:max_len-tot_len]
            tot_len = max_len
        else:
            refer_docs +=refer_doc
    messages = [
        {"role": "system", "content": f"You are a helpful assistant that generates a summary based on the query question and reference documents"},
        {"role": "user", "content": f"Here are the main query {original_query} Here are some reference documents {refer_docs}"},
        {"role": "user", "content": "OUTPUT:"}
    ]
    #print(tot_len, messages)
    response = openai.ChatCompletion.create(
        api_base = "https://console.endpoints.anyscale.com/m/v1",
        api_key = ANYSCALE_API_KEY,
        model="meta-llama/Llama-2-70b-chat-hf",
        messages=messages
    )
    return response.choices[0]["message"]["content"]

final_output = generate_output(reranked_results, all_id_contents, generated_queries,
                               tokenizer, top_k=3, max_len=4000)

print(final_output)

  Sure, here's a summary of the main query and reference documents:

Main Query: What is Ray cluster in the context of computer science?

Reference Documents: Ray Clusters Overview, Ray Documentation

Summary:

A Ray cluster is a set of worker nodes connected to a common Ray head node, allowing for seamless scaling of workloads from a laptop to a large cluster. Ray clusters can be fixed-size or autoscale up and down according to the resources requested by applications running on the cluster. Ray provides native cluster deployment support on various technology stacks, including AWS, GCP, Kubernetes, and manual deployment on Linux. Ray clusters can be launched using the Cluster Launcher, which starts a cluster on the cloud and creates a designated head node and worker nodes. Additionally, users can connect other nodes to the head node to create a Ray cluster by calling ray start on those nodes.

Key concepts related to Ray clusters include:

* Ray head node: The machine that runs the Ray