# Iterative Optimization of LlamaIndex RAG Pipeline: A Step-by-Step Approach

Llama new website: https://docs.llamaindex.ai/en/stable

In [2]:
!pip3 install deeplake llama_index langchain openai tiktoken cohere pandas torch sentence-transformers llama-index-llms-litellm llama-index-embeddings-cohere

Collecting sentence-transformers
  Using cached sentence_transformers-2.6.0-py3-none-any.whl.metadata (11 kB)
Collecting transformers<5.0.0,>=4.32.0 (from sentence-transformers)
  Using cached transformers-4.39.1-py3-none-any.whl.metadata (134 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Using cached torch-2.2.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting scikit-learn (from sentence-transformers)
  Using cached scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting scipy (from sentence-transformers)
  Using cached scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting jinja2 (from torch>=1.11.0->sentence-transformers)
  Using cached Jinja2-3.1.3-py3-none-any.whl.metadata (3.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

## Load data

In [2]:
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core import SimpleDirectoryReader

# First we create Document LlamaIndex objects from the text data
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# By default, the node/chunks ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

print(f"Number of Documents: {len(documents)}")
print(f"Number of nodes: {len(nodes)} with the current chunk size of {node_parser.chunk_size}")

Number of Documents: 1
Number of nodes: 61 with the current chunk size of 512


## Create vector store index

In [3]:
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores.deeplake import DeepLakeVectorStore

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Create a local Deep Lake VectorStore
dataset_path = "./data/paul_graham/deep_lake_db"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)

# LLM that will answer questions with the retrieved context
llm = OpenAI(model="gpt-3.5-turbo-1106")
# We use OpenAI's embedding model "text-embedding-ada-002"
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)

  service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
  from .autonotebook import tqdm as notebook_tqdm
Generating embeddings: 100%|██████████| 61/61 [00:03<00:00, 18.16it/s]

Uploading data to deeplake dataset.



100%|██████████| 61/61 [00:00<00:00, 383.51it/s]

Dataset(path='./data/paul_graham/deep_lake_db', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (61, 1)      str     None   
 metadata     json      (61, 1)      str     None   
 embedding  embedding  (61, 1536)  float32   None   
    id        text      (61, 1)      str     None   





In [4]:
query_engine = vector_index.as_query_engine(similarity_top_k=10)
response_vector = query_engine.query("What are the main things Paul worked on before college?")
print(response_vector.response)

Before college, the main things Paul worked on outside of school were writing and programming. He wrote short stories and experimented with programming on the IBM 1401, using an early version of Fortran.


## Create embeddigns for qa pairs

In [5]:
from llama_index.core.evaluation import generate_question_context_pairs

qc_dataset = None

if not os.path.exists("qc_dataset.json"):
    qc_dataset = generate_question_context_pairs(
        nodes,
        llm=llm,
        num_questions_per_chunk=1
    )
    # We can save the dataset as a json file for later use.
    qc_dataset.save_json("qc_dataset.json")

## Load embeddings dataset

In [5]:
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

if os.path.exists("qc_dataset.json"):
    qc_dataset = EmbeddingQAFinetuneDataset.from_json(
        "qc_dataset.json"
    )

## Prompt Template

In [17]:
DEFAULT_QA_GENERATE_PROMPT_TMPL = """\
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."
"""

# Evaluation

Both the below metrics are useful for evaluating the effectiveness of a retrieval system, like how well a search engine or a recommendation system works.

1. **Hit Rate:** When guessing a correct option from a list of options. The hit rate measures how often you guess the correct answer by only looking at your top few guesses. You have a high hit rate if you often find the right answer in your first few guesses. So, in a retrieval system, **it's about how frequently the system finds the correct document within its top 'k' picks** (where 'k' is a number you decide, like top 5 or top 10).

2. **Mean Reciprocal Rank (MRR):** MRR is like measuring how quickly you can find a treasure in a list of boxes. Imagine having a row of boxes and if the treasure is in the 1st box then MMR is 1. If its in 2nd then 1/2 and for nth treasure box its 1/n. The MMR scores for each retrieval is avg at the end. **MMR looks at where the correct doc ranks in the system's guesses**.

- See Blog on [Hit Rate & MRR](https://tamilselvan-subramanian.medium.com/how-hit-rate-and-mrr-measure-llm-retrievers-ai-simplified-series-7203ba2d4032#:~:text=Remember%3A-,Hit%20Rate%20tells%20you%20if%20the%20LLM%20found%20any%20relevant,the%20future%20of%20information%20access.)

In [32]:
import pandas as pd


def display_results_retriever(name, eval_results: list, show_list: bool= False) -> pd.DataFrame:
    """Display results from evaluate"""

    metric_dicts = [eval_result.metric_vals_dict  for eval_result in eval_results]

    full_df = pd.DataFrame(metric_dicts)

    if show_list:
        print(list(full_df['hit_rate']))
        print(list(full_df['mrr']))

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

In [7]:
from llama_index.core.evaluation import RetrieverEvaluator

# we can evaluate the retrievers with different top_k values.

for i in range(2, 10 + 1, 2):
    retriever = vector_index.as_retriever(similarity_top_k= i)
    retriever_evaluator= RetrieverEvaluator.from_metric_names(['mrr', 'hit_rate'], retriever= retriever)

    eval_results= await retriever_evaluator.aevaluate_dataset(qc_dataset)
    print(display_results_retriever(f"Retriever top_{i}", eval_results, True))

[1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0]
[1.0, 1.0, 0.0, 0.0, 1.0, 0.5, 0.5, 1.0, 0.0, 1.0, 0.0, 0.5, 0.0, 1.0, 1.0, 0.0, 0.0, 0.5, 0.0, 1.0, 0.0, 0.0, 0.0, 0.5, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.5, 1.0, 0.5, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.5, 1.0, 1.0, 1.0, 0.0, 0.0

**Observation**

- Notice how hit rate increase as our top k increase as well, which is expected. We're increasing the probability of the correct answer being included in the returned set.

## Evaluation for Relevancy and Faithfulness metrics

- **Relevancy**: evaluates whether the retrieved context and answer are relevant to the query.
- **Faithfulness**: evaluates if the answer is faithful to the retrieved contexts or, in other words, whether there's hallucination.

In [34]:
from llama_index.core.evaluation import (
    RelevancyEvaluator,
    FaithfulnessEvaluator,
    BatchEvalRunner,
)

for i in range(2, 10 + 1, 2):
    # set faithfulness and relevancy evaluators
    query_engine = vector_index.as_query_engine(similarity_top_k=i)

    # while we use gpt3.5-turbo to answer questions
    # we can use gpt4 to evaluate the answers
    llm_gpt4 = OpenAI(temperature=0, model="gpt-4-0125-preview")

    service_context_gpt4 = ServiceContext.from_defaults(llm=llm_gpt4)

    faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_gpt4)
    relevancy_evaluator = RelevancyEvaluator(service_context=service_context_gpt4)

    # Run evaluation
    queries = list(qc_dataset.queries.values())
    batch_eval_queries = queries[:20]

    runner = BatchEvalRunner(
        {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
        workers=8,
    )

    eval_results = await runner.aevaluate_queries(
        query_engine, queries=batch_eval_queries
    )

    faithfulness_score = sum(
        result.passing for result in eval_results["faithfulness"]
    ) / len(eval_results["faithfulness"])
    print(f"top_{i} faithfulness_score: {faithfulness_score}")

    relevancy_score = sum(
        result.passing for result in eval_results["faithfulness"]
    ) / len(eval_results["relevancy"])
    print(f"top_{i} relevancy_score: {relevancy_score}")

  service_context_gpt4 = ServiceContext.from_defaults(llm=llm_gpt4)


top_2 faithfulness_score: 1.0
top_2 relevancy_score: 1.0
top_4 faithfulness_score: 1.0
top_4 relevancy_score: 1.0
top_6 faithfulness_score: 1.0
top_6 relevancy_score: 1.0
top_8 faithfulness_score: 1.0
top_8 relevancy_score: 1.0
top_10 faithfulness_score: 1.0
top_10 relevancy_score: 1.0


**Observation**


Here above we are getting score=1.0 for each top_i as we are using gpt-4 and the data we are using for retrieval is not too complicated. But, for a larger dataset and vector store these scores may vary. 

### Default eval prompt template

```py
DEFAULT_EVAL_TEMPLATE = PromptTemplate(
    "Your task is to evaluate if the response for the query \
    is in line with the context information provided.\n"
    "You have two options to answer. Either YES/ NO.\n"
    "Answer - YES, if the response for the query \
    is in line with context information otherwise NO.\n"
    "Query and Response: \n {query_str}\n"
    "Context: \n {context_str}\n"
    "Answer: "
)
```

# Changing the embedding model

- Checkout huggingface leaderboard for best embeddings: https://huggingface.co/spaces/mteb/leaderboard
- Now that we have the baseline evaluation score, we can start changing some modules of our LlamaIndex RAG pipeline. 

## Cohere Embedding model

In [8]:
import os
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.openai import OpenAI

# Create another local DeepLakeVectorStore to store the embeddings
dataset_path = "./data/paul_graham/deep_lake_db_1"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)

llm = OpenAI(model="gpt-3.5-turbo-1106")
embed_model = CohereEmbedding(
    cohere_api_key=os.getenv('COHERE_API_KEY'),
    model_name="embed-english-v3.0",
    input_type="search_document",
)

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)

Deep Lake Dataset in ./data/paul_graham/deep_lake_db_1 already exists, loading from the storage


  service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
Generating embeddings:   0%|          | 0/61 [00:00<?, ?it/s]

CohereAPIError: invalid api token

## Eval using cohere

In [13]:
from llama_index.core.evaluation import RetrieverEvaluator

# embed_model.input_type = "search_query"
retriever = vector_index.as_retriever(similarity_top_k=10, embed_model= embed_model)
retriever_evaluator = RetrieverEvaluator.from_metric_names(['mrr', 'hit_rate'], retriever= retriever)
eval_results = await retriever_evaluator.aevaluate_dataset(qc_dataset, show_progres= True, workers=1)
print(display_results_retriever(f"Retriever_cohere_embeds", eval_results))

CohereAPIError: You are using a Trial key, which is limited to 100 API calls / minute. You can continue to use the Trial key for free or upgrade to a Production key with higher rate limits at 'https://dashboard.cohere.com/api-keys'. Contact us on 'https://discord.gg/XW44jPfYJu' or email us at support@cohere.com with any questions

# Reranker

- cross-encoder/ms-marco-MiniLM-L-6-v2 from the Hugging Face hub.
- LlamaIndex’s LLMRerank 
- Cohere’s CohereRerank. 

In [9]:
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.postprocessor import SentenceTransformerRerank, LLMRerank
from llama_index.core.evaluation import RetrieverEvaluator

st_reranker = SentenceTransformerRerank(
    top_n=5, model="cross-encoder/ms-marco-MiniLM-L-6-v2"
)

llm_reranker = LLMRerank(choice_batch_size=4, top_n=5)
embed_model = OpenAIEmbedding()
cohere_rerank = CohereRerank(api_key=os.getenv("COHERE_API_KEY"), top_n=10)

for reranker in [st_reranker, llm_reranker, cohere_rerank]:
    retriever_with_reranker = vector_index.as_retriever(
        similarity_top_k=10, postprocessor=reranker, embed_model=embed_model
    )

    retriever_evaluator_1 = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"], retriever=retriever_with_reranker
    )

    eval_results1 = await retriever_evaluator_1.aevaluate_dataset(qc_dataset)
    print(display_results_retriever(f"retriever with added reranker", eval_results1))
    print("=" * 20)

                  Retriever Name  Hit Rate       MRR
0  retriever with added reranker  0.931507  0.601272
                  Retriever Name  Hit Rate       MRR
0  retriever with added reranker  0.931507  0.601272
                  Retriever Name  Hit Rate       MRR
0  retriever with added reranker  0.931507  0.601272


# Employing activeloop's deep memory

- This will train a tiny neural network layer to match user queries with relevant data from a corpus.
- It can improve the retrieval accuracy by upto 27%.

In [10]:
def create_query_relevance(qa_dataset):
    """Fun for converting llama index dataset to correct format for deep memeory training"""

    queries= [text for _, text in qa_dataset.queries.items()]

    relevant_docs = qa_dataset.relevant_docs
    relevance = [[(relevant_docs[doc][0], 1)]  for doc in relevant_docs]
    return queries, relevance

train_queries, train_relevance = create_query_relevance(qc_dataset)
print(len(train_queries))

146


In [16]:
import deeplake
local = "./data/paul_graham/deep_lake_db"
username = "akshatunsubscribe"

hub_path = f"hub://{username}/optimization_paul_graham"
hub_managed_path = f"hub://{username}/optimization_paul_graham_managed"

# First upload our local vector store
deeplake.deepcopy(local, hub_path, overwrite=True)
# Create a managed vector store
deeplake.deepcopy(hub_path, hub_managed_path, overwrite=True, runtime={"tensor_db": True})

Copying dataset: 96%|█████████▋| 27/28 [00:31<00:01


This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/akshatunsubscribe/optimization_paul_graham
Your Deep Lake dataset has been successfully created!


Copying dataset: 96%|█████████▋| 27/28 [00:48<00:01


This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/akshatunsubscribe/optimization_paul_graham_managed
Your Deep Lake dataset has been successfully created!


Dataset(path='hub://akshatunsubscribe/optimization_paul_graham_managed', tensors=['embedding', 'id', 'metadata', 'text'])

## Loading dataset from deep lake cloud storage

In [17]:
import os
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")
embed_model = OpenAIEmbedding()

vector_store = DeepLakeVectorStore(
    dataset_path=hub_managed_path,
    overwrite=False,
    runtime={"tensor_db": True},
    read_only=True,
)

service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex.from_vector_store(
    vector_store,
    service_context=service_context,
    storage_context=storage_context,
    use_async=False,
    show_progress=True,
)

Deep Lake Dataset in hub://akshatunsubscribe/optimization_paul_graham_managed already exists, loading from the storage


  service_context = ServiceContext.from_defaults(


In [19]:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

job_id = vector_store._vectorstore.deep_memory.train(
    queries=train_queries,
    relevance=train_relevance,
    embedding_function=embeddings.embed_documents,
)

Starting DeepMemory training job


  warn_deprecated(


Your Deep Lake dataset has been successfully created!


 

Preparing training data for deepmemory:


Creating 146 embeddings in 1 batches of size 146:: 100%|██████████| 1/1 [00:37<00:00, 37.52s/it]


DeepMemory training job started. Job ID: 660126675787feddb98239c6


In [27]:
vector_store._vectorstore.deep_memory.status(job_id)

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/akshatunsubscribe/optimization_paul_graham_managed
--------------------------------------------------------------
|                  660126675787feddb98239c6                  |
--------------------------------------------------------------
| status                     | completed                     |
--------------------------------------------------------------
| progress                   | eta: 0.5 seconds              |
|                            | recall@10: 86.21% (+10.34%)   |
--------------------------------------------------------------
| results                    | recall@10: 86.21% (+10.34%)   |
--------------------------------------------------------------




## Evaluate deep memory new embeddings retrieval

In [23]:
from llama_index.core.evaluation import generate_question_context_pairs

if not os.path.exists("./data/test_dataset.json"):
    # Generate dataset
    test_dataset = generate_question_context_pairs(
        nodes[:20],
        llm= llm,
        num_questions_per_chunk= 1
    )

    test_dataset.save_json("./data/test_dataset.json")

else:
    # Load from local if already exists
    from llama_index.core.evaluation import EmbeddingQAFinetuneDataset


    test_dataset = EmbeddingQAFinetuneDataset.from_json("./data/test_dataset.json")


In [24]:
# start creating dataset for deep memory
test_queries, test_relevance= create_query_relevance(test_dataset)

### Evaluate test set for @recall

- **Recall** is calculates as  `TP/(TP + FN)` where TP= True Pos; FN= False Neg.

- In LLM's we @recall formula is (# of relevant items retrieved)/(Total # of relevant items in dataset).

- Compared to hit rate recall is about the system's attention to detail in retrieving all the relevant items as we consider all the retrieved items relevancy, and hit rate is ensuring that each query retrieves something relevant.

In [29]:
recalls = vector_store._vectorstore.deep_memory.evaluate(
    queries= test_queries,
    relevance= test_relevance,
    embedding_function= embeddings.embed_documents
)

Embedding queries took 1.73 seconds
---- Evaluating without Deep Memory ---- 
Recall@1:	  38.8%
Recall@3:	  79.6%
Recall@5:	  87.8%
Recall@10:	  93.9%
Recall@50:	  100.0%
Recall@100:	  100.0%
---- Evaluating with Deep Memory ---- 
Recall@1:	  49.0%
Recall@3:	  81.6%
Recall@5:	  89.8%
Recall@10:	  91.8%
Recall@50:	  100.0%
Recall@100:	  100.0%


In [31]:
import os
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.evaluation import (
    RetrieverEvaluator,
)

base_retriever = vector_index.as_retriever(similarity_top_k=10)
deep_memory_retriever = vector_index.as_retriever(
    similarity_top_k=10, vector_store_kwargs={"deep_memory": True}
)

base_retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=base_retriever
)
eval_results = await base_retriever_evaluator.aevaluate_dataset(test_dataset)
print(display_results_retriever("Retriever Results", eval_results, True))

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
[1.0, 1.0, 0.5, 0.5, 1.0, 0.1, 1.0, 0.25, 0.5, 1.0, 0.0, 1.0, 0.5, 0.5, 0.2, 1.0, 0.5, 1.0, 0.14285714285714285, 0.5, 0.5, 1.0, 0.5, 0.1, 0.0, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 0.3333333333333333, 1.0, 0.5, 1.0, 0.3333333333333333, 1.0, 0.5, 0.0, 1.0, 0.25, 0.5, 0.5, 0.5, 0.5, 0.2]
      Retriever Name  Hit Rate       MRR
0  Retriever Results  0.938776  0.610398


Observation:

- Looking at the below table we can see that deep memory indeed increases the MRR and Hit Rate.
- The gains may not be too high as we are using a very small train/test dataset, using a bigger dataset can significantly increase the evaluation metrics.

```text

     Retriever Name               Hit Rate       MRR
0  Retriever top_10               0.931507  0.601272
====================
                  Retriever Name  Hit Rate       MRR
0  retriever with added reranker  0.931507  0.601272
====================
                  Retriever Name  Hit Rate       MRR
0  retriever with added reranker  0.931507  0.601272
====================
                  Retriever Name  Hit Rate       MRR
0  retriever with added reranker  0.931507  0.601272
====================
```

# Resourses

- [Hit Rate & MRR Blog](https://tamilselvan-subramanian.medium.com/how-hit-rate-and-mrr-measure-llm-retrievers-ai-simplified-series-7203ba2d4032#:~:text=Remember%3A-,Hit%20Rate%20tells%20you%20if%20the%20LLM%20found%20any%20relevant,the%20future%20of%20information%20access.)