## Evaluation of Text Retrieval Techniques for RAG 

In [1]:
import json

with open('documents-with-ids.json', 'rt') as f_in:
    documents = json.load(f_in)

```bash
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

In [2]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [3]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/1010 [00:00<?, ?it/s]

In [4]:
def elastic_search(query, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [5]:
import pandas as pd 

In [6]:
df_ground_truth = pd.read_csv("ground-truth-data.csv")

In [7]:
df_ground_truth

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef
...,...,...,...
4545,How should I destroy infrastructure created us...,mlops-zoomcamp,886d1617
4546,What is the first step to destroy AWS infrastr...,mlops-zoomcamp,886d1617
4547,Can I destroy infrastructure created with GitH...,mlops-zoomcamp,886d1617
4548,What command initializes Terraform with specif...,mlops-zoomcamp,886d1617


In [8]:
ground_truth = df_ground_truth.to_dict(orient="records")
#ground_truth

In [9]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q["document"]
    results = elastic_search(query=q["question"], course = q["course"] )
    # check if doc id in the results, highly not optimal  solution
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)
     

  0%|          | 0/4550 [00:00<?, ?it/s]

In [10]:
relevance

[True, False, False, False, False]

*from [PerplexityAI](https://www.perplexity.ai)*

---

To evaluate the performance of RAG systems, two key metrics are often used: hit rate and Mean Reciprocal Rank (MRR).

## Hit Rate

The hit rate is a metric traditionally associated with sales performance, but it can be adapted to evaluate RAG systems as well. In the context of RAGs, the hit rate can be defined as follows:

**Hit Rate = Number of Successful Retrievals / Total Number of Queries**

A successful retrieval occurs when the RAG system retrieves relevant information that contributes to generating an accurate response. The hit rate provides a straightforward measure of how often the system successfully retrieves useful information.

**Key points about hit rate:**

- It's expressed as a percentage, ranging from 0% to 100%.
- A higher hit rate indicates better performance of the retrieval component.
- It can be used to compare different RAG systems or to track improvements over time.
- The definition of a "successful retrieval" may vary depending on the specific application and requirements.

## Mean Reciprocal Rank (MRR)

MRR is a more nuanced metric that takes into account not just whether relevant information was retrieved, but also its position in the list of retrieved items. It's particularly useful for evaluating ranking systems, including those used in RAGs.

The formula for MRR is:

$$MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i}$$

Where:
- |Q| is the total number of queries
- rank_i is the position of the first relevant item for query $i$, for example, for query=[False, False, True, False, False] rank_i=3

**Key points about MRR:**

- MRR ranges from 0 to 1, with 1 being the best possible score.
- It emphasizes the importance of ranking relevant information higher in the list of retrieved items.
- MRR is particularly useful when only the first relevant result matters, as it focuses on the rank of the first correct answer.
- It's often used in information retrieval systems, question-answering systems, and recommendation systems[3][6].

## Applying Hit Rate and MRR to RAGs

When evaluating RAG systems, both hit rate and MRR can provide valuable insights:

1. **Hit Rate for RAGs**: This metric can help assess how often the RAG system successfully retrieves relevant information from its external knowledge base. A high hit rate indicates that the system is effectively finding and utilizing external information to augment its responses.

2. **MRR for RAGs**: MRR can be used to evaluate the ranking capability of the RAG system. It helps determine how well the system prioritizes the most relevant information. A high MRR suggests that the system not only retrieves relevant information but also ranks it appropriately, ensuring that the most useful data is readily available for generating responses.

3. **Combined Analysis**: Using both metrics together can provide a more comprehensive evaluation. While hit rate gives an overall success rate, MRR offers insights into the quality of the ranking. A system with a high hit rate but low MRR might be retrieving relevant information but failing to prioritize it effectively.

4. **Benchmarking and Improvement**: These metrics can be used to compare different RAG implementations, track improvements over time, and identify areas for optimization. For example, if the hit rate is high but MRR is low, efforts might focus on improving the ranking algorithm.

5. **Context-Specific Evaluation**: The interpretation of these metrics can vary depending on the specific application of the RAG system. For instance, in a customer service chatbot, a very high MRR might be crucial to ensure that the most relevant information is immediately available to address customer queries.

By utilizing both hit rate and MRR, developers and researchers can gain a nuanced understanding of their RAG system's performance, helping to guide improvements and ensure that the system effectively leverages external knowledge to enhance its responses[1][4].

Citations:
* [1] https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/
* [2] https://en.wikipedia.org/wiki/Hit_rate
* [3] https://ru.wikipedia.org/wiki/%D0%A1%D1%80%D0%B5%D0%B4%D0%BD%D0%B5%D0%BE%D0%B1%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D0%B9_%D1%80%D0%B0%D0%BD%D0%B3
* [4] https://cloud.google.com/use-cases/retrieval-augmented-generation
* [5] https://1up.ai/sales-hit-rate/
* [6] https://www.evidentlyai.com/ranking-metrics/mean-reciprocal-rank-mrr


In [11]:
example = [
    [True, False, False, False, False], # 1, 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1 
    [False, False, True, False, False],  # 1
    [False, False, False, False, False], # 0
]


In [12]:
def hit_rate(relevance_total):
    count = 0

    for line in relevance_total:
        if True in line:
            count = count + 1

    return count / len(relevance_total)

def mrr(relevance_total):
    score = 0 
    for q in relevance_total:
        for rank in range(len(q)):
            if q[rank] == True:
                score += 1/(rank+1)
        
    return score / len(relevance_total)

In [13]:
print("Hit Rate:", hit_rate(example))
print('MRR:', mrr(example))

Hit Rate: 0.5833333333333334
MRR: 0.5277777777777778


In [14]:
print("Hit Rate:", hit_rate(relevance_total))
print('MRR:', mrr(relevance_total))

Hit Rate: 0.7345054945054945
MRR: 0.5993626373626381


### Compare ElasticSearch and minisearch 



In [16]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.Index at 0x723cc088c610>

In [17]:
def minsearch_search(query, course):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [18]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4550 [00:00<?, ?it/s]

In [20]:
print("Hit Rate:", hit_rate(relevance_total))
print('MRR:', mrr(relevance_total))

Hit Rate: 0.7687912087912088
MRR: 0.6596373626373627


#### Minisearch 

* Hit Rate: 0.7687912087912088

* MRR: 0.6596373626373627

#### ElasticSearch

* Hit Rate: 0.7345054945054945

* MRR: 0.5993626373626381

### Function for easier comparison 


In [21]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [22]:
evaluate(ground_truth, lambda q: elastic_search(q['question'], q['course']))

  0%|          | 0/4550 [00:00<?, ?it/s]

{'hit_rate': 0.7345054945054945, 'mrr': 0.5993626373626381}

In [23]:
evaluate(ground_truth, lambda q: minsearch_search(q['question'], q['course']))

  0%|          | 0/4550 [00:00<?, ?it/s]

{'hit_rate': 0.7687912087912088, 'mrr': 0.6596373626373627}