# Evaluation of retrieval methods

## Create the Ground truth dataset for retrieval evaluation

Ground truth is the dataset that contains all the relevant documents that should be retrieved from each query. Consider this as a label dataset that we know in advance the correct documents we need to retrieve for each query.

You can create a ground truth in various ways:
- Human annotators: That will find and label manually all the relevant documents for each query
- User interaction annotators: In production system people or LLMs can examine user queries and system results and label the most relevant docs
- LLM synthetic data: Use an LLM to generate a number of synthetic user questions for each record/document that we want to retrieve

In this exercise, the last option will be used.

In [1]:
import pandas as pd

In [2]:
# Fetch the documents that we want to fetch
import requests
# To get the documents I will download them for the GitHub repo
url_path = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json'
response = requests.get(url_path)
# Read the Json file 
docs_raw = response.json()
# Flatten the json (add the course in each question)
documents = []
# For each course in the Json
for courses in docs_raw:
    # Add the course name to the document
    for doc in courses['documents']:
        doc['course'] = courses['course']
        documents.append(doc)
# See the first question of the document
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

### Create an unique identifier for each document

In [3]:
# Use the library to create a hash key
import hashlib

# Create the function to generate the unique doc id
def generate_document_id(doc):
    # To create a unique string to hash we take the text from different elements of the document
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    # Create the hash object from the string
    hash_object = hashlib.md5(combined.encode())
    # Create a string from the hashed object
    hash_hex = hash_object.hexdigest()
    # Takne the first 8 digits of the string
    document_id = hash_hex[:8]
    return document_id

In [4]:
# Generate a unique id for each document
for doc in documents:
    doc['id'] = generate_document_id(doc)
# Examine the first record
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [5]:
# Check for duplicates
from collections import defaultdict
hashes = defaultdict(list)
for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)
# See the length
print(len(hashes), len(documents))
# Find the duplicate entries
for k, values in hashes.items():
    if len(values) > 1:
        print(k, len(values))

947 948
593f7569 2


### Generate user questions for each record using LLM

In [5]:
# Create the prompt template for the LLM
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [9]:
# Initialize the OpenAI instance
from openai import OpenAI
# Initialzite the client
client = OpenAI()

In [19]:
# Generate questions from the first record
sample = documents[0]
# Create the prompt
prompt = prompt_template.format(**sample)
# Make the request
full_response = client.chat.completions.create(
    model = 'gpt-4o',
    messages = [{"role": "user", "content": prompt}])
# Parse the response
response = full_response.choices[0].message.content
# Print the response
print(response)

[
    "When is the exact start date and time of the course?",
    "How can I subscribe to the course's public Google Calendar?",
    "Where can I register before the course begins?",
    "How do I join the course's Telegram channel for announcements?",
    "Which Slack channel should I join after registering in DataTalks.Club?"
]


In [13]:
# Create the function to generate the questions
def generate_questions(doc):
    # Create the prompt from a template
    prompt = prompt_template.format(**doc)
    # Request from the model
    full_response = client.chat.completions.create(
        model = 'gpt-4o',
        messages = [{"role": "user", "content": prompt}])
    # Parse the response
    response = full_response.choices[0].message.content
    return response

### Create the ground truth dataset

In [28]:
# Create a subset of the dataset to generate the questions
docs = documents[:5]

# Initialize the results object
results = {}
# For each document generate the user questions
for doc in docs:
    doc_id = doc['id']
    results[doc_id] = generate_questions(doc)

In [48]:
ids = []
user_query = []
course = []
i = 0 

for id, questions_string in results.items():
    # Convert the string of questions to a list
    questions = json.loads(questions_string)
    for query in questions:
        ids.append(id)
        user_query.append(query)
        course.append(docs[i]['course'])
    i+=1
# Create the dictionary to save as a dataframe
results_dic = {'document':ids,'question': user_query, 'course':course}
# Create the dataframe with the ground truth dataset
df = pd.DataFrame(results_dic)
# View the dataset
df.head(10)

Unnamed: 0,document,question,course
0,c02e79ef,When will the course begin?,data-engineering-zoomcamp
1,c02e79ef,How can I add the course schedule to my calendar?,data-engineering-zoomcamp
2,c02e79ef,Where should I register before the course starts?,data-engineering-zoomcamp
3,c02e79ef,Is there a Telegram channel for course announc...,data-engineering-zoomcamp
4,c02e79ef,Should I join any specific Slack channels for ...,data-engineering-zoomcamp
5,1f6520ca,What background knowledge do I need before enr...,data-engineering-zoomcamp
6,1f6520ca,Are there any specific skills required to star...,data-engineering-zoomcamp
7,1f6520ca,Where can I find the required prerequisites to...,data-engineering-zoomcamp
8,1f6520ca,Is there a list of prerequisites available for...,data-engineering-zoomcamp
9,1f6520ca,How can I check the prerequisites for this cou...,data-engineering-zoomcamp


## Evaluate the different search methods 

To evaluate the different search methods that we have created for our RAG system we will compute and compare the below metrics:
- **Hit Rate (HR) at k**: Counts from all the retrieval requests, how many of them contained the relevant documents in the top k results
- **Mean Reciprocal Rank (MRR)**: Takes into account also the rank of the relevant document, with responses with the relevant document ranked higher with have a bigger score

In [3]:
# Download the full dataset with ground truth
!wget https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-vector-search/eval/ground-truth-data.csv

--2024-09-09 06:22:24--  https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-vector-search/eval/ground-truth-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 504747 (493K) [text/plain]
Saving to: ‘ground-truth-data.csv’


2024-09-09 06:22:24 (22.3 MB/s) - ‘ground-truth-data.csv’ saved [504747/504747]



In [6]:
# Open the full dataset with the ground truth
df_ground_truth = pd.read_csv('ground-truth-data.csv')
df_ground_truth.head(10)

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef
5,Where can I find the prerequisites for this co...,data-engineering-zoomcamp,1f6520ca
6,How do I check the prerequisites for this course?,data-engineering-zoomcamp,1f6520ca
7,Where are the course prerequisites listed?,data-engineering-zoomcamp,1f6520ca
8,What are the requirements for joining this cou...,data-engineering-zoomcamp,1f6520ca
9,Where is the list of prerequisites for the cou...,data-engineering-zoomcamp,1f6520ca


In [7]:
# Create a list of records
ground_truth = df_ground_truth.to_dict(orient='records')
print(ground_truth[:5])

[{'question': 'When does the course begin?', 'course': 'data-engineering-zoomcamp', 'document': 'c02e79ef'}, {'question': 'How can I get the course schedule?', 'course': 'data-engineering-zoomcamp', 'document': 'c02e79ef'}, {'question': 'What is the link for course registration?', 'course': 'data-engineering-zoomcamp', 'document': 'c02e79ef'}, {'question': 'How can I receive course announcements?', 'course': 'data-engineering-zoomcamp', 'document': 'c02e79ef'}, {'question': 'Where do I join the Slack channel?', 'course': 'data-engineering-zoomcamp', 'document': 'c02e79ef'}]


### Create the two metrics we will examine

In [8]:
# Create the HR metric
def hit_rate(relevance_total):
    cnt = 0
    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

In [9]:
# Create the MRR metric
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

### Evaluate the elastic search

In [10]:
# Import Elastic Search
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm

In [11]:
# Initialize the client 
es_client = Elasticsearch('http://localhost:9200') # This is the port created after running the docker file

In [12]:
# Create the Schema of the Elastic Search Index
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"}
        }
    }
}

# Provide the name of the index
index_name = "course-questions"
# Create the elastic search index
response = es_client.indices.create(index=index_name, body=index_settings)
# Verify that elastic search is created
response

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [18]:
# Fit the documents into the elastic search index
for doc in tqdm(documents):
    es_client.index(index = index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [29]:
def elastic_search(query, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [30]:
# Parse the ground truth queries to see if you can find the relevant documents
relevance_total = []
# Create a request for each query in ground truth
for q in tqdm(ground_truth):
    doc_id = q['document']
    results = elastic_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [31]:
# Evaluate the search method
hit_rate(relevance_total), mrr(relevance_total)

(0.7395720769397017, 0.6029788920106625)

In [13]:
# Create a function to evaluate different search_functions
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Evaluate vector search

To evaluate the vector search we need to follow the following steps:
- Initialize a transformer model to create embedding
- Create the embeddings of the specific fields of the record
- Adjust the Index settings in Elastic Search and index these embeddings
- Create the embedding of the user query and query the data
- Calculate the HR and MRR metrics comparing the retrieved data with the ground truth dataset

In [14]:
# Import the library to create the embeddings
from sentence_transformers import SentenceTransformer

# Initialize the selected model to create the embeddings
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]