# Evaluation of retrieval methods

## Create the Ground truth dataset for retrieval evaluation

Ground truth is the dataset that contains all the relevant documents that should be retrieved from each query. Consider this as a label dataset that we know in advance the correct documents we need to retrieve for each query.

You can create a ground truth in various ways:
- Human annotators: That will find and label manually all the relevant documents for each query
- User interaction annotators: In production system people or LLMs can examine user queries and system results and label the most relevant docs
- LLM synthetic data: Use an LLM to generate a number of synthetic user questions for each record/document that we want to retrieve

In this exercise, the last option will be used.

In [41]:
import pandas as pd

In [1]:
# Fetch the documents that we want to fetch
import requests
# To get the documents I will download them for the GitHub repo
url_path = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json'
response = requests.get(url_path)
# Read the Json file 
docs_raw = response.json()
# Flatten the json (add the course in each question)
documents = []
# For each course in the Json
for courses in docs_raw:
    # Add the course name to the document
    for doc in courses['documents']:
        doc['course'] = courses['course']
        documents.append(doc)
# See the first question of the document
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

### Create an unique identifier for each document

In [2]:
# Use the library to create a hash key
import hashlib

# Create the function to generate the unique doc id
def generate_document_id(doc):
    # To create a unique string to hash we take the text from different elements of the document
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    # Create the hash object from the string
    hash_object = hashlib.md5(combined.encode())
    # Create a string from the hashed object
    hash_hex = hash_object.hexdigest()
    # Takne the first 8 digits of the string
    document_id = hash_hex[:8]
    return document_id

In [3]:
# Generate a unique id for each document
for doc in documents:
    doc['id'] = generate_document_id(doc)
# Examine the first record
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [4]:
# Check for duplicates
from collections import defaultdict
hashes = defaultdict(list)
for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)
# See the length
print(len(hashes), len(documents))
# Find the duplicate entries
for k, values in hashes.items():
    if len(values) > 1:
        print(k, len(values))

947 948
593f7569 2


### Generate user questions for each record using LLM

In [5]:
# Create the prompt template for the LLM
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [9]:
# Initialize the OpenAI instance
from openai import OpenAI
# Initialzite the client
client = OpenAI()

In [19]:
# Generate questions from the first record
sample = documents[0]
# Create the prompt
prompt = prompt_template.format(**sample)
# Make the request
full_response = client.chat.completions.create(
    model = 'gpt-4o',
    messages = [{"role": "user", "content": prompt}])
# Parse the response
response = full_response.choices[0].message.content
# Print the response
print(response)

[
    "When is the exact start date and time of the course?",
    "How can I subscribe to the course's public Google Calendar?",
    "Where can I register before the course begins?",
    "How do I join the course's Telegram channel for announcements?",
    "Which Slack channel should I join after registering in DataTalks.Club?"
]


In [13]:
# Create the function to generate the questions
def generate_questions(doc):
    # Create the prompt from a template
    prompt = prompt_template.format(**doc)
    # Request from the model
    full_response = client.chat.completions.create(
        model = 'gpt-4o',
        messages = [{"role": "user", "content": prompt}])
    # Parse the response
    response = full_response.choices[0].message.content
    return response

### Create the ground truth dataset

In [28]:
# Create a subset of the dataset to generate the questions
docs = documents[:5]

# Initialize the results object
results = {}
# For each document generate the user questions
for doc in docs:
    doc_id = doc['id']
    results[doc_id] = generate_questions(doc)

In [48]:
ids = []
user_query = []
course = []
i = 0 

for id, questions_string in results.items():
    # Convert the string of questions to a list
    questions = json.loads(questions_string)
    for query in questions:
        ids.append(id)
        user_query.append(query)
        course.append(docs[i]['course'])
    i+=1
# Create the dictionary to save as a dataframe
results_dic = {'document':ids,'question': user_query, 'course':course}
# Create the dataframe with the ground truth dataset
df = pd.DataFrame(results_dic)
# View the dataset
df.head(10)

Unnamed: 0,document,question,course
0,c02e79ef,When will the course begin?,data-engineering-zoomcamp
1,c02e79ef,How can I add the course schedule to my calendar?,data-engineering-zoomcamp
2,c02e79ef,Where should I register before the course starts?,data-engineering-zoomcamp
3,c02e79ef,Is there a Telegram channel for course announc...,data-engineering-zoomcamp
4,c02e79ef,Should I join any specific Slack channels for ...,data-engineering-zoomcamp
5,1f6520ca,What background knowledge do I need before enr...,data-engineering-zoomcamp
6,1f6520ca,Are there any specific skills required to star...,data-engineering-zoomcamp
7,1f6520ca,Where can I find the required prerequisites to...,data-engineering-zoomcamp
8,1f6520ca,Is there a list of prerequisites available for...,data-engineering-zoomcamp
9,1f6520ca,How can I check the prerequisites for this cou...,data-engineering-zoomcamp
