# Evaluation of retrieval methods

## Create the Ground truth dataset for retrieval evaluation

Ground truth is the dataset that contains all the relevant documents that should be retrieved from each query. Consider this as a label dataset that we know in advance the correct documents we need to retrieve for each query.

You can create a ground truth in various ways:
- Human annotators: That will find and label manually all the relevant documents for each query
- User interaction annotators: In production system people or LLMs can examine user queries and system results and label the most relevant docs
- LLM synthetic data: Use an LLM to generate a number of synthetic user questions for each record/document that we want to retrieve

In this exercise, the last option will be used.

In [2]:
# Fetch the documents that we want to fetch
import requests
# To get the documents I will download them for the GitHub repo
url_path = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json'
response = requests.get(url_path)
# Read the Json file 
docs_raw = response.json()
# Flatten the json (add the course in each question)
documents = []
# For each course in the Json
for courses in docs_raw:
    # Add the course name to the document
    for doc in courses['documents']:
        doc['course'] = courses['course']
        documents.append(doc)
# See the first question of the document
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

### Create an unique identifier for each document

In [4]:
# Use the library to create a hash key
import hashlib

# Create the function to generate the unique doc id
def generate_document_id(doc):
    # To create a unique string to hash we take the text from different elements of the document
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    # Create the hash object from the string
    hash_object = hashlib.md5(combined.encode())
    # Create a string from the hashed object
    hash_hex = hash_object.hexdigest()
    # Takne the first 8 digits of the string
    document_id = hash_hex[:8]
    return document_id

In [7]:
# Generate a unique id for each document
for doc in documents:
    doc['id'] = generate_document_id(doc)
# Examine the first record
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [9]:
# Check for duplicates
from collections import defaultdict
hashes = defaultdict(list)
for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)
# See the length
print(len(hashes), len(documents))
# Find the duplicate entries
for k, values in hashes.items():
    if len(values) > 1:
        print(k, len(values))

947 948
593f7569 2
