# Groud Truth Data

The ground truth data refers to a set of querys with the corresponding expected ideal results that can be used to evaluate retrieval methods.

In this case we are going to use aset of queries with one document as relevant result, which is a simplificantinon since usually for a single question there are multiple answers. Therefore what we will do is:

for each record in FAQ:
    generate 5 questions

therefore we will know that for each of the corresponding questions, that record is the relevant answer.

If ourdata set has a size of $N$ we will have $5N$ queries in our set for which we know the expected answer.

In production systems it's usually a good idea to use retroalimantation from users


In [1]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [2]:
documents[1]

{'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
 'section': 'General course-related questions',
 'question': 'Course - What are the prerequisites for this course?',
 'course': 'data-engineering-zoomcamp'}

We need a way of asigning and id to each document so we can use it to identify it in the query asignation

In [3]:
import hashlib

def generate_document_id(doc):
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id

In [4]:
for doc in documents:
    doc['id'] = generate_document_id(doc)

In [5]:
documents[3]

{'text': "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
 'section': 'General course-related questions',
 'question': 'Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?',
 'course': 'data-engineering-zoomcamp',
 'id': '0bbf41ec'}

In [6]:
from collections import defaultdict

hashes = defaultdict(list)

for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)

In [7]:
len(hashes), len(documents)

(947, 948)

In [8]:
for k, values in hashes.items():
    if len(values) > 1:
        print(k, len(values))

593f7569 2


In [9]:
hashes['593f7569']

[{'text': "They both do the same, it's just less typing from the script.\nAsked by Andrew Katoch, Added by Edidiong Esu",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'},
 {'text': "They both do the same, it's just less typing from the script.",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'}]

In [10]:
import json
with open("documents-with-ids.json", "wt") as f_out:
    json.dump(documents, f_out, indent=2)

Now we will use an LLM to generate user quesitons

In [11]:
prompt_template = """
You emulate a student taking our course. 
Formulate 5 user questions This student might ask based on a FAQ record. The record should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record.

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in a parsable JSON without using code blocs:

["question1", "question2", ..., "question5"]
""".strip()

In [42]:
import os
from dotenv import load_dotenv
import vertexai
from vertexai.generative_models import GenerativeModel,HarmCategory, HarmBlockThreshold

load_dotenv()

PROJECT_ID = os.getenv('PROJECT_ID')
REGION = os.getenv('REGION')

vertexai.init(
    project = PROJECT_ID,
    location = REGION
)

vertex_llm = GenerativeModel("gemini-1.5-flash-001")

generation_config = {
    "max_output_tokens": 8192,
    "temperature": 0.5,
    "top_p": 0.95,
}

safety_settings = {
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}



In [36]:
doc = documents[2]
prompt = prompt_template.format(**doc)

In [37]:
print(prompt)

You emulate a student taking our course. 
Formulate 5 user questions This student might ask based on a FAQ record. The record should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record.

The record:

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Provide the output in a parsable JSON without using code blocs:

["question1", "question2", ..., "question5"]


In [38]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)
    
    responses = vertex_llm.generate_content(
        [prompt],
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=True,
    )

    return "".join(response.text for response in responses)

In [39]:
json_response = generate_questions(doc)

In [40]:
json.loads(json_response)

['If I join the course after the start date, will I still be able to submit homework assignments?',
 "Is it possible to join the course after the start date even if I haven't registered?",
 'What are the deadlines for turning in the final projects if I join the course after the start date?',
 'If I join the course late, will I have access to all the course materials?',
 'Is there a penalty for joining the course after the start date?']

In [20]:
from tqdm.auto import tqdm
import time
vertex_results = {}

In [43]:

for doc in tqdm(documents):    
    doc_id = doc['id']
    if doc_id in vertex_results:
        continue
    questions = generate_questions(doc)
    vertex_results[doc_id] = questions
    time.sleep(10)

100%|██████████| 948/948 [17:17<00:00,  1.09s/it]


In [None]:
import pickle
with open("vertex-results.bin", "wb") as r_file:
    pickle.dump(vertex_results, r_file)

In [66]:
import pickle

with open("results.bin", "rb") as f_in:
    results = pickle.load(f_in)

In [69]:
import json 

parsed_results = {}

for doc_id, json_questions in results.items():
    parsed_results[doc_id] = json.loads(json_questions)

In [68]:
json_questions = [
r"How can I resolve the Docker error 'invalid mode: \Program Files\Git\var\lib\postgresql\data'?",
"What should I do if I encounter an invalid mode error in Docker on Windows?",
"What is the correct mounting path to use in Docker for PostgreSQL data on Windows?",
"Can you provide an example of a correct Docker mounting path for PostgreSQL data?",
r"How do I correct the mounting path error in Docker for \Program Files\Git\var\lib\postgresql\data'?"
]

results[doc_id] = json.dumps(json_questions)

In [70]:
doc_index = {d['id']:d for d in documents}

final_results = []

for doc_id, questions in parsed_results.items():
    course = doc_index[doc_id]['course']
    for q in questions:
        final_results.append((q, course, doc_id))

In [71]:
import pandas as pd

df = pd.DataFrame(final_results, columns=["question", "course", "document"])

In [72]:
df

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef
...,...,...,...
4622,How should I destroy infrastructure created us...,mlops-zoomcamp,886d1617
4623,What is the first step to destroy AWS infrastr...,mlops-zoomcamp,886d1617
4624,Can I destroy infrastructure created with GitH...,mlops-zoomcamp,886d1617
4625,What command initializes Terraform with specif...,mlops-zoomcamp,886d1617


In [74]:
df.to_csv('groud-truth-data.csv', index=False)