# Evaluation metrics for retrieval
* https://youtu.be/APMrUnC_dy0?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R
* https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R
* Evaluation of text retrieval techniques for RAG: https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R
* Evaluation metrics: https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/evaluation-metrics.md

In [32]:
# Load Python libraries
import requests
import hashlib
import json
import os
from openai import OpenAI
import pickle

In [3]:
docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
docs_raw = docs_response.json()

In [4]:
documents = []

for course in docs_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [6]:
documents[1]

{'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
 'section': 'General course-related questions',
 'question': 'Course - What are the prerequisites for this course?',
 'course': 'data-engineering-zoomcamp'}

In [87]:
# Create ID's in the documents
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=422
def generate_document_id(doc):
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    doc_id = hash_hex[:8]
    return doc_id

In [88]:
generate_document_id(documents[1])

'1f6520ca'

In [89]:
# Generate ID's for all records in documents
for doc in documents:
    doc['id'] = generate_document_id(doc)

In [90]:
documents[1:3]

[{'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
  'section': 'General course-related questions',
  'question': 'Course - What are the prerequisites for this course?',
  'course': 'data-engineering-zoomcamp',
  'id': '1f6520ca'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'}]

In [91]:
# Save document with ID's as new JSON file
with open('docs_with_ids.json', 'wt') as fout:
    json.dump(documents, fout, indent=2)

In [92]:
!ls -la

total 1792
drwxrwxrwx+ 3 codespace codespace   4096 Jul 15 20:51 .
drwxrwxrwx+ 7 codespace root        4096 Jul 15 13:10 ..
drwxrwxrwx+ 2 codespace codespace   4096 Jul 15 18:16 .ipynb_checkpoints
-rw-rw-rw-  1 codespace codespace 693170 Jul 15 18:33 ElasticSearch_example.ipynb
-rw-rw-rw-  1 codespace codespace  19062 Jul 15 20:51 Retrieval_Eval_Metrics.ipynb
-rw-rw-rw-  1 codespace codespace 699257 Jul 15 20:51 docs_with_ids.json
-rw-rw-rw-  1 codespace codespace 403252 Jul 15 20:29 results.bin


In [93]:
!head docs_with_ids.json

[
  {
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "c02e79ef"
  },
  {
    "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",


In [17]:
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [None]:
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=964

# Set OPENAI_API_KEY
os.environ['OPENAI_API_KEY'] = 'API_KEY'

# ChatGPT client
client = OpenAI()

In [94]:
# Create a prompt
doc = documents[2]
doc

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'section': 'General course-related questions',
 'question': 'Course - Can I still join the course after the start date?',
 'course': 'data-engineering-zoomcamp',
 'id': '7842b56a'}

In [95]:
prompt = prompt_template.format(**doc)
print(prompt)

You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]


In [29]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

In [None]:
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1129
# This will cost ~ 4 USD
results = {}
for doc in documents: 
    doc_id = doc['id']
    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

In [52]:
# Already processed file
# file from: https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/results.bin
with open('results.bin', 'rb') as f_in:
    results = pickle.load(f_in)

In [97]:
results['1f6520ca']

'["Where can I find the prerequisites for this course?", "How do I check the prerequisites for this course?", "Where are the course prerequisites listed?", "What are the requirements for joining this course?", "Where is the list of prerequisites for the course?"]'

In [98]:
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1332
parsed_results = {}
for docid, json_questions in results.items():
    parsed_results[docid] = json.loads(json_questions)

In [61]:
# fixing one issue:
json_questions = [
r"How can I resolve the Docker error 'invalid mode: \Program Files\Git\var\lib\postgresql\data'?",
"What should I do if I encounter an invalid mode error in Docker on Windows?",
"What is the correct mounting path to use in Docker for PostgreSQL data on Windows?",
"Can you provide an example of a correct Docker mounting path for PostgreSQL data?",
r"How do I correct the mounting path error in Docker for \Program Files\Git\var\lib\postgresql\data'?"
]

In [63]:
docid

'58c9f99f'

In [64]:
# fixing one issue:
results[docid] = json.dumps(json_questions)

In [66]:
# Re-run this again:
#for docid, json_questions in results.items():
#    parsed_results[docid] = json.loads(json_questions)

In [99]:
parsed_results['c02e79ef']

['When does the course begin?',
 'How can I get the course schedule?',
 'What is the link for course registration?',
 'How can I receive course announcements?',
 'Where do I join the Slack channel?']

In [107]:
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1469
doc_index = {d['id']: d for d in documents}
doc_index['c02e79ef']

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [110]:
final_results = []
for docid, questions in parsed_results.items():
    course = doc_index[docid]['course']
    for q in questions:
        final_results.append((q, course, docid))
final_results[:4]

[('When does the course begin?', 'data-engineering-zoomcamp', 'c02e79ef'),
 ('How can I get the course schedule?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('What is the link for course registration?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('How can I receive course announcements?',
  'data-engineering-zoomcamp',
  'c02e79ef')]

In [109]:
import pandas as pd

In [113]:
# Ground truth dataset
df = pd.DataFrame(final_results, columns=['question', 'course', 'document'])
df.head()

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef


In [114]:
df.to_csv('ground-truth-data.csv', index=False)

In [116]:
!head ground-truth-data.csv

question,course,document
When does the course begin?,data-engineering-zoomcamp,c02e79ef
How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef
Where can I find the prerequisites for this course?,data-engineering-zoomcamp,1f6520ca
How do I check the prerequisites for this course?,data-engineering-zoomcamp,1f6520ca
Where are the course prerequisites listed?,data-engineering-zoomcamp,1f6520ca
What are the requirements for joining this course?,data-engineering-zoomcamp,1f6520ca


## Evaluation of text retrieval techniques
https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R

In [118]:
!ls

ElasticSearch_example.ipynb   docs_with_ids.json     results.bin
Retrieval_Eval_Metrics.ipynb  ground-truth-data.csv


In [119]:
import json

In [120]:
with open('docs_with_ids.json', 'rt') as fin:
    docs = json.load(fin)

In [123]:
docs[5]

{'text': "There are 3 Zoom Camps in a year, as of 2024. However, they are for separate courses:\nData-Engineering (Jan - Apr)\nMLOps (May - Aug)\nMachine Learning (Sep - Jan)\nThere's only one Data-Engineering Zoomcamp “live” cohort per year, for the certification. Same as for the other Zoomcamps.\nThey follow pretty much the same schedule for each cohort per zoomcamp. For Data-Engineering it is (generally) from Jan-Apr of the year. If you’re not interested in the Certificate, you can take any zoom camps at any time, at your own pace, out of sync with any “live” cohort.",
 'section': 'General course-related questions',
 'question': 'Course - how many Zoomcamps in a year?',
 'course': 'data-engineering-zoomcamp',
 'id': '2ed9b986'}

In [None]:
# Run Docker, to get ElasticSearch
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

In [124]:
from elasticsearch import Elasticsearch

In [125]:
es_client = Elasticsearch('http://localhost:9200')