# Evaluation metrics for retrieval
*  Evaluation Metrics for Retrieval - https://youtu.be/APMrUnC_dy0?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R
* Ground Truth Dataset Generation for Retrieval Evaluation - https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R
* Evaluation of text retrieval techniques for RAG: https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R
* Evaluating Vector Retrieval - https://www.youtube.com/watch?v=VRprIm9-VV8&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&index=23
* Evaluation metrics: https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/evaluation-metrics.md

In [32]:
# Load Python libraries
import requests
import hashlib
import json
import os
from openai import OpenAI
import pickle

In [3]:
docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
docs_raw = docs_response.json()

In [4]:
documents = []

for course in docs_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [6]:
documents[1]

{'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
 'section': 'General course-related questions',
 'question': 'Course - What are the prerequisites for this course?',
 'course': 'data-engineering-zoomcamp'}

In [87]:
# Create ID's in the documents
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=422
def generate_document_id(doc):
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    doc_id = hash_hex[:8]
    return doc_id

In [88]:
generate_document_id(documents[1])

'1f6520ca'

In [89]:
# Generate ID's for all records in documents
for doc in documents:
    doc['id'] = generate_document_id(doc)

In [90]:
documents[1:3]

[{'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
  'section': 'General course-related questions',
  'question': 'Course - What are the prerequisites for this course?',
  'course': 'data-engineering-zoomcamp',
  'id': '1f6520ca'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'}]

In [91]:
# Save document with ID's as new JSON file
with open('docs_with_ids.json', 'wt') as fout:
    json.dump(documents, fout, indent=2)

In [92]:
!ls -la

total 1792
drwxrwxrwx+ 3 codespace codespace   4096 Jul 15 20:51 .
drwxrwxrwx+ 7 codespace root        4096 Jul 15 13:10 ..
drwxrwxrwx+ 2 codespace codespace   4096 Jul 15 18:16 .ipynb_checkpoints
-rw-rw-rw-  1 codespace codespace 693170 Jul 15 18:33 ElasticSearch_example.ipynb
-rw-rw-rw-  1 codespace codespace  19062 Jul 15 20:51 Retrieval_Eval_Metrics.ipynb
-rw-rw-rw-  1 codespace codespace 699257 Jul 15 20:51 docs_with_ids.json
-rw-rw-rw-  1 codespace codespace 403252 Jul 15 20:29 results.bin


In [93]:
!head docs_with_ids.json

[
  {
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "c02e79ef"
  },
  {
    "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",


In [17]:
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [None]:
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=964

# Set OPENAI_API_KEY
os.environ['OPENAI_API_KEY'] = 'API_KEY'

# ChatGPT client
client = OpenAI()

In [94]:
# Create a prompt
doc = documents[2]
doc

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'section': 'General course-related questions',
 'question': 'Course - Can I still join the course after the start date?',
 'course': 'data-engineering-zoomcamp',
 'id': '7842b56a'}

In [95]:
prompt = prompt_template.format(**doc)
print(prompt)

You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]


In [29]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

In [None]:
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1129
# This will cost ~ 4 USD
results = {}
for doc in documents: 
    doc_id = doc['id']
    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

In [52]:
# Already processed file
# file from: https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/results.bin
with open('results.bin', 'rb') as f_in:
    results = pickle.load(f_in)

In [97]:
results['1f6520ca']

'["Where can I find the prerequisites for this course?", "How do I check the prerequisites for this course?", "Where are the course prerequisites listed?", "What are the requirements for joining this course?", "Where is the list of prerequisites for the course?"]'

In [98]:
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1332
parsed_results = {}
for docid, json_questions in results.items():
    parsed_results[docid] = json.loads(json_questions)

In [61]:
# fixing one issue:
json_questions = [
r"How can I resolve the Docker error 'invalid mode: \Program Files\Git\var\lib\postgresql\data'?",
"What should I do if I encounter an invalid mode error in Docker on Windows?",
"What is the correct mounting path to use in Docker for PostgreSQL data on Windows?",
"Can you provide an example of a correct Docker mounting path for PostgreSQL data?",
r"How do I correct the mounting path error in Docker for \Program Files\Git\var\lib\postgresql\data'?"
]

In [63]:
docid

'58c9f99f'

In [64]:
# fixing one issue:
results[docid] = json.dumps(json_questions)

In [66]:
# Re-run this again:
#for docid, json_questions in results.items():
#    parsed_results[docid] = json.loads(json_questions)

In [99]:
parsed_results['c02e79ef']

['When does the course begin?',
 'How can I get the course schedule?',
 'What is the link for course registration?',
 'How can I receive course announcements?',
 'Where do I join the Slack channel?']

In [107]:
# https://youtu.be/bpxi6fKcyLw?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1469
doc_index = {d['id']: d for d in documents}
doc_index['c02e79ef']

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [110]:
final_results = []
for docid, questions in parsed_results.items():
    course = doc_index[docid]['course']
    for q in questions:
        final_results.append((q, course, docid))
final_results[:4]

[('When does the course begin?', 'data-engineering-zoomcamp', 'c02e79ef'),
 ('How can I get the course schedule?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('What is the link for course registration?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('How can I receive course announcements?',
  'data-engineering-zoomcamp',
  'c02e79ef')]

In [109]:
import pandas as pd

In [113]:
# Ground truth dataset
df = pd.DataFrame(final_results, columns=['question', 'course', 'document'])
df.head()

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef


In [114]:
df.to_csv('ground-truth-data.csv', index=False)

In [116]:
!head ground-truth-data.csv

question,course,document
When does the course begin?,data-engineering-zoomcamp,c02e79ef
How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef
Where can I find the prerequisites for this course?,data-engineering-zoomcamp,1f6520ca
How do I check the prerequisites for this course?,data-engineering-zoomcamp,1f6520ca
Where are the course prerequisites listed?,data-engineering-zoomcamp,1f6520ca
What are the requirements for joining this course?,data-engineering-zoomcamp,1f6520ca


## Evaluation of text retrieval techniques
Hit-rate, Mean Reciprocal Rank<br/>
https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R

In [8]:
import json

In [9]:
with open('docs_with_ids.json', 'rt') as fin:
    docs = json.load(fin)

In [10]:
docs[5]

{'text': "There are 3 Zoom Camps in a year, as of 2024. However, they are for separate courses:\nData-Engineering (Jan - Apr)\nMLOps (May - Aug)\nMachine Learning (Sep - Jan)\nThere's only one Data-Engineering Zoomcamp “live” cohort per year, for the certification. Same as for the other Zoomcamps.\nThey follow pretty much the same schedule for each cohort per zoomcamp. For Data-Engineering it is (generally) from Jan-Apr of the year. If you’re not interested in the Certificate, you can take any zoom camps at any time, at your own pace, out of sync with any “live” cohort.",
 'section': 'General course-related questions',
 'question': 'Course - how many Zoomcamps in a year?',
 'course': 'data-engineering-zoomcamp',
 'id': '2ed9b986'}

In [None]:
# Run Docker, to get ElasticSearch
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

In [1]:
from elasticsearch import Elasticsearch

In [2]:
es_client = Elasticsearch('http://localhost:9200')

In [3]:
es_client.info()

ObjectApiResponse({'name': 'ac55416b00ae', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'sLwCY69hRUqa5iU965wp5A', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [4]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"}
        }
    }
}

In [6]:
# Create indices
index_name = "course-questions"
es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [11]:
# Index dataset
for doc in docs:
    try:
        es_client.index(index=index_name, document=doc)
    except Exception as e:
        print(e)

In [12]:
docs[5]

{'text': "There are 3 Zoom Camps in a year, as of 2024. However, they are for separate courses:\nData-Engineering (Jan - Apr)\nMLOps (May - Aug)\nMachine Learning (Sep - Jan)\nThere's only one Data-Engineering Zoomcamp “live” cohort per year, for the certification. Same as for the other Zoomcamps.\nThey follow pretty much the same schedule for each cohort per zoomcamp. For Data-Engineering it is (generally) from Jan-Apr of the year. If you’re not interested in the Certificate, you can take any zoom camps at any time, at your own pace, out of sync with any “live” cohort.",
 'section': 'General course-related questions',
 'question': 'Course - how many Zoomcamps in a year?',
 'course': 'data-engineering-zoomcamp',
 'id': '2ed9b986'}

In [13]:
def elastic_search(query, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [14]:
elastic_search(
    query="I just discovered the course. Can I still join?",
    course="data-engineering-zoomcamp"
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp',
  'id': '63394d91'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it fin

In [77]:
# https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=223
import pandas as pd

In [17]:
df_ground_truth = pd.read_csv('ground-truth-data.csv')

In [54]:
df_ground_truth.head()

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef


In [18]:
ground_truth = df_ground_truth.to_dict(orient='records')

In [56]:
ground_truth[0:2]

[{'question': 'When does the course begin?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'How can I get the course schedule?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'}]

In [19]:
# https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=257
from tqdm.auto import tqdm
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = elastic_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  from .autonotebook import tqdm as notebook_tqdm
100%|███████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:13<00:00, 346.25it/s]


In [27]:
relevance_total[0:2]

[[True, False, False, False, False], [False, False, False, False, False]]

In [78]:
# https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=609
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

In [79]:
# Mean Reciprocal Rank 
# https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=813
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [67]:
# Elastic search
hit_rate(relevance_total), mrr(relevance_total)

(0.7395720769397017, 0.6029788920106625)

In [60]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-07-16 16:36:58--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py’


2024-07-16 16:36:58 (25.7 MB/s) - ‘minsearch.py’ saved [3832/3832]



In [63]:
# https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=991
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(docs)

<minsearch.Index at 0x774ef9074430>

In [64]:
def minsearch_search(query, course):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [65]:
relevance_total_minis = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total_minis.append(relevance)

100%|███████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:17<00:00, 271.16it/s]


In [73]:
from IPython.display import Markdown, display

eval_total = f"""**MiniSearch**<br/>Hit-rate: {hit_rate(relevance_total_minis):.4f}, MRR: {mrr(relevance_total_minis):.4f}\n
**ElasticSearch**<br/>Hit-rate: {hit_rate(relevance_total):.4f}, MRR: {mrr(relevance_total):.4f}"""

display(Markdown(eval_total))

**MiniSearch**<br/>Hit-rate: 0.7722, MRR: 0.6615

**ElasticSearch**<br/>Hit-rate: 0.7396, MRR: 0.6030

In [80]:
# https://youtu.be/fdIV4xCsp0c?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1172
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [75]:
evaluate(ground_truth, lambda q: elastic_search(q['question'], q['course']))

100%|███████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:09<00:00, 463.49it/s]


{'hit_rate': 0.7395720769397017, 'mrr': 0.6029788920106625}

In [76]:
evaluate(ground_truth, lambda q: minsearch_search(q['question'], q['course']))

100%|███████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:16<00:00, 272.44it/s]


{'hit_rate': 0.7722066133563864, 'mrr': 0.661454506159499}

## Evaluating vector retrieval
* https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R
* https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/evaluate-vector.ipynb

In [1]:
# Import Q&A JSON file
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=59
import json

with open('docs_with_ids.json', 'rt') as fin:
    documents = json.load(fin)

In [2]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=91
from sentence_transformers import SentenceTransformer # pip install sentence_transformers==2.7.0

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
model_name = 'multi-qa-MiniLM-L6-cos-v1'
model = SentenceTransformer(model_name)

In [4]:
v = model.encode('I just discovered the course. Can I still join?')

In [5]:
len(v)

384

In [6]:
v.dot(v)

1.0

In [7]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=281
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
            "question_vector": {
                "type": "dense_vector",
                "dims": 384, # len(v)
                "index": True,
                "similarity": "cosine"
            },
            "text_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "question_text_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
        }
    }
}

index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [8]:
from tqdm.auto import tqdm

In [9]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=347
for doc in tqdm(documents):
    question = doc['question']
    text = doc['text']
    qt = question + ' ' + text

    doc['question_vector'] = model.encode(question)
    doc['text_vector'] = model.encode(text)
    doc['question_text_vector'] = model.encode(qt)

100%|█████████████████████████████████████████████████████████████████████████████████████████████| 948/948 [01:54<00:00,  8.28it/s]


In [10]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=557
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|█████████████████████████████████████████████████████████████████████████████████████████████| 948/948 [00:29<00:00, 32.67it/s]


In [11]:
query = 'I just discovered the course. Can I still join it?'

In [12]:
v_q = model.encode(query)

In [13]:
v_q.shape

(384,)

In [23]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=588
search_query = {
        "field": "question_vector",
        "query_vector": v_q,
        "k": 5,
        "num_candidates": 10000
    }

es_results = es_client.search(
    index=index_name,
    knn=search_query,
    source = ["text", "section", "question", "course", "id"]
    )

In [24]:
es_results

ObjectApiResponse({'took': 25, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 5, 'relation': 'eq'}, 'max_score': 0.9597523, 'hits': [{'_index': 'course-questions', '_id': 'YL5fvZABkUlDw6EOX6Ir', '_score': 0.9597523, '_source': {'question': 'The course has already started. Can I still join it?', 'course': 'machine-learning-zoomcamp', 'section': 'General course-related questions', 'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.', 'id': 'ee58a693'}}, {'_index': 'course-questions', '_id': 'ob5fvZABkUlDw6EOKqAc', '_score': 0.89216983, '_source': {'question': 'Course - Can I still join the cour

In [25]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=725
result_docs = []

for hit in es_results['hits']['hits']:
    result_docs.append(hit['_source'])

result_docs[0]

{'question': 'The course has already started. Can I still join it?',
 'course': 'machine-learning-zoomcamp',
 'section': 'General course-related questions',
 'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
 'id': 'ee58a693'}

In [26]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=966
def elastic_search_knn(field, vector, course):
    knn = {
        "field": field,
        "query_vector": vector,
        "k": 5,
        "num_candidates": 10000,
        "filter": {
            "term": {
                "course": course
            }
        }
    }

    search_query = {
        "knn": knn,
        "_source": ["text", "section", "question", "course", "id"]
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [29]:
elastic_search_knn('question_vector', v_q, 'data-engineering-zoomcamp')[:2]

[{'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'section': 'General course-related questions',
  'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'id': '7842b56a'},
 {'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp',
  'section': 'General course-related questions',
  'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'id': 'a482086d'}]

In [30]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1140
def question_vector_knn(q):
    question = q['question']
    course = q['course']

    v_q = model.encode(question)

    return elastic_search_knn('question_vector', v_q, course)

In [16]:
import pandas as pd

In [18]:
df_ground_truth = pd.read_csv('ground-truth-data.csv')

In [19]:
df_ground_truth.head()

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef


In [20]:
ground_truth = df_ground_truth.to_dict(orient='records')
ground_truth[0]

{'question': 'When does the course begin?',
 'course': 'data-engineering-zoomcamp',
 'document': 'c02e79ef'}

In [31]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

In [32]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [40]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1278
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [34]:
evaluate(ground_truth, question_vector_knn)

100%|███████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [01:45<00:00, 43.89it/s]


{'hit_rate': 0.773071104387292, 'mrr': 0.6666810748505158}

In [35]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1467
def text_vector_knn(q):
    question = q['question']
    course = q['course']

    v_q = model.encode(question)

    return elastic_search_knn('text_vector', v_q, course)

In [36]:
evaluate(ground_truth, text_vector_knn)

100%|███████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [01:39<00:00, 46.34it/s]


{'hit_rate': 0.8286146531229739, 'mrr': 0.7062315395144454}

In [38]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1583
def question_text_vector_knn(q):
    question = q['question']
    course = q['course']

    v_q = model.encode(question)

    return elastic_search_knn('question_text_vector', v_q, course)

In [39]:
evaluate(ground_truth, question_text_vector_knn)

100%|███████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [01:36<00:00, 47.83it/s]


{'hit_rate': 0.9172249837907932, 'mrr': 0.824306606152295}

In [43]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=1750
def elastic_search_knn_combined(vector, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": [
                    {
                        "script_score": {
                            "query": {
                                "term": {
                                    "course": course
                                }
                            },
                            "script": {
                                "source": """
                                    cosineSimilarity(params.query_vector, 'question_vector') + 
                                    cosineSimilarity(params.query_vector, 'text_vector') + 
                                    cosineSimilarity(params.query_vector, 'question_text_vector') + 
                                    1
                                """,
                                "params": {
                                    "query_vector": vector
                                }
                            }
                        }
                    }
                ],
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        },
        "_source": ["text", "section", "question", "course", "id"]
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [44]:
elastic_search_knn_combined(v_q, 'data-engineering-zoomcamp')[:2]

[{'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'section': 'General course-related questions',
  'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'id': '7842b56a'},
 {'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp',
  'section': 'General course-related questions',
  'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'id': 'a482086d'}]

In [46]:
# https://youtu.be/VRprIm9-VV8?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R&t=2022
def vector_combined_knn(q):
    question = q['question']
    course = q['course']

    v_q = model.encode(question)

    return elastic_search_knn_combined(v_q, course)

In [47]:
evaluate(ground_truth, vector_combined_knn)

100%|███████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [01:39<00:00, 46.61it/s]


{'hit_rate': 0.9023125135076724, 'mrr': 0.804480945176861}