# Homework: Search Evaluation
In this homework, we will evaluate the results of vector search.

It's possible that your answers won't match exactly. If it's the case, select the closest one.

Required libraries
We will use minsearch and Qdrant. Make sure you have the most up-to-date versions:

## Evaluation data
For this homework, we will use the same dataset we generated in the videos.

Let's get them:

In [1]:
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

In [22]:
for q in ground_truth:
    print(q)
    break

{'question': 'When does the course begin?', 'course': 'data-engineering-zoomcamp', 'document': 'c02e79ef'}


In [2]:
for q in documents:
    print(q)
    break

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.", 'section': 'General course-related questions', 'question': 'Course - When will the course start?', 'course': 'data-engineering-zoomcamp', 'id': 'c02e79ef'}


In [3]:
from tqdm import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Q1. Minsearch text
Now let's evaluate our usual minsearch approach, but tweak the parameters. Let's use the following boosting params:
```python
boost = {'question': 1.5, 'section': 0.1}
```
**What's the hitrate for this approach?**

- 0.64
- 0.74
- 0.84
- 0.94

In [5]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

def minsearch_search(query, course):
    boost = {'question': 1.5, 'section': 0.1}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [6]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

100%|██████████| 4627/4627 [00:43<00:00, 107.25it/s]


In [7]:
hit_rate(relevance_total), mrr(relevance_total)

(0.848714069591528, 0.7288235717887772)

## Embeddings
The latest version of minsearch also supports vector search. We will use it:
```python
from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
```
Let's create embeddings for the "question" field:

In [8]:
from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

## Q2. Vector search for question
---
Now let's index these embeddings with minsearch:

```python
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)
```
Evaluate this seach method. What's MRR for it?

- 0.25
- 0.35
- 0.45
- 0.55

In [9]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x7717d060cb10>

In [10]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    Y = pipeline.transform([q['question']])
    results = vindex.search(Y, filter_dict={'course': q['course']}, num_results=5)
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)
    

100%|██████████| 4627/4627 [00:20<00:00, 230.31it/s]


In [11]:
mrr(relevance_total)

0.3572833369353793

## Q3. Vector search for question and answer
We only used question in Q2. We can use both question and answer:
```python
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)
```
Using the same pipeline (min_df=3 for TF-IDF vectorizer and n_components=128` for SVD), evaluate the performance of this approach

What's the hitrate?

- 0.62
- 0.72
- 0.82
- 0.92

In [12]:
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)
    
pipeline = make_pipeline(
TfidfVectorizer(min_df=3),
TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

In [13]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x7717d0603d90>

In [14]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    Y = pipeline.transform([q['question']])
    results = vindex.search(Y, filter_dict={'course': q['course']}, num_results=5)
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

100%|██████████| 4627/4627 [00:25<00:00, 181.76it/s]


In [15]:
hit_rate(relevance_total)

0.8210503566025502

## Q4. Qdrant
Now let's evaluate the following settings in Qdrant:

- text = doc['question'] + ' ' + doc['text']
- model_handle = "jinaai/jina-embeddings-v2-small-en"
- limit = 5

What's the MRR?
- 0.65
- 0.75
- 0.85
- 0.95

In [40]:
from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333") #connecting to local Qdrant instance


In [44]:
EMBEDDING_DIMENSIONALITY = 512
model_handle = "jinaai/jina-embeddings-v2-small-en"

collection_name = "zoomcamp-faq"

client.delete_collection(collection_name=collection_name)

False

In [45]:
# Create the collection with specified vector parameters
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,
        distance=models.Distance.COSINE
    )
)

True

In [46]:
client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword"
)

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

In [47]:
points = []

for i, doc in enumerate(documents):
    text = doc['question'] + ' ' + doc['text']
    vector = models.Document(text=text, model=model_handle)
    point = models.PointStruct(
        id=i,
        vector=vector,
        payload=doc
    )
    points.append(point)

In [48]:
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

In [49]:
def vector_search(question):
    query_points = client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=question,
            model=model_handle 
        ),
        limit=5,
        with_payload=True
    )
    
    results = []
    
    for point in query_points.points:
        results.append(point.payload)
    
    return results

In [50]:
query = 'I just discovered the course. Can I join now?'
scores = vector_search(query)
scores

# print("The highest score in the results: ", scores[0])

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp',
  'id': 'ee58a693'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'Yes, we wi

In [51]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = vector_search(q['question'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 2/4627 [00:00<06:38, 11.61it/s]

100%|██████████| 4627/4627 [05:09<00:00, 14.95it/s]


In [52]:
mrr(relevance_total)

0.8243462286578789

## Q5. Cosine simiarity
In the second part of the module, we looked at evaluating the entire RAG approach. In particular, we looked at comparing the answer generated by our system with the actual answer from the FAQ.

One of the ways of doing it is using the cosine similarity. Let's see how to calculate it.

Cosine similarity is a dot product between two normalized vectors. In geometrical sense, it's the cosine of the angle between the vectors. Look up "cosine similarity geometry" if you want to learn more about it.

For us, it means that we need two things:

- First, we normalize each of the vectors
- Then, compute the dot product
So, we get this:

In [55]:
import numpy as np

def normalize(u):
    norm = np.sqrt(u.dot(u))
    return u / norm

In [56]:

def cosine(u, v):
    u = normalize(u)
    v = normalize(v)
    return u.dot(v)

In [57]:
def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)

Now let's use this function to compute the A->Q->A cosine similarity.

We will use the results from our gpt-4o-mini evaluations:

In [58]:
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

df_results.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


When creating embeddings, we will use a simple way - the same we used in the Embeddings section:

In [60]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

Let's fit the vectorizer on all the text data we have:

In [61]:
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)

0,1,2
,steps,"[('tfidfvectorizer', ...), ('truncatedsvd', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,128
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,1
,tol,0.0


In [64]:
# Create embeddings for the 'answer_llm' and 'answer_orig' columns
embeddings_llm = pipeline.transform(df_results['answer_llm'])
embeddings_orig = pipeline.transform(df_results['answer_orig'])
print("Embeddings created for 'answer_llm' and 'answer_orig'.")

Embeddings created for 'answer_llm' and 'answer_orig'.


In [65]:
# Calculate cosine similarity for each pair of embeddings
cosine_similarities = []
for i in range(len(embeddings_llm)):
    sim = cosine(embeddings_llm[i], embeddings_orig[i])
    cosine_similarities.append(sim)

In [66]:
# Add the calculated cosine similarities as a new column to the DataFrame
df_results['cosine_similarity'] = cosine_similarities
print("Cosine similarities calculated for each pair and added to DataFrame.")
print(df_results[['answer_llm', 'answer_orig', 'cosine_similarity']].head())

Cosine similarities calculated for each pair and added to DataFrame.
                                          answer_llm  \
0  You can sign up for the course by visiting the...   
1  You can sign up using the link provided in the...   
2  Yes, there is an FAQ for the Machine Learning ...   
3  The context does not provide any specific info...   
4  To structure your questions and answers for th...   

                                         answer_orig  cosine_similarity  
0  Machine Learning Zoomcamp FAQ\nThe purpose of ...           0.463526  
1  Machine Learning Zoomcamp FAQ\nThe purpose of ...           0.781565  
2  Machine Learning Zoomcamp FAQ\nThe purpose of ...           0.889158  
3  Machine Learning Zoomcamp FAQ\nThe purpose of ...           0.614962  
4  Machine Learning Zoomcamp FAQ\nThe purpose of ...           0.624086  


In [67]:
df_results['cosine_similarity'].describe()

count    1830.000000
mean        0.841584
std         0.173737
min         0.079093
25%         0.806927
50%         0.905812
75%         0.950711
max         0.996457
Name: cosine_similarity, dtype: float64

## Q6. Rouge
And alternative way to see how two texts are similar is ROUGE.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [69]:
df_results.iloc[10]

answer_llm           Yes, all sessions are recorded, so if you miss...
answer_orig          Everything is recorded, so you won’t miss anyt...
document                                                      5170565b
question                          Are sessions recorded if I miss one?
course                                       machine-learning-zoomcamp
cosine_similarity                                             0.942857
Name: 10, dtype: object

In [68]:
from rouge import Rouge
rouge_scorer = Rouge()

r = df_results.iloc[10]
scores = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)[0]
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

In [71]:
rogue_scores = []
for i, row in df_results.iterrows():
    scores = rouge_scorer.get_scores(row.answer_llm, row.answer_orig)[0]
    rogue_scores.append(scores)

In [72]:
rogue_scores[:3]

[{'rouge-1': {'r': 0.061224489795918366,
   'p': 0.21428571428571427,
   'f': 0.09523809178130524},
  'rouge-2': {'r': 0.017543859649122806,
   'p': 0.07142857142857142,
   'f': 0.028169010918468917},
  'rouge-l': {'r': 0.061224489795918366,
   'p': 0.21428571428571427,
   'f': 0.09523809178130524}},
 {'rouge-1': {'r': 0.08163265306122448,
   'p': 0.26666666666666666,
   'f': 0.12499999641113292},
  'rouge-2': {'r': 0.03508771929824561,
   'p': 0.13333333333333333,
   'f': 0.05555555225694465},
  'rouge-l': {'r': 0.061224489795918366, 'p': 0.2, 'f': 0.09374999641113295}},
 {'rouge-1': {'r': 0.32653061224489793,
   'p': 0.5714285714285714,
   'f': 0.41558441095631643},
  'rouge-2': {'r': 0.14035087719298245,
   'p': 0.24242424242424243,
   'f': 0.17777777313333343},
  'rouge-l': {'r': 0.30612244897959184,
   'p': 0.5357142857142857,
   'f': 0.3896103849822905}}]

In [73]:
r1_f1 = []
for score in rogue_scores:
    r1_f1.append(score['rouge-1']['f'])

In [74]:
# What's the average Rouge-1 F1
average_r1_f1 = np.mean(r1_f1)

print(f"\nWhat's the average average_r1_f1? {average_r1_f1}")


What's the average average_r1_f1? 0.3516946452113943
