## Homework: Search Evaluation
---
In this homework, we will evaluate the results of vector search.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

## Required libraries
---
We will use minsearch and Qdrant. Make sure you have the most up-to-date versions:

```pip install -U minsearch qdrant_client```

minsearch should be at least 0.0.4.

## Evaluation data
---
For this homework, we will use the same dataset we generated in the videos.

Let's get them:

In [1]:
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

Index the documents

In [2]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.minsearch.Index at 0x705898302f60>

Here, ```documents``` contains the documents from the FAQ database with unique IDs, and ground_truth contains generated question-answer pairs.

Also, we will need the code for evaluating retrieval:

In [3]:
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

  from .autonotebook import tqdm as notebook_tqdm


## Q1. Minsearch text
---
Now let's evaluate our usual minsearch approach, but tweak the parameters. Let's use the following boosting params:

```boost = {'question': 1.5, 'section': 0.1}```

What's the hitrate for this approach?

* 0.64
* 0.74
* **0.84** *<- hit rate for this approach* 
* 0.94

In [4]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [5]:
def search(query, course):
    boost = {'question': 1.5, 'section': 0.1}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [6]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:15<00:00, 293.83it/s]


In [7]:
relevance_total

[[True, False, False, False, False],
 [False, False, False, True, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [False, True, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [False, False, False, True, False],
 [False, False, False, False, False],
 [False, True, False, False, False],
 [True, False, False, False, False],
 [False, True, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [],
 [],
 [],
 [],
 [],
 [True, False, False, False, False],
 [False, False, False, False, True],
 [False, False, False, False, False],
 [True, False, False, False, False],
 [False, True, False, False, False],
 [True, False, False, False, False],
 [True, Fal

In [8]:
example = [[True, False, False, False, False],
 [False, False, False, True, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [False, True, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [True, False, False, False, False],
 [False, False, False, True, False],
 [False, False, False, False, False],
]

* Hit-rate (recall)

In [9]:
11 / len(example) # 11 is the number of rows with at least one True

0.9166666666666666

In [10]:
hit_rate(example)

0.9166666666666666

In [11]:
hit_rate(relevance_total)

0.848714069591528

## Embeddings
---
The latest version of minsearch also supports vector search. We will use it:

In [12]:
from minsearch import VectorSearch

We will also use TF-IDF and Singular Value Decomposition to create embeddings from texts. You can refer to our ["Create Your Own Search Engine" workshop](https://github.com/alexeygrigorev/build-your-own-search-engine) if you want to know more about it.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from minsearch import VectorSearch

Let's create embeddings for the "question" field:

In [16]:
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

## Q2. Vector search for question
---
Now let's index these embeddings with minsearch:

In [17]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x70589881a120>

Evaluate this search method. What's MRR for it?

* 0.25
* **0.35** *<-- MRR*
* 0.45
* 0.55

In [18]:
# 3. Search function using the SAME pipeline
def search_function(q):
    query_vector = pipeline.transform([q['question']])[0]  # Same pipeline!
    return vindex.search(
        query_vector=query_vector,
        filter_dict={'course': q['course']},
        num_results=5
    )

# 4. Evaluate
minsearch_vector_eval_results = evaluate(ground_truth, search_function)

print(f"Minsearch text hitrate: {minsearch_vector_eval_results['hit_rate']}")
print(f"Minsearch vector MRR: {minsearch_vector_eval_results['mrr']}")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:07<00:00, 638.88it/s]

Minsearch text hitrate: 0.48173762697212014
Minsearch vector MRR: 0.3572833369353793





## Q3. Vector search for question and answer
---
We only used question in Q2. We can use both question and answer:

In [19]:
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)

Using the same pipeline ```(min_df=3 for TF-IDF vectorizer and n_components=128` for SVD)```, evaluate the performance of this approach

What's the hitrate?

* 0.62
* 0.72
* **0.82** *<- hit rate*
* 0.92

In [20]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

In [21]:
# let's index these embeddings with minsearch
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x705896d77fb0>

In [22]:
# Embeddings for ground_truth question
X_q = pipeline.transform([q['question'] for q in ground_truth])
for question, vector in zip(ground_truth, X_q):
    question['vector'] = vector

In [23]:
minsearch_vector_eval_results = evaluate(ground_truth, search_function)

print(f"Minsearch text hitrate: {minsearch_vector_eval_results['hit_rate']}")
print(f"Minsearch vector MRR: {minsearch_vector_eval_results['mrr']}")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [00:07<00:00, 612.74it/s]

Minsearch text hitrate: 0.8210503566025502
Minsearch vector MRR: 0.6717347453353508





## Q4. Qdrant
---
Now let's evaluate the following settings in Qdrant:

* ```text = doc['question'] + ' ' + doc['text']```
* ```model_handle = "jinaai/jina-embeddings-v2-small-en"```
* ```limit = 5```

What's the MRR?

* 0.65
* 0.75
* **0.85** *<- MRR*
* 0.95

```
docker run -p 6333:6333 -p 6334:6334 \
   -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
   qdrant/qdrant
```

In [24]:
from qdrant_client import QdrantClient, models

qd_client = QdrantClient("http://localhost:6333")

EMBEDDING_DIMENSIONALITY = 512
model_handle = "jinaai/jina-embeddings-v2-small-en"

In [25]:
collection_name = "evaluating_ground_truth"
qd_client.delete_collection(collection_name=collection_name)

True

In [26]:
qd_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,
        distance=models.Distance.COSINE
    )
)

True

In [27]:
qd_client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword"
)

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

In [28]:
points = []

for i, doc in enumerate(documents):
    text = doc['question'] + ' ' + doc['text']
    vector = models.Document(text=text, model=model_handle)
    point = models.PointStruct(
        id=i,
        vector=vector,
        payload=doc
    )
    points.append(point)

In [29]:
qd_client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

In [30]:
def qdrant_vector_search(question):    
    course = question['course']
    query_points = qd_client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=question['question'],
            model=model_handle 
        ),
        query_filter=models.Filter( 
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=5,
        with_payload=True
    )
    
    results = []
    
    for point in query_points.points:
        results.append(point.payload)
    
    return results

In [31]:
evaluate(ground_truth, qdrant_vector_search)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4627/4627 [01:24<00:00, 54.81it/s]


{'hit_rate': 0.9299762264966501, 'mrr': 0.8517722066133576}

## Q5. Cosine simiarity
---
In the second part of the module, we looked at evaluating the entire RAG approach. In particular, we looked at comparing the answer generated by our system with the actual answer from the FAQ.

One of the ways of doing it is using the cosine similarity. Let's see how to calculate it.

Cosine similarity is a dot product between two normalized vectors. In geometrical sense, it's the cosine of the angle between the vectors. Look up "cosine similarity geometry" if you want to learn more about it.

For us, it means that we need two things:

First, we normalize each of the vectors
Then, compute the dot product
So, we get this:
```python
def cosine(u, v):
    u = normalize(u)
    v = normalize(v)
    return u.dot(v)
```
For normalization, we first compute the vector norm (its length), and then divide the vector by it:
```python
def normalize(u):
    norm = np.sqrt(u.dot(u))
    return u / norm
(where np is import numpy as np)
```
Or we can simplify it:
```python
def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)
```
Now let's use this function to compute the A->Q->A cosine similarity.

We will use the results from our [gpt-4o-mini evaluations:](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-evaluation/rag_evaluation/data/results-gpt4o-mini.csv)

In [32]:
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

In [33]:
import numpy as np

def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)

When creating embeddings, we will use a simple way - the same we used in the [Embeddings](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2025/03-evaluation/homework.md#embeddings) section:

In [34]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

Let's fit the vectorizer on all the text data we have:

In [35]:
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)

Now use the transform methon of the pipeline to create the embeddings and calculate the cosine similarity between each pair.

What's the average cosine?

* 0.64
* 0.74
* **0.84** *< average cosine*
* 0.94

This is how you do it:

* For each answer pair, compute
    * v_llm for the answer from the LLM
    * v_orig for the original answer
    * then compute the cosine between them
* At the end, take the average

In [36]:
df_results['v_llm'] = pipeline.transform(df_results['answer_llm']).tolist()
df_results['v_orig'] = pipeline.transform(df_results['answer_orig']).tolist()

In [37]:
df_results['cosine_similarity'] = df_results[['v_llm', 'v_orig']].apply(
    lambda x: cosine(np.array(x['v_llm']), np.array(x['v_orig'])), axis=1)

In [38]:
df_results.head(2)

Unnamed: 0,answer_llm,answer_orig,document,question,course,v_llm,v_orig,cosine_similarity
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,"[0.1554985879579983, 0.11219644369710587, -0.1...","[0.22746772878326751, 0.12079641681716524, -0....",0.463526
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,"[0.1489427947945486, 0.1767921364621164, -0.16...","[0.22746772878326751, 0.12079641681716524, -0....",0.781565


In [39]:
df_results['cosine_similarity'].mean()

np.float64(0.8415841233490402)

## Q6. Rouge
---
And alternative way to see how two texts are similar is ROUGE.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [40]:
pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Note: you may need to restart the kernel to use updated packages.


(The latest version at the moment of writing is ```1.0.1```)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (```doc_id=5170565b```)

In [41]:
from rouge import Rouge
rouge_scorer = Rouge()

r = df_results.iloc[10]
scores = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)[0]
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

* ```rouge-1``` - the overlap of unigrams,
* ```rouge-2``` - bigrams,
* ```rouge-l``` - the longest common subsequence

For the 10th document, Rouge-1 F1 score is 0.45

Let's compute it for the pairs in the entire dataframe. What's the average Rouge-1 F1?

* 0.25
* **0.35** *<- average Rouge-1 F1*
* 0.45
* 0.55

In [42]:
full_score = rouge_scorer.get_scores(df_results['answer_llm'], df_results['answer_orig'])

In [43]:
# What's the average Rouge-1 F1?
metrics = ['r', 'p', 'f']
for metric in metrics:
    print(f"{metric}_avg: {np.mean([record['rouge-1'][metric] for record in full_score])}")
# [record['rouge-1'][metric] for record in full_score for metric in record['rouge-1']]

r_avg: 0.34043594697723023
p_avg: 0.4299569796022711
f_avg: 0.3516946452113943
