## Homework: Search Evaluation
In this homework, we will evaluate the results of vector search.

It's possible that your answers won't match exactly. If it's the case, select the closest one.

In [2]:
# !pip install -U minsearch qdrant_client

## Evaluation data
For this homework, we will use the same dataset we generated in the videos.

In [1]:
import os
import math
import warnings
from typing import Callable, Dict, List


warnings.filterwarnings("ignore")

In [2]:
import requests
import numpy as np
import pandas as pd

print("Downloading documents and ground truth...")

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
print(f"- Loaded {len(documents)} documents")
print(f"- Loaded {len(df_ground_truth)} ground-truth queries")
ground_truth = df_ground_truth.to_dict(orient='records')



Downloading documents and ground truth...
- Loaded 948 documents
- Loaded 4627 ground-truth queries


Here, documents contains the documents from the FAQ database with unique IDs, and ground_truth contains generated question-answer pairs.

Also, we will need the code for evaluating retrieval:

In [3]:
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [4]:
CHOICES = {
    "Q1": [0.64, 0.74, 0.84, 0.94],
    "Q2": [0.25, 0.35, 0.45, 0.55],
    "Q3": [0.62, 0.72, 0.82, 0.92],
    "Q4": [0.65, 0.75, 0.85, 0.95],
    "Q5": [0.64, 0.74, 0.84, 0.94],
    "Q6": [0.25, 0.35, 0.45, 0.55],
}


## Q1. Minsearch text
Now let's evaluate our usual minsearch approach, indexing documents with:

text_fields=["question", "section", "text"],
keyword_fields=["course", "id"]

but tweak the parameters for search. Let's use the following boosting params:

boost = {'question': 1.5, 'section': 0.1}

### What's the hitrate for this approach?

0.64

0.74

0.84

0.94


In [5]:
# -----------------------------
# Q1: Minsearch lexical with boosts
# -----------------------------

import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"],
)
index.fit(documents)

def search_fn(q):
    boost = {"question": 1.5, "section": 0.1}
    return index.search(
        query=q["question"],
        filter_dict={"course": q["course"]},
        boost_dict=boost,
        num_results=5,
    )

res = evaluate(ground_truth, search_fn)
print(f"Q1 — Minsearch (lexical w/ boosts) -> HitRate={res['hit_rate']:.3f}, MRR={res['mrr']:.3f}")
# print(f"  Nearest choice (HitRate): {nearest_choice(res['hit_rate'], CHOICES['Q1']):.2f}")


100%|██████████| 4627/4627 [00:10<00:00, 456.61it/s]

Q1 — Minsearch (lexical w/ boosts) -> HitRate=0.849, MRR=0.729





HitRate=0.849 (Close to option C: 0.84)

## Embeddings
The latest version of minsearch also supports vector search. We will use it.


We will also use TF-IDF and Singular Value Decomposition to create embeddings from texts. You can refer to our "Create Your Own Search Engine" workshop if you want to know more about it.



In [6]:
# -----------------------------
# Q2: Vector search (question field only)
# -----------------------------

from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

## Let's create embeddings for the "question" field:

texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

X = pipeline.fit_transform(texts)

# Now let's index these embeddings with minsearch:

vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x24d461325d0>

In [7]:
# Evaluate this seach method. What's MRR for it?
def search_fn(q):
        qv = pipeline.transform([q["question"]])
        return vindex.search(
            query_vector=qv[0], 
            filter_dict={"course": q["course"]}, 
            num_results=5
        )

res = evaluate(ground_truth, search_fn)
print(f"Q2 — Vector (question only) -> HitRate={res['hit_rate']:.3f}, MRR={res['mrr']:.3f}")
    

100%|██████████| 4627/4627 [00:04<00:00, 932.29it/s]

Q2 — Vector (question only) -> HitRate=0.482, MRR=0.357





Choices: 0.25, 0.35, 0.45, 0.55

Answer selected: 0.35 (Close to calculated .357)

## Q3. Vector search for question and answer
We only used question in Q2. We can use both question and answer:


Using the same pipeline (min_df=3 for TF-IDF vectorizer and n_components=128` for SVD), evaluate the performance of this approach

What's the hitrate? Choices given:

0.62

0.72

0.82

0.92

In [None]:
from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)

pipe = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1),
)
X = pipe.fit_transform(texts)





In [9]:
vindex = VectorSearch(keyword_fields={"course"})
vindex.fit(X, documents)

def search_fn(q):
    qv = pipe.transform([q["question"]])
    return vindex.search(
        query_vector=qv[0], filter_dict={"course": q["course"]}, num_results=5
    )

res = evaluate(ground_truth, search_fn)
print(f"Q3 — Vector (question+text) -> HitRate={res['hit_rate']:.3f}, MRR={res['mrr']:.3f}")
    

100%|██████████| 4627/4627 [00:09<00:00, 496.73it/s]

Q3 — Vector (question+text) -> HitRate=0.821, MRR=0.672





Answer selected: 0.82 (Close to calculated .821)

## Q4. Qdrant
Now let's evaluate the following settings in Qdrant:

What's the MRR?

0.65

0.75

0.85

0.95

In [10]:
# !pip install fastembed

In [11]:
# -----------------------------
# Q4: Qdrant ANN search (SentenceTransformer embeddings)
# -----------------------------

from sentence_transformers import SentenceTransformer
import qdrant_client
import requests
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue
from fastembed import TextEmbedding as FastEmbed 
from tqdm.auto import tqdm


# ---------- Qdrant in-memory collection ----------
client = QdrantClient(":memory:")  # in-memory, no server



texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)

# model_name  = "jinaai/jina-embeddings-v2-small-en"
model_name  = "all-MiniLM-L6-v2"
limit = 5

model = SentenceTransformer(model_name)
embeddings  = model.encode(texts,show_progress_bar=True)

dim = embeddings.shape[1]

collection = "course-questions"
client.create_collection(
    collection_name=collection,
    vectors_config=VectorParams(size=dim, distance=Distance.COSINE),
)


# Upsert points with original FAQ IDs in payload so we can return them later

## Doing in small batches.
points = []

for i, (doc, emb) in enumerate(zip(documents, embeddings)):
    points.append(
        PointStruct(
            id=i,  # internal numeric ID
            vector=emb.tolist(),
            payload=doc
        )
    )
client.upsert(collection_name=collection, points=points)

Batches: 100%|██████████| 30/30 [00:11<00:00,  2.55it/s]


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [12]:
def qdrant_search_fn(qrow):
        query = qrow['question']
        
        q_vec = model.encode([query])

        
        # Use the correct filter format
        query_filter = Filter(
            must=[
                FieldCondition(
                    key="course",
                    match=MatchValue(value=qrow['course'])
                )
            ]
        )

        hits = client.search(
            collection_name=collection,
            query_vector=q_vec[0].tolist(),
            query_filter=query_filter,
            limit=5
        )
        # Return the original FAQ IDs in the expected format
        return [{'id': hit.payload['id'], 'score': hit.score} for hit in hits]


In [13]:
res = evaluate(ground_truth, qdrant_search_fn)
print(f"Q4 — Qdrant ANN -> HitRate={res['hit_rate']:.3f}, MRR={res['mrr']:.3f}")
 

100%|██████████| 4627/4627 [04:33<00:00, 16.92it/s]

Q4 — Qdrant ANN -> HitRate=0.920, MRR=0.827





Answer selected: 0.85 (Close to calculated 0.827)

## Q5. Cosine simiarity
Cosine similarity is a dot product between two normalized vectors. In geometrical sense, it's the cosine of the angle between the vectors. Look up "cosine similarity geometry" if you want to learn more about it.



In [None]:
# -----------------------------
# Q5: Average cosine between TF-IDF vectors
# -----------------------------
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

# For normalization, we first compute the vector norm (its length), and then divide the vector by it.
# First, we normalize each of the vectors
# Then, compute the dot product

def cosine(u, v, eps=1e-12):
    u = np.asarray(u)
    v = np.asarray(v)
    u_norm = np.linalg.norm(u)
    v_norm = np.linalg.norm(v)
    if u_norm < eps or v_norm < eps:
        return 0.0
    return float(np.dot(u, v) / (u_norm * v_norm))


In [40]:
# Now let's use this function to compute the A->Q->A cosine similarity.
# We will use the results from our gpt-4o-mini evaluations.

results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

In [41]:
questions = df_results['question']
answers  = df_results['answer_llm']


vect = TfidfVectorizer(min_df=3)
Q = vect.fit_transform(questions)
A = vect.transform(answers)

In [42]:
df_results.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [43]:
# When creating embeddings, we will use a simple way - the same we used in the Embeddings section

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

# Let's fit the vectorizer on all the text data we have:
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)



0,1,2
,steps,"[('tfidfvectorizer', ...), ('truncatedsvd', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,n_components,128
,algorithm,'randomized'
,n_iter,5
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,1
,tol,0.0


In [44]:

# Now use the `transform` method of the pipeline to create the embeddings and calculate the cosine similarity between each pair.

# Compute per-row cosine similarities
cosines = []

for _, row in df_results.iterrows():
# for _, row in df_results.dropna(subset=["answer_llm","answer_orig"]).iterrows():
    # Create embeddings for the LLM answer and the original answer
    v_llm = pipeline.transform([row['answer_llm']])[0]
    v_orig = pipeline.transform([row['answer_orig']])[0]

    # Compute cosine similarity and store it
    cosines.append(cosine(v_llm, v_orig))


avg_cosine = float(np.mean(cosines))
print(f"Average cosine similarity: {avg_cosine:.4f}")

Average cosine similarity: 0.8416


Answer selected: 0.84 (Close to calculated .8416)

# Q6. Rouge
And alternative way to see how two texts are similar is ROUGE.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

In [21]:
# !pip install rouge

In [22]:
# Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

from rouge import Rouge
import numpy as np

# (Optional) sanity check: the row with index 10 (doc_id=5170565b)
rouge_scorer = Rouge()

r = df_results.iloc[10]
scores_example = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)[0]
print("Row 10 ROUGE-1:", scores_example['rouge-1'])   # shows p/r/f for ROUGE-1


# scores = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)[0]
# scores


Row 10 ROUGE-1: {'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}


There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

rouge-1 - the overlap of unigrams,
rouge-2 - bigrams,
rouge-l - the longest common subsequence
For the 10th document, Rouge-1 F1 score is 0.45

Let's compute it for the pairs in the entire dataframe. What's the average Rouge-1 F1?

0.25

0.35

0.45

0.55


In [23]:
# 1) Prepare data (skip rows with missing answers)
pairs = df_results[['answer_llm', 'answer_orig']].dropna()

# 2) Compute ROUGE for the whole set in one go and average
#    (rouge.get_scores accepts lists and avg=True returns corpus averages)
scores_avg = rouge_scorer.get_scores(
    pairs['answer_llm'].tolist(),
    pairs['answer_orig'].tolist(),
    avg=True
)

avg_r1_f1 = scores_avg['rouge-1']['f']
print(f"Average ROUGE-1 F1: {avg_r1_f1:.4f}")

Average ROUGE-1 F1: 0.3517


Answer selected: 0.35 (Close to calculated .3517)