## Homework: Search Evaluation

In this homework, we will evaluate the results of vector
search.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Required libraries

We will use minsearch and Qdrant. Make sure you have the most up-to-date versions:

```bash
pip install -U minsearch qdrant_client
``` 

minsearch should be at least 0.0.4.



## Evaluation data

For this homework, we will use the same dataset we generated
in the videos.

Let's get them:

```python
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')
```

Here, `documents` contains the documents from the FAQ database
with unique IDs, and `ground_truth` contains generated
question-answer pairs. 

Also, we will need the code for evaluating retrieval:

```python
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }
```



## Q1. Minsearch text

Now let's evaluate our usual minsearch approach, indexing documents with:
```python
text_fields=["question", "section", "text"],
keyword_fields=["course", "id"]
```
but tweak the parameters for search. Let's use the following boosting params:

```python
boost = {'question': 1.5, 'section': 0.1}
```

What's the hitrate for this approach?

* 0.64
* 0.74
* 0.84
* 0.94

In [None]:
!pip install -U minsearch qdrant_client ipwidget jupyter

In [None]:
import requests  # for downloading datasets
import pandas as pd  # for loading and handling tabular data
from minsearch import Index  # the MinSearch class for text-based search
from tqdm.auto import tqdm  # for progress bars during evaluation

In [2]:
# Define base URL to GitHub raw data
url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'

# URL for the documents JSON file
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'

# Download and parse the documents
documents = requests.get(docs_url).json()

# URL for the ground truth CSV file
ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'

# Load ground truth into a DataFrame
df_ground_truth = pd.read_csv(ground_truth_url)

# Convert DataFrame to list of dictionaries for easier processing
ground_truth = df_ground_truth.to_dict(orient='records')


In [3]:
# Compute Hit Rate: % of queries for which the correct document was retrieved
def hit_rate(relevance_total):
    cnt = 0
    for line in relevance_total:
        if True in line:  # If any retrieved doc matches the correct ID
            cnt = cnt + 1
    return cnt / len(relevance_total)

# Compute Mean Reciprocal Rank (MRR)
def mrr(relevance_total):
    total_score = 0.0
    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:  # True means relevant doc found at this rank
                total_score = total_score + 1 / (rank + 1)
                break  # only the first correct hit counts for MRR
    return total_score / len(relevance_total)

# Main evaluation loop
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):  # iterate over each query
        doc_id = q['document']  # correct document id
        results = search_function(q)  # run search
        relevance = [d['id'] == doc_id for d in results]  # check if results match the true id
        relevance_total.append(relevance)  # collect all relevance flags

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }


In [4]:
# Create a MinSearch index with specified text and keyword fields
index = Index(
    text_fields=["question", "section", "text"],  # full-text searchable fields
    keyword_fields=["course", "id"]  # fields for exact matching
)

# Fit the index to our document list
index.fit(documents)


<minsearch.minsearch.Index at 0x78353f668dd0>

In [5]:
def search_function(q):
    return index.search(
        q["question"],  # use the question as the search query
        filter_dict={"course": q["course"]},  # filter by course to narrow down results
        boost_dict={"question": 1.5, "section": 0.1},  # boost weights for fields
        num_results=5  # how many top results to return
    )


In [6]:
# Evaluate using the ground truth and the defined search function
results = evaluate(ground_truth, search_function)

# Print final evaluation metrics
print(results)


  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.848714069591528, 'mrr': 0.7283553058137033}




## Embeddings 

The latest version of minsearch also supports vector search. 
We will use it:

```python
from minsearch import VectorSearch
```

We will also use TF-IDF and Singular Value Decomposition to 
create embeddings from texts. You can refer to our
["Create Your Own Search Engine" workshop](https://github.com/alexeygrigorev/build-your-own-search-engine)
if you want to know more about it.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
```

Let's create embeddings for the "question" field:

```python
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)
```

## Q2. Vector search for question

Now let's index these embeddings with minsearch:

```python
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)
```

Evaluate this seach method. What's MRR for it?

- 0.25
- 0.35
- 0.45
- 0.55


In [7]:
from minsearch import VectorSearch  # For vector-based semantic search
from sklearn.feature_extraction.text import TfidfVectorizer  # TF-IDF text vectorization
from sklearn.decomposition import TruncatedSVD  # Dimensionality reduction
from sklearn.pipeline import make_pipeline  # To chain TF-IDF + SVD


In [8]:
texts = []

for doc in documents:
    # Extract the 'question' field from each document
    t = doc['question']
    texts.append(t)


In [9]:
# Create a pipeline: TF-IDF vectorization + Truncated SVD (128 dimensions)
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),  # ignore rare words
    TruncatedSVD(n_components=128, random_state=1)  # reduce to 128D
)

# Fit and transform the texts into embeddings (2D numpy array)
X = pipeline.fit_transform(texts)


In [10]:
# Initialize the vector index with 'course' as a keyword filter
vindex = VectorSearch(keyword_fields=['course'])

# Fit the vector index with our embeddings and original documents
vindex.fit(X, documents)


<minsearch.vector.VectorSearch at 0x78353e58a950>

In [11]:
def vector_search_function(q):
    # Convert the input question to a single embedding vector
    query_vec = pipeline.transform([q["question"]])  # shape: (1, 128)

    # Run the vector search, filter by 'course'
    return vindex.search(
        query_vector=query_vec[0],  # use 1D vector
        filter_dict={"course": q["course"]},  # filter documents by course
        num_results=5  # return top 5 matches
    )


In [12]:
# Use the same evaluation function from earlier
results = evaluate(ground_truth, vector_search_function)

# Print metrics
print(results)


  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.48173762697212014, 'mrr': 0.3568510914199265}


## Q3. Vector search for question and answer

We only used question in Q2. We can use both question and answer:

```python
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)
```

Using the same pipeline (`min_df=3 for TF-IDF vectorizer and `n_components=128` for SVD), evaluate the performance of this
approach

What's the hitrate?

- 0.62
- 0.72
- 0.82
- 0.92

In [13]:
texts = []

for doc in documents:
    # Concatenate question and answer text
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

# Create pipeline for embeddings
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),  # filter rare terms
    TruncatedSVD(n_components=128, random_state=1)  # reduce to 128D
)

# Generate embeddings from combined text
X = pipeline.fit_transform(texts)


In [15]:
from minsearch import VectorSearch

# Create a new vector search index with 'course' as keyword field
vindex = VectorSearch(keyword_fields=['course'])

# Fit the vector search index with embeddings and documents
vindex.fit(X, documents)


<minsearch.vector.VectorSearch at 0x78353e796750>

In [16]:
def vector_search_combined(q):
    # Create embedding from the input question only (same pipeline)
    query_vec = pipeline.transform([q["question"]])  # query is still only the question

    # Perform search using combined vector index
    return vindex.search(
        query_vector=query_vec[0],
        filter_dict={"course": q["course"]},
        num_results=5
    )


In [17]:
results = evaluate(ground_truth, vector_search_combined)

print(results)


  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.8210503566025502, 'mrr': 0.6711944384410349}


## Q4. Qdrant

Now let's evaluate the following settings in Qdrant:

- `text = doc['question'] + ' ' + doc['text']`
- `model_handle = "jinaai/jina-embeddings-v2-small-en"`
- `limit = 5`

What's the MRR?

- 0.65
- 0.75
- 0.85
- 0.95

In [None]:
!pip install -U qdrant-client sentence-transformers


Collecting sentence-transformers
  Using cached sentence_transformers-5.0.0-py3-none-any.whl.metadata (16 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Using cached transformers-4.54.1-py3-none-any.whl.metadata (41 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Using cached torch-2.7.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Using cached huggingface_hub-0.34.3-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers<5.0.0,>=4.41.0->sentence-transformers)
  Using cached regex-2025.7.29-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (53 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers<5.0.0,>=4.41.0->sentence-transformers)
  Using cached tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.3 (from transformers<5.0.0,>=4.41.0->

In [None]:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue

from sentence_transformers import SentenceTransformer


In [None]:
# Load pretrained embedding model
model = SentenceTransformer("jinaai/jina-embeddings-v2-small-en")

# Initialize Qdrant (in-memory)
client = QdrantClient(":memory:")  # or "http://localhost:6333" if running a server


In [None]:
# Collection name
COLLECTION_NAME = "faq_data"

# Create embeddings from question + text
texts = [doc["question"] + " " + doc["text"] for doc in documents]
vectors = model.encode(texts).tolist()  # Convert to list of float vectors

# Create collection with vector settings
client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=len(vectors[0]), distance=Distance.COSINE)
)

# Prepare points
points = [
    PointStruct(
        id=i,
        vector=vectors[i],
        payload={
            "id": doc["id"],
            "course": doc["course"],
            "question": doc["question"],
            "text": doc["text"]
        }
    )
    for i, doc in enumerate(documents)
]

# Upload points to Qdrant
client.upsert(
    collection_name=COLLECTION_NAME,
    points=points
)


In [None]:
def qdrant_search_function(q):
    # Embed the query
    query_vector = model.encode(q["question"]).tolist()

    # Filter by course
    course_filter = Filter(
        must=[
            FieldCondition(
                key="course",
                match=MatchValue(value=q["course"])
            )
        ]
    )

    # Search
    hits = client.search(
        collection_name=COLLECTION_NAME,
        query_vector=query_vector,
        query_filter=course_filter,
        limit=5
    )

    # Convert back to required document format
    return [
        {
            "id": hit.payload["id"],
            "question": hit.payload["question"],
            "text": hit.payload["text"],
            "course": hit.payload["course"]
        }
        for hit in hits
    ]


In [None]:
results = evaluate(ground_truth, qdrant_search_function)

print(results)
