## Homework: Search Evaluation

In this homework, we will evaluate the results of vector
search.


## Required libraries

We will use minsearch and Qdrant. Make sure you have the most up-to-date versions:

```bash
pip install -U minsearch qdrant_client
``` 

minsearch should be at least 0.0.4.

## Evaluation data

For this homework, we will use the same dataset we generated
in the videos.

Let's get them:

In [1]:
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

Here, `documents` contains the documents from the FAQ database
with unique IDs, and `ground_truth` contains generated
question-answer pairs. 

Also, we will need the code for evaluating retrieval:

In [2]:
from tqdm.auto import tqdm

def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Q1. Minsearch text

Now let's evaluate our usual minsearch approach, indexing documents with:
```python
text_fields=["question", "section", "text"],
keyword_fields=["course", "id"]
```
but tweak the parameters for search. Let's use the following boosting params:

```python
boost = {'question': 1.5, 'section': 0.1}
```

What's the hit rate for this approach?

> 0.84

In [21]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "section", "text"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

boost = {"question": 1.5, "section": 0.1}

<minsearch.minsearch.Index at 0x1788c3850>

In [43]:
def search_function(q):
    """
    q is one element from ground_truth, e.g.
      {'question': 'What is a vector index?', 'document': 42, ...}
    
    """
    return index.search(
        query=q["question"],           # what we search for
        filter_dict={'course': q["course"]},      # filter by course
        boost_dict=boost         # the homework tweak
    )


In [44]:
metrics = evaluate(ground_truth, search_function)
print("Hit-rate:", metrics["hit_rate"])
print("MRR:     ", metrics["mrr"])


  0%|          | 0/4627 [00:00<?, ?it/s]

Hit-rate: 0.8995029176572293
MRR:      0.7356124850343578


In [None]:
def minsearch_search(query, course):

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [36]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [37]:
print("Hit-rate:", hit_rate(relevance_total))

Hit-rate: 0.848714069591528


## Embeddings 

The latest version of minsearch also supports vector search. 
We will use it:

```python
from minsearch import VectorSearch
```

We will also use TF-IDF and Singular Value Decomposition to 
create embeddings from texts. You can refer to our
["Create Your Own Search Engine" workshop](https://github.com/alexeygrigorev/build-your-own-search-engine)
if you want to know more about it.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
```

Let's create embeddings for the "question" field:

```python
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

In [16]:
from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)

## Q2. Vector search for question

Now let's index these embeddings with minsearch:

```python
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)
```

Evaluate this seach method. What's MRR for it?

> 0.35


In [17]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x177a1dfc0>

In [49]:
def search_function_vindex(q):
    """
    q is one row from ground_truth, e.g.
      {'question': 'What is a vector index?', 'course': 'embeddings', ...}

    Returns a list of up to k hits; each hit is the dict we stored in vindex.
    """
    query_vec = pipeline.transform([q["question"]])        # embed the query

    return vindex.search(
        query_vec,             # ← vector not text
        filter_dict={"course": q["course"]},  # stay in the same course
    )

metrics = evaluate(ground_truth, search_function_vindex)

print("Hit-rate:", metrics["hit_rate"])
print("MRR:     ", metrics["mrr"])

  0%|          | 0/4627 [00:00<?, ?it/s]

Hit-rate: 0.560622433542252
MRR:      0.36761837866765423


## Q3. Vector search for question and answer

We only used question in Q2. We can use both question and answer:

```python
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)
```

Using the same pipeline (`min_df=3 for TF-IDF vectorizer and `n_components=128` for SVD), evaluate the performance of this
approach

What's the hitrate?

- 0.62
- 0.72
- 0.82
- 0.92

In [57]:
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)

# Create a pipeline with TfidfVectorizer and TruncatedSVD
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

# Fit the pipeline to the texts, with both question and answer text
X = pipeline.fit_transform(texts)

# index the vectors
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

# search function with question and answer text
def search_function_vindex_qa(q):
    """
    q is one row from ground_truth, e.g.
      {'question': 'What is a vector index?', 'course': 'embeddings', ...}

    Returns a list of up to k hits; each hit is the dict we stored in vindex.
    """

    query_vec = pipeline.transform([                # embed the user's question
        q["question"]           # <- same concat rule
    ])


    return vindex.search(
        query_vec,             # ← vector not text
        filter_dict={"course": q["course"]},  # stay in the same course
    )

# Run the evaluation
metrics = evaluate(ground_truth, search_function_vindex_qa)

print("Hit-rate:", metrics["hit_rate"])
print("MRR:     ", metrics["mrr"])

  0%|          | 0/4627 [00:00<?, ?it/s]

Hit-rate: 0.8841582018586557
MRR:      0.6805470650186457


***NOTE***: Adding the full answer text gives the vectoriser many more words to anchor on, so semantically similar questions that phrase things differently can still end up near each other in the SVD space. That typically boosts recall (hit-rate) a bit compared with question-only embeddings.

## Q4. Qdrant

Now let's evaluate the following settings in Qdrant:

- `text = doc['question'] + ' ' + doc['text']`
- `model_handle = "jinaai/jina-embeddings-v2-small-en"`
- `limit = 5`

What's the MRR?

- 0.65
- 0.75
- 0.85
- 0.95

In [79]:
from qdrant_client import QdrantClient, models
qd_client = QdrantClient("http://localhost:6333")

EMBEDDING_DIMENSIONALITY = 512
model_handle = "jinaai/jina-embeddings-v2-small-en"
collection_name = "zoomcamp-hw-week3"

In [80]:
qd_client.delete_collection(collection_name=collection_name)

False

In [81]:
qd_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,
        distance=models.Distance.COSINE
    )
)

qd_client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword"
)

points = []

for i, doc in enumerate(documents):
    text = doc['question'] + ' ' + doc['text']
    vector = models.Document(text=text, model=model_handle)
    point = models.PointStruct(
        id=i,
        vector=vector,
        payload=doc
    )
    points.append(point)

qd_client.upsert(
    collection_name=collection_name,
    points=points
)


UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

In [85]:
def vector_search(q):
    #course = 'data-engineering-zoomcamp'
    
    query_points = qd_client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=q["question"]  ,
            model=model_handle 
        ),
        #query_filter=models.Filter( 
        #    must=[
        #        models.FieldCondition(
        #            key="course",
        #            match=models.MatchValue(value=course)
        #        )
        #    ]
        #),
        limit=5,
        with_payload=True
    )
    
    results = []
    
    for point in query_points.points:
        results.append(point.payload)
    
    return results

In [86]:
# Run the evaluation
metrics = evaluate(ground_truth, vector_search)

print("Hit-rate:", metrics["hit_rate"])
print("MRR:     ", metrics["mrr"])

  0%|          | 0/4627 [00:00<?, ?it/s]

Hit-rate: 0.9120380376053598
MRR:      0.8245623514156052


## Q5. Cosine simiarity

In the second part of the module, we looked at evaluating
the entire RAG approach. In particular, we looked at 
comparing the answer generated by our system with the actual
answer from the FAQ.

One of the ways of doing it is using the cosine similarity. 
Let's see how to calculate it.

Cosine similarity is a dot product between two normalized vectors.
In geometrical sense, it's the cosine of the angle between
the vectors. Look up "cosine similarity geometry" if you want to
learn more about it.

For us, it means that we need two things:

- First, we normalize each of the vectors
- Then, compute the dot product

So, we get this:

```python
def cosine(u, v):
    u = normalize(u)
    v = normalize(v)
    return u.dot(v)
```

For normalization, we first compute the vector norm (its length),
and then divide the vector by it:

```python
def normalize(u):
    norm = np.sqrt(u.dot(u))
    return u / norm
```

(where `np` is `import numpy as np`)

Or we can simplify it:

```python
def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)
```

Now let's use this function to compute the
A->Q->A cosine similarity.

We will use the results from [our gpt-4o-mini evaluations](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-evaluation/rag_evaluation/data/results-gpt4o-mini.csv):


```python
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)
```


When creating embeddings, we will use a simple way -
the same we used in the [Embeddings](#embeddings) section:

```python
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
```

Let's fit the vectorizer on all the text data we have:

```python
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)
```

Now use the `transform` methon of the pipeline to create the embeddings and calculate the cosine similarity between each
pair.

What's the average cosine?

- 0.64
- 0.74
- 0.84
- 0.94

This is how you do it:

- For each answer pair, compute
    - `v_llm` for the answer from the LLM 
    - `v_orig` for the original answer
    - then compute the cosine between them
- At the end, take the average


In [87]:
import numpy as np

def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)

In [88]:
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

In [89]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

In [90]:
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)