Evaluation data

For this homework, we will use the same dataset we generated in the videos.

In [1]:
# Downloading documents and ground truth data for search evaluation
# This script fetches the documents and ground truth data from a specified URL
# and prepares them for further analysis.
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

Here, documents contains the documents from the FAQ database with unique IDs, and ground_truth contains generated question-answer pairs.

Also, we will need the code for evaluating retrieval:

In [2]:
# The code below is used to evaluate the retrieval performance of a search function.
# It calculates the hit rate and mean reciprocal rank (MRR) based on the relevance of
# the search results to the ground truth data.

# Use the tqdm library to show a progress bar while evaluating each query.
from tqdm.auto import tqdm

# Calculate the hit rate: proportion of queries where the correct document is found in the results
def hit_rate(relevance_total):
    cnt = 0
    for line in relevance_total:
        if True in line:
            cnt = cnt + 1
    return cnt / len(relevance_total)

# Calculate the mean reciprocal rank (MRR): average of reciprocal ranks of the first relevant result
def mrr(relevance_total):
    total_score = 0.0
    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)
    return total_score / len(relevance_total)

# Evaluate a search function using ground truth data
# Returns a dictionary with hit_rate and mrr
def evaluate(ground_truth, search_function):
    relevance_total = []
    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        # Check if each result matches the ground truth document id
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }


### Q1. Minsearch text

Now let's evaluate our usual minsearch approach, but tweak the parameters. Let's use the following boosting params: 
`boost = {'question': 1.5, 'section': 0.1}`

What's the hitrate for this approach?


In [None]:
import minsearch
# Create a minsearch index with specified text and keyword fields
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)
# Fit the index to the documents (build the search index)
index.fit(documents)


<minsearch.minsearch.Index at 0x7e801a7b36e0>

In [4]:
# This function performs a search query on the minsearch index for course-related questions.
# It takes a query string and a course identifier as parameters.
#
# The search query uses the minsearch index to find documents that match the query string,
# with a higher weight given to the "question" field.
# 
# It filters the results to only include documents that match the specified course.
# The search results are limited to the top 5 matches based on relevance.
def minsearch_search(query, course):
    boost = {'question': 1.5, 'section': 0.1}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [5]:
evaluate(ground_truth, lambda q: minsearch_search(q['question'], q['course']))

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.848714069591528, 'mrr': 0.7288235717887772}

Answer: hit_rate 0.84

Embeddings

The latest version of minsearch also supports vector search. We will use it:

In [None]:
pip install -U minsearch qdrant_client

In [7]:
from minsearch import VectorSearch

We will also use TF-IDF and Singular Value Decomposition to create embeddings from texts. 
You can refer to our "Create Your Own Search Engine" workshop if you want to know more about it.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

Let's create embeddings for the "question" field:

In [None]:
# This code creates a pipeline that combines TF-IDF vectorization and Truncated SVD for dimensionality reduction.
# It processes the 'question' field from the documents to generate a matrix of features.

# Prepare a list of questions from the documents for embedding
texts = []
for doc in documents:
    t = doc['question']
    texts.append(t)

# Create a pipeline that vectorizes text using TF-IDF and reduces dimensionality with SVD
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),                  # Ignore terms that appear in fewer than 3 documents
    TruncatedSVD(n_components=128, random_state=1)  # Reduce to 128 dimensions
)

# Fit the pipeline to the questions and transform them into embeddings
X = pipeline.fit_transform(texts)


### Q2. Vector search for question

Now let's index these embeddings with minsearch:



In [None]:
# Create a vector search index using minsearch, specifying 'course' as a keyword field for filtering
vindex = VectorSearch(keyword_fields={'course'})

# Fit the vector index with the embeddings (X) and the original documents
vindex.fit(X, documents)

In [None]:
# This function performs a search query on the vector index for course-related questions.
def vector_search(query, course):
    X = pipeline.transform([query])         # Convert query to vector
    results = vindex.search(                # Search using vector index
        X[0],                               # x[0] is the vector itself
        filter_dict={'course': course},     # Same filtering as before
        num_results=5
    )
    return results


In [None]:
evaluate(ground_truth, lambda q: vector_search(q['question'], q['course']))

Answer: MRR fo rthis seach method is 35

### Q3. Vector search for question and answer

We only used question in Q2. We can use both question and answer:



In [None]:
# Create a list to hold the combined question and text for each document
texts = []

for doc in documents:
    # Concatenate the 'question' and 'text' fields for each document
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)

# Create a pipeline that vectorizes text using TF-IDF and reduces dimensionality with SVD
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),                  # Ignore terms that appear in fewer than 3 documents
    TruncatedSVD(n_components=128, random_state=1)  # Reduce to 128 dimensions
)

# Fit the pipeline to the combined texts and transform them into embeddings
X = pipeline.fit_transform(texts)

In [None]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

In [None]:
def vector_search_qa(query, course):
    x = pipeline.transform([query])
    results = vindex.search(
        x[0],
        filter_dict={'course': course},
        num_results=5
    )
    return results


Using the same pipeline `(min_df=3 for TF-IDF vectorizer and n_components=128 for SVD)`, evaluate the performance of this approach



In [None]:
evaluate(ground_truth, lambda q: vector_search(q['question'], q['course']))

Answer: Hit Rate for this approach is 0.82

### Q4. Qdrant
Now let's evaluate the following settings in Qdrant:
- `text = doc['question'] + ' ' + doc['text']`
- `model_handle = "jinaai/jina-embeddings-v2-small-en"`
- `limit = 5`

In [4]:
pip install "qdrant-client[fastembed]>=1.14.2"

In [5]:
from qdrant_client import QdrantClient, models
from fastembed import TextEmbedding
from typing import List
from uuid import uuid4

In [6]:
client = QdrantClient("http://localhost:6333")

In [7]:
model_name="jinaai/jina-embeddings-v2-small-en"

In [8]:
model = TextEmbedding(model_name=model_name)

In [None]:
# Define the collection name
# collection_name = "llm-zoomcamp-homework3"
if not client.collection_exists('llm-zoomcamp-homework3'):
# Create the collection with specified vector parameters
    client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(
            size=client.get_embedding_size(model_name),  # Dimensionality of the vectors
            distance=models.Distance.COSINE  # Distance metric for similarity search
        )
    )

In [13]:
# Prepare texts and metadata
texts = []
payloads = []
for doc in documents:
    text = doc['question'] + ' ' + doc['text']
    texts.append(text)
    payloads.append({
        "text": doc['text'],
        "section": doc['section'],
        "course": doc['course'],
        "id": doc['id']
    })

In [None]:
# Set the batch size for uploading vectors to Qdrant
batch_size = 16

# Iterate over the texts and payloads in batches
for i in tqdm(range(0, len(texts), batch_size)):
    batch_texts = texts[i:i + batch_size]           # Get a batch of texts
    batch_payloads = payloads[i:i + batch_size]     # Get the corresponding payloads
    batch_vectors = list(model.embed(batch_texts))  # Generate embeddings for the batch

    points = []
    # Create Qdrant PointStructs for each vector in the batch
    for j, vector in enumerate(batch_vectors):
        points.append(models.PointStruct(
            id=str(uuid4()),                       # Generate a unique ID for each point
            vector=vector,                         # The embedding vector
            payload=batch_payloads[j]              # Associated metadata
        ))

    # Upload the batch of points to the Qdrant collection
    client.upsert(collection_name=collection_name, points=points)


In [None]:
def qdrant_search_fastembed(question, course, limit=5):
    # Generate the embedding vector for the input question using the fastembed model
    query_vector = list(model.embed([question]))[0]

    # Perform a vector search in the Qdrant collection, filtering by course
    hits = client.search(
        collection_name=collection_name,
        query_vector=query_vector,
        limit=limit,
        query_filter={
            "must": [{"key": "course", "match": {"value": course}}]
        }
    )
    # Return the payload (metadata) of each search result
    return [hit.payload for hit in hits]

In [19]:
evaluate(ground_truth, lambda q: qdrant_search_fastembed(q['question'], q['course']))

Answer: MRR is 0.85

### Q5. Cosine simiarity
In the second part of the module, we looked at evaluating the entire RAG approach. In particular, we looked at comparing the answer generated by our system with the actual answer from the FAQ.

One of the ways of doing it is using the cosine similarity. Let's see how to calculate it.

Cosine similarity is a dot product between two normalized vectors. In geometrical sense, it's the cosine of the angle between the vectors. Look up "cosine similarity geometry" if you want to learn more about it.

For us, it means that we need two things:

First, we normalize each of the vectors
Then, compute the dot product
So, we get this:

```
def cosine(u, v):
    u = normalize(u)
    v = normalize(v)
    return u.dot(v)
```

For normalization, we first compute the vector norm (its length), and then divide the vector by it:

```
def normalize(u):
    norm = np.sqrt(u.dot(u))
    return u / norm
```

(where `np` is import `numpy as np`)

Or we can simplify it:
```
def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)
```

Now let's use this function to compute the A->Q->A cosine similarity.

We will use the results from our gpt-4o-mini evaluations:

```
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)
```
When creating embeddings, we will use a simple way - the same we used in the Embeddings section:
```
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
```
Let's fit the vectorizer on all the text data we have:
`pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)`

Now use the `transform` method of the pipeline to create the embeddings and calculate the cosine similarity between each pair.

- For each answer pair, compute
    - `v_llm` for the answer from the LLM
    - `v_orig` for the original answer
    - then compute the cosine between them
- At the end, take the average

What's the average cosine?

In [3]:
def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)

In [4]:
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

In [6]:
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)

In [7]:
pipeline.fit(df_results.answer_llm + ' ' + df_results.answer_orig + ' ' + df_results.question)

In [8]:
# Compute embeddings
X_llm = pipeline.transform(df_results['answer_llm'])
X_orig = pipeline.transform(df_results['answer_orig'])

In [10]:
import pandas as pd
import numpy as np

In [11]:
cosines = [cosine(u, v) for u, v in zip(X_llm, X_orig)]
print("Average cosine similarity:", round(np.mean(cosines), 2))

Average cosine similarity: 0.84


Answer: average cosine similarity: 0.84

### Q6. Rouge
And alternative way to see how two texts are similar is ROUGE.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [12]:
pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Note: you may need to restart the kernel to use updated packages.


(The latest version at the moment of writing is 1.0.1)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [None]:
from rouge import Rouge

# Initialize the Rouge scorer
rouge_scorer = Rouge()

# Select the 10th row from the results dataframe
sample_result = df_results.iloc[10]

# Compute ROUGE scores between the LLM answer and the original answer for the sample
scores = rouge_scorer.get_scores(sample_result.answer_llm, sample_result.answer_orig)[0]

# Display the ROUGE scores
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

- `rouge-1` - the overlap of unigrams,
- `rouge-2` - bigrams,
- `rouge-l` - the longest common subsequence

For the 10th document, Rouge-1 F1 score is 0.45

Let's compute it for the pairs in the entire dataframe. What's the average Rouge-1 F1?


In [None]:
# Initialize an empty list to store Rouge-1 F1 scores for each answer pair
rouge_1_f1 = []

# Iterate over each row in the results dataframe
for _, row in df_results.iterrows():
    # Compute ROUGE scores between the LLM answer and the original answer
    scores = rouge_scorer.get_scores(row.answer_llm, row.answer_orig)[0]
    # Append the Rouge-1 F1 score to the list
    rouge_1_f1.append(scores['rouge-1']['f'])

In [18]:
avg_rouge_1_f1 = sum(rouge_1_f1) / len(rouge_1_f1)
avg_rouge_1_f1

0.3516946452113944

Answer: the average Rouge-1 F1 is 0.35