I will evaluate the quality of the RAG system.

## Getting the data
Let's start by getting the dataset. I will use the data generated in the module.

In particular, I'll evaluate the quality of our RAG system with gpt-4o-mini

Read it:

In [20]:
import pandas as pd
import numpy as np

In [3]:
url = f'{'https://github.com/arsonor/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'}?raw=1'
df = pd.read_csv(url)

In [4]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


I will use only the first 300 documents:

In [6]:
df = df.iloc[:300]

## Q1. Getting the embeddings model
Now, get the embeddings model **multi-qa-mpnet-base-dot-v1** from the [Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

In [8]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

Create the embeddings for the first LLM answer:

In [14]:
answer_llm = df.iloc[0].answer_llm

embeddings = embedding_model.encode(answer_llm)
embeddings[0]

-0.4224466

What's the first value of the resulting vector?

+ **-0.42**
+ -0.22
+ -0.02
+ 0.21

## Q2. Computing the dot product
Now for each answer pair, let's create embeddings and compute dot product between them

I will put the results (scores) into the evaluations list

In [15]:
def compute_dot_product(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)
    
    return v_llm.dot(v_orig)

In [17]:
from tqdm.auto import tqdm
evaluations = []

for index, record in tqdm(df.iterrows(), total=df.shape[0]):
    sim = compute_dot_product(record)
    evaluations.append(sim)

100%|██████████| 300/300 [00:59<00:00,  5.05it/s]


In [18]:
df['cosine'] = evaluations
df['cosine'].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cosine'] = evaluations


count    300.000000
mean      27.495996
std        6.384744
min        4.547925
25%       24.307845
50%       28.336863
75%       31.674306
max       39.476021
Name: cosine, dtype: float64

What's the 75% percentile of the score?

+ 21.67
+ **31.67**
+ 41.67
+ 51.67

## Q3. Computing the cosine
From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So I need to normalize them.

To do it, I

+ Compute the norm of a vector
+ Divide each element by this norm

So, for vector v, it'll be v / ||v||

In numpy, this is how you do it:

+ norm = np.sqrt((v * v).sum())
+ v_norm = v / norm

Let's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity

In [21]:
def normalize_vector(v):
    norm = np.sqrt((v * v).sum())
    if norm == 0:
        return v
    return v / norm

def compute_cosine_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)
    
    v_llm_norm = normalize_vector(v_llm)
    v_orig_norm = normalize_vector(v_orig)
    
    return np.dot(v_llm_norm, v_orig_norm)

evaluations = []

for index, record in tqdm(df.iterrows(), total=df.shape[0]):
    sim = compute_cosine_similarity(record)
    evaluations.append(sim)

# Now evaluations contains the cosine similarities

100%|██████████| 300/300 [00:58<00:00,  5.13it/s]


What's the 75% cosine in the scores?

+ 0.63
+ 0.73
+ **0.83**
+ 0.93

In [23]:
percentile_75 = np.percentile(evaluations, 75)
print(f"The 75th percentile of the cosine similarity scores is: {percentile_75}")

The 75th percentile of the cosine similarity scores is: 0.8362348526716232


## Q4. Rouge
Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

In [24]:
pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [38]:
df.iloc[10]

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
cosine                                                 32.344704
Name: 10, dtype: object

In [45]:
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(df['answer_llm'], df['answer_orig'])[10]
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

+ rouge-1 - the overlap of unigrams,
+ rouge-2 - bigrams,
+ rouge-l - the longest common subsequence

What's the F score for rouge-1?

+ 0.35
+ **0.45**
+ 0.55
+ 0.65

## Q5. Average rouge score
Let's compute the average F-score between rouge-1, rouge-2 and rouge-l for the same record from Q4

+ **0.35**
+ 0.45
+ 0.55
+ 0.65

In [46]:
# Extract F-scores
rouge_1_f = scores['rouge-1']['f']
rouge_2_f = scores['rouge-2']['f']
rouge_l_f = scores['rouge-l']['f']

# Compute the average F-score
average_f_score = (rouge_1_f + rouge_2_f + rouge_l_f) / 3

print(f"The F-score for ROUGE-1 is: {rouge_1_f}")
print(f"The average F-score between ROUGE-1, ROUGE-2, and ROUGE-L is: {average_f_score}")

The F-score for ROUGE-1 is: 0.45454544954545456
The average F-score between ROUGE-1, ROUGE-2, and ROUGE-L is: 0.35490034990035496


## Q6. Average rouge score for all the data points
Now let's compute the score for all the records and create a dataframe from them.

What's the average rouge_2 across all the records?

+ 0.10
+ **0.20**
+ 0.30
+ 0.40

In [48]:
all_scores = rouge_scorer.get_scores(df['answer_llm'].tolist(), df['answer_orig'].tolist())

rouge_2_f_scores = [score['rouge-2']['f'] for score in all_scores]

average_rouge_2_f_score = sum(rouge_2_f_scores) / len(rouge_2_f_scores)

print(f"The average ROUGE-2 F-score across all records is: {average_rouge_2_f_score}")

The average ROUGE-2 F-score across all records is: 0.2069650198342332
