# Homework: Evaluation and Monitoring
### In this homework, we'll evaluate the quality of our RAG system.

### Getting the data
#### Let's start by getting the dataset. We will use the data we generated in the module.

#### In particular, we'll evaluate the quality of our RAG system with gpt-4o-mini

In [1]:
import pandas as pd

In [2]:
github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'

url = f'{github_url}?raw=1'
df = pd.read_csv(url)

In [3]:
df = df.iloc[:300]

In [4]:
df.head(5)

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [10]:
df.count()

answer_llm     300
answer_orig    300
document       300
question       300
course         300
dtype: int64

## Q1. Getting the embeddings model

In [5]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

  from .autonotebook import tqdm as notebook_tqdm
You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





In [6]:
# Create the embeddings for the first LLM answer:
answer_llm = df.iloc[0].answer_llm

In [7]:
llm_answer_vector = embedding_model.encode(answer_llm)

### What's the first value of the resulting vector?

In [8]:
llm_answer_vector[0]

-0.42244655

## Q2. Computing the dot product
Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

In [9]:
from tqdm.auto import tqdm

In [10]:
def compute_similarity(record, normalized=False):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = embedding_model.encode(answer_llm)
    v_orig = embedding_model.encode(answer_orig)

    if normalized:
        v_llm_norm = normalize_vector(v_llm)
        v_orig_norm = normalize_vector(v_orig)
        return v_llm_norm.dot(v_orig_norm)
    else:
        return v_llm.dot(v_orig)

In [11]:
evaluations = []

results_df = df.to_dict(orient='records')

for record in tqdm(results_df):
    sim = compute_similarity(record)
    evaluations.append(sim)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:24<00:00,  2.08it/s]


In [12]:
evaluations[10]

32.34471

### What's the 75% percentile of the score?

In [13]:
df['cosine'] = evaluations
df['cosine'].describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547924
25%       24.307844
50%       28.336870
75%       31.674309
max       39.476013
Name: cosine, dtype: float64

## Q3 Computing the cosine
From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

Compute the norm of a vector
Divide each element by this norm

In [14]:
import numpy as np

In [15]:
def normalize_vector(v):
    norm = np.sqrt((v * v).sum())
    v_norm = v / norm
    return v_norm

In [16]:
evaluations_norm = []

results_df = df.to_dict(orient='records')

for record in tqdm(results_df):
    sim = compute_similarity(record, True)
    evaluations_norm.append(sim)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:26<00:00,  2.05it/s]


### What's the 75% cosine in the scores?

In [17]:
df['cosine_norm'] = evaluations_norm
df['cosine_norm'].describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
Name: cosine_norm, dtype: float64

## Q4. Rouge
Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

In [18]:
# pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Note: you may need to restart the kernel to use updated packages.


Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [24]:
df_row10 = df.iloc[10]

In [27]:
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(df_row10['answer_llm'], df_row10['answer_orig'])[0]

There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

rouge-1 - the overlap of unigrams,
rouge-2 - bigrams,
rouge-l - the longest common subsequence
### What's the F score for rouge-1?

In [28]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}