## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

Solution:

* Video: TBA
* Notebook: TBA

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:


```python
df = df.iloc[:300]
```


In [1]:
import pandas as pd

github_url = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv"
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
df = df.iloc[:300]

df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [2]:
len(df)

300

## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

In [3]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

answer_llm = df.iloc[0].answer_llm

answer_llm

  from .autonotebook import tqdm as notebook_tqdm
You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [4]:
query_vector = embedding_model.encode(answer_llm)
v = query_vector
print(round(v[0],2))

-0.42


## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

In [5]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [6]:
df["answer_llm_emb"] = df["answer_llm"].apply(embedding_model.encode)
df["answer_orig_emb"] = df["answer_orig"].apply(embedding_model.encode)

In [7]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course,answer_llm_emb,answer_orig_emb
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,"[-0.42244655, -0.22485626, -0.3240584, -0.2847...","[-0.030214058, -0.3444381, -0.28076234, 0.0615..."
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,"[-0.38068146, 0.047848288, -0.31510952, -0.210...","[-0.030214058, -0.3444381, -0.28076234, 0.0615..."
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,"[-0.05881373, -0.33736944, -0.36157572, 0.0217...","[-0.030214058, -0.3444381, -0.28076234, 0.0615..."
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,"[-0.22753648, -0.008134096, -0.21719913, -0.11...","[-0.030214058, -0.3444381, -0.28076234, 0.0615..."
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,"[-0.06969386, -0.5005093, -0.1659844, 0.306661...","[-0.030214058, -0.3444381, -0.28076234, 0.0615..."


In [8]:
# np.dot(df["answer_llm_emb"][1], df["answer_llm_emb"][1])

In [17]:

answer = []
for each in range(0, len(df)):
    answer.append(df["answer_orig_emb"][each].dot(df["answer_llm_emb"][each]))


df_dot_prod = pd.DataFrame(answer, columns=['dot_prod'])
df["dot_product"] = df_dot_prod

df_dot_prod.describe()

Unnamed: 0,dot_prod
count,300.0
mean,27.495996
std,6.384742
min,4.547923
25%,24.307844
50%,28.33687
75%,31.674309
max,39.476013


A/ 31.67

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

In [18]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course,answer_llm_emb,answer_orig_emb,dot_product
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,"[-0.42244655, -0.22485626, -0.3240584, -0.2847...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",17.515987
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,"[-0.38068146, 0.047848288, -0.31510952, -0.210...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",13.418402
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,"[-0.05881373, -0.33736944, -0.36157572, 0.0217...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",25.313255
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,"[-0.22753648, -0.008134096, -0.21719913, -0.11...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",12.147415
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,"[-0.06969386, -0.5005093, -0.1659844, 0.306661...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",18.747736


In [20]:
import numpy as np

def norm_vector(v):
    norm = np.sqrt((v * v).sum())
    v_norm = v / norm
    return v_norm

In [21]:
df["answer_llm_emb_norm"] = df["answer_llm_emb"].apply(norm_vector)
df["answer_orig_emb_norm"] = df["answer_orig_emb"].apply(norm_vector)

In [23]:
answer = []
for each in range(0, len(df)):
    answer.append(df["answer_orig_emb_norm"][each].dot(df["answer_llm_emb_norm"][each]))


df_dot_prod = pd.DataFrame(answer, columns=['dot_prod'])
df["dot_product_norm"] = df_dot_prod

df_dot_prod.describe()

Unnamed: 0,dot_prod
count,300.0
mean,0.728393
std,0.157755
min,0.125357
25%,0.651273
50%,0.763761
75%,0.836235
max,0.958796


A/ 0.83

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45
- 0.55
- 0.65

In [48]:
df[df.document == "5170565b"]

Unnamed: 0,answer_llm,answer_orig,document,question,course,answer_llm_emb,answer_orig_emb,dot_product,answer_llm_emb_norm,answer_orig_emb_norm,dot_product_norm
10,"Yes, all sessions are recorded, so if you miss...","Everything is recorded, so you won’t miss anyt...",5170565b,Are sessions recorded if I miss one?,machine-learning-zoomcamp,"[-0.10797262, -0.07068468, -0.091208436, 0.092...","[-0.22097382, -0.07662514, -0.19240223, -0.038...",32.344711,"[-0.016557612, -0.010839502, -0.013986822, 0.0...","[-0.03465839, -0.012018184, -0.030177113, -0.0...",0.777956
11,"Yes, you can ask your questions in advance if ...","Everything is recorded, so you won’t miss anyt...",5170565b,Can I ask questions in advance if I can't atte...,machine-learning-zoomcamp,"[-0.38412586, -0.30479348, -0.2386713, 0.07005...","[-0.22097382, -0.07662514, -0.19240223, -0.038...",31.441843,"[-0.061034273, -0.048429046, -0.037922803, 0.0...","[-0.03465839, -0.012018184, -0.030177113, -0.0...",0.783566
12,"If you miss a session, don't worry! Everything...","Everything is recorded, so you won’t miss anyt...",5170565b,How will my questions be addressed if I miss a...,machine-learning-zoomcamp,"[-0.28844845, -0.2045337, -0.18220218, -0.0422...","[-0.22097382, -0.07662514, -0.19240223, -0.038...",36.380718,"[-0.045732845, -0.03242835, -0.028887738, -0.0...","[-0.03465839, -0.012018184, -0.030177113, -0.0...",0.904688
13,"Yes, there is a way to catch up on a missed se...","Everything is recorded, so you won’t miss anyt...",5170565b,Is there a way to catch up on a missed session?,machine-learning-zoomcamp,"[-0.4017524, -0.16281958, -0.14009969, 0.03922...","[-0.22097382, -0.07662514, -0.19240223, -0.038...",33.340504,"[-0.061946534, -0.025105285, -0.021602087, 0.0...","[-0.03465839, -0.012018184, -0.030177113, -0.0...",0.806303
14,"Yes, you can still interact with instructors a...","Everything is recorded, so you won’t miss anyt...",5170565b,Can I still interact with instructors after mi...,machine-learning-zoomcamp,"[-0.20765506, -0.2724766, -0.111881085, -0.051...","[-0.22097382, -0.07662514, -0.19240223, -0.038...",30.606163,"[-0.03147433, -0.041299347, -0.016957844, -0.0...","[-0.03465839, -0.012018184, -0.030177113, -0.0...",0.727596


In [24]:
!pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [44]:
from rouge import Rouge
rouge_scorer = Rouge()

answer_llm = df[df.document == "5170565b"]['answer_llm'][10]
answer_orig = df[df.document == "5170565b"]['answer_orig'][10]

scores = rouge_scorer.get_scores(answer_llm, answer_orig)[0]
scores["rouge-1"]["f"]

0.45454544954545456

In [56]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}


## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65


In [51]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

In [54]:
scores_f = []
for each in scores.keys():
    scores_f.append(scores[each]["f"])

average = sum(scores_f) / len(scores_f)

print("Average:", average)

Average: 0.35490034990035496


## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the agerage `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40

In [85]:
scores_r1_f = []
scores_r2_f = []
scores_rl_f = []

for each in range(0, len(df)):
    score = rouge_scorer.get_scores(df["answer_llm"][each], df["answer_orig"][each])
    scores_r1_f.append(scores["rouge-1"]["f"])
    scores_r2_f.append(scores["rouge-2"]["f"])
    scores_rl_f.append(scores["rouge-l"]["f"])

df_rouge_scores_r1_f = pd.DataFrame(scores_r1_f, columns=['scores_r1_f'])
df_rouge_scores_r2_f = pd.DataFrame(scores_r2_f, columns=['scores_r2_f'])
df_rouge_scores_rl_f = pd.DataFrame(scores_rl_f, columns=['scores_rl_f'])

df["scores_r1_f"] = df_rouge_scores_r1_f
df["scores_r2_f"] = df_rouge_scores_r2_f
df["scores_rl_f"] = df_rouge_scores_rl_f
df["scores_f_avg"] = (df["scores_r1_f"] + df["scores_r2_f"] + df["scores_rl_f"])/3

df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course,answer_llm_emb,answer_orig_emb,dot_product,answer_llm_emb_norm,answer_orig_emb_norm,dot_product_norm,scores_r1_f,scores_r2_f,scores_rl_f,scores_f_avg
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp,"[-0.42244655, -0.22485626, -0.3240584, -0.2847...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",17.515987,"[-0.071590446, -0.038105555, -0.054916974, -0....","[-0.005158082, -0.058801766, -0.047931172, 0.0...",0.506754,0.454545,0.216216,0.393939,0.3549
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp,"[-0.38068146, 0.047848288, -0.31510952, -0.210...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",13.418402,"[-0.06456947, 0.008115811, -0.05344746, -0.035...","[-0.005158082, -0.058801766, -0.047931172, 0.0...",0.388549,0.454545,0.216216,0.393939,0.3549
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp,"[-0.05881373, -0.33736944, -0.36157572, 0.0217...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",25.313255,"[-0.009779983, -0.056100298, -0.060125496, 0.0...","[-0.005158082, -0.058801766, -0.047931172, 0.0...",0.718599,0.454545,0.216216,0.393939,0.3549
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp,"[-0.22753648, -0.008134096, -0.21719913, -0.11...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",12.147415,"[-0.037005045, -0.0013228761, -0.035323843, -0...","[-0.005158082, -0.058801766, -0.047931172, 0.0...",0.337266,0.454545,0.216216,0.393939,0.3549
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp,"[-0.06969386, -0.5005093, -0.1659844, 0.306661...","[-0.030214058, -0.3444381, -0.28076234, 0.0615...",18.747736,"[-0.011362247, -0.08159844, -0.027060572, 0.04...","[-0.005158082, -0.058801766, -0.047931172, 0.0...",0.521792,0.454545,0.216216,0.393939,0.3549


In [89]:
df["scores_r2_f"].describe()

count    3.000000e+02
mean     2.162162e-01
std      2.780195e-17
min      2.162162e-01
25%      2.162162e-01
50%      2.162162e-01
75%      2.162162e-01
max      2.162162e-01
Name: scores_r2_f, dtype: float64

In [91]:
df["scores_r2_f"].mean()

0.21621621121621634

## Submit the results

* Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw4
* It's possible that your answers won't match exactly. If it's the case, select the closest one.
