## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

## Load documents with IDs

In [1]:
import requests 

base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '03-vector-search/eval/documents-with-ids.json'
docs_url = f'{base_url}/{relative_url}?raw=1'
docs_response = requests.get(docs_url)
documents = docs_response.json()

In [3]:
print("Count of documents: ",len(documents))
documents[10]
print(type(documents))

Count of documents:  948
<class 'list'>


## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


In [19]:
import pandas as pd

base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '04-monitoring/data/results-gpt4o-mini.csv'
documents_url = f'{base_url}/{relative_url}?raw=1'

df = pd.read_csv(documents_url)
df = df.iloc[:300]
documents = df.to_dict(orient='records')
print("Count of documents: ",len(df))
print(df[:2])


Count of documents:  300
                                          answer_llm  \
0  You can sign up for the course by visiting the...   
1  You can sign up using the link provided in the...   

                                         answer_orig  document  \
0  Machine Learning Zoomcamp FAQ\nThe purpose of ...  0227b872   
1  Machine Learning Zoomcamp FAQ\nThe purpose of ...  0227b872   

                              question                     course  
0  Where can I sign up for the course?  machine-learning-zoomcamp  
1   Can you provide a link to sign up?  machine-learning-zoomcamp  


## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3


In [8]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
model = SentenceTransformer(model_name)

  from .autonotebook import tqdm as notebook_tqdm
You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





Create the embeddings for the first LLM answer.

What's the first value of the resulting vector?

In [22]:
answer_llm = documents[0]['answer_llm']
print("Answer: ",answer_llm)
embedd_answer = model.encode(answer_llm)
print("Embedding: ",embedd_answer[0])

Answer:  You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).
Embedding:  -0.42244655


Our answer is -0.42

## Q2. Computing the dot product

Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67


In [20]:
documents[0]['answer_llm']

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [24]:
from tqdm.auto import tqdm

evaluations=[]
for doc in tqdm(documents):
    answer_llm_emb = model.encode(doc['answer_llm'])
    answer_orig_emb = model.encode(doc['answer_orig'])
    dot_product= answer_llm_emb.dot(answer_orig_emb)
    evaluations.append(dot_product)

print("Count of evaluations:", len(evaluations))
print(evaluations)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:22<00:00,  2.11it/s]


NameError: name 'evaluation' is not defined

In [26]:
import numpy as np

p75 = np.percentile(evaluations, 75)
print("Percentile 75%: ", p75)

Percentile 75%:  31.67430877685547


## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`


In [29]:
def normalize_vector(v):
    norm = np.sqrt((v * v).sum())
    if norm != 0:
        v_norm = v / norm
    else:
        v_norm= 0
    return v_norm

In [32]:
normalize_vector(model.encode(documents[1]['answer_llm']))

array([-6.45694733e-02,  8.11581127e-03, -5.34474589e-02, -3.57535817e-02,
       -1.01497117e-02,  2.92569622e-02, -2.13802066e-02, -8.96153972e-03,
       -5.17105358e-03,  2.96327323e-02, -5.15980041e-03,  2.52887625e-02,
       -5.15669351e-03,  3.74120660e-02,  4.31329645e-02, -1.28087755e-02,
        1.61199700e-02,  4.69575822e-03, -1.77060179e-02, -3.86353321e-02,
        7.35478802e-03,  2.44517960e-02, -9.20967683e-02, -8.68302584e-03,
        2.09734831e-02, -3.06710731e-02, -2.15153843e-02, -4.84315567e-02,
       -1.07426653e-02, -2.64777057e-02, -3.52982315e-03,  3.71070281e-02,
        2.14214120e-02,  3.17584313e-02, -1.62356409e-05, -6.16304623e-03,
       -6.43580128e-03,  7.39409495e-03, -2.46850140e-02,  1.35752521e-02,
       -2.79324502e-02,  3.36448364e-02, -3.27904150e-02, -1.72284581e-02,
       -7.17717260e-02,  3.49436887e-02,  5.79928532e-02,  2.37358194e-02,
        8.91923979e-02,  4.56683226e-02,  5.72560914e-02, -3.40075344e-02,
        7.16267452e-02, -

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?


In [33]:
evaluations=[]
for doc in tqdm(documents):
    answer_llm_emb = normalize_vector(model.encode(doc['answer_llm']))
    answer_orig_emb = normalize_vector(model.encode(doc['answer_orig']))
    cosine_sim= answer_llm_emb.dot(answer_orig_emb)
    evaluations.append(cosine_sim)

print("Count of evaluations:", len(evaluations))
print(evaluations)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [02:22<00:00,  2.10it/s]

Count of evaluations: 300
[0.5067539, 0.38854873, 0.7185989, 0.33726627, 0.5217923, 0.83053213, 0.7462835, 0.6944061, 0.84688616, 0.65590763, 0.7779559, 0.78356636, 0.90468806, 0.80630296, 0.72759616, 0.7751896, 0.71516633, 0.5890557, 0.53322953, 0.5857593, 0.81232715, 0.83714426, 0.76611555, 0.43333992, 0.81558585, 0.92667866, 0.552616, 0.7622108, 0.9452982, 0.8478371, 0.7192839, 0.6864791, 0.6100939, 0.64910805, 0.48555, 0.6549567, 0.52971876, 0.84890294, 0.73956215, 0.76096815, 0.70153177, 0.7140965, 0.77817, 0.6202106, 0.62210196, 0.33472955, 0.3324926, 0.31343076, 0.25845352, 0.27644622, 0.77109647, 0.89201, 0.5712719, 0.7779895, 0.7033882, 0.8988763, 0.7822658, 0.69761264, 0.6318737, 0.5829771, 0.59635806, 0.5221753, 0.5993201, 0.65132016, 0.53131604, 0.761606, 0.6682948, 0.6511333, 0.66239053, 0.75467545, 0.89955723, 0.87245953, 0.75394404, 0.7211681, 0.8531313, 0.74570763, 0.85769904, 0.6625385, 0.91524327, 0.55959284, 0.8276353, 0.8465157, 0.74230355, 0.8715825, 0.7529516, 0.8




In [34]:
import numpy as np

p75 = np.percentile(evaluations, 75)
print("Percentile 75%: ", p75)

Percentile 75%:  0.8362348973751068


## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```


In [35]:
!pip install rouge==1.0.1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge==1.0.1
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?


In [44]:
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(documents[10]['answer_orig'], documents[10]['answer_llm'])[0]
print("Rouge Score: \n", scores)


Rouge Score: 
 {'rouge-1': {'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}, 'rouge-2': {'r': 0.21621621621621623, 'p': 0.21621621621621623, 'f': 0.21621621121621637}, 'rouge-l': {'r': 0.42424242424242425, 'p': 0.42424242424242425, 'f': 0.42424241924242434}}


## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65


In [48]:
print("Rouge Score: \n", scores)

Rouge Score: 
 {'rouge-1': {'r': 0.45454545454545453, 'p': 0.45454545454545453, 'f': 0.45454544954545456}, 'rouge-2': {'r': 0.21621621621621623, 'p': 0.21621621621621623, 'f': 0.21621621121621637}, 'rouge-l': {'r': 0.42424242424242425, 'p': 0.42424242424242425, 'f': 0.42424241924242434}}


In [49]:
np.mean([scores['rouge-1']['f'],scores['rouge-2']['f'], scores['rouge-l']['f']])

0.36500136000136507

In [51]:
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
print("Avg score: ",rouge_avg)

Avg score:  0.36500136000136507


## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```


In [52]:
def avg_rouge_score(scores):
    rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    rouge_l = scores['rouge-l']['f']
    
    rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3    

    return rouge_avg

In [53]:
evaluations=[]
for doc in tqdm(documents):
    scores = rouge_scorer.get_scores(doc['answer_llm'], doc['answer_orig'])[0]
    avg_score= avg_rouge_score(scores)
    evaluations.append(avg_score)

print("Count of evaluations:", len(evaluations))
print(evaluations)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 307.70it/s]

Count of evaluations: 300
[0.07288173149369313, 0.09143518169307017, 0.32765752302398016, 0.15082140518956247, 0.09873112518185549, 0.37360320114980733, 0.3137973090718354, 0.2351894201987098, 0.4184529312211076, 0.3005952332728943, 0.35490034990035496, 0.5342902661335307, 0.6952852637710035, 0.6679536630162465, 0.5918253653550112, 0.3919191869193432, 0.4052287533533585, 0.28968253494163104, 0.21221531594688722, 0.3240740691477366, 0.18479477562485383, 0.22380560534963087, 0.44576497682443134, 0.14211711470058744, 0.4201871727487044, 0.5330459720143877, 0.282828278570679, 0.34875680959824235, 0.5676035296417705, 0.30272255382544655, 0.3687641674943503, 0.43293246502892596, 0.3040873805183951, 0.20606060205768265, 0.11111110781893012, 0.18461538068595776, 0.20085469692527402, 0.3809523766780046, 0.18678478708345356, 0.6283602100822864, 0.43134042634470965, 0.3309225280145684, 0.475330635751632, 0.2989604939688171, 0.1753590288787108, 0.07407407198559676, 0.10526315589104342, 0.066666664




And create a dataframe from them

What's the agerage `rouge_2` across all the records?


In [55]:
df_rouge= pd.DataFrame(evaluations, columns=['avg_rouge_score'])
df_rouge

Unnamed: 0,avg_rouge_score
0,0.072882
1,0.091435
2,0.327658
3,0.150821
4,0.098731
...,...
295,0.604570
296,0.535991
297,0.618851
298,0.247252


In [57]:
df_rouge.mean()

avg_rouge_score    0.313205
dtype: float64