## Homework week 4 - Evaluation and Monitoring
In this homework, we'll evaluate the quality of our RAG system.



### Getting the data
Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2024/04-monitoring/homework.md#:~:text=Let%27s%20start%20by,Read%20it%3A)

Read it:



In [27]:
import pandas as pd

github_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv'
url = f'{github_url}?raw=1'

df = pd.read_csv(url)

We will use only the first 300 documents:

In [28]:
df = df.iloc[:300]

### Q1. Getting the embeddings model
Now, get the embeddings model multi-qa-mpnet-base-dot-v1 from the [Sentence Transformer library](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2024/04-monitoring/homework.md#:~:text=Now%2C%20get%20the%20embeddings%20model%20multi%2Dqa%2Dmpnet%2Dbase%2Ddot%2Dv1%20from%20the%20Sentence%20Transformer%20library)



In [7]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
embedding_model = SentenceTransformer(model_name)

  from tqdm.autonotebook import tqdm, trange


Create the embeddings for the first LLM answer:



In [48]:
answer_llm = df.iloc[0].answer_llm
v = embedding_model.encode(answer_llm)[0]
v

np.float32(-0.42244655)

### Q2. Computing the dot product
Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the evaluations list

What's the 75% percentile of the score?



In [29]:
records = df.to_dict(orient='records')
records[0]

{'answer_llm': 'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).',
 'answer_orig': 'Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\nIn the course GitHub repository there’s a link. Here it is: https://airtable.com/shryxwLd0COOEaqXo\nwork',
 'document': '0227b872',
 'question': 'Where can I sign up for the course?',
 'course': 'machine-learning-zoomcamp'}

In [34]:
from tqdm.auto import tqdm

evaluations = []

for doc in tqdm(records):
    v_llm = embedding_model.encode(doc['answer_llm'])
    v_orig = embedding_model.encode(doc['answer_orig'])
    evaluations.append(v_llm.dot(v_orig))

100%|████████████████████████| 300/300 [01:25<00:00,  3.52it/s]


In [35]:
import numpy as np

np.percentile(evaluations, 75)

np.float32(31.674309)

### Q3. Computing the cosine
From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we

- Compute the norm of a vector
- Divide each element by this norm
So, for vector v, it'll be v / ||v||
In numpy, this is how you do it:




In [46]:
def normalize(v):
    return  v / np.sqrt((v * v).sum())

In [53]:
from tqdm.auto import tqdm

norm_evaluations = []

for doc in tqdm(records):
    v_llm = embedding_model.encode(doc['answer_llm'])
    v_orig = embedding_model.encode(doc['answer_orig'])
    llm_norm = normalize(v_llm)
    orig_norm = normalize(v_orig)
    norm_evaluations.append(llm_norm.dot(orig_norm))

100%|████████████████████████| 300/300 [01:24<00:00,  3.53it/s]


In [55]:
evaluations = pd.Series(norm_evaluations)
evaluations.describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
dtype: float64

### Q4. Rouge
Now we will explore an alternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:



In [56]:
!pip install rouge

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


(The latest version at the moment of writing is 1.0.1)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)



In [61]:
from rouge import Rouge
rouge_scorer = Rouge()
doc = records[10]
scores = rouge_scorer.get_scores(doc['answer_llm'], doc['answer_orig'])[0]

In [62]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

### Q5. Average rouge score
Let's compute the average F-score between rouge-1, rouge-2 and rouge-l for the same record from Q4

In [67]:
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
(rouge_1 + rouge_2 + rouge_l ) /3

0.35490034990035496

### Q6. Average rouge score for all the data points
Now let's compute the F-score for all the records and create a dataframe from them.

What's the average F-score in rouge_2 across all the records?



In [84]:
evaluations = []

for doc in tqdm(records):
    scores = rouge_scorer.get_scores(doc['answer_llm'], doc['answer_orig'])[0]

    rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    rouge_l = scores['rouge-l']['f']
    
    evaluations.append({
        'rouge_1':rouge_1,
        'rouge_2':rouge_2,
        'rouge_l':rouge_l,
        'rouge_avg':(rouge_1 + rouge_2 + rouge_l ) /3,
    })

100%|███████████████████████| 300/300 [00:00<00:00, 411.98it/s]


In [86]:
evaluation = pd.DataFrame(evaluations)
evaluation['rouge_2'].describe()

count    300.000000
mean       0.206965
std        0.153550
min        0.000000
25%        0.097809
50%        0.178671
75%        0.286181
max        0.739130
Name: rouge_2, dtype: float64