# Homework: Evaluation and Monitoring

In this homework we'll evaluate the quality of our RAG systems

## Getting the data

Let's start by getting the dataset. We will use the data generated in the module, particularly the quality of our RAG system with "gpt-4o-mini"

In [2]:
import pandas as pd

base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '04-monitoring/data/results-gpt4o-mini.csv'
url = f'{base_url}/{relative_url}?raw=1'

df = pd.read_csv(url)

results_gpt4o_mini = df.to_dict(orient="records")

In [3]:
results_gpt4o_mini[0]

{'answer_llm': 'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).',
 'answer_orig': 'Machine Learning Zoomcamp FAQ\nThe purpose of this document is to capture frequently asked technical questions.\nWe did this for our data engineering course and it worked quite well. Check this document for inspiration on how to structure your questions and answers:\nData Engineering Zoomcamp FAQ\nIn the course GitHub repository there’s a link. Here it is: https://airtable.com/shryxwLd0COOEaqXo\nwork',
 'document': '0227b872',
 'question': 'Where can I sign up for the course?',
 'course': 'machine-learning-zoomcamp'}

We will use only the frist 300 documents:

In [4]:
df = df.iloc[:300]

In [5]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


In [6]:
df.shape

(300, 5)

## Q1 Getting the embeddings model

Now get the embbeding model multi-qa-mpnet-base-dot-v1 from the Sentence Transformer library

In [7]:
from sentence_transformers import SentenceTransformer

model_name = 'multi-qa-mpnet-base-dot-v1'
model = SentenceTransformer(model_name)

  from .autonotebook import tqdm as notebook_tqdm
You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





Create the embbedings for the first LLM answer:

In [8]:
answer_llm = df.iloc[0].answer_llm

In [9]:
answer_llm

'You can sign up for the course by visiting the course page at [http://mlzoomcamp.com/](http://mlzoomcamp.com/).'

In [10]:
model.encode(answer_llm)[0]

-0.42244673

## Computing the dot product

For each answer pair, let's create embeddings and compute dot product between them. 
We will put the results (scores) into the evaluations list.
What's the 75% percentile of the score?

In [11]:
results_dict = df.to_dict(orient="records")

In [12]:
from tqdm.auto import tqdm

def compute_similarity(record):
    answer_orig = record['answer_orig']
    answer_llm = record["answer_llm"]
    
    v_orig = model.encode(answer_orig)
    v_llm = model.encode(answer_llm)
    
    return v_llm.dot(v_orig)


evaluations = [
    compute_similarity(record) for record in tqdm(results_dict)
]

100%|██████████| 300/300 [01:14<00:00,  4.02it/s]


In [13]:
pd.Series(evaluations).describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547925
25%       24.307845
50%       28.336873
75%       31.674312
max       39.476009
dtype: float64

In [14]:
import numpy as np

np.percentile(evaluations, 75)

31.674311637878418

## Q3. Computing the cosine

We can see that the results are not within the [0,1] range. It's becouse the vecors coming from this model are not normalized.

To normalize them we

* Compute thenorm of a vector
* Divide each element by this norm

In [15]:

def compute_cosine(record):
    answer_orig = record['answer_orig']
    answer_llm = record["answer_llm"]
    
    v_orig = model.encode(answer_orig)
    v_llm = model.encode(answer_llm)
    
    orig_norm = np.sqrt((v_orig*v_orig).sum())
    llm_norm = np.sqrt((v_llm*v_llm).sum())
    
    v_orig_norm = v_orig/orig_norm
    v_llm_norm = v_llm/llm_norm
    
    return v_llm_norm.dot(v_orig_norm)

cosines = [
    compute_cosine(record) for record in tqdm(results_dict)
]

100%|██████████| 300/300 [01:16<00:00,  3.91it/s]


In [16]:
pd.Series(cosines).describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
dtype: float64

In [17]:
np.percentile(cosines, 75)

0.836234912276268

## Q4. Rouge

Now we will explore analternative metric - the ROUGE score.

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similaritythan just cosine similarity alone.

We don't need to implement it ourselves, ther's a python package for it:

In [18]:
%%bash

pip install rouge

Defaulting to user installation because normal site-packages is not writeable
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge




Successfully installed rouge-1.0.1


Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)

In [21]:
df.iloc[10, :]

answer_llm     Yes, all sessions are recorded, so if you miss...
answer_orig    Everything is recorded, so you won’t miss anyt...
document                                                5170565b
question                    Are sessions recorded if I miss one?
course                                 machine-learning-zoomcamp
Name: 10, dtype: object

In [27]:
from rouge import Rouge
rouge_scorer = Rouge()
record = df.iloc[10, :]
scores = rouge_scorer.get_scores(record['answer_llm'], record['answer_orig'])[0]

There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.

* rouge-1: The overlap of unigrams
* rouge-2: Bigrams
* rouge-l: The longest common subsequence

In [28]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

## Q5. Average rouge score

Let's compute the average Fscore between rouge-1, rouge-2 and rouge-l for the same record form Q4

In [33]:
    
np.mean([rouge['f'] for rouge in scores.values()])

0.35490034990035496

## Q6 Average rouge score for allthe data points

Now let's comppute the score for all the records and create a dataframe form them

What's the averafe rouge_2 acrros all the records

In [34]:
def get_rouge_socres(record):
    return rouge_scorer.get_scores(record['answer_llm'], record['answer_orig'])[0]

In [36]:
rouge_scores = list(df.apply(get_rouge_socres, axis=1))

In [37]:
pd.DataFrame(rouge_scores)

Unnamed: 0,rouge-1,rouge-2,rouge-l
0,"{'r': 0.061224489795918366, 'p': 0.21428571428...","{'r': 0.017543859649122806, 'p': 0.07142857142...","{'r': 0.061224489795918366, 'p': 0.21428571428..."
1,"{'r': 0.08163265306122448, 'p': 0.266666666666...","{'r': 0.03508771929824561, 'p': 0.133333333333...","{'r': 0.061224489795918366, 'p': 0.2, 'f': 0.0..."
2,"{'r': 0.32653061224489793, 'p': 0.571428571428...","{'r': 0.14035087719298245, 'p': 0.242424242424...","{'r': 0.30612244897959184, 'p': 0.535714285714..."
3,"{'r': 0.16326530612244897, 'p': 0.32, 'f': 0.2...","{'r': 0.03508771929824561, 'p': 0.071428571428...","{'r': 0.14285714285714285, 'p': 0.28, 'f': 0.1..."
4,"{'r': 0.2653061224489796, 'p': 0.0970149253731...","{'r': 0.07017543859649122, 'p': 0.022346368715...","{'r': 0.22448979591836735, 'p': 0.082089552238..."
...,...,...,...
295,"{'r': 0.6428571428571429, 'p': 0.6666666666666...","{'r': 0.559322033898305, 'p': 0.52380952380952...","{'r': 0.6071428571428571, 'p': 0.6296296296296..."
296,"{'r': 0.6428571428571429, 'p': 0.5454545454545...","{'r': 0.5423728813559322, 'p': 0.4, 'f': 0.460...","{'r': 0.6071428571428571, 'p': 0.5151515151515..."
297,"{'r': 0.6607142857142857, 'p': 0.6491228070175...","{'r': 0.5932203389830508, 'p': 0.5384615384615...","{'r': 0.6428571428571429, 'p': 0.6315789473684..."
298,"{'r': 0.2857142857142857, 'p': 0.3265306122448...","{'r': 0.13559322033898305, 'p': 0.129032258064...","{'r': 0.2857142857142857, 'p': 0.3265306122448..."


In [40]:
np.mean(list(map(lambda scores: np.mean([rouge['f'] for rouge in scores.values()]), rouge_scores)))

0.313205367339838

In [42]:
np.mean(list(score['rouge-2']['f'] for score in  rouge_scores))

0.20696501983423318