In [None]:
%%bash
which python
python --version
#python -m ipykernel install --name py3.10-env --user
pip install -q tqdm openai elasticsearch pandas scikit-learn transformers accelerate bitsandbytes tiktoken

## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.## Homework: Evaluation and Monitoring

In this homework, we'll evaluate the quality of our RAG system.

> It's possible that your answers won't match exactly. If it's the case, select the closest one.

## Getting the data

Let's start by getting the dataset. We will use the data we generated in the module.

In particular, we'll evaluate the quality of our RAG system
with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)


Read it:

```python
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
```

We will use only the first 300 documents:


```python
df = df.iloc[:300]
```

In [1]:
import pandas as pd
github_url="https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/3757854db171c4d22da407a085e79fb370f1fae3/04-monitoring/data/results-gpt4o-mini.csv"
url = f'{github_url}?raw=1'
df = pd.read_csv(url)
df = df.iloc[:300]

In [2]:
df.head()

Unnamed: 0,answer_llm,answer_orig,document,question,course
0,You can sign up for the course by visiting the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Where can I sign up for the course?,machine-learning-zoomcamp
1,You can sign up using the link provided in the...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Can you provide a link to sign up?,machine-learning-zoomcamp
2,"Yes, there is an FAQ for the Machine Learning ...",Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Is there an FAQ for this Machine Learning course?,machine-learning-zoomcamp
3,The context does not provide any specific info...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,Does this course have a GitHub repository for ...,machine-learning-zoomcamp
4,To structure your questions and answers for th...,Machine Learning Zoomcamp FAQ\nThe purpose of ...,0227b872,How can I structure my questions and answers f...,machine-learning-zoomcamp


## Q1. Getting the embeddings model

Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

> Note: this is not the same model as in HW3

```bash
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)
```

Create the embeddings for the first LLM answer:

```python
answer_llm = df.iloc[0].answer_llm
```

What's the first value of the resulting vector?

* -0.42
* -0.22
* -0.02
* 0.21

In [3]:
from sentence_transformers import SentenceTransformer
model_name = "multi-qa-mpnet-base-dot-v1"
embedding_model = SentenceTransformer(model_name)

  from tqdm.autonotebook import tqdm, trange


In [4]:
answer_llm = df.iloc[0].answer_llm
answer_orig = df.iloc[0].answer_orig

In [5]:
llm_emb = embedding_model.encode(answer_llm)
# orig_emb = embedding_model.encode(answer_orig)

In [6]:
llm_emb[0]

-0.42244655

## Q2. Computing the dot product


Now for each answer pair, let's create embeddings and compute dot product between them

We will put the results (scores) into the `evaluations` list

What's the 75% percentile of the score?

* 21.67
* 31.67
* 41.67
* 51.67

In [7]:
def compute_similarity(record:dict, model:SentenceTransformer):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = model.encode(answer_llm)
    v_orig = model.encode(answer_orig)
    
    return v_llm.dot(v_orig)

In [8]:
from tqdm import tqdm
evaluations = []

for record in tqdm(df.to_dict(orient='records')):
    sim = compute_similarity(record,embedding_model)
    evaluations.append(sim)

100%|██████████| 300/300 [01:23<00:00,  3.59it/s]


In [9]:
pd.Series(evaluations).describe()

count    300.000000
mean      27.495996
std        6.384742
min        4.547923
25%       24.307844
50%       28.336870
75%       31.674309
max       39.476013
dtype: float64

## Q3. Computing the cosine

From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.

So we need to normalize them.

To do it, we 

* Compute the norm of a vector
* Divide each element by this norm

So, for vector `v`, it'll be `v / ||v||`

In numpy, this is how you do it:

```python
norm = np.sqrt((v * v).sum())
v_norm = v / norm
```

Let's put it into a function and then compute dot product 
between normalized vectors. This will give us cosine similarity

What's the 75% cosine in the scores?

* 0.63
* 0.73
* 0.83
* 0.93

In [10]:
import numpy as np
def normalise(v):
    norm = np.sqrt((v * v).sum())
    v_norm = v / norm
    return v_norm

def compute_normalised_similarity(record:dict, model:SentenceTransformer):
    answer_orig = record['answer_orig']
    answer_llm = record['answer_llm']
    
    v_llm = normalise(model.encode(answer_llm))
    v_orig = normalise(model.encode(answer_orig))

    return v_llm.dot(v_orig)


norm_evaluations = []

for record in tqdm(df.to_dict(orient='records')):
    sim = compute_normalised_similarity(record, embedding_model)
    norm_evaluations.append(sim)

100%|██████████| 300/300 [01:23<00:00,  3.59it/s]


In [11]:
pd.Series(norm_evaluations).describe()

count    300.000000
mean       0.728393
std        0.157755
min        0.125357
25%        0.651273
50%        0.763761
75%        0.836235
max        0.958796
dtype: float64

## Q4. Rouge

Now we will explore an alternative metric - the ROUGE score.  

This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.

It can give a more nuanced view of text similarity than just cosine similarity alone.

We don't need to implement it ourselves, there's a python package for it:

```bash
pip install rouge
```

(The latest version at the moment of writing is `1.0.1`)

Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)

```
from rouge import Rouge
rouge_scorer = Rouge()

scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
```

There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.

* `rouge-1` - the overlap of unigrams,
* `rouge-2` - bigrams,
* `rouge-l` - the longest common subsequence

What's the F score for `rouge-1`?

- 0.35
- 0.45
- 0.55
- 0.65

In [12]:
!pip install rouge==1.0.1 -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [13]:
from rouge import Rouge
rouge_scorer = Rouge()

In [14]:
r = df.iloc[10]

In [15]:
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]

In [16]:
scores

{'rouge-1': {'r': 0.45454545454545453,
  'p': 0.45454545454545453,
  'f': 0.45454544954545456},
 'rouge-2': {'r': 0.21621621621621623,
  'p': 0.21621621621621623,
  'f': 0.21621621121621637},
 'rouge-l': {'r': 0.3939393939393939,
  'p': 0.3939393939393939,
  'f': 0.393939388939394}}

## Q5. Average rouge score

Let's compute the average between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4

- 0.35
- 0.45
- 0.55
- 0.65



In [19]:
f_vals = (scores['rouge-1']['f'],scores['rouge-2']['f'],scores['rouge-l']['f'])
sum(f_vals)/len(f_vals)

0.35490034990035496

## Q6. Average rouge score for all the data points

Now let's compute the score for all the records

```python
rouge_1 = scores['rouge-1']['f']
rouge_2 = scores['rouge-2']['f']
rouge_l = scores['rouge-l']['f']
rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
```

And create a dataframe from them

What's the agerage `rouge_2` across all the records?

- 0.10
- 0.20
- 0.30
- 0.40

In [30]:
rouge_scorer = Rouge()
def rouge_scores(answer_llm,answer_orig,rouge_scorer=rouge_scorer):
    
    scores = rouge_scorer.get_scores(answer_llm, answer_orig)[0]
    rouge_1 = scores['rouge-1']['f']
    rouge_2 = scores['rouge-2']['f']
    rouge_l = scores['rouge-l']['f']
    rouge_avg = (rouge_1 + rouge_2 + rouge_l) / 3
    return pd.Series([ rouge_1,rouge_2, rouge_l, rouge_avg], index=['rouge_1', 'rouge_2','rouge_l','rouge_avg']) 


r = df.apply(lambda r: rouge_scores(r.answer_llm, r.answer_orig), axis=1, result_type='expand')   



In [36]:
r.describe()

Unnamed: 0,rouge_1,rouge_2,rouge_l,rouge_avg
count,300.0,300.0,300.0,300.0
mean,0.378844,0.206965,0.353807,0.313205
std,0.165977,0.15355,0.162965,0.158133
min,0.0,0.0,0.0,0.0
25%,0.261625,0.097809,0.228032,0.197358
50%,0.378762,0.178671,0.337792,0.29864
75%,0.479281,0.286181,0.451613,0.404169
max,0.85,0.73913,0.85,0.813043
