<a href="https://colab.research.google.com/github/arafat04/CODEFORCES-Div-2/blob/main/5_Choosing__Loading_and_Using_Metrics_for_Automatic_Evaluation_Arafat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5. Choosing, Loading and Using Metrics for Automatic Evaluation

Welcome to the 5th session of the LLM tutorial!

Congratulations! Through the previous sessions, you've learned how to use LLMs to generate text in various creative and effective ways (e.g., using prompts, RAG, etc.).

In the final two sessions, we will focus on the evaluation of LLMs, exploring how to evaluate models efficiently and effectively.

The goal of **this session** is to learn how to use common automatic metrics to assess the performance of LLMs.

## 5.1 Introduction to Automatic Evaluation Metric

**Automatic metrics** are automated standards for evaluating the performance of NLP models **without requiring human intervention** during the evaluation process.

Based on different categorization methods, these metrics can be grouped into the following categories:
- **Reference-based** vs. **Reference-free**:
    - **Reference-based metrics** (e.g., BLEU, TER) compare the generated text against reference texts to assess similarity.
    - **Reference-free metrics** evaluate the generated text directly, without requiring reference texts. These can focus on:
        - **Fluency and linguistic quality** (e.g., Perplexity), which measures how well the generated text aligns with the statistical patterns of a language model's training data.
        - **Input alignment** (e.g., tasks like entailment detection or faithfulness metrics), which directly evaluate the relationship between the input and output without comparing to reference texts.
- **Design structure (Pattern-matched vs. Semantics-based)**:
    - **Pattern-matched metrics** assess surface-level overlap between generated and reference texts, such as n-gram matches (e.g. BLEU, ROUGE)
    - **Semantics-based metrics** evaluate the semantic similarity of the texts, focusing on meaning rather than exact word matching (e.g., BERTScore).
- **Task-specific metrics**:
    - These are tailored for specific NLP tasks, such as BLEU for machine translation, ROUGE for text summarization, and Accuracy for classification/retrieval problem.

It's important to emphasize that **there is no perfect metric**—each has its limitations. Therefore, it is crucial to select metrics based on the specific requirements and goals of your task.

In this session, we will explore how to use different metrics with Huggingface Evaluate Library to evaluate a machine translation model (EN-FR).

**Note.** Before you start running the code, please activate one GPU: runtime -> change runtime type -> T4 ([How to activate GPU in Colab?](https://saturncloud.io/blog/how-to-activate-gpu-computing-in-google-colab/))

## 5.2 Enviroment Setup

In [None]:
!pip install -q datasets
!pip install -q transformers
!pip install -q evaluate
!pip install -q tqdm

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.7 MB/s[0m eta [36m0:0

## 5.3 Load a EN-FR dataset
We will use a dataset with translations from English to French. You can preview the data [here](https://huggingface.co/datasets/Helsinki-NLP/opus_books/viewer/en-fr).

In [None]:
from datasets import load_dataset

# Load En-Fr Dataset
data = load_dataset("Helsinki-NLP/opus_books", "en-fr")

# Load part of the dataset to speed up the evaluation (optional)
data_portion = load_dataset("Helsinki-NLP/opus_books", "en-fr", split="train[:100]")

# Show data structure
print(data)

# Print the first instance of data
print(data['train'][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 127085
    })
})
{'id': '0', 'translation': {'en': 'The Wanderer', 'fr': 'Le grand Meaulnes'}}


## 5.4 Load a LLM for translation

In [None]:
from transformers import AutoTokenizer
from transformers import pipeline
import torch
# Check if GPU is available
print("GPU Available:", torch.cuda.is_available())

# Set up model name (Feel free to try different models)
model_name = "google-t5/t5-small"

# Define translator
if torch.cuda.is_available():
    translator = pipeline("translation_en_to_fr", model=model_name, device=0)
else:
    translator = pipeline("translation_en_to_fr", model=model_name)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

GPU Available: True


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Here is an example to translate a few sentences.

In [None]:
# Example list
texts = ["Good morning", "How are you?", "This is a test for batch translation."]

# Perform batch translation
translations = translator(texts)

# Print translations
for i, translation in enumerate(translations):
    print(f"Original: {texts[i]}")
    print(f"Translated: {translation['translation_text']}\n")

Original: Good morning
Translated: Bonjour

Original: How are you?
Translated: Comment êtes-vous?

Original: This is a test for batch translation.
Translated: Il s'agit d'un test pour la traduction par lots.



In [None]:
# Create a French translation for the English data
from tqdm import tqdm
def translation_en_to_fr(data, translator):
    # We first merge all english text into one list
    text_en_list = [i['translation']['en'] for i in data]
    ref_fr_list = [i['translation']['fr'] for i in data]

    # Process in batches with tqdm for progress
    batch_size = 16
    french_translation = []
    for i in tqdm(range(0, len(text_en_list), batch_size)):
        batch_texts = text_en_list[i:i+batch_size]
        batch_translations = translator(batch_texts)
        french_translation.extend(batch_translations)

    return text_en_list, ref_fr_list, french_translation

In [None]:
text_en, ref_fr, predict_fr = translation_en_to_fr(data_portion, translator)

100%|██████████| 7/7 [01:01<00:00,  8.79s/it]


In [None]:
for i in range(6, 10):
    print(f"English: {text_en[i]}")
    print(f"Reference: {ref_fr[i]}")
    print(f"Predicted: {predict_fr[i]['translation_text']}\n")

English: I still say 'our home,' although the house no longer belongs to us.
Reference: Je continue à dire « chez nous », bien que la maison ne nous appartienne plus.
Predicted: Je dirais encore « notre maison », même si la maison ne nous appartient plus.

English: We left that part of the country nearly fifteen years ago and shall certainly never go back to it.
Reference: Nous avons quitté le pays depuis bientôt quinze ans et nous n’y reviendrons certainement jamais.
Predicted: Nous avons quitté cette partie du pays il y a près de quinze ans et ne reviendrons certainement jamais à celle-ci.

English: We were living in the building of the Higher Elementary Classes at Sainte-Agathe's School.
Reference: Nous habitions les bâtiments du Cours Supérieur de Sainte-Agathe.
Predicted: Nous vivions dans l'édifice des classes élémentaires supérieures de l'école Sainte-Agathe's.

English: My father, whom I used to call M. Seurel as did other pupils, was head of the Middle School and also of the H

## 5.5 Evaluate

### 5.5.1 SacreBLEU

In [None]:
!pip install -q sacrebleu

import evaluate

metric = evaluate.load("sacrebleu")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
def postprocess_text(preds, refs):
    # Sometimes we need a postprocess
    preds = [pred['translation_text'].strip() for pred in preds]
    refs = [ref.strip() for ref in refs]

    return preds, refs

In [None]:
def compute_bleu_metrics(predict_fr, ref_fr):
    predict_fr, ref_fr = postprocess_text(predict_fr, ref_fr)

    result = metric.compute(predictions=predict_fr, references=ref_fr)
    result = {"sacrebleu": result["score"]}

    return result

In [None]:
compute_bleu_metrics(predict_fr, ref_fr)

{'sacrebleu': 15.84051049276102}

### 5.5.2 METEOR

In [None]:
def compute_meteor_metrics(predict_fr, ref_fr):
    predict_fr, ref_fr = postprocess_text(predict_fr, ref_fr)

    metric = evaluate.load("meteor")
    result = metric.compute(predictions=predict_fr, references=ref_fr)

    return result

In [None]:
compute_meteor_metrics(predict_fr, ref_fr)

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


{'meteor': 0.3671263141061779}

### 5.5.3 Bertscore

In [None]:
!pip install -q bert_score
def compute_bertscore_metrics(predict_fr, ref_fr):
    predict_fr, ref_fr = postprocess_text(predict_fr, ref_fr)
    metric = evaluate.load("bertscore")
    result = metric.compute(predictions=predict_fr, references=ref_fr, model_type="distilbert-base-uncased")

    return sum(result['precision'])/len(result['precision']), sum(result['recall'])/len(result['recall']), sum(result['f1'])/len(result['f1'])

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
precision, recall, f1 = compute_bertscore_metrics(predict_fr, ref_fr)
print(f"Precision: {precision}, Recall: {recall}, F1: {f1}")

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Precision: 0.887796248793602, Recall: 0.8896603459119796, F1: 0.8883174055814743


## 5.6 Exercise

In the previous sections, we demonstrated how to evaluate translation models using the Evaluate library. However, for real-world projects, **I strongly recommend consulting the official documentation or GitHub repositories of the metrics you intend to use.** The metrics integrated into the Evaluate library often face update lags—for instance, a metric might already be at version 1.5, while Evaluate still deploys version 1.1. Additionally, the Evaluate library supports only a limited selection of commonly used metrics.

In the next exercise, we encourage you to explore **reference-less** metrics to evaluate the performance of a translation model on the same En-Fr dataset. This means assessing the model's translation quality without relying on French reference translations. Instead, you will compare the model's French outputs directly with the original English inputs. You may need to explore libraries and resources beyond the Evaluate library to achieve this.

Below are some suggested reference-less metrics you can explore (you are also welcome to choose others):

- Prism: https://github.com/thompsonb/prism
- NMTScore: https://github.com/ZurichNLP/nmtscore/tree/master
- Cometkiwi: https://github.com/Unbabel/COMET
- SBERT (Multilingual embedding model + cosine similarity): https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#multilingual-models

Feel free to explore, experiment, and apply these or other suitable reference-less metrics to evaluate your translation models!


### Using SBERT as the reference-less metric

In [None]:
from sentence_transformers import SentenceTransformer
import torch


In [None]:
# Load the SBERT model
sbert_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Function to get sentence embeddings
def get_sentence_embeddings(sentences):
    return sbert_model.encode(sentences, convert_to_tensor=True)

# Get embeddings for the original and translated texts
original_embeddings = get_sentence_embeddings(text_en)
translated_embeddings = get_sentence_embeddings(predict_fr)


In [None]:
# Function to calculate cosine similarities for all pairs
def calculate_cosine_similarities(embeddings1, embeddings2):
    return torch.nn.functional.cosine_similarity(embeddings1, embeddings2)

# Calculate the cosine similarities
similarities = calculate_cosine_similarities(original_embeddings, translated_embeddings)
for i, similarity in enumerate(similarities):
    print(f"Sentence {i+1} Cosine Similarity: {similarity.item()}")


Sentence 1 Cosine Similarity: 0.838951826095581
Sentence 2 Cosine Similarity: 1.0000001192092896
Sentence 3 Cosine Similarity: 0.9229799509048462
Sentence 4 Cosine Similarity: 1.0
Sentence 5 Cosine Similarity: 0.6841739416122437
Sentence 6 Cosine Similarity: 0.9802467823028564
Sentence 7 Cosine Similarity: 0.9014228582382202
Sentence 8 Cosine Similarity: 0.9625925421714783
Sentence 9 Cosine Similarity: 0.9660935997962952
Sentence 10 Cosine Similarity: 0.9449139833450317
Sentence 11 Cosine Similarity: 0.9762867093086243
Sentence 12 Cosine Similarity: 0.8934643268585205
Sentence 13 Cosine Similarity: 0.8394966125488281
Sentence 14 Cosine Similarity: 0.8939528465270996
Sentence 15 Cosine Similarity: 0.8193341493606567
Sentence 16 Cosine Similarity: 0.8755040168762207
Sentence 17 Cosine Similarity: 0.9120951890945435
Sentence 18 Cosine Similarity: 0.8223617672920227
Sentence 19 Cosine Similarity: 0.898890495300293
Sentence 20 Cosine Similarity: 0.9333308935165405
Sentence 21 Cosine Similar

In [None]:
# Calculate the average cosine similarity
average_similarity = torch.mean(similarities).item()
print(f"Average Cosine Similarity: {average_similarity}")

Average Cosine Similarity: 0.8884351849555969
