## 🧩 Evaluation of Text Generation

In this notebook, we'll continue **look at text generation**. And this time, we will focus on **Evaluation**.

Evaluation of Text Generation is a central challenge in natural language generation research. How we evaluate systems shapes not only model comparison but also our very definition of what counts as a “good” output. In this session, you will explore different types of evaluation metrics to uncover their strengths, limitations, and inherent biases.

The goal of this notebook is **not to chase perfect evaluation scores**, but to **experiment** and **build intuition** about how evaluation of text generation actually works.  

You’re encouraged to:  
- Try out different evaluation metrics from various classes,  
- Compare how they rate the same model outputs, and  
- Reflect on when and why these metrics agree—or fail to.  

In this notebook, we’ll focus on two main types of evaluation metrics:  
(a) **Content-overlap metrics**   
(b) **Model-based metrics**.  

Beyond these automatic methods, you’re also encouraged to **manually evaluate** some generated outputs yourself — observe which metrics best align with your own intuition about quality.

By the end of this notebook, you’ll have a practical understanding of how to **evaluate text generation models** and a clearer sense of **what good evaluation really means**.

### 🧮 The `evaluate` Library

[`evaluate`](https://huggingface.co/docs/evaluate) is a lightweight library from Hugging Face that provides a unified interface for computing a wide range of NLP evaluation metrics — from classic ones like **BLEU**, and **Perplexity**, to modern model-based metrics such as **BERTScore** and **COMET**. 

 Evaluate provides access to a wide range of evaluation tools. It covers a range of modalities such as text, computer vision, audio, etc. as well as tools to evaluate models or datasets. 

You can check more Metrics here: https://huggingface.co/evaluate-metric/spaces

 Each metric is a separate Python module, but for using any of them, there is a single entry point: `evaluate.load()`!

### 📏 Content-overlap Metrics

**Content-overlap metrics** evaluate how closely a generated text matches one or more reference texts by comparing their surface forms — typically through word or n-gram overlap.  

These metrics are simple, interpretable, and fast to compute, but they often fail to capture deeper semantic meaning or paraphrasing.

In this notebook, we’ll focus on two of the most widely used metrics in this category:  

- **BLEU** — computes the overlap of n-grams between generated and reference texts, and is widely used in **machine translation** and **summarization**.  
  
- **ROUGE** — measures the overlap of n-grams, words, or word sequences, but is especially designed for **summarization** tasks, focusing on recall rather than precision.n.


### BLEU

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” – this is the central idea behind BLEU. 

BLEU and BLEU-derived metrics are most often used for machine translation.



You should first run `pip install evaluate` to install it. Then, You Can load evaluate by this:

In [None]:
import evaluate

bleu = evaluate.load("bleu")

Here is an example texts:



In [None]:
predictions = [
    "The cat is on the mat.",
    "There is a cat sitting on the carpet."
]
references = [
    ["The cat sits on the mat."],
    ["A cat is on the carpet."]
]


------------
**`TODO:`** Compute the BLEU score for the `predictions` list against the `references` list using the `bleu.compute()` function.  
- Use the `predictions` variable as the input for the `predictions` parameter.  
- Use the `references` variable as the input for the `references` parameter.  
- Store the result in a variable named `results`.  
- Print the BLEU score from the `results` dictionary using the key `'bleu'`.  
This will help you understand how BLEU evaluates the overlap between the generated and reference texts.

-----

BLEU also has several limitations, which we’ll illustrate through examples.

In [None]:
references = [["The cat is on the mat."]]
predictions_good = ["A cat sits on the rug."]    
predictions_bad = ["The cat is not on the mat."]  


----
**`TODO：`**: Compute the scores of `predictions_good` and `predictions_bad` against the `references`, and **discuss** why this happens — what limitation of BLEU does it reveal?

[Your Answer]

----

### ROUGE

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.



This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

Unlike BLEU, which focuses on precision, ROUGE emphasizes **recall** — how much of the reference text’s content is captured in the generated text.

Here are the main variants you’ll encounter:

- **ROUGE-1** — Measures the overlap of individual words (unigrams) between the prediction and reference.  
  → Captures basic lexical similarity.

- **ROUGE-2** — Measures the overlap of 2-word sequences (bigrams).  
  → Reflects fluency and short-phrase consistency.

- **ROUGE-L** — Based on the *Longest Common Subsequence (LCS)* between prediction and reference.  
  → Captures sentence-level structure and word order similarity.

- **ROUGE-Lsum** — A summary-level variant of ROUGE-L that averages the LCS-based recall across multiple sentences in the generated summary.  
  → More suitable for multi-sentence summarization tasks.

💡 **Interpretation:**  
Higher ROUGE scores generally indicate better content overlap with the reference, but like BLEU, ROUGE is still surface-based and does not measure semantic correctness or factual accuracy.

Here is the case:

In [None]:
reference = "The cat is on the mat."

prediction_normal = "The cat is on the mat."
prediction_paraphrase = "A cat sits on the rug."
prediction_negated = "The cat is not on the mat."

----

**`TODO:`** Use the example above to compute **ROUGE** scores for your generated outputs.



1. **Load the ROUGE metric** using `evaluate.load("rouge")`.

2. **Compute ROUGE** for your different predictions (e.g., `prediction_normal`, `prediction_paraphrase`, and `prediction_negated`) against the same reference.

3. **Access specific ROUGE variants** such as ROUGE-1, ROUGE-2, ROUGE-L, or ROUGE-Lsum by indexing the result dictionary (e.g., `rouge.compute(...)[‘rouge1’]`).

4. **Compare the results** across the three predictions:  
   - How do the scores differ between exact matches, paraphrases, and negated sentences?  
   - Do higher ROUGE scores always correspond to better or more semantically accurate outputs?

5. **Discuss your findings:**  
   Consider where ROUGE may fail to capture semantic equivalence or meaning preservation.


----

[Your Answer]

### Model-based metrics: bert_score

**BERTScore** is a *model-based evaluation metric* that measures the similarity between generated and reference texts using contextual embeddings from pretrained language models such as **BERT** or **RoBERTa**.  

Instead of comparing surface-level n-gram overlap (like BLEU or ROUGE), BERTScore computes **semantic similarity** between words by aligning their embeddings in a high-dimensional space.  It captures meaning even when different words or phrases are used.


💡 **Note:**  
BERTScore relies on a large pretrained model, so it is computationally heavier than BLEU or ROUGE, but it provides a more meaning-aware evaluation of generated text.

You can refer to Tianyi’s paper for more details (but unfortunately, it’s a different Tianyi — not me 🤡): https://arxiv.org/pdf/1904.09675


We still use the example:

In [None]:
reference = "The cat is on the mat."

prediction_paraphrase = "A cat sits on the rug."
prediction_negated = "The cat is not on the mat."

BERTScore uses contextual embeddings from a pretrained model (like BERT) to measure **semantic similarity** between tokens in the prediction and reference sentences.  
Here’s how each component is computed:

1. **Token-level similarity:**  
   Each token is represented as a vector embedding.  
   The similarity between two tokens is measured using **cosine similarity**.

2. **Precision (P):**  
   For each token in the *prediction*, find the **most similar** token in the *reference*,  
   then take the **average** of these maximum similarities.  


3. **Recall (R):**  
   For each token in the *reference*, find the **most similar** token in the *prediction*,  
   then take the average of these maximum similarities.  

4. **F1 Score:**  
   The harmonic mean of Precision and Recall, capturing overall alignment between prediction and reference


💡 **Intuition:**  
- Precision measures how *relevant* the generated tokens are to the reference.  
- Recall measures how much of the reference meaning is *covered* by the generation.  
- F1 balances both — higher F1 indicates stronger semantic similarity overall.

----
**`TODO:`** Use the example above to compute **BERTScore** for your generated outputs.
 You can access specific values via keys like `res['precision']`, `res['recall']`, and `res['f1']` (e.g., `res['f1'][0]`). you can use any model_type settings (e.g., `model_type="bert-base-uncased"`) and compare scores with BLEU and ROUGE, which aligns with your intuition?

 ✅ Steps to follow:

1. **Load the BERTScore metric** using `evaluate.load("bertscore")`.

2. **Compute BERTScore** for your predictions (e.g., `prediction_normal`, `prediction_paraphrase`, and `prediction_negated`) against the same reference.  
   You can experiment with different model backbones, such as `model_type="bert-base-uncased"` or `model_type="roberta-large"`.

3. **Access individual score components** using keys like  
   `res['precision']`, `res['recall']`, and `res['f1']`  
   (for example, `res['f1'][0]` to view the F1 score of a single instance).

4. **Compare the results** across the three predictions and reflect on:  
   - Which metric—**BLEU**, **ROUGE**, or **BERTScore**—best matches your own intuition about semantic similarity?  
   - Are there cases where BERTScore provides a more meaningful evaluation than surface-level metrics?



-----


-----
**`TODO:`** : Discussion: Why does `prediction_negated` still get a high score? How to imporve it?

[Your Answer]



----