#### **Evaluation Metrics for Text Generation**

Aim: Text generation model aims to generate human-like text

Standard Accuracy Metrics like Accuracy and F1 Score falls short for these tasks.

Therefore, we use metrics like -
- BLEU (Bilingual Evaluation Understudy)
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

#### **BLEU**

It compares the generated text and the reference text, and checks for the occurrence of n-grams.

In the sentence "The Cat is on the mat", 
- 1-grams (uni-gram) are [`the`, `cat`, `is`, `on`, `the`, `mat`]
- 2-grams (bi-gram) are [`the cat`, `cat is`, `is on`, `on the`, `the mat`]
- and so on... for n-grams

In BLEU, a perfect match yields a score of `1.0` \
While a score of `0` means no match.

In [3]:
# BLEU score with PyTorch

from torchmetrics.text import BLEUScore

generated_text = ['the cat is on the mat']
real_text = [['there is a cat on the mat', 'a cat is on the mat']]

bleu = BLEUScore()
bleu_metric = bleu(generated_text, real_text)
print(f'BLEU Score: {bleu_metric.item():.4f}')

BLEU Score: 0.7598


#### **ROUGE**

It compares a generated text to a reference text in two ways -
- `ROUGE-N`: Considers overlapping n-grams (N=1 for unigrams, 2 for bigrams, etc.) in both texts.
- `ROUGE-L`: Looks at the longest common subsequence (LCS) between the texts.

ROUGE Metrics -
- F-Measure: Harmonic mean
- Precision: Matches the n-grams in generated text within the reference text
- Recall: Matches of n-grams in reference text within the generated text

`rouge1`, `rouge2`, `rougeL` prefixes refer to 1-gram, 2-gram or LCS, respectively.

In [1]:
from torchmetrics.text import ROUGEScore

generated_text = 'Hello, how are you doing?'
real_text = "Hello, how are you?"

rouge = ROUGEScore()

rouge_score = rouge([generated_text], [[real_text]])
print(f"ROUGE Score: {rouge_score}")

ROUGE Score: {'rouge1_fmeasure': tensor(0.8889), 'rouge1_precision': tensor(0.8000), 'rouge1_recall': tensor(1.), 'rouge2_fmeasure': tensor(0.8571), 'rouge2_precision': tensor(0.7500), 'rouge2_recall': tensor(1.), 'rougeL_fmeasure': tensor(0.8889), 'rougeL_precision': tensor(0.8000), 'rougeL_recall': tensor(1.), 'rougeLsum_fmeasure': tensor(0.8889), 'rougeLsum_precision': tensor(0.8000), 'rougeLsum_recall': tensor(1.)}


#### **Limitations of BLEU and ROUGE**

- They evaluate word presence, without considering its semantic meaninng.
- They are sensitive to the length of the generated text.
- The quality of reference text affects the scores.