# Chapter 6: Summarization - Metrics deep-dive
In this notebook I am going to deep-dive into BLEU and ROUGE metrics.  
I am going to use BLEU and ROUGE HuggingFace implementations.

In [2]:
# Uncomment and run this cell if loading this notebook for the first time
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 3.2 MB/s 
Collecting huggingface-hub>=0.7.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 11.1 MB/s 
[?25hCollecting datasets>=2.0.0
  Downloading datasets-2.5.1-py3-none-any.whl (431 kB)
[K     |████████████████████████████████| 431 kB 49.9 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 57.0 MB/s 
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 15.0 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  

In [3]:
import evaluate
import nltk
import pandas as pd
import numpy as np

In [4]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Recommended read > [click here](https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499)

## Introduction
In the case of text-generation (e.g., machine translation, summarization, etc.) finding an evaluation metric might be tricky.  
If we take either classification or regression, we ultimately calculated some sort of "deviation" of our prediction from the ground truth.  
When we generate text, however, it is often the case that our prediction, while not being identical to the ground truth, is prefectly equivalent. With machine translation, for example, one sentence could actually be translated in slightly different ways but still carrying the same meaning. Similarly for summarisation, where two possible summaries - while being written differently - might be equivalently good.  
  
Given this context, the two metrics I am going to present in this notebook (BLEU and ROUGE) are built on a *common-sense* intuition, i.e., counting "similarities" of generated text versus (a set of) reference text.  
The difference between the two lays on the specific task, with BLEU being a *precision-oriented* metric and ROUGE being a *recall-oriented* one (we will see that, in reality, there are multiple - blurred - implementations).  
In short, BLEU checks whether the "components" of generated text (e.g., words) appear (in n-gram fashion) in the reference text. The higher such proportion, the better. ROUGE, on the other side, checks how many of the "components" of reference text are in the generated text. All of this makes BLEU a metric typically used in machine tranlsaiton (i.e., I want my translation to be precise) and ROUGE a metric used in summarisation (as reference text (i.e., one of the possible summaries of document) contains all the information I need, I would like those information to be present in the generated text).  
  
All of this may sound (very) cryptic, but let's review it step by step.

## BLEU
First of all, let me list all the available metrics in HuggingFace `evaluate`.

In [5]:
evaluate.list_evaluation_modules()

['lvwerra/test',
 'precision',
 'code_eval',
 'roc_auc',
 'cuad',
 'xnli',
 'rouge',
 'pearsonr',
 'mse',
 'super_glue',
 'comet',
 'cer',
 'sacrebleu',
 'mahalanobis',
 'wer',
 'competition_math',
 'f1',
 'recall',
 'coval',
 'mauve',
 'xtreme_s',
 'bleurt',
 'ter',
 'accuracy',
 'exact_match',
 'indic_glue',
 'spearmanr',
 'mae',
 'squad',
 'chrf',
 'glue',
 'perplexity',
 'mean_iou',
 'squad_v2',
 'meteor',
 'bleu',
 'wiki_split',
 'sari',
 'frugalscore',
 'google_bleu',
 'bertscore',
 'matthews_correlation',
 'seqeval',
 'trec_eval',
 'rl_reliability',
 'jordyvl/ece',
 'angelina-wang/directional_bias_amplification',
 'cpllab/syntaxgym',
 'lvwerra/bary_score',
 'kaggle/amex',
 'kaggle/ai4code',
 'hack/test_metric',
 'yzha/ctc_eval',
 'codeparrot/apps_metric',
 'mfumanelli/geometric_mean',
 'daiyizheng/valid',
 'poseval',
 'erntkn/dice_coefficient',
 'mgfrantz/roc_auc_macro',
 'mathemakitten/harness_sentiment',
 'mathemakitten/sentiment',
 'Vlasta/pr_auc',
 'gorkaartola/metric_for_tp

The standard syntax for using a metric is the following:
* Load the metric, e.g., `evaluate.load(<metric_name>)`
* Compute the metric, e.g., `<loaded_metric>.compute(...)`

In [6]:
bleu = evaluate.load("bleu")

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

I will now define a `generated_text` sample sentence and a few reference sentences.

In [7]:
generated_text = ["Paris is the capital of France"]
reference_sentences = [["Paris is the biggest French city"]]

Before jumping into any implementation and talking about *n-grams*, clipping, etc., let me show you a very verys simple exercise which represents the key idea of BLEU.

In [8]:
print(nltk.word_tokenize(generated_text[0]))
print(nltk.word_tokenize(reference_sentences[0][0]))

['Paris', 'is', 'the', 'capital', 'of', 'France']
['Paris', 'is', 'the', 'biggest', 'French', 'city']


We can see that our generated text is made of 6 words. If we calculated BLEU at the word level, we simply need to check if a word in the generated text is also present in the reference text; if that is the case, we mark it as a hit.  
In our case:
* *Paris* is in the reference text > it's a hit!
* *is* is in the reference text > it's a hit!
* *the* is in the reference text > it's a hit!
* *capital* is **not** in the reference text
* *of* is **not** in the reference text
* *France* is **not** in the reference text
  
Out of 6 generated words, 3 are in the reference text. Therefore, our BLEU score is 3/6, i.e., 0.5.  
Easy, no?

## Fooling BLEU - Super short sentences
The most attentive readers may have already noticed that the implementation above could easily lead to inflating the metric value, while not necessarily having a meaningful generated text.  
For example if I change the generated text to *Paris* only, I would achieve a perfect BLEU score.  
  
That is why BLEU comes with a **brevity penalty** factor. If the generated text is (much) shorter than the reference text(s), the BLEU score will be multiplied by lower-than-1 penalty term.  
You might wonder that brevity is not bad per se. That is true - at least partially - but, especially in the case of machine translation, we would not expect the generated translation to be extremely shorter than the reference text (there might be a difference of a few words/token but, as per the example above, only using *Paris* would not be enough).

## Fooling BLEU - Repeating right words
Brevity is a problem? No problem (pun intended)! Let me simply repeat *Paris* six times. As the generated text now has the same length of the reference text, there would be no brevity penalty and BLUE score should be perfect, i.e., 1.  
Obviously, this would not work either. That is why the count of each individual toekn in the generated text is **capped** at the count of such token in the reference text.  
  
In our case, as *Paris* is only present once in the reference text, repeating it 6 times in the generated text would not make any difference.  
The BLEU score for the string `Paris Paris Paris Paris Paris Paris* would be $1/6|$ (and not 1).


## BLEU in Hugging Face
Ok, it is time to calculate BLEU using Hugging Face implementation.

In [9]:
bleu_score = bleu.compute(predictions=generated_text, references=reference_sentences, tokenizer=nltk.word_tokenize, max_order=1)
pd.DataFrame.from_dict(bleu_score, orient="index", columns=["value"])

Unnamed: 0,value
bleu,0.5
precisions,[0.5]
brevity_penalty,1.0
length_ratio,1.0
translation_length,6
reference_length,6


As expected, the BLEU score is equal to 0.5. Let's check out the fooling cases.

In [10]:
bleu_score = bleu.compute(predictions=["Paris"], references=reference_sentences, tokenizer=nltk.word_tokenize, max_order=1)
pd.DataFrame.from_dict(bleu_score, orient="index", columns=["value"])

Unnamed: 0,value
bleu,0.006738
precisions,[1.0]
brevity_penalty,0.006738
length_ratio,0.166667
translation_length,1
reference_length,6


As you can see, precision is indeed 1.0 (one hit). However, the final BLEU score is obtained by multiplying such precision with the brevity penalty (kind of a low number).  
The brevity penalty is calculated as the minimum value between 1.0 (i.e., no penalty) and $e^{(1-{len\_ref / len\_gen})}$ where $len\_ref$ is the length of the reference sentence and $len\_gen$ is the length of the generated text (in our case it is $e^{1-6/1}$).

In [11]:
bleu_score = bleu.compute(predictions=["Paris Paris Paris Paris Paris Paris"], references=reference_sentences, tokenizer=nltk.word_tokenize, max_order=1)
pd.DataFrame.from_dict(bleu_score, orient="index", columns=["value"])

Unnamed: 0,value
bleu,0.166667
precisions,[0.16666666666666666]
brevity_penalty,1.0
length_ratio,1.0
translation_length,6
reference_length,6


In this case, the brevity penatly is 1 (i.e., no penalty). However, precision is - as expected - $1/6/$ as the word *Paris* is present only once in the reference text.

## Fooling BLEU - Again?
As of now, we have calculated the BLEU score for single words only. If we were to randomly swap the words' order (e.g., *France is the capital of Paris* as generated text), the BLEU score wouldn't change.  
As you can imagine, however, changing the word orders may lead to significantly different ouputs. That is way we usually calculate the BLEU score for multiple **n-grams** (typically from 1 (i.e., single tokens) to 4).  
  
Please notice that, from now, I will use `sacrebleu`, a BLEU implementation in which the tokenisation step is already included.

In [12]:
# Uncomment it if you haven't run it before
!pip install sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacrebleu
  Downloading sacrebleu-2.2.1-py3-none-any.whl (116 kB)
[K     |████████████████████████████████| 116 kB 5.2 MB/s 
[?25hCollecting portalocker
  Downloading portalocker-2.5.1-py2.py3-none-any.whl (15 kB)
Collecting colorama
  Downloading colorama-0.4.5-py2.py3-none-any.whl (16 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.5 portalocker-2.5.1 sacrebleu-2.2.1


In [13]:
bleu = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Please notice that `sacrebleu` includes a smoothing option to increase (by a certain amount) the count of 0-occurence n-grams. This is to avoid extreme penalisation for missing (long) n-grams in the generated text.  
In this case, I am setting the smoothing value to 0 and use the plain vanilla BLEU calculation.

In [14]:
bleu_score = bleu.compute(predictions=generated_text, references=reference_sentences, smooth_method="floor", smooth_value=0)
pd.DataFrame.from_dict(bleu_score, orient="index", columns=["value"])

Unnamed: 0,value
score,0.0
counts,"[3, 2, 1, 0]"
totals,"[6, 5, 4, 3]"
precisions,"[50.0, 40.0, 25.0, 0.0]"
bp,1.0
sys_len,6
ref_len,6


In [15]:
print(generated_text)
print(reference_sentences)

['Paris is the capital of France']
[['Paris is the biggest French city']]


Let's inspect the scores above.  
First of all, we now have multiple lists of values. That is because, by default, BLEU is calculated across n-grams up to n = 4 (i.e., single word, bi-gram, tri-gram and quad-gram).  
  
Let me walk trough each *precisions* value:
* the first precision is 0.5. This is equivalent to the single-word precision we calculated before (*Paris is the* - 3 tokens - out of 6 workds > 0.5)
* The second precision is measured using bi-gram (or 2-gram). The matching bi-grams are *Paris is* and *is the* out of five total bi-grams. Precision is then $2/5 = 0.4$
* The third precision is measured using tri-gram (or 3-gram). The matching tri-grams is only *Paris is the* out of four total tri-grams. Precision is therefore $1/4 = 0.25$
* The fourth precision is measured using - you should get it by now - quad-grams (or 4-gram). There are no matching quad-grams out of three possible options. Precision is zero.  
  
The overall BLEU score is an average of each n-gram score. More precisely, it is calculated as the **geometric mean** of all the precisions, multiplied by the brevity penalty (in this case no penalty as the generated and reference text are the same. Because we have no matching 4-grams, the BLEU score is extremely conservative and the geometric mean is zero.  
Let me re-calculate it by setting smooth-factor equal to 1, i.e., even if the count of a matching n-gram is zero, it will be "floored" to 1.



In [16]:
bleu_score = bleu.compute(predictions=generated_text, references=reference_sentences, smooth_method="floor", smooth_value=1)
pd.DataFrame.from_dict(bleu_score, orient="index", columns=["value"])

Unnamed: 0,value
score,35.930411
counts,"[3, 2, 1, 0]"
totals,"[6, 5, 4, 3]"
precisions,"[50.0, 40.0, 25.0, 33.333333333333336]"
bp,1.0
sys_len,6
ref_len,6


All the first three precisions are the same, but the last one, i.e., the one for 4-grams is now set to $1/3$. The score is calculated as the geometric mean of all the precisions.

In [17]:
print(f'BLEU score: {bleu_score["bp"] * np.product(bleu_score["precisions"])**(1/len(bleu_score["precisions"])):.6f}')

BLEU score: 35.930411


### Multiple references
Let me now add another reference sentence, similar to the other one.

In [18]:
reference_sentences[0].append("Paris is the largest city in France")

In [19]:
print(generated_text)
print(reference_sentences)

['Paris is the capital of France']
[['Paris is the biggest French city', 'Paris is the largest city in France']]


In [20]:
bleu_score = bleu.compute(predictions=generated_text, references=reference_sentences, smooth_method="floor", smooth_value=1)
pd.DataFrame.from_dict(bleu_score, orient="index", columns=["value"])

Unnamed: 0,value
score,38.60974
counts,"[4, 2, 1, 0]"
totals,"[6, 5, 4, 3]"
precisions,"[66.66666666666667, 40.0, 25.0, 33.33333333333..."
bp,1.0
sys_len,6
ref_len,6


In [21]:
print(f"Precisions: {bleu_score['precisions']}")

Precisions: [66.66666666666667, 40.0, 25.0, 33.333333333333336]


As you can see, precision scores are almost identical. Precision for 1-gram, however, is higher (66.67 vs 0.50). This is because the word *France* is now present in the second reference. The matching 1-gram count is then 4 (and not 3), leading to a 1-gram BLEU precision of $4/6$.  
  
In the case of multiple references, the brevity penalty is calculated by comparing the length of the generated text vs that of the **shortest** reference sentence.

## ROUGE
As already mentioned in the *Introduction*, BLEU is a relatively meaningful metric for tasks where precision matters, e.g., machine translation. In such cases, *recall* is relatively less important.  
  
In other cases, like summarisation, **recall** is more important. The intuition behind this reasoning is actually quite simple. If a reference text is a summary of a document, and we want to generate a summary too, the more information in the reference text we are able to re-collect, the better.  
  
In practice, ROUGE implementation is quite similar to BLEU's. The only difference - the same between precision and recall - lays in the denominator used to calculate the hit-ratio against. In the case of ROUGE, we will be using the count of n-grams in the reference text.  
  
Let me show you a few examples.

In [22]:
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24955 sha256=14b6aa1397d6e6135f778d5343acc64c42fc5bc636e4e2b9201689c494be423c
  Stored in directory: /root/.cache/pip/wheels/84/ac/6b/38096e3c5bf1dc87911e3585875e21a3ac610348e740409c76
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


Let me slightly change the generated text (see below).

In [23]:
generated_text = ["Paris is not the capital of France"]
reference_sentences = [["Paris is the biggest French city"]]
print(generated_text)
print(reference_sentences)

['Paris is not the capital of France']
[['Paris is the biggest French city']]


For the two sentences above, we would expect the ROUGE-1 score (i.e., the ROUGE score calculated on uni-gram) to be:
* Count of reference unigrams also present in generated text (3, i.e., *Paris*, *is* and *the*)
* Divided by total number of unigrams in reference text  (i.e., 6)
  
In number, this would be 3/6, i.e., 0.5. Let's verify if that is the case using the HuggingFace library.

In [24]:
rouge = evaluate.load("rouge")
rouge_score = rouge.compute(predictions=generated_text, references=reference_sentences, rouge_types=["rouge1"])

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Mmm, that turns out not to be what we expected. ROUGE, in fact, has been "adjusted" to avoid simply measuring recall and discard precision. In fact, somebody may simply have a high ROUGE score by having a gibberih generated text that also has many n-grams present in the reference text.  
The actual ROUGE calculation has been modified to be the harmonic-mean (an F1-score) between recall (as calculated, i.e., our 0.5) and an un-clipped BLEU score.  
  
Let me show you this in details.

In [25]:
# unclipped BLEU-1 score
unclipped_bleu1 = 3/7 # "Paris", "is" and "the" out of 7 unigrams
rouge1 = 3/6 # "Paris", "is" and "the" out of 6 unigrams

manual_rouge_score = 2 * (unclipped_bleu1 * rouge1) / (unclipped_bleu1 + rouge1)
print(f"Adjusted ROUGE-1 score: {manual_rouge_score:.8f}") 

assert manual_rouge_score == rouge_score["rouge1"]

Adjusted ROUGE-1 score: 0.46153846


Please note that the *brevity penalty* is not included in the calculation above. A brevity penalty, in fact, is somehow included by the fact that we consider both precision and recall. A perfectly precise 1-word generated sentence, in fact, will have a very recall score (vs a longer reference text).  
Let me quickly show you this with an example.

In [27]:
rouge_score = rouge.compute(predictions=["Paris"], references=reference_sentences, rouge_types=["rouge1"])
print(rouge_score)

{'rouge1': 0.2857142857142857}


In [36]:
# unclipped BLEU-1 score for single-word generated text
unclipped_bleu1 = 1/1 # "Paris" is the only matching unigram in the entire sentence "Paris"
rouge1 = 1/6 # "Paris", "is" and "the" out of 6 unigrams

manual_rouge_score = 2 * (unclipped_bleu1 * rouge1) / (unclipped_bleu1 + rouge1)
print(f"Adjusted ROUGE-1 score: {manual_rouge_score:.8f}") 

assert manual_rouge_score == rouge.compute(predictions=["Paris"], references=reference_sentences)["rouge1"]

Adjusted ROUGE-1 score: 0.28571429


### Multiple ROUGE versions
In the default implementation of HuggingFace ROUGE, four ROUGE metrics are actually reported. They are:
* ROUGE-1 (what we have seen above)
* ROUGE-2 (i.e., ROUGE for bi-grams)
* ROUGE-L
* ROUGE-Lsum
  
I will go over ROUGE-2 quickly and focus on ROUGE-L and ROUGE-Lsum later.

#### ROUGE-2
Very simply, ROUGE-2 is calculated using bi-grams. Let me calculate it for our examples.

In [37]:
print(generated_text)
print(reference_sentences)

['Paris is not the capital of France']
[['Paris is the biggest French city']]


In [93]:
# unclipped BLEU-2 score for single-word generated text
unclipped_bleu2 = 1/6 # "Paris is" is the only matching bigram
rouge2 = 1/5 # "Paris is" out of 5 unigrams

manual_rouge_score = 2 * (unclipped_bleu2 * rouge2) / (unclipped_bleu2 + rouge2)
print(f"Adjusted ROUGE-2 score: {manual_rouge_score:.8f}") 

assert manual_rouge_score == rouge.compute(predictions=generated_text, references=reference_sentences)["rouge2"]

Adjusted ROUGE-2 score: 0.18181818


#### Longest Common Substring (LCS)
Another common metrics used in ROUGE makes use - instead of counting n-grams - of the *longest common substring* (or LCS). Using LCS, a ROUGE score is calculated by using LCS with respect to the generated text (kind of precision) and LCS with respect to reference text (kind of recall) and combine the two metrics together in a sort of F1-score. By doing this, we are able to (sort of) normalise between samples of different lengths (i.e., by definition, a longer sentence may have a longer LCS simply because there are more words!).

**IMPORTANT** >>> when calculating the longest commong substring, we are not looking at the longest sequence of consecutive word, but rather the longest series of words in order.
In our case, the longest-common substring is *Paris is ... the*. The fact that there is a *not* in between does not matter. 
Let's calculate the two metrics $precision_{LCS}$ and $recall_{LCS}$.

In [120]:
precision_LCS = 3/7 # "Paris is ... the" is the longest common sequence of text out 7 tokens
recall_LCS = 3/6 # "Paris is the " out of 6 tokens

As a quick reminder on beta in F-score:
* if $beta = 1$, we have the harmonic mean between precision and recall
* if $beta > 1$, precision carries less weight than recall
* if $beta < 1$, precision carries more weight than recall

In [125]:
print(f"Precision LCS:\t{precision_LCS:.4f}")
print(f"Recall LCS:\t{recall_LCS:.4f}")
beta = 1
rouge_LCS = (1+beta**2) * (precision_LCS * recall_LCS) / (precision_LCS * (beta**2) + recall_LCS)
print(f"ROUGE-LCS:\t{rouge_LCS:.8f}")

assert rouge.compute(predictions=generated_text, references=reference_sentences)["rougeL"] == rouge_LCS

Precision LCS:	0.4286
Recall LCS:	0.5000
ROUGE-LCS:	0.46153846


In the case of a multi-sentence reference text, we actually have two LCS implementations:
* rougeL > calculate the score for each sentence and then average it
* rougeLsum > calculates the score across the entire reference text  
  
In practice, we look at all ROUGE metrics to have a better idea of model performrance.