Rouge compares n-gram of human summarization vs. model. Different Rouges are as follows:

•	ROUGE-N: Overlap of N-grams[2] between the system and reference summaries.

•	ROUGE-1 refers to the overlap of unigram (each word) between the system and reference summaries.

•	ROUGE-2 refers to the overlap of bigrams2 between the system and reference summaries.

•	ROUGE-L: Longest Common Subsequence (LCS)[3] based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.

•	ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes .

•	ROUGE-S: Skip-bigram[3] based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.

•	ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.

In our papers it is looking at Rouge-1 (unigram) F1, Rouge-2 (bigram) F1, Rouge-L F1 (longest common sub-sequence) 

If a score is ever identified as rouge without identifying its metric assume it is refering to the F1 score

In [1]:
# Simple Rouge
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown dog jumps on the log.')

scores

{'rouge1': Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765),
 'rouge2': Score(precision=0.2857142857142857, recall=0.25, fmeasure=0.26666666666666666),
 'rougeL': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471)}

# Rouge with Pegasus

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

In [1]:
# Prepping pegasus 
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-cnn_dailymail")
model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-cnn_dailymail")
device = 'cuda' if torch.cuda.is_available() else 'cpu'

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2275327883.0, style=ProgressStyle(descr…




In [3]:
from datasets import load_dataset, load_metric

In [6]:
# Prepping the data
raw_datasets = load_dataset("cnn_dailymail", '3.0.0')
# metric = load_metric("rouge")

Downloading and preparing dataset cnn_dailymail/3.0.0 (download: 558.32 MiB, generated: 1.28 GiB, post-processed: Unknown size, total: 1.82 GiB) to /Users/acmonnin/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234...


  0%|          | 0/5 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/572k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/661k [00:00<?, ?B/s]

  0%|          | 0/5 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /Users/acmonnin/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [67]:
doc = raw_datasets['validation'][0]['document']
doc

'The country\'s consumer watchdog has taken Apple to court for false advertising because the tablet computer does not work on Australia\'s 4G network.\nApple\'s lawyers said they were willing to publish a clarification.\nHowever the company does not accept that it misled customers.\nThe Australian Competition and Consumer Commission (ACCC) said on Tuesday: "Apple\'s recent promotion of the new \'iPad with wi-fi + 4G\' is misleading because it represents to Australian consumers that the product can, with a sim card, connect to a 4G mobile data network in Australia, when this is not the case."\nThe watchdog then lodged a complaint at the Federal Court in Melbourne.\nAt a preliminary hearing, Apple lawyer Paul Anastassiou said Apple had never claimed the device would work fully on the current 4G network operated by Telstra.\nApple says the new iPad works on what is globally accepted to be a 4G network.\nThe matter will go to a full trial on 2 May.\nThe Apple iPad\'s third version went on 

In [23]:
# In this case we are just using a single document but we can pass multiple to the tokenizer
batch = tokenizer(doc, truncation=True, padding='longest', return_tensors="pt").to(device)

In [68]:
translated = model.generate(**batch)
translated

tensor([[    0,  1814,   131,   116,  5113,   243,   157,   195,  2747,   112,
          5442,   114, 19259,   110,   107,   106,   159,   301,   358,   146,
          2217,   120,   126, 45520,   527,   110,   107,   106,   159,   841,
           138,   275,   112,   114,   357,  2498,   124,   280,   913,   110,
           107,     1]])

In [73]:
# Converting from tokens back into strings
summary=tokenizer.batch_decode(translated, skip_special_tokens=True)
print("Summary from pegasus:")
summary

Summary from pegasus:


["Apple's lawyers said they were willing to publish a clarification.<n>The company does not accept that it misled customers.<n>The matter will go to a full trial on 2 May."]

In [74]:
# Summary as provided by a human reader
golden_summary=raw_datasets['validation'][0]['summary']
print("Summary from human reader:")
golden_summary

Summary from human reader:


'US technology firm Apple has offered to refund Australian customers who felt misled about the 4G capabilities of the new iPad.'

In [71]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) # These are the metrics used in our paper
scores = scorer.score(summary[0], golden_summary)
scores

{'rouge1': Score(precision=0.2857142857142857, recall=0.18181818181818182, fmeasure=0.2222222222222222),
 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0),
 'rougeL': Score(precision=0.19047619047619047, recall=0.12121212121212122, fmeasure=0.14814814814814814)}

### Options for Metric Calculation
load_metrics from datasets package - This will likely scale the best

rouge_scoreer - Great for getting a sense of how rouge works

In [75]:
from datasets import load_metric
from rouge_score import rouge_scorer

metric = load_metric("rouge")