# Introduction to BART with dailymail

## 1. Import libraries + cnn_dailymail dataset

In [1]:
from transformers import pipeline, set_seed

import matplotlib.pyplot as plt
import numpy as np

import pandas as pd
from datasets import load_dataset, load_metric
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/clementgillet/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
dataset = load_dataset("cnn_dailymail", version="3.0.0")

print(f"Features in cnn_dailymail : {dataset['train'].column_names}")

Found cached dataset cnn_dailymail (/Users/clementgillet/.cache/huggingface/datasets/cnn_dailymail/default/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

Features in cnn_dailymail : ['article', 'highlights', 'id']


In [3]:
sample = dataset["train"][1]
print(f"""
Article (excerpt of 500 characters, total length: {len(sample["article"])}):
""")
print(sample["article"][:500])
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])


Article (excerpt of 500 characters, total length: 4051):

Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most s

Summary (length: 281):
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .


In [4]:
sample_text = dataset["train"][1]["article"][:1000]

## 2. BART (fine-tuned on CNN Dailymail)

**Denoising Autoencoder** to training Seq2Seq models by corrupting the text with an arbitrary noising function and then learning a model to reconstruct the original text. 

It uses a transformer with **Bidirectional encoder** (like BERT) and a **left-to-right decoder** (like GPT).<br/>
This means that the **encoder attention mask is fully visible** and the **decoder attention mask is causal**.

In [5]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)

In [6]:
summary = '\n'.join(sent_tokenize(pipe_out[0]['summary_text']))

In [7]:
print("Ground truth summary:\n\n",sample["highlights"],"\n\n")
print("BART-generated summary :\n\n",pipe_out[0]["summary_text"])

Ground truth summary:

 Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change . 


BART-generated summary :

 Miami-Dade pretrial detention facility is dubbed the "forgotten floor" Here, inmates with the most severe mental illnesses are incarcerated. Most often, they face drug charges or charges of assaulting an officer. Judge Steven Leifman says the arrests often result from confrontations with police.


## 3. Evaluation (SacreBLEU & ROUGE)

- **BLEU** measures **precision**. How much of the words (n-grams) from the machine-generated summary appear in the gold summary.

- **ROUGE** measures **recall**. How much of the words (n-grams) from the gold summary appear in the machine-generated summary. <br/><br/> In summarization, high recall is more important than high precision.
<br/><br/>
- ROUGE-N $\rightarrow$ Measures the match-rate of n-grams between the model output and the gold reference
- ROUGE-L $\rightarrow$ Measures the Longest Common Subsequence (LCS) between the model output and the gold reference. In other words, we count the longest sequence of tokens that is shared between both summarties.it calculates the score per sentence and averages it for the whole summary
- ROUGE-Lsum $\rightarrow$ In contrast, it calculates the LCS directly over the whole summary.

(n-gram = a sequence of n tokens)

### a) BLEU

In [8]:
bleu_metric = load_metric("sacrebleu")

  bleu_metric = load_metric("sacrebleu")


In [9]:
bleu_metric.add(prediction = [summary], reference = [sample["highlights"]] )

results = bleu_metric.compute(smooth_method = 'floor', smooth_value = 0 )

results['precision'] = [np.round(p, 2) for p in results['precisions'] ]
print("\n\nBLEU results on BART-large-cnn:")
pd.DataFrame.from_dict(results, orient = 'index', columns = ['Value'] )



BLEU results on BART-large-cnn:


Unnamed: 0,Value
score,10.887081
counts,"[19, 7, 5, 2]"
totals,"[55, 54, 53, 52]"
precisions,"[34.54545454545455, 12.962962962962964, 9.4339..."
bp,0.96429
sys_len,55
ref_len,57
precision,"[34.55, 12.96, 9.43, 3.85]"


### b) ROUGE

In [10]:
rouge_metric = load_metric("rouge")

In [11]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

rouge_metric.add(prediction = [summary], reference = [sample["highlights"]] )
score = rouge_metric.compute()
rouge_dict =  dict((rn, score[rn].mid.fmeasure) for rn in rouge_names )
print(score["rougeL"].mid)
print("\n\nROUGE results on BART-large-cnn:")
print(rouge_dict)
pd.DataFrame.from_dict([rouge_dict])

Score(precision=0.20454545454545456, recall=0.1836734693877551, fmeasure=0.1935483870967742)


ROUGE results on BART-large-cnn:
{'rouge1': 0.3655913978494624, 'rouge2': 0.13186813186813184, 'rougeL': 0.1935483870967742, 'rougeLsum': 0.1935483870967742}


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
0,0.365591,0.131868,0.193548,0.193548
