## ProjF3 - Baseline Model

Use this document as a template to provide the evaluation of your baseline model. You are welcome to go in as much depth as needed.

Make sure you keep the sections specified in this template, but you are welcome to add more cells with your code or explanation as needed.

In [1]:
import nltk
from nltk.tokenize import sent_tokenize

### 1. Load and Prepare Data

This should illustrate your code for loading the dataset and the split into training, validation and testing. You can add steps like pre-processing if needed.

In [2]:
from datasets import load_dataset

In [3]:
# Load CNN/DailyMail dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")

In [4]:
from transformers import pipeline, set_seed

In [5]:
# truncate it
sample_text = dataset["train"][1]["article"][:2000]
summaries = {}

### 2. Prepare your Baseline Model

Here you can have your code to either train (e.g., if you are building it from scratch) or load (e.g., in the case that you are loading a pre-trained model) your model. These steps may require you to use other packages or python files. You can just call them here. You don't have to include them in your submission. Remember that we will be looking at the saved outputs in the notebooked and we will not run the entire notebook.

##### BART

In [6]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

### Example of BART on one training example:

In [7]:
# original article from the train dataset
sample_text

'Editor\'s note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O\'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most severe mental illnesses are incarcerated until they\'re ready to appear in court. Most often, they face drug charges or charges of assaulting an officer --charges that Judge Steven Leifman says are usually "avoidable felonies." He says the arrests often result from confrontations with police. Mentally ill people often won\'t do what they\'re told when police arrive on the scene -- confrontation seems to exacerbate their illness and they become more paranoid, delusional, and less likely to foll

#### Generated Abstractive Summary


In [8]:
summaries

{'bart': 'Mentally ill inmates are housed on the "forgotten floor" of Miami-Dade jail.\nMost often, they face drug charges or charges of assaulting an officer.\nJudge Steven Leifman says the arrests often result from confrontations with police.\nHe says about one-third of all people in the county jails are mentally ill.'}

## 3. Baseline Performance
## Metric: ROUGE

Make sure to include the following:
- Performance on the training set
- Performance on the test set
- Provide some screenshots of your output (e.g., pictures, text output, or a histogram of predicted values in the case of tabular data). Any visualization of the predictions are welcome.

In [9]:
from datasets import load_metric
rouge = load_metric("rouge")

  rouge = load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [10]:
import pandas as pd

actual = dataset["train"][1]["highlights"]
records = []
rge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

In [11]:
from tqdm import tqdm

In [12]:
def chunks(list_ele, batch_size):
  for i in range(0, len(list_ele), batch_size):
    yield list_ele[i : i+batch_size]

In [13]:
def evaluate_bart(dataset, metric, model, tokenizer,
                            batch_size=16, device='cpu',
                            column_text="article",
                            column_summary="highlights"):
    article_batches = [dataset[column_text][i:i+batch_size] for i in range(0, len(dataset[column_text]), batch_size)]
    target_batches = [dataset[column_summary][i:i+batch_size] for i in range(0, len(dataset[column_summary]), batch_size)]

    for article_batch, target_batch in tqdm(zip(article_batches, target_batches), total=len(article_batches)):
        inputs = tokenizer(article_batch, max_length=1024, truncation=True, padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                                   attention_mask=inputs["attention_mask"].to(device),
                                   max_length=128,
                                   num_beams=4,
                                   length_penalty=2.0,
                                   early_stopping=True)

        decoded_summaries = [tokenizer.decode(summary, skip_special_tokens=True, clean_up_tokenization_spaces=True) for summary in summaries]
        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    score = metric.compute()
    return score

In [14]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"

### Performace on Train Dataset
(on a sample of the dataset)

In [15]:
train_sampled = dataset["train"].shuffle(seed=1234).select(range(1000))

In [17]:
# hide_output
from transformers import BartForConditionalGeneration, BartTokenizer

model_ckpt = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_ckpt)
model = BartForConditionalGeneration.from_pretrained(model_ckpt).to(device)


In [18]:
score = evaluate_bart(train_sampled, rouge, model, tokenizer, batch_size=8)

100%|███████████████████████████████████████| 125/125 [2:36:58<00:00, 75.35s/it]


In [19]:
rouge_dict = {rn: score[rn].mid.fmeasure for rn in score.keys()}
pd.DataFrame(rouge_dict, index=["bart"])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
bart,0.408997,0.194298,0.294226,0.352474


### Performance on Test Dataset

In [20]:
test_sampled = dataset["test"].shuffle(seed=1234).select(range(1000))

In [23]:
# hide_output
from transformers import BartForConditionalGeneration, BartTokenizer

model_ckpt = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_ckpt)
model = BartForConditionalGeneration.from_pretrained(model_ckpt).to(device)
score = evaluate_bart(test_sampled, rouge,
                                model, tokenizer, batch_size=8)
rouge_dict = {rn: score[rn].mid.fmeasure for rn in score.keys()}
pd.DataFrame(rouge_dict, index=["bart"])


100%|███████████████████████████████████████| 125/125 [1:55:23<00:00, 55.39s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
bart,0.427265,0.208488,0.300528,0.364702


### Screenshots of Baseline BART Summary Output on Single Test Data

In [21]:
sample_test_text = dataset["test"][102]["article"][:2000]
sample_test_text

'(CNN)Anyone who has given birth -- or been an observer of the event -- knows how arduous it can be. But to do it live on the Internet? With two hooves sticking out for several minutes in the midst of labor? Luckily, Katie -- a giraffe at the Dallas Zoo -- is a champ. In an hour-long labor captured by 10 cameras and streamed live by Animal Planet, Katie gave birth to a not-so-little baby (about 6 feet tall) early Friday evening. There was no immediate word on the newborn\'s gender or condition. But there were good signs, as seen on the live stream and Dallas Zoo\'s Twitter feed -- like its ears moving, its efforts to stand, and its nursing (or at least trying to nurse) from mom. "We\'re so proud," the zoo tweeted. The newcomer\'s debut was a long time coming, especially when you count for Katie\'s 15-month gestation period -- average for a giraffe, according to Animal Planet. The baby joins a sister, 4-year-old calf Jamie. It wasn\'t immediately known how many people online saw Katie g

In [22]:
pipe_out = pipe(sample_test_text)
print("\n".join(sent_tokenize(pipe_out[0]["summary_text"])))

Katie, a giraffe at the Dallas Zoo, gave birth to a not-so-little baby early Friday evening.
There was no immediate word on the newborn's gender or condition.
But there were good signs, as seen on the live stream and Dallas Zoo's Twitter feed.
