# Introduction to Transformers - Use Cases Using HuggingFace Transformers Library

This notebook is based on chapter 6 **Summarization** of the book **Natural Language Processing with Tranformers** and can be found [here](https://nbviewer.org/github/nlp-with-transformers/notebooks/blob/main/06_summarization.ipynb).

## Imports & Inits

In [None]:
%load_ext autoreload
%autoreload 2
%config IPCompleter.greedy=True

import pdb, pickle, sys, warnings, tqdm, time, torch
warnings.filterwarnings(action='ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()

from transformers import pipeline, set_seed, AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset, load_metric
import nltk
from nltk.tokenize import sent_tokenize

set_seed(42)

## Load Dataset

The dataset we are using for this task is the CNN/DailyMail dataset which consists of:
* 300,000 pairs of news articles and their corresponding summaries
* Composed of bullet points that CNN and the DailyMail attach to their articles
* Summaries are abstractive and not extractive - they consist of new sentences instead of simple excerpts

This dataset can be found in the Hugging Face hub [here](https://huggingface.co/datasets/cnn_dailymail).

In [None]:
art_idx = 40767
dataset = load_dataset('cnn_dailymail', version='3.0.0')
print(f"Features: {dataset['train'].column_names}")

sample_text = dataset['train'][art_idx]
print(f"Article (excerpt of 500 characters, total length: {len(sample_text['article'])}):")
print(sample_text['article'][:500])
print(f"\nSummary (length: {len(sample_text['highlights'])}):")
print(sample_text['highlights'])

We limit the articles' length to 2000 characters to have the same input to all the models and due to memory restrictions.

In [None]:
sample_text = dataset['train'][art_idx]['article'][:2000]
summaries = {}

## Generate Summaries using Different Models Models

### Naive Baseline

In [None]:
def three_sentence_summary(text):
  return '\n'.join(sent_tokenize(text)[:3])

summaries['baseline'] = three_sentence_summary(sample_text)

### GPT-2

By adding `TL;DR:` at the end of the article prompts the GPT-2 model to generate a summary instead to generating free text

In [None]:
pipe = pipeline('text-generation', model='gpt2-xl')
gpt2_query = sample_text + '\nTL;DR:\n' 
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
summaries['gpt2'] = '\n'.join(sent_tokenize(pipe_out[0]['generated_text'][len(gpt2_query) :]))

### T5

T5 transformer is a universal trasnformer architecture by formulating all tasks as text-to-text tasks. T5 checkpoints are trained ona mixture of unsupervised data (to resconstruct masked words) and supervised data for several tasks including summarization.

In [None]:
pipe = pipeline('summarization', model='t5-large')
pipe_out = pipe(sample_text)
summaries['t5'] = '\n'.join(sent_tokenize(pipe_out[0]['summary_text']))

### BART

BART also uses an encoder-decoder architecture and is trained to reconstruct corrupted inputs. It combines pretraining schemes of BERT and GPT-2.

In [None]:
pipe = pipeline('summarization', model='facebook/bart-large-cnn')
pipe_out = pipe(sample_text)
summaries['bart'] = '\n'.join(sent_tokenize(pipe_out[0]['summary_text']))

### PEAGSUS

PEAGSUS is also an encoder-decoder architecture that is based on the premise that the closer the pretraining objective is to the downstream task, the more effectifve it is. In a very large corpus, sentences containing most of the content in their surrounding paragraphs can be reconstructed to obtain a SOTA model for text summarization.

In [None]:
pipe = pipeline('summarization', model='google/pegasus-cnn_dailymail')
pipe_out = pipe(sample_text)
summaries['pegasus'] = pipe_out[0]['summary_text'].replace(' .<n>', '.\n')

## Comparing Generated Summaries

In [None]:
print('GROUND TRUTH')
print(dataset['train'][1]['highlights'])
print('')

for model_name in summaries:
  print(model_name.upper())
  print(summaries[model_name])
  print('')

## Evaluating using ROGUE Metric

The ROUGE score was developed for applications like summarization where high recall is more important than precision. ROUGE is calculated based on how many `n`-grams in the reference text also occur in the generated text.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

def evaluate_summaries_baseline(dataset, metric, column_text='article', column_summary='highlights'):
  summaries = [three_sentence_summary(text) for text in dataset[column_text]]
  metric.add_batch(predictions=summaries, references=dataset[column_summary])    
  score = metric.compute()
  return score

def chunks(list_of_elements, batch_size):
  """
  Yield successive batch-sized chunks from list_of_elements.
  """
  for i in range(0, len(list_of_elements), batch_size):
    yield list_of_elements[i : i + batch_size]

def evaluate_summaries_pegasus(dataset, metric, model, tokenizer,
                               batch_size=8, device=device, column_text='article',
                               column_summary='highlights'):
  article_batches = list(chunks(dataset[column_text], batch_size))
  target_batches = list(chunks(dataset[column_summary], batch_size))
  for article_batch, target_batch in tqdm_notebook(zip(article_batches, target_batches), total=len(article_batches)):
    inputs = tokenizer(article_batch, max_length=1024,  truncation=True, padding='max_length', return_tensors='pt')
    summaries = model.generate(input_ids=inputs['input_ids'].to(device),
                               attention_mask=inputs['attention_mask'].to(device),
                               length_penalty=0.8, num_beams=8, max_length=128)

    decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                          clean_up_tokenization_spaces=True)
                         for s in summaries]
    decoded_summaries = [d.replace('<n>', ' ') for d in decoded_summaries]
    metric.add_batch(predictions=decoded_summaries, references=target_batch)

  score = metric.compute()
  return score

In [None]:
rouge_metric = load_metric('rouge', chace_dir=None)
rouge_names = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']

In [None]:
test_sampled = dataset['test'].shuffle(seed=42).select(range(250))

score = evaluate_summaries_baseline(test_sampled, rouge_metric)
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
rogue_scores = pd.DataFrame.from_dict(rouge_dict, orient='index', columns=['baseline']).T

In [None]:
model_ckpt = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
score = evaluate_summaries_pegasus(test_sampled, rouge_metric, model, tokenizer, batch_size=4)
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
rogue_scores = rogue_scores.append(pd.DataFrame(rouge_dict, index=["pegasus"]))
rogue_scores

## Training a Summarization Model