# Introduction to Transformers - Autosummarization Use Case

The first part of this notebook is based on chapter 6 **Summarization** of the book **Natural Language Processing with Tranformers** and can be found [here](https://nbviewer.org/github/nlp-with-transformers/notebooks/blob/main/06_summarization.ipynb).

## Imports, Inits, and Functions

In [1]:
%load_ext autoreload
%autoreload 2
%config IPCompleter.greedy=True

import pdb, pickle, sys, warnings, tqdm, time, torch
warnings.filterwarnings(action='ignore')
sys.path.insert(0, '../scripts')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()

device = 'cuda' if torch.cuda.is_available() else 'cpu'
from transformers import pipeline, set_seed
from transformers import (AutoModelForSeq2SeqLM, AutoTokenizer,
                          AutoModelForQuestionAnswering)
from transformers import DataCollatorForSeq2Seq, TrainingArguments, Trainer
from transformers.data.processors.squad import SquadV1Processor
from datasets import load_dataset, load_metric
import nltk
from nltk.tokenize import sent_tokenize

set_seed(42)

In [2]:
def evaluate_summaries_baseline(dataset, metric, column_text='article', column_summary='abstract'):
  summaries = [three_sentence_summary(text) for text in dataset[column_text]]
  metric.add_batch(predictions=summaries, references=dataset[column_summary])    
  score = metric.compute()
  return score

def chunks(list_of_elements, batch_size):
  """
  Yield successive batch-sized chunks from list_of_elements.
  """
  for i in range(0, len(list_of_elements), batch_size):
    yield list_of_elements[i : i + batch_size]

def evaluate_summaries_pegasus(dataset, metric, model, tokenizer,
                               batch_size=8, device=device, column_text='article',
                               column_summary='abstract'):
  article_batches = list(chunks(dataset[column_text], batch_size))
  target_batches = list(chunks(dataset[column_summary], batch_size))
  for article_batch, target_batch in tqdm_notebook(zip(article_batches, target_batches), total=len(article_batches)):
    inputs = tokenizer(article_batch, max_length=1024,  truncation=True, padding='max_length', return_tensors='pt')
    summaries = model.generate(input_ids=inputs['input_ids'].to(device),
                               attention_mask=inputs['attention_mask'].to(device),
                               length_penalty=0.8, num_beams=8, max_length=128)

    decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                          clean_up_tokenization_spaces=True)
                         for s in summaries]
    decoded_summaries = [d.replace('<n>', ' ') for d in decoded_summaries]
    metric.add_batch(predictions=decoded_summaries, references=target_batch)

  score = metric.compute()
  return score

def convert_examples_to_features(example_batch):
  input_encodings = tokenizer(example_batch['dialogue'], max_length=1024, truncation=True)

  with tokenizer.as_target_tokenizer():
    target_encodings = tokenizer(example_batch['summary'], max_length=128, truncation=True)

  return {'input_ids': input_encodings['input_ids'],
          'attention_mask': input_encodings['attention_mask'],
          'labels': target_encodings['input_ids']}

## HuggingFace's Summarization Pipeline

### Load Data

The dataset we are using for this task is the Pubmed Summarization dataset which consists of 119,924 pairs of articles and their corresponding abstracts.

This dataset can be found in the Hugging Face hub [here](https://huggingface.co/datasets/ccdv/pubmed-summarization).

In [3]:
# art_idx = 40767
art_idx = 2
# dataset = load_dataset('cnn_dailymail', version='3.0.0')
dataset = load_dataset('ccdv/pubmed-summarization')
print(f"Features: {dataset['train'].column_names}")

sample_text = dataset['train'][art_idx]
print(f"Article (excerpt of 500 characters, total length: {len(sample_text['article'])}):")
print(sample_text['article'][:500])
print(f"\nSummary (length: {len(sample_text['abstract'])}):")
print(sample_text['abstract'])

Using the latest cached version of the module from /net/kdinxidk03/opt/NFS/huggingface_cache/modules/datasets_modules/datasets/ccdv--pubmed-summarization/f9a2a592892a29ff6b5579891a7d5fcc3e15642540af26d1f6ac99086dc9fecd (last modified on Mon Jul  4 16:15:38 2022) since it couldn't be found locally at ccdv/pubmed-summarization., or remotely on the Hugging Face Hub.
No config specified, defaulting to: pubmed-summarization/section
Reusing dataset pubmed-summarization (/net/kdinxidk03/opt/NFS/huggingface_cache/datasets/ccdv___pubmed-summarization/section/1.0.0/f9a2a592892a29ff6b5579891a7d5fcc3e15642540af26d1f6ac99086dc9fecd)


  0%|          | 0/3 [00:00<?, ?it/s]

Features: ['article', 'abstract']
Article (excerpt of 500 characters, total length: 7419):
tardive dystonia ( td ) , a rarer side effect after longer exposure to antipsychotics , is characterized by local or general , sustained , involuntary contraction of a muscle or muscle group , with twisting movements , generally slow , which may affect the limbs , trunk , neck , or face . 
 td has been shown to develop in about 3% of patients who have had long - term exposure to antipsychotics . 
 . the low risk of td for atypical antipsychotics is thought to result from their weak affinity for 

Summary (length: 1009):
tardive dystonia ( td ) is a serious side effect of antipsychotic medications , more with typical antipsychotics , that is potentially irreversible in affected patients . 
 studies show that newer atypical antipsychotics have a lower risk of td . as a result , many clinicians may have developed a false sense of security when prescribing these medications . 
 we report a case of 20

We limit the articles' length to 2000 characters to have the same input to all the models and due to memory restrictions.

In [4]:
sample_text = dataset['train'][art_idx]['article'][:2000]
summaries = {}

### Generate Summaries using Different Models Models

### Naive Baseline

In [5]:
def three_sentence_summary(text):
  return '\n'.join(sent_tokenize(text)[:3])

summaries['baseline'] = three_sentence_summary(sample_text)

### GPT-2

By adding `TL;DR:` at the end of the article prompts the GPT-2 model to generate a summary instead to generating free text

In [6]:
pipe = pipeline('text-generation', model='gpt2-xl')
gpt2_query = sample_text + '\nTL;DR:\n' 
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
summaries['gpt2'] = '\n'.join(sent_tokenize(pipe_out[0]['generated_text'][len(gpt2_query) :]))

2022-07-07 17:54:15.260041: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2022-07-07 17:54:15.260494: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-07-07 17:54:15.261223: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
The attention mask and the

### T5

T5 transformer is a universal trasnformer architecture by formulating all tasks as text-to-text tasks. T5 checkpoints are trained ona mixture of unsupervised data (to resconstruct masked words) and supervised data for several tasks including summarization.

In [None]:
pipe = pipeline('summarization', model='t5-large')
pipe_out = pipe(sample_text)
summaries['t5'] = '\n'.join(sent_tokenize(pipe_out[0]['summary_text']))

Token indices sequence length is longer than the specified maximum sequence length for this model (526 > 512). Running this sequence through the model will result in indexing errors


### BART

BART also uses an encoder-decoder architecture and is trained to reconstruct corrupted inputs. It combines pretraining schemes of BERT and GPT-2.

In [8]:
pipe = pipeline('summarization', model='facebook/bart-large-cnn')
pipe_out = pipe(sample_text)
summaries['bart'] = '\n'.join(sent_tokenize(pipe_out[0]['summary_text']))

### PEAGSUS

PEAGSUS is also an encoder-decoder architecture that is based on the premise that the closer the pretraining objective is to the downstream task, the more effectifve it is. In a very large corpus, sentences containing most of the content in their surrounding paragraphs can be reconstructed to obtain a SOTA model for text summarization.

In [9]:
pipe = pipeline('summarization', model='google/pegasus-pubmed')
pipe_out = pipe(sample_text)
summaries['pegasus'] = pipe_out[0]['summary_text'].replace(' .<n>', '.\n')

### Comparing Generated Summaries

In [10]:
from termcolor import colored

In [11]:
print(colored('GROUND TRUTH', 'red'))
print(colored(dataset['train'][art_idx]['abstract'], 'green'))
print('')

for model_name in summaries:
  print(colored(model_name.upper(), 'red'))
  print(colored(summaries[model_name], 'blue'))
  print('')

[31mGROUND TRUTH[0m
[32mtardive dystonia ( td ) is a serious side effect of antipsychotic medications , more with typical antipsychotics , that is potentially irreversible in affected patients . 
 studies show that newer atypical antipsychotics have a lower risk of td . as a result , many clinicians may have developed a false sense of security when prescribing these medications . 
 we report a case of 20-year - old male with hyperthymic temperament and borderline intellectual functioning , who developed severe td after low dose short duration exposure to atypical antipsychotic risperidone and then olanzapine . 
 the goal of this paper is to alert the reader to be judicious and cautious before using casual low dose second generation antipsychotics in patient with no core psychotic features , hyperthymic temperament , or borderline intellectual functioning suggestive of organic brain damage , who are more prone to develop adverse effects such as td and monitor the onset of td in patie

### Evaluating using ROGUE Metric

The ROUGE score was developed for applications like summarization where high recall is more important than precision. ROUGE is calculated based on how many `n`-grams in the reference text also occur in the generated text.

In [12]:
rouge_metric = load_metric('rouge', chace_dir=None)
rouge_names = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']

test_sampled = dataset['test'].shuffle(seed=42).select(range(250))

score = evaluate_summaries_baseline(test_sampled, rouge_metric)
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
rogue_scores = pd.DataFrame.from_dict(rouge_dict, orient='index', columns=['baseline']).T

Using the latest cached version of the module from /net/kdinxidk03/opt/NFS/huggingface_cache/modules/datasets_modules/metrics/rouge/0ffdb60f436bdb8884d5e4d608d53dbe108e82dac4f494a66f80ef3f647c104f (last modified on Tue Jul  5 12:28:09 2022) since it couldn't be found locally at rouge, or remotely on the Hugging Face Hub.
Loading cached shuffled indices for dataset at /net/kdinxidk03/opt/NFS/huggingface_cache/datasets/ccdv___pubmed-summarization/section/1.0.0/f9a2a592892a29ff6b5579891a7d5fcc3e15642540af26d1f6ac99086dc9fecd/cache-6df034ac805cbe09.arrow


In [13]:
model_ckpt = "google/pegasus-pubmed"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
score = evaluate_summaries_pegasus(test_sampled, rouge_metric, model, tokenizer, batch_size=4)
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
rogue_scores = rogue_scores.append(pd.DataFrame(rouge_dict, index=["pegasus"]))
rogue_scores

  0%|          | 0/63 [00:00<?, ?it/s]

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.270288,0.090744,0.168766,0.244628
pegasus,0.350126,0.154301,0.226754,0.296896


# Introduction to Transformers - Question Answering/ Machine Reading Comprehension Use Case

The second and final part of this notebook is adapted from the HuggingFace's QA pipeline.

## HuggingFace's QA pipeline

### Load dev dataset using HuggingFace data processors

The dev dataset we are using for this task is the SQuAD-1.1 MRC dataset which consists of 10,570 question-answer pairs and 48 unique contexts.

This dataset can be found in the Hugging Face hub [here](https://huggingface.co/datasets/squad).

Hugging Face provides the [Processors](https://huggingface.co/transformers/main_classes/processors.html) library for facilitating basic processing tasks with some canonical NLP datasets. The processors can be used for loading datasets and converting their examples to features for direct use in the model. We'll be using the [SQuAD processors](https://huggingface.co/transformers/main_classes/processors.html#squad). 

In [14]:
path_data = "/net/kdinxidk03/opt/NFS/collab_dir/SIGIR2022/"
processor = SquadV1Processor()
examples = processor.get_dev_examples(path_data, filename="squad/test_squad.json")

100%|███████████████████████████████████████████| 48/48 [00:02<00:00, 16.81it/s]


In [15]:
print('An example sample in the SQuAD-1.1 development set: ',
      '\n\nQuestion: ', examples[10].__dict__['question_text'],
      '\n\nContext: ', examples[10].__dict__['context_text'],
      '\n\nAnswer: ', examples[10].__dict__['answers'][0]['text'],
     )

An example sample in the SQuAD-1.1 development set:  

Question:  What day was the Super Bowl played on? 

Context:  Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. 

Answer:  February 7, 2016


In [16]:
def display_example(qid):    
  idx = qid_to_example_index[qid]
  q = examples[idx].question_text
  c = examples[idx].context_text
  a = [answer['text'] for answer in examples[idx].answers]

  print(f'Example {idx} of {len(examples)}\n---------------------')
  print(f"Q: {q}\n")
  print("Context:")
  print(c)
  print(f"\nTrue Answers:\n{a}")

In [17]:
qid_to_example_index = {example.qas_id: i for i, example in enumerate(examples)}
answer_qids = list(qid_to_example_index.keys())
display_example(answer_qids[120])

Example 120 of 10570
---------------------
Q: What venue in Miami was a candidate for the site of Super Bowl 50?

Context:
The league eventually narrowed the bids to three sites: New Orleans' Mercedes-Benz Superdome, Miami's Sun Life Stadium, and the San Francisco Bay Area's Levi's Stadium.

True Answers:
['Sun Life Stadium', 'Sun Life Stadium', 'Sun Life Stadium']


### Load a transformer model fine-tuned on SQuAD-1.1 for inference

In [18]:
model_name_or_path = "csarron/bert-base-uncased-squad-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForQuestionAnswering.from_pretrained(model_name_or_path)

### Get prediction and evaluate the prediction using EM & F1 metrics

* Metrics for QA:

1. Exact Match (EM): For each question+answer pair, if the _characters_ of the model's prediction exactly match the characters of (one of) the True Answer(s), EM = 1, otherwise EM = 0. This is a strict all-or-nothing metric; being off by a single character results in a score of 0.

2. F1: F1 score is a common metric for classification problems, and widely used in QA. It is appropriate when we care equally about precision and recall. In this case, it's computed over the individual _words_ in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the _prediction_, and recall is the ratio of the number of shared words to the total number of words in the _ground truth_.

In [19]:
def get_prediction(qid):
    # given a question id (qas_id or qid), load the example, get the model outputs and generate an answer
    question = examples[qid_to_example_index[qid]].question_text
    context = examples[qid_to_example_index[qid]].context_text

    inputs = tokenizer.encode_plus(question, context, return_tensors='pt')

    outputs = model(**inputs)
    answer_start = torch.argmax(outputs[0])  # get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(outputs[1]) + 1 

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))

    return answer

In [20]:
# these functions are heavily influenced by the HF squad_metrics.py script
def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
    
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)

def get_gold_answers(example):
    """helper function that retrieves all possible true answers from a squad2.0 example"""
    
    gold_answers = [answer["text"] for answer in example.answers if answer["text"]]

    # if gold_answers doesn't exist it's because this is a negative example - 
    # the only correct answer is an empty string
    if not gold_answers:
        gold_answers = [""]
        
    return gold_answers

In [21]:
idx = 200
prediction = get_prediction(answer_qids[idx])
example = examples[qid_to_example_index[answer_qids[idx]]]
gold_answers = get_gold_answers(example)

em_score = max((compute_exact_match(prediction, answer)) for answer in gold_answers)
f1_score = max((compute_f1(prediction, answer)) for answer in gold_answers)

print(f"\nQuestion: {example.question_text}")
print(f"\nPrediction: {prediction}")
print(f"\nTrue Answers: {gold_answers}")
print(f"\nPerformance Scores: EM: {em_score} \t F1: {f1_score}")


Question: Who had the best record in the NFC?

Prediction: carolina panthers

True Answers: ['Carolina Panthers', 'the Panthers', 'Carolina']

Performance Scores: EM: 1 	 F1: 1.0
