# Assignment 3: Summarization Tests

**Description:** This assignment covers summarization outputs. You will compare three different types of solutions, all using an encoder decoder architecture. You should also be able to develop an intuition for:


* How well summarization systems work
* The effects of using different pre-training and fine-tuning checkpoints on outcomes
* The effects of hyperparameters on outcomes



This notebook on your GCP instance as the generation of summaries does not require a GPU to work in a timely fashion. This notebook should be run on a Google Colab but it does not require a GPU. By default, when you open the notebook in Colab it will not configure a GPU.  Summarization commands can take up to five minutes to run depending on the hyperparameters you use.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2022-summer-main/blob/master/assignment/a3/Summarization_test.ipynb)

The overall assignment structure is as follows:

1. T5 for summarization

2. Pegasus for summarization

3. BART for summarization




**INSTRUCTIONS:**: 

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.

* **### YOUR CODE HERE** indicates that you are supposed to write code.



In [1]:
!pip install -q sentencepiece

In [2]:
!pip install -q transformers

In [3]:
!pip install -q datasets

Let's leverage the pre-trained and fine tuned models on HuggingFace to demonstrate some capabilities.  They include models/checkpoints that were fine tuned on a particular dataset.  We can leverge the datasets library to look at some of their outputs.

In [28]:
#let's make longer output readable without scrolling
from pprint import pprint

We'll use this same toy article as the input to all of our summarization attempts.  That way we have the ability to compare.

In [3]:
ARTICLE_TO_SUMMARIZE = (
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
    "The record breaking drought has made the current conditions even worse than in previous years. It exponentially"
    "increases the probability of large scale wildfires."
)


### 1. T5 for summarization

T5 is an encoder decoder architecture that has been trained on multiple tasks, so not purely summarization.  You can read more about it [here](https://huggingface.co/docs/transformers/model_doc/t5).

In [4]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

model = TFT5ForConditionalGeneration.from_pretrained("google/t5-v1_1-base")
tokenizer = T5Tokenizer.from_pretrained("google/t5-v1_1-base")

In [None]:
model.summary()

Since T5 can perform multiple tasks we need to tell it what kind of output we want.  Therefore we need to prepend a "prompt" to our article text to make sure it does the right thing.

In [5]:
PROMPT = 'summarize: '
T5ARTICLE_TO_SUMMARIZE = PROMPT + ARTICLE_TO_SUMMARIZE

In [6]:
inputs = tokenizer(T5ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

In [7]:
inputs

In [8]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"] 
)
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

In [9]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=1,
                              no_repeat_ngram_size=1,
                              min_length=10,
                              max_length=20)
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

Let's experiment with the four hyperparameters shown in the cell above.  Please experiment in the cell below.  The num_beams value is like a beam search.  It indicates the number of tries the model makes before showing you its best output.  The no_repeat_ngram_size is designed to help reduce repeition in the output.  min_length and max_length set boundaries on the size of the summary.

*There is no one correct answer to these questions.  There are ranges that tend to work better than others.  The goal is to have you experiment to help build inutition.  Please enter the values that you think are generating the most readable output.*

**QUESTION:**

1.1 What num_beams value gives you the most readable output?

1.2 Which no_repeat_ngram_size gives the most readable output?

1.3 What min_length value gives you the most readable output?

1.4 Which max_length value gives the most readable output?

In [10]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"], 
### YOUR CODE HERE                    
                             )
### END YOUR CODE
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

### 2. Pegasus for summarization 

Pegasus is an encoder decoder architecture that has been trained as an abstractive summarizer.  You can read more about it [here](https://huggingface.co/docs/transformers/model_doc/pegasus).

We'll use the google/pegasus-xsum checkpoint.  It is trained on a summarization task that reads a news article and them emits a headline as a summary.  This doesn't mean that it is limited in its output.  It does mean that it works well with news article type inputs.

In [30]:
from transformers import PegasusTokenizer, TFPegasusForConditionalGeneration

model = TFPegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")

In [31]:
model.summary()

In [32]:
inputs = tokenizer(ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

In [21]:
inputs

In [33]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"] 
)
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

In [34]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=1,
                              no_repeat_ngram_size=1,
                              min_length=10,
                              max_length=20)
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

Let's experiment with the same hyperparameters for the Pegasus system.  It is designed for abstractive summarization.

**QUESTION:**

2.1 What num_beams value gives you the most readable output?

2.2 Which no_repeat_ngram_size gives the most readable output?

2.3 What min_length value gives you the most readable output?

2.4 Which max_length value gives the most readable output?

In [37]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"], 
### YOUR CODE HERE                        
### END YOUR CODE                             
                             )
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

### 3. BART for conditional generation

BART is an encoder decoder architecture that uses a transformer like BERT as it encoder and a language generator like GPT2 as its decoder.  It is designed as a translator that takes symbols in and then generates symbols out.  It has not been explicitly trained as an abstractive summarizer.  It is able to generate text. You can read more about it [here](https://huggingface.co/docs/transformers/model_doc/bart).

In [17]:
from transformers import BartTokenizer, TFBartForConditionalGeneration

model = TFBartForConditionalGeneration.from_pretrained("facebook/bart-large")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")



In [20]:
model.summary()

In [22]:
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, truncation=True, return_tensors="tf")


In [23]:
inputs

In [24]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"])
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False), compact=True)

In [25]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=1,
                              no_repeat_ngram_size=1,
                              min_length=10,
                              max_length=20)
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False), compact=True)

Let's experiment with the same hyperparameters for the BART system.  It is designed as a translator, taking words in and generating words as its output.

**QUESTION:**

3.1 What num_beams value gives you the most readable output?

3.2 Which no_repeat_ngram_size gives the most readable output?

3.3 What min_length value gives you the most readable output?

3.4 Which max_length value gives the most readable output?

In [29]:
# Generate Summary
summary_ids = model.generate(inputs["input_ids"],
### YOUR CODE HERE                             
### END YOUR CODE
                             )
pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False), compact=True)

Okay, you're done.  

Which model do you think produced the best summaries keeping in mind that best is in the eye of the reader?