# Assignment 3: Summarization Tests

**Description:** This assignment covers summarization outputs. You will compare three different types of solutions, all using an encoder-decoder architecture. You should also be able to develop an intuition for:


* How well summarization systems work
* The effects of using different pre-training and fine-tuning checkpoints on outcomes
* The effects of hyperparameters on outcomes
* Evaluation of output using ROUGE



This notebook should be run on a Google Colab but it does not require a GPU. By default, when you open the notebook in Colab it will NOT configure a GPU.  Summarization commands can take up to five minutes to run depending on the hyperparameters you use. This notebook will NOT run on your GCP instance as the summary models are larger than the avaialble memory.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2023-summer-main/blob/master/assignment/a3/Summarization_test.ipynb)

The overall assignment structure is as follows:

 Setup

1. T5 for generic summarization

2. Pegasus for headline summarization

3. Pegasus for longer generation




**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.

* **### YOUR CODE HERE** indicates that you are supposed to write code.




## Setup

In [1]:
!pip install -q sentencepiece

In [2]:
!pip install -q transformers

In [3]:
!pip install -q evaluate
import evaluate

In [4]:
!pip install -q rouge_score

In [5]:
#let's make longer output readable without horizontal scrolling
from pprint import pprint

Let's leverage the pre-trained and fine tuned models on HuggingFace to demonstrate some capabilities with abstractive summarization and language generation.  They include models/checkpoints that were fine tuned on a particular dataset.  In our case we'll focus on one dataset that emphasizes a one line output and another that emphasizes a multi-line output.

We'll use this same toy article as the input to all of our summarization attempts.  That way we have the ability to compare. We'll also create two references for evaluation.  These are the targets you are trying to meet.  One reference is for the longer output. The second reference is the short one for the one line output.

In [6]:

ARTICLE_TO_SUMMARIZE = (
    "Nearly 800 thousand customers are scheduled to be affected by the shutoffs which are expected to last through at least midday tomorrow. "
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. "
    "The aim is to reduce the risk of wildfires. "
    "If Pacific Gas & Electric Co, a unit of PG&E Corp, goes through with another public safety power shutoff, "
    " it would be the fourth round of mass blackouts imposed by the utility since Oct. 9, when some 730,000 customers were left in the dark. "
    "The recent wave of precautionary shutoffs have drawn sharp criticism from Governor Gavin Newsom, state regulators and consumer activists as being overly broad in scale."
    "Newsom blames PG&E for doing too little to properly maintain and secure its power lines against wind damage."
    "Utility executives have acknowledged room for improvement while defending the sprawling scope of the power cutoffs as a matter of public safety."
    "The record breaking drought has made the current conditions even worse than in previous years. "
    "It exponentially increases the probability of large scale wildfires. "
)

LONG_REFERENCE = (
    "Many PG&E customers could be affected by public safety power shutoffs in response to forecasts for high winds and dry conditions. "
    "The record breaking drought exponentially increases the probability of large scale wildfires. "
    "Despite being criticized by Governor Newsom for being overly broad, company officials defend the cutoffs as a matter of public safety. "
)

SHORT_REFERENCE = (
    "California's largest utility is set to turn off power to hundreds of thousands of customers in an effort to reduce the risk of wildfires. "
)

How long is our article to summarize?  Obviously our summary should be shorter since it is supposed to be "abridged."

In [7]:
len(str.split(ARTICLE_TO_SUMMARIZE))

## 1. T5 for Generic Summarization

T5 is an encoder decoder architecture that has been trained on multiple tasks, so not purely summarization.  You can read more about it [here](https://huggingface.co/docs/transformers/model_doc/t5).

In [8]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

t5model = TFT5ForConditionalGeneration.from_pretrained("t5-base")
t5tokenizer = T5Tokenizer.from_pretrained("t5-base")

In [9]:
t5model.summary()

Since T5 can perform multiple tasks we need to tell it what kind of output we want.  Therefore we need to prepend a "prompt" to our article text to make sure it does the right thing.

In [10]:
PROMPT = 'summarize: '
T5ARTICLE_TO_SUMMARIZE = PROMPT + ARTICLE_TO_SUMMARIZE

In [11]:
inputs = t5tokenizer(T5ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

What do the inputs look like?  How does it compare with what we've seen from BERT?

In [12]:
inputs

Let's just run T5 using it's default hyperparameters and see what happens.  We'll hold on to the output in the candidate variable.  What do you think about the output?

In [13]:
# Generate Summary
summary_ids = t5model.generate(inputs["input_ids"],
                               max_new_tokens=30
)
candidate = t5tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
pprint(candidate[0], compact=True)

### 1.a Checkpoint Configuration

We're using the `t5-base` configuration and we know we can run out of the box to do summarization which means it has some hyperparameters set as defaults.  These may or may not be what we want to use.  How do we know which values are set as defaults?

HuggingFace provides access to the default hyperparameters via the AutoConfig object which we call below.  We simply pass in the name of the checkpoint we're using -- `t5-base` in this case.


In [14]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained("t5-base")

config

Look at the `task_specific_params` for summarization. You can see that this `t5-base` checkpoint has some values such as min_length and max_length as well as no_repeat_ngram_size and num_beams.  You can affect the size and content of the output by modifying these parameters which you will do below.

You can also look at the full set of possible parameters in the [TFGenerationMixin](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_tf_utils.TFGenerationMixin) class available to all of the pre-trained models.

HuggingFace has also written [a very helpful blog post](https://huggingface.co/blog/how-to-generate) that explains and discusses various strategies for text generation and how to manipulate the hyperparameters.  They discuss the two approaches of beam search (which we have discussed in the async and live session) as well as sampling (which tries to randomly pick the next word within a k-sized distribution of highly probable choices).

**Please read the blog post before you proceed.**

For your reference, here's a more complex, technical, and thorough [HuggingFace guide](https://huggingface.co/docs/transformers/main/en/generation_strategies) for controlling generation of text.  The blog post above is all you need to read to complete the assignment.

### 1.b ROUGE for summarization evaluation

ROUGE is the metric that has been traditionally used to evaluate sumarization results.  The ROUGE metric expects a reference as input and it will evaluate a candidate against that reference.  ROUGE-1 calculates the number of words in the reference that occur in the candidate.  ROUGE-2 performs that same calculation but for bigrams in the reference. ROUGE-L calculates the longest common subsequence of reference words that occur in the candidate.

HuggingFace provides a wrapper around [a library](https://huggingface.co/spaces/evaluate-metric/rouge) to calculate ROUGE metrics which you will use below.  Let's calculate the ROUGE score for the candidate you produced above.

In [15]:
rouge = evaluate.load('rouge')
predictions = candidate
references = [SHORT_REFERENCE]
results = rouge.compute(predictions=predictions,
                        references=references)
print(results)

Let's experiment with the hyperparameters shown above.  Please experiment in the cell below.  The `num_beams` value is like a beam search.  It indicates the number of tries the model makes before showing you its best output.  The `no_repeat_ngram_size` is designed to help reduce repetition in the output.  `min_length` and `max_length` (or now `max_new_tokens`) set boundaries on the size of the summary. You are free to use other hyperparameters as described in the [blog post](https://huggingface.co/blog/how-to-generate).

*There is no one correct answer to these questions.  There are ranges that tend to work better than others.  The goal is to have you experiment to help build intuition.  Please enter the values that you think are generating the most readable output.*

*Your readable output should consist of at least one complete sentence but does not have to end with a period and you must also have a ROUGE-1 score above 0.30 and ROUGE-L score equal to or above 0.25 when compared with the short reference.*

You can use the two cells below to come up with your answer.

In [16]:
# Generate Summary
summary_ids = t5model.generate(inputs["input_ids"],
### YOUR CODE HERE
### END YOUR CODE
)

candidate = t5tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
pprint(candidate[0], compact=True)

In [17]:
predictions = candidate
references = [SHORT_REFERENCE]
results = rouge.compute(predictions=predictions,
                        references=references)
print(results)

**QUESTION:**

1.1 What num_beams value gives you the most readable output that meets the score criteria?

1.2 Which no_repeat_ngram_size gives the most readable output that meets the score criteria?

1.3 What min_length value gives you the most readable output that meets the score criteria?

1.4 Which max_new_tokens value gives the most readable output that meets the score criteria?

1.5 What is the ROUGE-L score associated with your most readable candidate?

In [18]:
#In order to not consume all of the memory available in Colab we'll free up the memory we're using for these large language models
del t5model
del t5tokenizer


## 2. Pegasus for Headline Summarization

Pegasus is an encoder decoder architecture that has been explicitly pre-trained as an abstractive summarizer.  You can read more about it [here](https://huggingface.co/docs/transformers/model_doc/pegasus) and [here](https://arxiv.org/pdf/1912.08777.pdf).

We'll first use the `google/pegasus-xsum` checkpoint.  It is trained on a [summarization task](https://aclanthology.org/D18-1206.pdf) that reads a news article and then [emits a one line summary](https://huggingface.co/datasets/xsum).  This doesn't mean that it is limited in its output length.  It does mean that it works well with news article type inputs and tends toward shorter outputs.

In [19]:
from transformers import PegasusTokenizer, TFPegasusForConditionalGeneration

pmodel = TFPegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
ptokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")

In [20]:
pmodel.summary()

Let's see what kinds of default parameters are configured in to this checkpoint.

In [21]:
config = AutoConfig.from_pretrained("google/pegasus-xsum")

config

Generate the inputs using the pegasus tokenizer for this checkpoint.

In [22]:
inputs = ptokenizer(ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

Let's get some output using just the default values and see what we're working with.

In [23]:
# Generate Summary
summary_ids = pmodel.generate(inputs["input_ids"]
)
pprint(ptokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

Let's experiment with the same set of hyperparameters for the Pegasus system.  It is designed for abstractive summarization. Remember that the checkpoint we are using was trained on data that generates a one line summary for the input article.

*Your readable output should consist of at least one complete sentence but does not have to end with a period and you must also have a ROUGE-1 score above 0.30 and ROUGE-L score equal to or above 0.25 when compared with the short reference.*

You can use the two cells below to experiment with hyperparameters and generating and scoring your outputs in order to answer questions 2.1 - 2.5 in your answers file.

**QUESTION:**

2.1 What num_beams value gives you the most readable output that meets the score criteria?

2.2 Which no_repeat_ngram_size gives the most readable output that meets the score criteria?

2.3 What min_length value gives you the most readable output that meets the score criteria?

2.4 Which max_new_tokens value gives the most readable output that meets the score criteria?

2.5 What is the ROUGE-L score associated with your most readable candidate?

In [24]:
# Generate Summary
summary_ids = pmodel.generate(inputs["input_ids"],
### YOUR CODE HERE
### END YOUR CODE
)
candidate = ptokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
pprint(candidate[0], compact=True)

In [25]:
rouge = evaluate.load('rouge')
predictions = candidate
references = [SHORT_REFERENCE]
results = rouge.compute(predictions=predictions,
                        references=references)
print(results)

Delete that Pegasus model and tokenizer so we can load the next one.

In [26]:
del pmodel
del ptokenizer

## 3. Pegasus for Longer Generation

Now let's try to produce a longer summary of our article.  In order to do that we are going to use a different fine-tuned checkpoint for Pegasus.  This checkpoint is fine-tuned on the [CNN/Daily Mail](https://huggingface.co/datasets/cnn_dailymail) set of news articles.  The references are on the order of several sentences long.

In [27]:
from transformers import PegasusTokenizer, TFPegasusForConditionalGeneration

cnnmodel = TFPegasusForConditionalGeneration.from_pretrained("google/pegasus-cnn_dailymail", from_pt=True)
cnntokenizer = PegasusTokenizer.from_pretrained("google/pegasus-cnn_dailymail", from_pt=True)

Let's see how this checkpoint is configured by default:

In [28]:
config = AutoConfig.from_pretrained("google/pegasus-cnn_dailymail")

config

Let's tokenize our input for this checkpoint.

In [29]:
cnninputs = cnntokenizer(ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

Run the summarizer with the defaults and let's see what it looks like.

In [30]:
# Generate Summary
summary_ids = cnnmodel.generate(inputs["input_ids"]
)

pprint(cnntokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

Let's again experiment with the same set of hyperparameters (but possibly with different values) for the Pegasus system.  It is designed for abstractive summarization and this checkpoint is based on multi-line outputs.  We'll evaluate it against the long reference record.

*Your readable multi-line output must have a ROUGE-1 score above 0.25 and a ROUGE-L score above 0.15.*

You can use the two cells below to experiment with hyperparameters and generating and scoring your outputs in order to answer questions 3.1 - 3.5 in your answers file.

**QUESTION:**

3.1 What num_beams value gives you the most readable output that meets the score criteria?

3.2 Which no_repeat_ngram_size gives the most readable output that meets the score criteria?

3.3 What min_length value gives you the most readable output that meets the score criteria?

3.4 Which max_new_tokens value gives you the most readable output that meets the score criteria?

3.5 What is the ROUGE-L score associated with your most readable candidate?

In [31]:
# Generate Summary
summary_ids = cnnmodel.generate(cnninputs["input_ids"],
### YOUR CODE HERE
### END YOUR CODE
                             )
candidate = cnntokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
pprint(candidate[0], compact=True)

In [32]:
rouge = evaluate.load('rouge')
predictions = candidate
references = [LONG_REFERENCE]
results = rouge.compute(predictions=predictions,
                        references=references)
print(results)

Okay, you're done.

Which model do you think produced the best summaries keeping in mind that best is in the eye of the reader?