Summarization Tests

**Description:**  three different types of summarization solutions, using an encoder decoder architecture and evaluation of output using ROUGE.



This notebook should be run on a Google Colab but it does not require a GPU. By default, when you open the notebook in Colab it will NOT configure a GPU.  Summarization commands can take up to five minutes to run depending on the hyperparameters you use. This notebook will NOT run on your GCP instance as the summary models are larger than the avaialble memory.

 Setup

1. T5 for generic summarization

2. Pegasus for headline summarization

3. Pegasus for longer generation

## Setup

In [None]:
!pip install -q sentencepiece

In [None]:
!pip install -q transformers

In [None]:
!pip install -q evaluate
import evaluate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.[0m[31m
[0m

In [None]:
!pip install -q rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [None]:
#let's make longer output readable without horizontal scrolling
from pprint import pprint

Using pretrained Huggingface models, and use pretrained on one dataset trained for a one line output and another one on a multi-line output.

Use this same toy article as the input to all of our summarization attempts for comparison.

In [None]:

ARTICLE_TO_SUMMARIZE = (
    "Nearly 800 thousand customers are scheduled to be affected by the shutoffs which are expected to last through at least midday tomorrow. "
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. "
    "The aim is to reduce the risk of wildfires. "
    "If Pacific Gas & Electric Co, a unit of PG&E Corp, goes through with another public safety power shutoff, "
    " it would be the fourth round of mass blackouts imposed by the utility since Oct. 9, when some 730,000 customers were left in the dark. "
    "The recent wave of precautionary shutoffs have drawn sharp criticism from Governor Gavin Newsom, state regulators and consumer activists as being overly broad in scale."
    "Newsom blames PG&E for doing too little to properly maintain and secure its power lines against wind damage."
    "Utility executives have acknowledged room for improvement while defending the sprawling scope of the power cutoffs as a matter of public safety."
    "The record breaking drought has made the current conditions even worse than in previous years. "
    "It exponentially increases the probability of large scale wildfires. "
)

LONG_REFERENCE = (
    "Many PG&E customers could be affected by public safety power shutoffs in response to forecasts for high winds and dry conditions. "
    "The record breaking drought exponentially increases the probability of large scale wildfires. "
    "Despite being criticized by Governor Newsom for being overly broad, company officials defend the cutoffs as a matter of public safety. "
)

SHORT_REFERENCE = (
    "California's largest utility is set to turn off power to hundreds of thousands of customers in an effort to reduce the risk of wildfires. "
)

Length of article to summarize:

In [None]:
len(str.split(ARTICLE_TO_SUMMARIZE))

177

## 1. T5 for Generic Summarization



In [None]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

t5model = TFT5ForConditionalGeneration.from_pretrained("t5-base")
t5tokenizer = T5Tokenizer.from_pretrained("t5-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
t5model.summary()

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (Embedding)          multiple                  24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  109628544 
                                                                 
 decoder (TFT5MainLayer)     multiple                  137949312 
                                                                 
Total params: 222903552 (850.31 MB)
Trainable params: 222903552 (850.31 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Multitask T5 need to prepend a "prompt" as instruction.

In [None]:
PROMPT = 'summarize: '
T5ARTICLE_TO_SUMMARIZE = PROMPT + ARTICLE_TO_SUMMARIZE

In [None]:
inputs = t5tokenizer(T5ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

In [None]:
inputs

{'input_ids': <tf.Tensor: shape=(1, 241), dtype=int32, numpy=
array([[21603,    10, 10455,   120,  8640,  7863,   722,    33,  5018,
           12,    36,  4161,    57,     8,  6979,  1647,     7,    84,
           33,  1644,    12,   336,   190,    44,   709,  2076,  1135,
         5721,     5,     3,  7861,   184,   427,  4568,    34,  5018,
            8,  1001,   670,     7,    16,  1773,    12,  7555,     7,
           21,   306, 13551, 18905,  2192,  1124,     5,    37,  2674,
           19,    12,  1428,     8,  1020,    13,  3645,  6608,     7,
            5,   156,  5824,  6435,     3,   184,  8666,   638,     6,
            3,     9,  1745,    13,     3,  7861,   184,   427, 10052,
            6,  1550,   190,    28,   430,   452,  1455,   579,  6979,
         1647,     6,    34,   133,    36,     8,  4509,  1751,    13,
         3294,  1001,   670,     7,     3, 16068,    57,     8,  6637,
          437,  6416,     5,  9902,   116,   128,   489, 17093,   722,
          130, 

Running T5 using default hyperparameters.

In [None]:
# Generate Summary
summary_ids = t5model.generate(inputs["input_ids"],
                               max_new_tokens=30
)
candidate = t5tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
pprint(candidate[0], compact=True)

('PG&e shuts off power to 800 thousand customers . the shutoffs are scheduled '
 'to last through at least midday tomorrow .')


### 1.a Checkpoint Configuration



HuggingFace provides access to the default hyperparameters via the AutoConfig object which we call below.  


In [None]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained("t5-base")

config

T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
   

Technical reference [HuggingFace guide](https://huggingface.co/docs/transformers/main/en/generation_strategies) for controlling generation of text.

### 1.b ROUGE for summarization evaluation

ROUGE-1 calculates the number of words in the reference that occur in the candidate.  
ROUGE-2 performs that same calculation but for bigrams in the reference. 
ROUGE-L calculates the longest common subsequence of reference words that occur in the candidate.

HuggingFace provides a wrapper around [a library](https://huggingface.co/spaces/evaluate-metric/rouge) to calculate ROUGE metrics.

In [None]:
rouge = evaluate.load('rouge')
predictions = candidate
references = [SHORT_REFERENCE]
results = rouge.compute(predictions=predictions,
                        references=references)
print(results)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'rouge1': 0.2666666666666666, 'rouge2': 0.0930232558139535, 'rougeL': 0.22222222222222224, 'rougeLsum': 0.22222222222222224}


Let's experiment with the hyperparameters:

In [None]:
# Generate Summary
summary_ids = t5model.generate(inputs["input_ids"],


                               max_new_tokens=32,
                               num_beams= 4,
                               no_repeat_ngram_size = 2,
                               min_length =30


)

candidate = t5tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
pprint(candidate[0], compact=True)

('PG&e shuts off power to 800 thousand customers in response to forecasts for '
 'high winds . the aim is to reduce the risk of wildfire')


In [None]:
predictions = candidate
references = [SHORT_REFERENCE]
results = rouge.compute(predictions=predictions,
                        references=references)
print(results)

{'rouge1': 0.48, 'rouge2': 0.2916666666666667, 'rougeL': 0.4000000000000001, 'rougeLsum': 0.4000000000000001}



Best Rougue metric: 
num_beams: 4

no_repeat_ngram_size: 2

 min_length:  30

max_new_tokens: 32

ROUGE-L score: 0.40

In [None]:
# free up the memory we're using for these large language models
del t5model
del t5tokenizer


## 2. Pegasus for Headline Summarization

Pegasus is an encoder decoder architecture that has been explicitly pre-trained as an abstractive summarizer. 

In [None]:
from transformers import PegasusTokenizer, TFPegasusForConditionalGeneration

pmodel = TFPegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
ptokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFPegasusForConditionalGeneration.

Some layers of TFPegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['final_logits_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

In [None]:
pmodel.summary()

Model: "tf_pegasus_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 model (TFPegasusMainLayer)  multiple                  569748480 
                                                                 
 final_logits_bias (BiasLay  multiple                  96103     
 er)                                                             
                                                                 
Total params: 569844583 (2.12 GB)
Trainable params: 569748480 (2.12 GB)
Non-trainable params: 96103 (375.40 KB)
_________________________________________________________________


checking default parameters at this checkpoint

In [None]:
config = AutoConfig.from_pretrained("google/pegasus-xsum")

config

PegasusConfig {
  "_name_or_path": "google/pegasus-xsum",
  "activation_dropout": 0.1,
  "activation_function": "relu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "PegasusForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 16,
  "decoder_start_token_id": 0,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 16,
  "eos_token_id": 1,
  "extra_pos_embeddings": 0,
  "force_bos_token_to_be_generated": false,
  "forced_eos_token_id": 1,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,


Generate the inputs:

In [None]:
inputs = ptokenizer(ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

output using just the default values:

In [None]:
# Generate Summary
summary_ids = pmodel.generate(inputs["input_ids"]
)
pprint(ptokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

("California's largest utility has announced plans to cut power to hundreds of "
 'thousands of customers in a bid to reduce the risk of wildfires.')


experiment with set of hyperparameters for the Pegasus system.  

num_beams value: 5

no_repeat_ngram_size: 2

min_length value: 50

max_new_tokens: 100

ROUGE-L score: 0.51351

In [None]:
# Generate Summary
summary_ids = pmodel.generate(inputs["input_ids"],

                               num_beams=5,
                               no_repeat_ngram_size=2,
                               min_length=50,
                               max_new_tokens=100

)
candidate = ptokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
pprint(candidate[0], compact=True)

("California's largest utility has announced plans to cut power to hundreds of "
 'thousands of customers in a bid to reduce the risk of wildfires as the state '
 'endures its worst drought in living memory, according to state officials and '
 'industry executives who spoke on the condition of anonymity.')


In [None]:
rouge = evaluate.load('rouge')
predictions = candidate
references = [SHORT_REFERENCE]
results = rouge.compute(predictions=predictions,
                        references=references)
print(results)

{'rouge1': 0.5135135135135135, 'rouge2': 0.4166666666666667, 'rougeL': 0.5135135135135135, 'rougeLsum': 0.5135135135135135}


Delete Pegasus model and tokenizer to free memory

In [None]:
del pmodel
del ptokenizer

## 3. Pegasus for Longer Generation

Produce a longer summary of the article.  Use a different fine-tuned checkpoint for Pegasus.  This checkpoint is fine-tuned on the [CNN/Daily Mail](https://huggingface.co/datasets/cnn_dailymail) set of news articles. 

In [None]:
from transformers import PegasusTokenizer, TFPegasusForConditionalGeneration

cnnmodel = TFPegasusForConditionalGeneration.from_pretrained("google/pegasus-cnn_dailymail", from_pt=True)
cnntokenizer = PegasusTokenizer.from_pretrained("google/pegasus-cnn_dailymail", from_pt=True)

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFPegasusForConditionalGeneration.

Some weights or buffers of the TF 2.0 model TFPegasusForConditionalGeneration were not initialized from the PyTorch model and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Checking checkpoint configuration by default:

In [None]:
config = AutoConfig.from_pretrained("google/pegasus-cnn_dailymail")

config

PegasusConfig {
  "_name_or_path": "google/pegasus-cnn_dailymail",
  "activation_dropout": 0.1,
  "activation_function": "relu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "PegasusForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 16,
  "decoder_start_token_id": 0,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 16,
  "eos_token_id": 1,
  "extra_pos_embeddings": 1,
  "forced_eos_token_id": 1,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "length_penalty": 0.8,
  "max_length": 128,
  "max_position_embeddings": 1024,
  "min_length": 3

tokenize our input

In [None]:
cnninputs = cnntokenizer(ARTICLE_TO_SUMMARIZE, max_length=1024, truncation=True, return_tensors="tf")

Run the summarizer:

In [None]:
# Generate Summary
summary_ids = cnnmodel.generate(inputs["input_ids"]
)

pprint(cnntokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

('Nearly 800 thousand customers are scheduled to be affected by the shutoffs '
 'which are expected to last through at least midday tomorrow .<n>PG&E stated '
 'it scheduled the blackouts in response to forecasts for high winds amid dry '
 'conditions .<n>The aim is to reduce the risk of wildfires .')



num_beams value: 5

no_repeat_ngram_size: 2

min_length value: 100

max_new_tokens value: 1000

ROUGE-L: 0.36

In [None]:
# Generate Summary
summary_ids = cnnmodel.generate(cnninputs["input_ids"],

                               num_beams=5,
                               no_repeat_ngram_size=2,
                               min_length=100,
                               max_new_tokens=1000

                             )
candidate = cnntokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
pprint(candidate[0], compact=True)

('Nearly 800 thousand customers are scheduled to be affected by the shutoffs '
 'which are expected to last through at least midday tomorrow .<n>PG&E stated '
 'it scheduled the blackouts in response to forecasts for high winds amid dry '
 'conditions, the aim is to reduce the risk of wildfires ,<n>The record '
 'breaking drought has made the current conditions even worse than in previous '
 'years. It exponentially increases the probability of large scale wildfires, '
 'according to the California Department of Forestry and Fire Prevention.The '
 'state is in the midst of one of the worst wildfire seasons on record.')


In [None]:
rouge = evaluate.load('rouge')
predictions = candidate
references = [LONG_REFERENCE]
results = rouge.compute(predictions=predictions,
                        references=references)
print(results)

{'rouge1': 0.4025974025974026, 'rouge2': 0.27631578947368424, 'rougeL': 0.3636363636363636, 'rougeLsum': 0.3636363636363636}
