## Lesson 11 Session Notebook

In [1]:
import tensorflow as tf
import transformers

from transformers import PegasusTokenizer, TFPegasusModel, TFPegasusForConditionalGeneration
from transformers import T5Tokenizer, TFT5Model, TFT5ForConditionalGeneration

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"  # Turn off GPU to avoid Resource Exhaustion issues. Ok for quick demo


In [2]:
!pip freeze | grep Transformers

In [3]:
# from Huggingface example:

ARTICLE_TO_SUMMARIZE = ( "PG&E stated it scheduled the blackouts in response to forecasts for high winds \
            amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were \
            scheduled to be affected by the shutoffs which were expected to last through \
            at least midday tomorrow.")

### Pegasus

See: https://huggingface.co/transformers/model_doc/pegasus.html#tfpegasusforconditionalgeneration

Let us first check out Pegasus. We start by creating the model:

In [4]:
pegasus_model = TFPegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')
pegasus_tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')

pegasus_model.summary()

All model checkpoint layers were used when initializing TFPegasusForConditionalGeneration.

All the layers of TFPegasusForConditionalGeneration were initialized from the model checkpoint at google/pegasus-xsum.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFPegasusForConditionalGeneration for predictions without further training.


Model: "tf_pegasus_for_conditional_generation"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
model (TFPegasusMainLayer)   multiple                  569748480 
Total params: 569,844,583
Trainable params: 569,748,480
Non-trainable params: 96,103
_________________________________________________________________


570 million parameters. Not tiny for sure.

Next we greate the input tokens:

In [5]:
pegasus_inputs = pegasus_tokenizer([ARTICLE_TO_SUMMARIZE], return_tensors='tf')

In [6]:
pegasus_inputs['input_ids'].shape

TensorShape([1, 55])

Now we check how Pegasus works and how well it performs. We also test the impact of a couple of generation parameters:

In [7]:
pegasus_summary_ids = pegasus_model.generate(pegasus_inputs['input_ids'])

print([pegasus_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) 
       for g in pegasus_summary_ids])

["California's largest utility has cut power to hundreds of thousands of customers in an effort to reduce the risk of wildfires."]


In [8]:
pegasus_summary_ids = pegasus_model.generate(pegasus_inputs['input_ids'], 
                                    min_length=20,
                                    max_length=40,
                                    early_stopping=True)

print([pegasus_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) 
       for g in pegasus_summary_ids])

["California's largest utility has cut power to hundreds of thousands of customers in an effort to reduce the risk of wildfires."]


In [9]:
pegasus_summary_ids = pegasus_model.generate(pegasus_inputs['input_ids'], 
                                    no_repeat_ngram_size=2,
                                    min_length=20,
                                    max_length=40,
                                    early_stopping=True)

print([pegasus_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) 
       for g in pegasus_summary_ids])

["California's largest utility has cut power to hundreds of thousands of customers in an effort to reduce the risk of wildfires."]


No change so far. What if we force it to go a bit longer?

In [10]:
pegasus_summary_ids = pegasus_model.generate(pegasus_inputs['input_ids'], 
                                    no_repeat_ngram_size=2,
                                    min_length=35,
                                    max_length=80,
                                    early_stopping=True)

print([pegasus_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) 
       for g in pegasus_summary_ids])

["California's largest electricity provider has turned off power to hundreds of thousands of customers in an effort to reduce the risk of wildfires, the company said in a statement on Monday."]


Still good! But...

#### ... Question: *where are the tokens  "California's", "company", "Monday" coming from?"*

Interesting. Solid summary - obviously with some external knowledge applied - and beam search had no impact here.



### T5

see: https://huggingface.co/transformers/model_doc/t5.html#tft5forconditionalgeneration

Now we do the same for T5:

In [11]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-large')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-large')

t5_model.summary()


All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Model: "tf_t5for_conditional_generation"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
shared (TFSharedEmbeddings)  multiple                  32899072  
_________________________________________________________________
encoder (TFT5MainLayer)      multiple                  302040576 
_________________________________________________________________
decoder (TFT5MainLayer)      multiple                  402728448 
Total params: 737,668,096
Trainable params: 737,668,096
Non-trainable params: 0
_________________________________________________________________


A bit larger than Pegasus. 

When we create the input we need to tell T5 what we want it to do! So we need to prepend the input with 'summarize: '

In [12]:
t5_input_text = "summarize: " + ARTICLE_TO_SUMMARIZE

In [13]:
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

And again, we will look at the generated output for the same text and also explore some generation parameters:

In [14]:
# Generate Summary
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'])

print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['PG&E said it scheduled the blackouts in response to forecasts for high winds']


Pretty short... can we force it to do more?

In [15]:
# Generate Summary
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], min_length=20,  max_length=40)

print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])


['PG&E said it scheduled the blackouts in response to forecasts for high winds amid dry conditions . aim is to reduce the risk of wildfires .']


Ok, that's longer. Still pretty extractive though. Can we do better with beams?

In [16]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                    num_beams=3,
                                    no_repeat_ngram_size=1,
                                    min_length=20,
                                    max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['PG&E said it scheduled the blackouts in response to forecast for high winds amid dry conditions and aim is reduce risk of wildfire.']


What if we force it to even go a bit longer?

In [17]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                    num_beams=3,
                                    no_repeat_ngram_size=1,
                                    min_length=35,
                                    max_length=80)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['PG&E said it scheduled the blackouts in response to forecast for high winds amid dry conditions and aim is reduce risk of wildfire. nearly 800,000 customers were due be affected by shutoff which was expected last through at least midday tomorrow morning']


Hmm.. almost back to original now.

### Summary

Based on these quick tests it appears that Pegasus is doing a better job, but...

#### Question: *Why may this be an unfair comparison?*