# Pegasus Summarizer 

This notebook is utilizing the Pegasus Summarizer from Google to test out the Summarization of the model created by Google. This notebook doesn't necessary investigate into the complexities behind the model rather test run the model to have a functional baseline that returns an output. We will be looking to take a blob of text and run it through the model to get an understanding if the model is return an abstractive summary or an extractive summary. We will explore more into the model's capabilities as we expand on this repo. 

In [1]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

#### CUDA and Model name

Set the model name to specified summarizer.  If you are looking to use another pegasus model, change the model name. For this notebook, we are explicitely using the Pegasus Summarizer. 

In [2]:
model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

#### Download the Model 

We will be downloading the model in this section.  This is take about 5-10 minutes depending on your connection. This will be downloaded from the public internet. We will also set the model to use the specified CUDA hardware (if available).  Otherwise, the model will be using a CPU.

In [3]:
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

#### Text to Summarize

Ironically, we will be summarizing a summary of a Christopher Nolan's movie **Dunkirk**. So let's create a variable that we will use to pass it along into the model. The source text needs to be a list format.

In [6]:
plot = ["In May 1940, Germany advanced into France, trapping Allied troops on the beaches of Dunkirk. Under air and ground cover from British and French forces, troops were slowly and methodically evacuated from the beach using every serviceable naval and civilian vessel that could be found. At the end of this heroic mission, 330,000 French, British, Belgian and Dutch soldiers were safely evacuated."]

In [14]:
len(plot[0]) ## The length of the text we are currently passing into the model.

393

### Start Preparing the Model

Let's start preparing the model in this step. From the tokenizers, we will be invoking the sequence to sequence and will be passing in the movie plot summary we just defined above.  Some of the critical parameters to note here: Truncation is being set to **true**, and padding is being set to **longest**.  If you are looking to summarize very large text, it might be ideal to might the longest length of the text, and truncate/pad if needed. Otherwise, you may be chopping off text that maybe critical to the summary of the text.

In [7]:
batch = tokenizer.prepare_seq2seq_batch(
    plot, 
    truncation=True,
    padding='longest',
    return_tensors='pt'
).to(torch_device) # Pass these parameters into the torch device 

#### Generate a Batch Result 

Let's generate a batch result and then we will grab the summary from the provided output

In [8]:
translated = model.generate(**batch)

In [9]:
translated  ## Returns the resulting tensors from the batch

tensor([[    0,   139,  7467,   113, 69119,   140,   156,   113,   109,  1458,
         10083, 10931,   113,   894,  1981,  2508,   107,     1]])

In [10]:
output = tokenizer.batch_decode(translated, skip_special_tokens=True)

In [12]:
output ## Summary Output

['The Battle of Dunkirk was one of the bloodiest battles of World War Two.']