# Supervised Summarization with T5

## Introduction

In this notebook, we will utilize Google's T5 model for summarizing the dialogues existing in the SAMSUM corpus. We will start with installing&loading the necessary packages/models/metrics etc, then move on with our analysis.

In [None]:
!pip install transformers==4.4.2

!pip install datasets

!pip install py7zr

!pip install sentencepiece==0.1.94

!pip install rouge_score

!pip install Rouge

Collecting transformers==4.4.2
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 14.7MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 43.1MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 50.3MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.4.2
Collecting datasets
[?25l  Downloading https://file

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import nltk

import torch

import transformers
from datasets import load_dataset, list_datasets, load_metric
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

from rouge import Rouge

import string

#import spacy
#from spacy.lang.en.stop_words import STOP_WORDS
#from spacy.lang.en import English

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Let's load our dataset from the huggingface library directly, and take a brief look at its structure.

In [None]:
samsum_dataset = load_dataset('samsum')

print(samsum_dataset['train'][0])
print('-'*50)
print(samsum_dataset)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1417.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=770.0, style=ProgressStyle(description_…


Downloading and preparing dataset samsum/samsum (download: 2.81 MiB, generated: 10.04 MiB, post-processed: Unknown size, total: 12.85 MiB) to /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/3f7dba43be72ab10ca66a2e0f8547b3590e96c2bd9f2cbb1f6bb1ec1f1488ba6...


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Downloading', max=1.0, style=ProgressSt…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset samsum downloaded and prepared to /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/3f7dba43be72ab10ca66a2e0f8547b3590e96c2bd9f2cbb1f6bb1ec1f1488ba6. Subsequent calls will reuse this data.
{'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)", 'id': '13818513', 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}
--------------------------------------------------
DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})


As we mentioned it in the report, the dataset comes with pre-separated train, test and validation datasets. This is really nice since we won't spend any extra time on splitting the original dataset.

We have also checked the first dialogue-summary pair above, looks like it is possible that the dialogues include some unnecessary characters such as '\r' or '\n'. On the other hand, the summary for the first dialouge of the train set looks fine, but we may need to check more of them later.

Let's look at the dialogues in validation and test datasets as well, this time in a more compact format.


In [None]:
train = samsum_dataset['train']
val = samsum_dataset['validation']
test = samsum_dataset['test']

In [None]:
train[0].keys()

dict_keys(['dialogue', 'id', 'summary'])

In [None]:
pprint(train['dialogue'][:3], compact= True)
print('-'*50)
pprint(test['dialogue'][:3], compact= True)
print('-'*50)
pprint(val['dialogue'][:3], compact= True)

['Amanda: I baked  cookies. Do you want some?\r\n'
 'Jerry: Sure!\r\n'
 "Amanda: I'll bring you tomorrow :-)",
 'Olivia: Who are you voting for in this election? \r\n'
 'Oliver: Liberals as always.\r\n'
 'Olivia: Me too!!\r\n'
 'Oliver: Great',
 "Tim: Hi, what's up?\r\n"
 'Kim: Bad mood tbh, I was going to do lots of stuff but ended up '
 'procrastinating\r\n'
 'Tim: What did you plan on doing?\r\n'
 'Kim: Oh you know, uni stuff and unfucking my room\r\n'
 "Kim: Maybe tomorrow I'll move my ass and do everything\r\n"
 "Kim: We were going to defrost a fridge so instead of shopping I'll eat some "
 'defrosted veggies\r\n'
 'Tim: For doing stuff I recommend Pomodoro technique where u use breaks for '
 'doing chores\r\n'
 'Tim: It really helps\r\n'
 "Kim: thanks, maybe I'll do that\r\n"
 'Tim: I also like using post-its in kaban style']
--------------------------------------------------
["Hannah: Hey, do you have Betty's number?\n"
 'Amanda: Lemme check\n'
 'Hannah: <file_gif>\n'
 "Amanda: 

## Data Cleaning

All of the first three dialogues in our datasets consistently include the characters '\r' and '\n'. It would be best to clear them out from our text since they don't represent anything.

We will create three corresponding pandas dataframe files to show our data in a more neat form.


In [None]:
df_train = pd.DataFrame()
df_val = pd.DataFrame()
df_test = pd.DataFrame()

for elem in ['id', 'dialogue', 'summary']:
  df_train[elem] = train[elem]
  df_val[elem] = val[elem]
  df_test[elem] = test[elem]

In [None]:
for elem in ['dialogue', 'summary']:
  df_train[elem] = df_train[elem].str.replace('\r', ' ')
  df_train[elem] = df_train[elem].str.replace('\n', ' ')

  df_test[elem] = df_test[elem].str.replace('\r', ' ')
  df_test[elem] = df_test[elem].str.replace('\n', ' ')

  df_val[elem] = df_val[elem].str.replace('\r', ' ')
  df_val[elem] = df_val[elem].str.replace('\n', ' ')

In [None]:
pd.set_option('display.max_colwidth', 500)

display(df_train.head())

Unnamed: 0,id,dialogue,summary
0,13818513,Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: I'll bring you tomorrow :-),Amanda baked cookies and will bring Jerry some tomorrow.
1,13728867,Olivia: Who are you voting for in this election? Oliver: Liberals as always. Olivia: Me too!! Oliver: Great,Olivia and Olivier are voting for liberals in this election.
2,13681000,"Tim: Hi, what's up? Kim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating Tim: What did you plan on doing? Kim: Oh you know, uni stuff and unfucking my room Kim: Maybe tomorrow I'll move my ass and do everything Kim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies Tim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores Tim: It really helps Kim: thanks, maybe I'll do that Tim: I also li...",Kim may try the pomodoro technique recommended by Tim to get more stuff done.
3,13730747,"Edward: Rachel, I think I'm in ove with Bella.. rachel: Dont say anything else.. Edward: What do you mean?? rachel: Open your fu**ing door.. I'm outside",Edward thinks he is in love with Bella. Rachel wants Edward to open his door. Rachel is outside.
4,13728094,"Sam: hey overheard rick say something Sam: i don't know what to do :-/ Naomi: what did he say?? Sam: he was talking on the phone with someone Sam: i don't know who Sam: and he was telling them that he wasn't very happy here Naomi: damn!!! Sam: he was saying he doesn't like being my roommate Naomi: wow, how do you feel about it? Sam: i thought i was a good rommate Sam: and that we have a nice place Naomi: that's true man!!! Naomi: i used to love living with you before i moved in ...","Sam is confused, because he overheard Rick complaining about him as a roommate. Naomi thinks Sam should talk to Rick. Sam is not sure what to do."


The text is not crystal clear but it looks like the unwanted characters are gone. We can move on with using the T5 now.

## Initial Trials with T5

Let's load our t5 tokenizer&model and check the output of tokenized sentences.

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
#device = torch.device('cpu')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1197.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242065649.0, style=ProgressStyle(descri…




In [None]:
print(type(tokenizer)) # Sanity Check
print()
print(type(model)) # Sanity Check

<class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>

<class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>


In [None]:
test_sent = "Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: I'll bring you tomorrow :-)"
test_sent_2 = "Olivia: Who are you voting for "
pprint(tokenizer([test_sent,test_sent_2], max_length=100, truncation=True), compact = True)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                     1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1]],
 'input_ids': [[21542, 10, 27, 13635, 5081, 5, 531, 25, 241, 128, 58, 16637, 10,
                10625, 55, 21542, 10, 27, 31, 195, 830, 25, 5721, 3, 10, 18, 61,
                1],
               [25051, 10, 2645, 33, 25, 10601, 21, 1]]}


Here we define a function that will transform our dataset dictionary files directly for the summarization task. It will take text inputs from both the dialogues and summaries and clean them before tokenizing.

It eventually returns a dictionary consisting of dialogue token ids, their attention mask and the summary token ids.

In [None]:
max_input_length = 512
max_target_length = 512
prefix = "summarize: "
def preprocess_function(examples):
    inputs = [prefix + (doc.replace('\r', ' ').replace('\n', '')) for doc in examples["dialogue"]]
    #print(inputs)
    #print(len(inputs[0]))
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    inputs_summary = [(doc.replace('\r', ' ').replace('\n', '')) for doc in examples["summary"]]
    #print(inputs_summary)
    #print(len(inputs_summary[0]))
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(inputs_summary, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
preprocess_function(train[:2])

{'input_ids': [[21603, 10, 21542, 10, 27, 13635, 5081, 5, 531, 25, 241, 128, 58, 16637, 10, 10625, 55, 21542, 10, 27, 31, 195, 830, 25, 5721, 3, 10, 18, 61, 1], [21603, 10, 25051, 10, 2645, 33, 25, 10601, 21, 16, 48, 4356, 58, 15865, 10, 18587, 7, 38, 373, 5, 25051, 10, 1212, 396, 1603, 15865, 10, 1651, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[21542, 13635, 5081, 11, 56, 830, 16637, 128, 5721, 5, 1], [25051, 11, 20373, 5144, 33, 10601, 21, 10215, 7, 16, 48, 4356, 5, 1]]}

Let's map this function to the whole samsum dataset.

In [None]:
tokenized_datasets = samsum_dataset.map(preprocess_function, batched=True)

HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Lets also define the arguments for our model too. We will evaluate our performance per epochs and use a batch size of 16. The important arguments here are the number of training epochs, batch size and learning rate; with the first one being probably the most important among all.

In [None]:

batch_size = 16
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
)

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Let's define another function for that will compute the rouge score during training and return the results.

In [None]:
metric = load_metric("rouge")

def compute_metrics(eval_pred):

    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = Rouge().get_scores(decoded_preds, decoded_labels, avg = True)
    for k,v in result.items():
      result = {k: v['f'] * 100}

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2170.0, style=ProgressStyle(description…




We can finally start the training. We will train for one epoch only on the validation data.

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len,Runtime,Samples Per Second
1,2.0136,1.866449,39.0967,17.224,33.0163,36.4219,16.4976,71.5946,11.425


TrainOutput(global_step=921, training_loss=2.030467477605863, metrics={'train_runtime': 345.7049, 'train_samples_per_second': 2.664, 'total_flos': 2417202840772608.0, 'epoch': 1.0, 'init_mem_cpu_alloc_delta': 4040153, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 63278290, 'init_mem_gpu_peaked_delta': 484207616, 'train_mem_cpu_alloc_delta': 314772, 'train_mem_gpu_alloc_delta': 484107776, 'train_mem_cpu_peaked_delta': 66885090, 'train_mem_gpu_peaked_delta': 5353555968})

### Second Run

Let's increase the number of training epochs to see if the results change significantly or not. If they do, we will conduct extensive tuning for the other arguments(such as learning rate); if they don't, we won't go down that path.

In [None]:
batch_size = 16
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
)

In [None]:
metric = load_metric("rouge")
def compute_metrics(eval_pred):

    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = Rouge().get_scores(decoded_preds, decoded_labels, avg = True)
    final_result = {}
    for k,v in result.items():
      if (k == 'rouge-1') or (k == 'rouge-2') or (k == 'rouge-l'):
        final_result[k] = round(v['f'],4)
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    # Uncomment these cells below if you'd like to compare predictions and true
    # values manually.
     
    #print(decoded_preds[:2])
    #print(decoded_labels[:2])

    return final_result

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge-1,Rouge-2,Rouge-l,Runtime,Samples Per Second
1,1.8477,1.802188,0.3764,0.1689,0.3754,70.7407,11.563
2,1.9494,1.7851,0.3812,0.1753,0.3805,70.809,11.552
3,1.9176,1.769705,0.386,0.1781,0.3835,70.7436,11.563
4,1.8993,1.761391,0.3827,0.1771,0.3797,70.6725,11.575
5,1.8762,1.753501,0.3827,0.1752,0.3792,70.8189,11.551
6,1.8651,1.750863,0.3832,0.1776,0.3806,71.15,11.497
7,1.8625,1.745924,0.3882,0.1811,0.3857,71.2204,11.485
8,1.8386,1.740326,0.3895,0.1799,0.3857,71.2446,11.482
9,1.8424,1.740314,0.3891,0.181,0.3862,71.2017,11.488
10,1.842,1.738942,0.3881,0.1803,0.3857,71.4683,11.446


TrainOutput(global_step=9210, training_loss=1.8777457702172824, metrics={'train_runtime': 3462.6253, 'train_samples_per_second': 2.66, 'total_flos': 2.403353891050291e+16, 'epoch': 10.0, 'init_mem_cpu_alloc_delta': 176883, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 909964, 'train_mem_gpu_alloc_delta': 484111872, 'train_mem_cpu_peaked_delta': 68787832, 'train_mem_gpu_peaked_delta': 5353551872})

The results did not change much on the validation set even if we increase our training epochs from 1 to 10. As a consequence, we won't be tuning our arguments extensively. But we still need to check our model's performance on the test set.

## Test Set Results

From the results that we obtained in the validation set, we can see that there is no golden epoch number that optimizes f1 scores for all of the rouge metrics. We will pick 7 for the number of epochs since it maximizes the rouge-2 score, while doing very well with the other two at the same time.

In [None]:
batch_size = 16
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=7,
    predict_with_generate=True,
    fp16=True,
)

In [None]:
def compute_metrics(eval_pred):

    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = Rouge().get_scores(decoded_preds, decoded_labels, avg = True)
    final_result = {}
    for k,v in result.items():
      if (k == 'rouge-1') or (k == 'rouge-2') or (k == 'rouge-l'):
        final_result[k] = round(v['f'],4)
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    # Uncomment these cells below if you'd like to compare predictions and true
    # values manually.
    
    #print(decoded_preds[:2])
    #print(decoded_labels[:2])

    return final_result

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge-1,Rouge-2,Rouge-l,Runtime,Samples Per Second
1,1.6759,1.73867,0.3887,0.1683,0.3789,74.6164,10.976
2,1.7914,1.733005,0.389,0.1703,0.3819,77.4896,10.569
3,1.7799,1.728448,0.3881,0.1724,0.3828,75.108,10.904
4,1.7736,1.72513,0.3886,0.1729,0.383,77.423,10.578
5,1.7639,1.724511,0.3898,0.1728,0.3846,77.0997,10.623


["Betty's number is Lemme.\nHannah and Larry are at the park together.", "Rob and Eric are going to watch some of Eric's stand-ups on youtube"]
["Hannah needs Betty's number but Amanda doesn't have it.\nShe needs to contact Larry.", 'Eric and Rob are going to watch a stand-up on youtube.']
["Betty's number is Lemme.\nAmanda asks Larry to call her last", "Rob and Eric are going to watch some of Eric's stand-ups on youtube"]
["Hannah needs Betty's number but Amanda doesn't have it.\nShe needs to contact Larry.", 'Eric and Rob are going to watch a stand-up on youtube.']
["Betty's number is Lemme.\nAmanda asks Larry to call her last", "Rob and Eric are going to watch some of Eric's stand-ups on youtube"]
["Hannah needs Betty's number but Amanda doesn't have it.\nShe needs to contact Larry.", 'Eric and Rob are going to watch a stand-up on youtube.']
["Betty's number is Lemme.\nAmanda asks Larry to call her last", "Rob and Eric are going to watch some of Eric's stand-ups on youtube"]
["Hanna

Epoch,Training Loss,Validation Loss,Rouge-1,Rouge-2,Rouge-l,Runtime,Samples Per Second
1,1.6759,1.73867,0.3887,0.1683,0.3789,74.6164,10.976
2,1.7914,1.733005,0.389,0.1703,0.3819,77.4896,10.569
3,1.7799,1.728448,0.3881,0.1724,0.3828,75.108,10.904
4,1.7736,1.72513,0.3886,0.1729,0.383,77.423,10.578
5,1.7639,1.724511,0.3898,0.1728,0.3846,77.0997,10.623
6,1.7633,1.724133,0.3885,0.1735,0.3832,76.4514,10.713
7,1.7667,1.721247,0.3911,0.1742,0.3849,76.0599,10.768


["Betty's number is Lemme.\nAmanda asks Larry to call her last", "Rob and Eric are going to watch some of Eric's stand-ups on youtube"]
["Hannah needs Betty's number but Amanda doesn't have it.\nShe needs to contact Larry.", 'Eric and Rob are going to watch a stand-up on youtube.']
["Betty's number is Lemme.\nAmanda asks Larry to call her last", "Rob and Eric are going to watch some of Eric's stand-ups on youtube"]
["Hannah needs Betty's number but Amanda doesn't have it.\nShe needs to contact Larry.", 'Eric and Rob are going to watch a stand-up on youtube.']


TrainOutput(global_step=6447, training_loss=1.7646830307155945, metrics={'train_runtime': 2475.1679, 'train_samples_per_second': 2.605, 'total_flos': 1.6828091617714176e+16, 'epoch': 7.0, 'init_mem_cpu_alloc_delta': 221470, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 596842, 'init_mem_gpu_peaked_delta': 14651034112, 'train_mem_cpu_alloc_delta': 608660, 'train_mem_gpu_alloc_delta': 484167168, 'train_mem_cpu_peaked_delta': 68451250, 'train_mem_gpu_peaked_delta': 5349498880})

The output is a bit distorted(we tried it for two times to get a nicer looking one) but colab managed to complete running the model till the end. 

The Rouge scores are not the best one can achieve, but they are not too bad either, considering the metrics we saw in the research papers that dealt with dialogue summarization. The task itself is quite challenging but the results can be improved given more resources and time.