In [1]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
import pandas as pd
import numpy as np

import re   
import nltk
from nltk.tokenize import sent_tokenize

In [2]:
train_df = pd.read_json('corpus/train.json')
test_df = pd.read_json('corpus/test.json')
val_df = pd.read_json('corpus/val.json')
train_df.head()

Unnamed: 0,id,summary,dialogue
0,13818513,Amanda baked cookies and will bring Jerry some...,Amanda: I baked cookies. Do you want some?\r\...
1,13728867,Olivia and Olivier are voting for liberals in ...,Olivia: Who are you voting for in this electio...
2,13681000,Kim may try the pomodoro technique recommended...,"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I wa..."
3,13730747,Edward thinks he is in love with Bella. Rachel...,"Edward: Rachel, I think I'm in ove with Bella...."
4,13728094,"Sam is confused, because he overheard Rick com...",Sam: hey overheard rick say something\r\nSam:...


In [3]:
train_df.shape, test_df.shape, val_df.shape

((14732, 3), (819, 3), (818, 3))

In [4]:
sample_text = train_df['dialogue'][:1000]
summaries = {}

Baseline Model

We will use extractive summarization as our baseline model. We will use cosine similarity to find the similarity between the sentences and the document. We will then rank the sentences based on the similarity score and select the top N sentences as the summary.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def baseline_summary_extractive(text, num_sentences=2):
    sentences = sent_tokenize(text)
    vectorizer = CountVectorizer().fit_transform(sentences)
    vectors = vectorizer.toarray()
    similarity_matrix = cosine_similarity(vectors)
    sentence_scores = similarity_matrix.sum(axis=1)
    ranked_sentences = [sentences[i] for i in sentence_scores.argsort()[-num_sentences:]]
    return "\n".join(ranked_sentences)

In [23]:
summaries['baseline'] = sample_text[0:5].apply(baseline_summary_extractive)
summaries['baseline']

0    Amanda: I baked  cookies.\nAmanda: I'll bring ...
1           Oliver: Great\nOliver: Liberals as always.
2    Kim: Oh you know, uni stuff and unfucking my r...
3    Edward: Rachel, I think I'm in ove with Bella....
4    Naomi: i used to love living with you before i...
Name: dialogue, dtype: object

Abstractive Summarization

We will use the seq2seq model with attention for abstractive summarization. We will use the encoder-decoder architecture with attention.

HuggingFace Pipelines

BART

BART is a denoising autoencoder for pretraining sequence-to-sequence models. It is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Transformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.

In [7]:
pipe = pipeline('summarization', model='philschmid/bart-large-cnn-samsum', max_length=100, min_length=5)
pipe_out = pipe(sample_text[0:5].tolist())
summaries['bart'] = [out['summary_text'] for out in pipe_out]
summaries['bart']

Your max_length is set to 100, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 100, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 100, but your input_length is only 50. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


['Amanda baked cookies and will bring them to Jerry tomorrow.',
 'Olivia and Oliver are voting for Liberals in this election.',
 'Kim is in a bad mood. She was going to do a lot of stuff but she procrastinated. She will do uni stuff and clean her room tomorrow. Tim recommends Pomodoro technique where you use breaks for doing chores.',
 'Edward is in love with Bella. Rachel is outside waiting for him to open the door.',
 "Sam overheard Rick saying that he doesn't like being Sam's roommate. Naomi used to live with Sam before she moved in with her boyfriend."]

PEGASUS

PEGASUS is a state-of-the-art abstractive summarization model. It is trained on a large corpus of unlabelled text such as news articles and scientific papers, and is able to generate coherent summaries from documents it has never seen before. PEGASUS is trained using a form of self-supervision called pre-training via back-translation. This means that it is trained to reconstruct human-written summaries of documents, and it learns to do so by reading millions of example document-summary pairs. PEGASUS is trained to predict the summary of a document from a corrupted version of the summary. This is done by corrupting the summary with an arbitrary noising function, and learning a model to reconstruct the original summary. This is similar to the denoising auto-encoder objective used in BART, but with a few key differences. First, PEGASUS is trained on a much larger corpus of data, and second, PEGASUS is trained to predict the summary of a document, rather than the document itself.

In [8]:
pipe = pipeline('summarization', model='google/pegasus-xsum', max_length=100, min_length=5)
pipe_out = pipe(sample_text[0:5].tolist())
summaries['pegasus'] = [out['summary_text'] for out in pipe_out]
summaries['pegasus']

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Your max_length is set to 100, but your input_length is only 25. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)
Your max_length is set to 100, but your input_length is only 26. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)
Your max_length is set to 100, but your input_length is only 48. Since this is a summarization task, where outputs shorter than the input are typically wanted, you m

['Jerry: Hello, Amanda.',
 "Olivia: Hi, I'm Olivia from Newsround and I'm here to answer your questions.",
 "Kim: Hi Tim, what's up?",
 "Rachel: I'm outside.",
 "In this week's episode of The Only Way Is Essex, Sam and Naomi are having a bit of a problem with each other."]

T5

T5 is a text-to-text transformer model that uses the same architecture as BERT (bidirectional encoder) and GPT (left-to-right decoder). It is trained on a large corpus of unlabelled text such as news articles and scientific papers, and is able to generate coherent summaries from documents it has never seen before. T5 is trained using a form of self-supervision called pre-training via back-translation. This means that it is trained to reconstruct human-written summaries of documents, and it learns to do so by reading millions of example document-summary pairs. T5 is trained to predict the summary of a document from a corrupted version of the summary. This is done by corrupting the summary with an arbitrary noising function, and learning a model to reconstruct the original summary. This is similar to the denoising auto-encoder objective used in BART, but with a few key differences. First, T5 is trained on a much larger corpus of data, and second, T5 is trained to predict the summary of a document, rather than the document itself.

In [9]:
pipe = pipeline('summarization', model='pszemraj/long-t5-tglobal-base-16384-book-summary', max_length=100, min_length=5)
pipe_out = pipe(sample_text[0:5].tolist())
summaries['long-t5'] = [out['summary_text'] for out in pipe_out]
summaries['long-t5']

Your max_length is set to 100, but your input_length is only 28. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)
Your max_length is set to 100, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)
Your max_length is set to 100, but your input_length is only 55. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=27)


['Jerry bakes cookies for her. She bakes them the next day.',
 "Olivia asks Oliver who he's voting for in the election. Oliver says Liberals, as always.",
 'The next morning, Kim is in a bad mood and decides to do some housework instead of shopping. Tim suggests that she use a break from doing her chores to help her.',
 "Edward tells Rachel that he's in love with Bella and wants to marry her immediately. She doesn't want to talk about it to anyone else so she leaves the room.",
 "Sam overhears a conversation between the two of them. It turns out that Mr. Rigby is upset about being left behind in London and is not happy with his new job. He's also unhappy at the fact that he has to live with another man who doesn't approve of him."]

Accuracy Metrics

ROUGE and BLEU

Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.

Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.

Precision can be seen as a measure of quality, and recall as a measure of quantity. Higher precision means that an algorithm returns more relevant results than irrelevant ones, and high recall means that an algorithm returns most of the relevant results (whether or not irrelevant ones are also returned).

ROGUE

The ROUGE score was specifically developed for applications like summarization where high recall is more important than just precision. 

ROUGE-N

With ROUGE-N, the N represents the n-gram that we are using. For ROUGE-1 we would be measuring the match-rate of unigrams between our model output and reference. ROUGE-2 and ROUGE-3 would use bigrams and trigrams respectively.

ROUGE-L

ROUGE-L measures the longest common subsequence (LCS) between our model output and reference. All this means is that we count the longest sequence of tokens that is shared between both

In [10]:
from datasets import load_metric

rouge_metric = load_metric("rouge")

  rouge_metric = load_metric("rouge")


In [18]:
rouge_names = ["rouge1", "rouge2", "rougeL"]

reference = train_df['summary'][:1000]
sample_ref = reference[0:5]

records = []

for model_name in summaries:
    rouge_metric.add(prediction = summaries[model_name], reference = sample_ref )
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )
    print('rouge_dict ', rouge_dict )
    records.append(rouge_dict)

pd.DataFrame.from_records(records, index = summaries.keys() )



rouge_dict  {'rouge1': 0.0034776645768025078, 'rouge2': 0.0009306882194464854, 'rougeL': 0.0027429467084639503}
rouge_dict  {'rouge1': 0.558659217877095, 'rouge2': 0.2937853107344633, 'rougeL': 0.4134078212290503}
rouge_dict  {'rouge1': 0.3076923076923077, 'rouge2': 0.015625, 'rougeL': 0.1384615384615385}
rouge_dict  {'rouge1': 0.3529411764705882, 'rouge2': 0.06392694063926942, 'rougeL': 0.20814479638009048}


Unnamed: 0,rouge1,rouge2,rougeL
baseline,0.003478,0.000931,0.002743
bart,0.558659,0.293785,0.413408
pegasus,0.307692,0.015625,0.138462
long-t5,0.352941,0.063927,0.208145


BLEU

In [17]:
import evaluate

bleu_metric = evaluate.load('bleu')

In [40]:
records = []

for model_name in summaries.keys():
    results = bleu_metric.compute(predictions = summaries[model_name], references = sample_ref )
    records.append(results)

pd.DataFrame.from_records(records, index = summaries.keys())

Unnamed: 0,bleu,precisions,brevity_penalty,length_ratio,translation_length,reference_length
baseline,0.0,"[0.12916666666666668, 0.01702127659574468, 0.0...",1.0,2.727273,240,88
bart,0.197958,"[0.44036697247706424, 0.2403846153846154, 0.15...",1.0,1.238636,109,88
pegasus,0.0,"[0.26666666666666666, 0.01818181818181818, 0.0...",0.627089,0.681818,60,88
long-t5,0.031352,"[0.23026315789473684, 0.04081632653061224, 0.0...",1.0,1.727273,152,88


Clearly, bart is giving the best results. We will use bart for our final model.