# Building a text summarizer for business cases

### Work by Group C: 
Guillermo Chacon, 
Diego Cuartas,
Esteban Delgado,
Aayush Kerjiwal,
Mohamed Khanafer,
Mariana Naváez, and 
Valentina Premoli
___
___
# Table of Content

___
## 0. Importing libraries
___

## 1. Goal Description 
### 1.1 Datasets and Metrics used
     1.1.1 The ROUGE Metric
     1.1.2 The Texts used
___

## 2. Extractive type of Summarizers
___
### 2.1 Basic First model 
     2.1.1 Preprocess the text and Frequency table
     2.1.2 Tokenizing and scoring each sentence according to the word frequencies
     2.1.3 Putting it all together and getting the summary
     2.1.4 Computing Rouge
     2.1.5 Conclusion on the first approach
___
### 2.2 Adding the effect of TF-IDF and more cleaning
     2.2.1 Cleaning and Tokenizing the text
     2.2.2 Getting the Frequency Matrix for Words in Sentence
     2.2.3 The TF Matrix and the IDF matrix
     2.2.4 Assigning scores to the sentences
     2.2.5 Seeting the treshold and getting the summary
     2.2.6 Computing ROUGE
     2.2.7 Concluding the second approach
___
### 2.3 Adding in the Cosine similarity of the sentences
     2.3.1 Text Preprocessing
     2.3.2 Tokenization and TF-IDF
     2.3.3 Computing cosine-similarities values between each pair of sentences
     2.3.4 Getting the summary
     2.3.5 Computing ROUGE
     2.3.5 Conclusing remarks for the Cosine Similarity model
___
### 2.4 Looking at the TextRank algorithm
     2.4.1 Downloading Glove
     2.4.2 Preprocessing the text
     2.4.3 Vector Representation of Sentences
     2.4.5 Preparing the Similarity Matrix
     2.4.6 Applying the Page Rank algorithm
     2.4.7 Getting the summary
     2.4.8 Computing ROUGE
     2.4.9 Conclusion on the Model with TextRank
___

## 3. Abstractive types of Summarizers 
### 3.1 A simple implementation of Abstractive Summarization using T5
     3.1.1 Importing the pre-trained model
     3.1.2 Preprocessing the text for our model
     3.1.3 Running the T5 Summarizer Model
     3.1.4 Analyzing the output of the model and adding new texts
     3.1.5 Computing the ROUGE metrics for comparison with the Extractive Summarizers
         3.1.5.1 The ROUGE scores for the basic_summarizer
         3.1.5.2 The ROUGE scores for the text_rank_summarizer
         3.1.5.3 The ROUGE scores for the Abstractive T5
         3.1.5.4 Comparing the scores and outputs
___

## 4. Trying the output of a combined model
### 4.1 Concluding on the approach 
___

## 5. Explanation of the final model chosen
___

## 6. Further Future Work
___
___

## 0. Importing libraries

In [125]:
# Basic Libraries:
import pandas as pd
import os
import sys
import numpy as np
from numpy.core._rational_tests import denominator
import re
import math

# NLTK:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.cluster.util import cosine_distance
from nltk.util import ngrams
stop=set(stopwords.words('english'))

# Other libraries:
from operator import itemgetter
import heapq 
import networkx as nx
from collections import defaultdict
from collections import Counter
import random
import string
from tqdm import tqdm
from scipy import sparse

# sklearn: 
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Keras: 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM,Dense, SpatialDropout1D, Dropout
from keras.initializers import Constant
from keras.optimizers import Adam
            
# For T5:
import torch
import json 
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

# From ROUGE:
from rouge import Rouge

# For printing the scores properly:
def pretty(d, indent=0):
   for key, value in d.items():
      print('\t' * indent + str(key))
      if isinstance(value, dict):
         pretty(value, indent+1)
      else:
         print('\t' * (indent+1) + str(value))

## 1. Goal Description 

In the following markdown, we explore different summarizing approaches that are available in NLP before chosing a final model for our business application. 

In NLP, there are many methods for text summarization. The most common way of classifying them is according to two basic approaches to the summary, based on their outputs type: extractive ones and abstractive ones. 
In the extractive summaries, the summary is simply the extraction of the most important sentences which are then joined to get the brief summary. On the other hand, the abstract summary algorithms generate new sentences with the most useful information from the original text (a more human-like approach). 

In the following notebook, we first take a look at the Extractive Approach to summarization before looking at the Abstractive Approach. We try to develop a simple yet accurate model using the most knowledge we have seen in our class. We base our reasonning on what have already be done in the field and try to improve on them.

The final goal is to be able to build a proper pipeline of clean code that we can use in a Web Application also developped for the matter.

We first start by building a basic pipeline for Extractive Summarizers before building on top of it.

### 1.1 Datasets and Metrics used

To be able to compare the evolution of the different models, more than our common sense and reading the text and comparing the outputs, it helps to have a metric benchmark to base our comparison on. 

That is why we will be using here the ROUGE metric for comparing the output of our model to actual summaries of texts we have.

#### 1.1.1 The ROUGE Metric
ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation is basically a set of metrics for evaluating the summarization of texts by various models and could as well be used to assess machine translations.

What it does is comparing the automatically produced summary to a set of reference summaries, which are usually made by humans. 

To get a quantitative value, we will compute the precision, recall, and F-measure using the overlap between those 2 texts. 

- **Recall** here in the context of ROUGE, tries to assess how much of the reference summary the model summary is capturing. But it is not necessarily telling us everything. The model's summary can be very long, capturing all the words in the reference summary. However, many of the words in the model summary could be useless, which would make our summary unnecessarily verbose.

- **Precision** is then relevant. What this measures is how much of the model summary is in fact needed or relevant. This measure is important when we are trying to generate summaries that are concise.

The library we use here to compute this metric outputs 3 different granularity of text measures:

- **ROUGE-1** which refers to overlap of unigrams between our model summary and the original summary; 
- **ROUGE-2** which refers to overlap of bigrams between our model summary and the original summary; 
- **ROUGE-L** which measures the longest matching sequence of words using LCS (which stands for Longest Common Subsequence). The advantage here is that LCS does not require consecutive matches but rather in-sequence matches that actually reflect some sentence level word order. We thus here do not need to specify the n-gram we are looking for.
(we used [freeCodeCamp](https://www.freecodecamp.org/news/what-is-rouge-and-how-it-works-for-evaluation-of-summaries-e059fb8ac840/)'s great explanations).

#### 1.1.2 The Texts used
As mentionned above, in order to assess the performance of our models, we should have some texts and a corresponding human made summary. We actually found a dataset containing over 400 news articles of the BBC from 2004 to 2005. And for each articles, summaries are provided. This dataset can be found on Kaggle [here](https://www.kaggle.com/pariza/bbc-news-summary). 

Because our use case focuses on summarizing texts related to business, we chose articles in the business area as well. 
Those are:

In [370]:
text = '''
Call centre users 'lose patience'

Customers trying to get through to call centres are getting impatient and quicker to hang up, a survey suggests.

Once past the welcome message, callers on average hang up after just 65 seconds of listening to canned music. The drop in patience comes as the number of calls to call centres is growing at a rate of 20% every year. "Customers are getting used to the idea of an 'always available' society," says Cara Diemont of IT firm Dimension Data, which commissioned the survey. However, call centres also saw a sharp increase of customers simply abandoning calls, she says, from just over 5% in 2003 to a record 13.3% during last year. When automated phone message systems are taken out of the equation, where customers have to pick their way through multiple options and messages, the number of abandoned calls is even higher - a sixth of all callers give up rather than wait. One possible reason for the lack in patience, Ms Diemont says, is the fact that more customers are calling 'on the move' using their mobile phones.

The surge in customers trying to get through to call centres is also a reflection of the centres' growing range of tasks. "Once a call centre may have looked after mortgages, now its agents may also be responsible for credit cards, insurance and current accounts," Ms Diemont says. Problems are occurring because increased responsibility is not going hand-in-hand with more training, the survey found.

In what Dimension Data calls an "alarming development", the average induction time for a call centre worker fell last year from 36 to just 21 days, leaving "agents not equipped to deal with customers". This, Ms Diemont warns, is "scary" and not good for the bottom line either. Poor training frustrates both call centre workers and customers. As a result, call centres have a high "churn rate", with nearly a quarter of workers throwing in the towel every year, which in turn forces companies to pay for training new staff. Resolution rates - the number of calls where a customer's query is resolved to mutual satisfaction - are running at just 50%. When the query is passed on to a second or third person - a specialist or manager - rates rise to about 70%, but that is still well below the industry target of an 85% resolution rate.

Suggestions that "outsourcing" - relocating call centres to low-cost countries like India or South Africa - is to blame are wrong, Ms Diemont says.

There are "no big differences in wait time and call resolution" between call centres based in Europe or North America and those in developing countries around the world. "You can make call centres perform anywhere if you have good management and the right processes in place," she says. However, companies that decide to "offshore" their operations are driven not just by cost considerations. Only 42% of them say that saving money is the main consideration when closing domestic call centre operations. Half of them argue that workers in other countries offer better skills for the money. But not everybody believes that outsourcing and offshoring are the solution. Nearly two-thirds of all firms polled for the survey have no plans to offshore their call centres. They give three key reasons for not making the move:

- call centre operations are part of their business "core function", 
 - they are worried about the risk of going abroad, 
 - they fear that they will damage their brand if they join the offshoring drive. The survey was conducted by Sunovate on behalf of Dimension Data, and is based on in-depth questionnaires of 166 call centres in 24 countries and five continents. What are your experiences with call centres? Are you happy to listen to Vivaldi or Greensleeves, or do you want an immediate response? And if you work in a call centre: did your training prepare you for your job?
'''

text2 = '''
Yukos loses US bankruptcy battle

A judge has dismissed an attempt by Russian oil giant Yukos to gain bankruptcy protection in the US.

Yukos filed for Chapter 11 protection in Houston in an unsuccessful attempt to halt the auction of its Yugansk division by the Russian authorities. The court ruling is a blow to efforts to get damages for the sale of Yugansk, which Yukos claims was illegally sold. Separately, former Yukos boss Mikhail Khodorkovsky began testimony on Friday in his trial for fraud and tax evasion.

Mr Khodorkovsky - who has been in jail for more than a year - pleaded not guilty to the charges brought against him and denied involvement in any criminal activities. "I pride myself on heading for 15 years a number of successful companies and helping other enterprises rise from their knees," he told a Russian court.

Yugansk was auctioned to help pay off $27.5bn (£14.5bn) in unpaid taxes. It was bought for $9.4bn by a previously-unknown group, which was in turn bought up almost immediately by state-controlled oil company Rosneft.

Texas Judge Letitia Clark said Yukos did not have enough of a US presence to establish US jurisdiction. "The vast majority of the business and financial activities of Yukos continue to occur in Russia," Judge Clark said in her ruling. "Such activities require the continued participation of the Russian government." Yukos had argued that a US court was entitled to declare it bankrupt before its Yugansk unit was sold, since it has local bank accounts and its chief finance officer Bruce Misamore lives in Houston. Yukos claimed it sought help in the US because other forums - Russian courts and the European Court of Human Rights - were either unfriendly or offered less protection. Russia had indicated it would in any case not abide by the rulings of the US courts.
'''

text3 = '''
BMW cash to fuel Mini production

Less than four years after the new Mini was launched, German car maker BMW has announced £100m of new investment.

Some 200 new jobs are to be created at the Oxford factory, including modernised machinery and a new body shell production building. The result of the investment could be to raise output to more than 200,000 cars from 2007. The rise, from 189,000 last year, is a response to rapidly-rising demand and could help wipe out waiting lists. Before Wednesday's announcement, BMW had invested some £280m in Mini production.

Since its launch during summer 2001, the new Mini has gone from strength to strength.

Last year, almost one in six cars sold by the BMW group was a Mini. The company admits that the success of the brand came despite scepticism from many in the industry. "Our decision to produce a new Mini was not received well right away," said Norbert Reithofer, a member of the BMW management board. Initially, BMW said it would produce 100,000 Mini models a year at its vast Cowley factory on the outskirts of Oxford, but the target was quickly reached, then raised, time and time again. Not everyone is convinced that the boom can continue. "The risk is that after they've invested massively in the brand, demand tapers off like it did with the new VW Beetle," said Brad Wernle, from Automotive News Europe.

The price of the car has also gone up. When it was launched, the cheapest Mini cost just more than £10,000. These days, buyers will have to fork out almost £11,500 to own a new Mini One, or even more for the Cooper S which costs up to £17,730. The Mini Convertible, which was launched last spring, costs up to £15,690 for the top model, and there is even a waiting list. Second-hand Minis are not cheap either. A Mini One bought when the model was launched should still fetch at least £8,000 for the cheapest model, while a used Cooper S is likely to be priced from £12,556, according to the-car buying website Parker's. The consumers' association Which operates with slightly different numbers, yet it confirms that the Mini Cooper 1.6 depreciates slower than any other car, other than the Mercedes Benz C180 SE and the BMW 1 Series 116i SE.

The Cowley factory, which initially seemed far too large a production plant for just 100,000 Minis, is increasingly being put to good use.

There are plans to tear down old buildings and build new ones and there are rumours that a new paint shop could be included in the plans. BMW's Mini adventure has made good much of what went wrong during its stewardship of the UK car maker Rover which it sold for £10 five years ago to the Phoenix consortium. In 1999, when BMW still owned Rover, the Oxford factory was producing the award-winning Rover 75. During that year, 3,500 people produced 56,000 cars. Last year, in the same factory, almost four times as many vehicles were produced by just 4,500 Mini-workers. The Mini factory's current output is equally impressive when compared with the main Rover factory in Longbridge, which in 1999 produced 180,000 Rover cars. Last year, MG Rover, which employs more than 6,000 people, produced just 110,000 cars, though it hopes to land a deal with Shanghai Automotive Industry Corporation (SAIC) that could help double the number of cars produced at Longbridge. Indeed, Mini is not only producing more cars than MG Rover does; it remains ahead even when the current sales of Land Rovers and Range Rovers (which are made by the former Rover unit that BMW sold to Ford) are taken into account.

'''

And their corresponding summaries are:

In [188]:
summ = '''The drop in patience comes as the number of calls to call centres is growing at a rate of 20% every year.Poor training frustrates both call centre workers and customers.In what Dimension Data calls an "alarming development", the average induction time for a call centre worker fell last year from 36 to just 21 days, leaving "agents not equipped to deal with customers".And if you work in a call centre: did your training prepare you for your job?There are "no big differences in wait time and call resolution" between call centres based in Europe or North America and those in developing countries around the world.Customers trying to get through to call centres are getting impatient and quicker to hang up, a survey suggests.Suggestions that "outsourcing" - relocating call centres to low-cost countries like India or South Africa - is to blame are wrong, Ms Diemont says.The surge in customers trying to get through to call centres is also a reflection of the centres' growing range of tasks.What are your experiences with call centres?However, call centres also saw a sharp increase of customers simply abandoning calls, she says, from just over 5% in 2003 to a record 13.3% during last year.The survey was conducted by Sunovate on behalf of Dimension Data, and is based on in-depth questionnaires of 166 call centres in 24 countries and five continents.As a result, call centres have a high "churn rate", with nearly a quarter of workers throwing in the towel every year, which in turn forces companies to pay for training new staff.'''
summ2 = ''' The court ruling is a blow to efforts to get damages for the sale of Yugansk, which Yukos claims was illegally sold.Yukos had argued that a US court was entitled to declare it bankrupt before its Yugansk unit was sold, since it has local bank accounts and its chief finance officer Bruce Misamore lives in Houston.A judge has dismissed an attempt by Russian oil giant Yukos to gain bankruptcy protection in the US.Texas Judge Letitia Clark said Yukos did not have enough of a US presence to establish US jurisdiction.Yukos claimed it sought help in the US because other forums - Russian courts and the European Court of Human Rights - were either unfriendly or offered less protection.Yukos said it would consider its options in light of the ruling."The vast majority of the business and financial activities of Yukos continue to occur in Russia," Judge Clark said in her ruling.Yukos filed for Chapter 11 protection in Houston in an unsuccessful attempt to halt the auction of its Yugansk division by the Russian authorities.The US court's jurisdiction had been challenged by Deutsche Bank and Gazpromneft, a former unit of Russian gas monopoly Gazprom which is due to merge with Rosneft.Russia had indicated it would in any case not abide by the rulings of the US courts.'''
summ3 = '''Less than four years after the new Mini was launched, German car maker BMW has announced £100m of new investment.Last year, almost one in six cars sold by the BMW group was a Mini.Initially, BMW said it would produce 100,000 Mini models a year at its vast Cowley factory on the outskirts of Oxford, but the target was quickly reached, then raised, time and time again.When it was launched, the cheapest Mini cost just more than £10,000.These days, buyers will have to fork out almost £11,500 to own a new Mini One, or even more for the Cooper S which costs up to £17,730.The Mini Convertible, which was launched last spring, costs up to £15,690 for the top model, and there is even a waiting list.Last year, MG Rover, which employs more than 6,000 people, produced just 110,000 cars, though it hopes to land a deal with Shanghai Automotive Industry Corporation (SAIC) that could help double the number of cars produced at Longbridge.The Mini factory's current output is equally impressive when compared with the main Rover factory in Longbridge, which in 1999 produced 180,000 Rover cars.BMW's Mini adventure has made good much of what went wrong during its stewardship of the UK car maker Rover which it sold for £10 five years ago to the Phoenix consortium."Our decision to produce a new Mini was not received well right away," said Norbert Reithofer, a member of the BMW management board.In 1999, when BMW still owned Rover, the Oxford factory was producing the award-winning Rover 75.Before Wednesday's announcement, BMW had invested some £280m in Mini production. '''

## 2. Extractive type of Summarizers
As mentionned above, the goal here is to remove those part of the text that do not change the overall meaning of the content studied.

Many libraries are available in Python to build a text summarizer. We will use the NLTK library which we have been using in our course to develop our first basic model. Other libraries such as Spacy also could do similar tasks.


Most of the basic summarizer model have been using a similar base pipeline: one that is choose the most relevant sentences to keep in the summary based on their Word Frequency Distribution (by scoring each sentence according to how many times each word is repeated in the text studied).

### 2.1 Basic First model 
For our first model, we thus will follow the following pipeline:

1. Preprocess the text (removing stopwords, and punctuations);
2. Build a Frequency table of words;
3. Assign a Score to each sentence according to the words in the frequency table;
4. Generate the summary by joining the sentences above a certain treshold that we set.

This will be the first model we try before improving on it in the following stages (we were inspired here by the great work of [Himanshu Sharma](https://www.presentslide.in/2019/08/text-summarization-python-spacy-library.html) and [Jesus Saves](https://jcharistech.wordpress.com/2018/12/31/text-summarization-using-spacy-and-python/)).

#### 2.1.1 Preprocess the text and Frequency table
We will first make sure to tokenize the text and remove stopwords before building our frequency table with the following lines:

In [189]:
stop_words = set(stopwords.words("english"))

frequency_of_words = {}  
for word in nltk.word_tokenize(text):  
    if word not in stop_words:
        if word not in frequency_of_words.keys():
             frequency_of_words[word] = 1
        else:
            frequency_of_words[word] += 1

list(frequency_of_words.items())[:10] # We can see what the table looks like:

[('Call', 1),
 ('centre', 7),
 ('users', 1),
 ("'lose", 1),
 ("patience'", 1),
 ('Customers', 2),
 ('trying', 2),
 ('get', 2),
 ('call', 18),
 ('centres', 12)]

After getting the frequency table, we can get the word with the highest frequency and use it to divide the other words in the text. This is to get the weighted frequencies and reduce bias in long sentences. We do this in this way:

In [190]:
max_frequency = max(frequency_of_words.values())

for word in frequency_of_words.keys():  
    frequency_of_words[word] = (frequency_of_words[word]/max_frequency)

list(frequency_of_words.items())[:10]

[('Call', 0.03333333333333333),
 ('centre', 0.23333333333333334),
 ('users', 0.03333333333333333),
 ("'lose", 0.03333333333333333),
 ("patience'", 0.03333333333333333),
 ('Customers', 0.06666666666666667),
 ('trying', 0.06666666666666667),
 ('get', 0.06666666666666667),
 ('call', 0.6),
 ('centres', 0.4)]

We can see now that those frequencies are now weighted.

#### 2.1.2 Tokenizing and scoring each sentence according to the word frequencies
Now having a value for each word, we can score each sentences in our text like this:

In [191]:
sentence_list = nltk.sent_tokenize(text) # We are tokenizing each sentence
score_of_sentence = {}  # We are creating a sentence score dictionnary
for sent in sentence_list:  # Looping through our dictionnary 
     for word in nltk.word_tokenize(sent.lower()): # We make each words lower case to be able to capture all words equally
        if word in frequency_of_words.keys():
            if len(sent.split(' ')) < 50:  # If the number of sentences is lower than let's say 50, we assign a score to each
                if sent not in score_of_sentence.keys():
                     score_of_sentence[sent] = frequency_of_words[word]
                else:
                    score_of_sentence[sent] += frequency_of_words[word]

In [192]:
list(score_of_sentence.items())[:3]

[("\nCall centre users 'lose patience'\n\nCustomers trying to get through to call centres are getting impatient and quicker to hang up, a survey suggests.",
  4.533333333333333),
 ('Once past the welcome message, callers on average hang up after just 65 seconds of listening to canned music.',
  2.366666666666667),
 ('The drop in patience comes as the number of calls to call centres is growing at a rate of 20% every year.',
  2.9000000000000004)]

We see that in this last step, each sentence has a been assigned a score.

#### 2.1.3 Putting it all together and getting the summary
Now that each sentence has been assigned, we can join them to form the summary of the text. To do so, we will use the heapq librrary which allows to get a list of the n largest elements:

In [193]:
summary_sentences = heapq.nlargest(5, score_of_sentence, key=score_of_sentence.get)
summary_sentences

['As a result, call centres have a high "churn rate", with nearly a quarter of workers throwing in the towel every year, which in turn forces companies to pay for training new staff.',
 'However, call centres also saw a sharp increase of customers simply abandoning calls, she says, from just over 5% in 2003 to a record 13.3% during last year.',
 'They give three key reasons for not making the move:\n\n- call centre operations are part of their business "core function", \n - they are worried about the risk of going abroad, \n - they fear that they will damage their brand if they join the offshoring drive.',
 'In what Dimension Data calls an "alarming development", the average induction time for a call centre worker fell last year from 36 to just 21 days, leaving "agents not equipped to deal with customers".',
 '"Once a call centre may have looked after mortgages, now its agents may also be responsible for credit cards, insurance and current accounts," Ms Diemont says.']

We can play here with the number of sentences we want our summarizer to return. To be able to reuse this model later on for comparison with others, we can compile all these steps into a function as such:

In [312]:
def basic_summarizer(text):
    stop_words = set(stopwords.words("english"))
    frequency_of_words = {}  
    for word in nltk.word_tokenize(text):  
        if word not in stop_words:
            if word not in frequency_of_words.keys():
                frequency_of_words[word] = 1
            else:
                frequency_of_words[word] += 1
    max_frequency = max(frequency_of_words.values())
    for word in frequency_of_words.keys():  
        frequency_of_words[word] = (frequency_of_words[word]/max_frequency)
    list_of_sentence = nltk.sent_tokenize(text)
    score_of_sentence = {}  
    for sent in list_of_sentence:  
        for word in nltk.word_tokenize(sent.lower()):
            if word in frequency_of_words.keys():
                if len(sent.split(' ')) < 30:
                    if sent not in score_of_sentence.keys():
                        score_of_sentence[sent] = frequency_of_words[word]
                    else:
                        score_of_sentence[sent] += frequency_of_words[word]
    summary_sentences = heapq.nlargest(7, score_of_sentence, key=score_of_sentence.get)
    summary_output = ' '.join(summary_sentences)  
    return summary_output

#### 2.1.4 Computing Rouge
We can now assess the outputs of our model to the provided summaries:

In [207]:
# we get the summaries for each text:
model1_text1 = basic_summarizer(text)
model1_text2 = basic_summarizer(text2)
model1_text3 = basic_summarizer(text3)

# We compute rouge for each text:
rouge = Rouge()

scores_m1_t1 = rouge.get_scores(model1_text1, summ)
scores_m1_t2 = rouge.get_scores(model1_text2, summ2)
scores_m1_t3 = rouge.get_scores(model1_text3, summ3)

In [208]:
# The score for the first text:
pretty(scores_m1_t1[0])

rouge-1
	f
		0.4755244708355204
	p
		0.6335403726708074
	r
		0.3805970149253731
rouge-2
	f
		0.3419203700212254
	p
		0.45625
	r
		0.27340823970037453
rouge-l
	f
		0.4837545077793272
	p
		0.5826086956521739
	r
		0.41358024691358025


In [209]:
# The score for the second text:
pretty(scores_m1_t2[0])

rouge-1
	f
		0.5788113646332687
	p
		0.6871165644171779
	r
		0.5
rouge-2
	f
		0.46233765746318095
	p
		0.5493827160493827
	r
		0.3991031390134529
rouge-l
	f
		0.5748987804487863
	p
		0.6173913043478261
	r
		0.5378787878787878


In [211]:
# The score for the third text:
pretty(scores_m1_t3[0])

rouge-1
	f
		0.5056947561513276
	p
		0.6809815950920245
	r
		0.40217391304347827
rouge-2
	f
		0.3524027413297447
	p
		0.47530864197530864
	r
		0.28
rouge-l
	f
		0.4726027349099268
	p
		0.5847457627118644
	r
		0.39655172413793105


#### 2.1.5 Conclusion on the first approach
Depending on the text we analyzed, the accuracy achieved is around 50%. We could say that being the output of an extractive summarizer, the result is already not that bad.

As we said, here the usage of frequency is based on the idea that if a word’s frequency in a text is high, we could then conclude that this word has a somewhat significant effect on the content of the text studied. And thus, the frequently repeated words will increase the score of sentences they are in. 

In the next step we go a step further and build a second model, this time considering TF-IDF.

### 2.2 Adding the effect of TF-IDF and more cleaning
From considering only the term frequencies, we now consider the TF-IDF approach, which is the multiplication of the term frequency and the inverse document frequency. 

1. The term frequency is how often a word appears in the text, divided by the total words;
2. The inverse document frequency is how unique or rare a word is in the text.

We also add to this model a cleaning step (we were here inspired by the great work of [Akash Panchal](https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3)).

#### 2.2.1 Cleaning and Tokenizing the text
We first start by cleaning and tokenizing the input text by dealing with the contractions we might have in the text:

In [133]:
def cleaning(data):
    data = re.sub(r'https?:\/\/t.co\/[A-Za-z0-9]+', '', data) # we remove the URLs in the text
    data = re.sub('<.*?>+', '', data) # we remove the HTML tags
    data = re.sub("[\(\[].*?[\)\]]", "", data)
    return data

def replace_contractions(data):
    data = re.sub(r"he's", "he is", data)
    data = re.sub(r"there's", "there is", data)
    data = re.sub(r"We're", "We are", data)
    data = re.sub(r"That's", "That is", data)
    data = re.sub(r"won't", "will not", data)
    data = re.sub(r"they're", "they are", data)
    data = re.sub(r"Can't", "Cannot", data)
    data = re.sub(r"wasn't", "was not", data)
    data = re.sub(r"aren't", "are not", data)
    data = re.sub(r"isn't", "is not", data)
    data = re.sub(r"What's", "What is", data)
    data = re.sub(r"haven't", "have not", data)
    data = re.sub(r"hasn't", "has not", data)
    data = re.sub(r"There's", "There is", data)
    data = re.sub(r"He's", "He is", data)
    data = re.sub(r"It's", "It is", data)
    data = re.sub(r"You're", "You are", data)
    data = re.sub(r"I'M", "I am", data)
    data = re.sub(r"shouldn't", "should not", data)
    data = re.sub(r"wouldn't", "would not", data)
    data = re.sub(r"i'm", "I am", data)
    data = re.sub(r"I'm", "I am", data)
    data = re.sub(r"Isn't", "is not", data)
    data = re.sub(r"Here's", "Here is", data)
    data = re.sub(r"you've", "you have", data)
    data = re.sub(r"we're", "we are", data)
    data = re.sub(r"what's", "what is", data)
    data = re.sub(r"couldn't", "could not", data)
    data = re.sub(r"we've", "we have", data)
    data = re.sub(r"who's", "who is", data)
    data = re.sub(r"y'all", "you all", data)
    data = re.sub(r"would've", "would have", data)
    data = re.sub(r"it'll", "it will", data)
    data = re.sub(r"we'll", "we will", data)
    data = re.sub(r"We've", "We have", data)
    data = re.sub(r"he'll", "he will", data)
    data = re.sub(r"Y'all", "You all", data)
    data = re.sub(r"Weren't", "Were not", data)
    data = re.sub(r"Didn't", "Did not", data)
    data = re.sub(r"they'll", "they will", data)
    data = re.sub(r"they'd", "they would", data)
    data = re.sub(r"they've", "they have", data)
    data = re.sub(r"i'd", "I would", data)
    data = re.sub(r"should've", "should have", data)
    data = re.sub(r"where's", "where is", data)
    data = re.sub(r"we'd", "we would", data)
    data = re.sub(r"i'll", "I will", data)
    data = re.sub(r"weren't", "were not", data)
    data = re.sub(r"They're", "They are", data)
    data = re.sub(r"let's", "let us", data)
    data = re.sub(r"it's", "it is", data)
    data = re.sub(r"can't", "cannot", data)
    data = re.sub(r"don't", "do not", data)
    data = re.sub(r"you're", "you are", data)
    data = re.sub(r"i've", "I have", data)
    data = re.sub(r"that's", "that is", data)
    data = re.sub(r"doesn't", "does not", data)
    data = re.sub(r"i'd", "I would", data)
    data = re.sub(r"didn't", "did not", data)
    data = re.sub(r"ain't", "am not", data)
    data = re.sub(r"you'll", "you will", data)
    data = re.sub(r"I've", "I have", data)
    data = re.sub(r"Don't", "do not", data)
    data = re.sub(r"I'll", "I will", data)
    data = re.sub(r"I'd", "I would", data)
    data = re.sub(r"Let's", "Let us", data)
    data = re.sub(r"you'd", "You would", data)
    data = re.sub(r"It's", "It is", data)
    data = re.sub(r"Ain't", "am not", data)
    data = re.sub(r"Haven't", "Have not", data)
    data = re.sub(r"Could've", "Could have", data)
    data = re.sub(r"youve", "you have", data)  
    return data

In [212]:
text = cleaning(text)
text = replace_contractions(text)
sentences = sent_tokenize(text)
total_documents = len(sentences)

Now that we have the text pre cleaned and tokenized, we can focus on building the TF-IDF matrices.

#### 2.2.2 Getting the Frequency Matrix for Words in Sentence

Here, we create a function that first creates a 'frequency table' of words in each sentence. In doing this, we convert all the words to lower case so there is no duplicate for words starting with a capital letter, and also get the stem word to again focus on meaning instead of different word forms. This is stored in a dictionary format, with words and their corresponding frequencies.

Once we have a frequency table for each sentence, the 'frequency matrix' is created as another dictionary, with the sentences as the key and their corresponding frequency tables as the value.

In [213]:
def create_frequency_matrix(sentences):
    frequency_matrix = {}
    stopWords = set(stopwords.words("english"))
    ps = PorterStemmer()
    
    for sent in sentences:
        freq_table = {}
        words = word_tokenize(sent)
        for word in words:
            word = word.lower()
            word = ps.stem(word)
            if word in stopWords:
                continue

            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        frequency_matrix[sent[:15]] = freq_table

    return frequency_matrix
freq_matrix = create_frequency_matrix(sentences)

#### 2.2.3 The TF Matrix and the IDF matrix

In [214]:
def create_tf_matrix(freq_matrix):
    tf_matrix = {}

    for sent, f_table in freq_matrix.items():
        tf_table = {}

        count_words_in_sentence = len(f_table)
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence

        tf_matrix[sent] = tf_table

    return tf_matrix

tf_matrix = create_tf_matrix(freq_matrix)

First, the TF matrix is created, which represents the frequency of a term in a sentence. Just like above, it is done in a 2-step manner, by first calculating and creating TF tables for each sentence, and then combining such tables for all sentences in a dictionary format. The term frequencies are calculated as the number of times a given term occurs in a sentence divided by the total number of words in the same sentence.

In [215]:
def create_documents_per_words(freq_matrix):
    word_per_doc_table = {}

    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table
count_doc_per_words = create_documents_per_words(freq_matrix)

In the above function, we create a dictionary telling us the number of documents, or in our case sentences, does each word occur in. Here, we do not care about the frequency of occurence within a document (sentence).

In [216]:
def create_idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix = {}

    for sent, f_table in freq_matrix.items():
        idf_table = {}

        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix
idf_matrix = create_idf_matrix(freq_matrix, count_doc_per_words, total_documents)

Now, we calculate the IDF matrix. The output of the above function, like the TF matrix, is a dictionary with each sentence being a key. And the corresponding value to each sentence is another dictionary, which contains the words and their corresponding IDF scores. 

In [217]:
def create_tf_idf_matrix(tf_matrix, idf_matrix):
    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):

        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(),
                                                    f_table2.items()):  # here, keys are the same in both the table
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix
tf_idf_matrix = create_tf_idf_matrix(tf_matrix, idf_matrix)

Finally, based on the TF matrix and the IDF matrix, we have the TF-IDF matrix, which contains the TF-IDF scores for each word in all sentences. This is calculated by simply multiplying the TF and IDF scores for each word.

#### 2.2.4 Assigning scores to the sentences

Now that we have the TF-IDF scores for each word in the document, we compute the total TF-IDF score of all words per sentence. This way, we can get some sense of score of sentences and their importance. To be fair, the scores are divided by the lenght of the sentence, or else longer sentences would have a bias towards having higher scores. 

Once we have this, we can then decide upon a threshold to determine which sentences to include in our summary.

In [218]:
def score_sentences(tf_idf_matrix) -> dict:
    sentenceValue = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score

        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue
sentence_scores = score_sentences(tf_idf_matrix)

#### 2.2.5 Seeting the treshold and getting the summary

Here we have calculated a variable called threshold which is the average sentence value from the above step. However, while generasting the summary, it is possible to adjust this by multiplying the threshold by a constant, to either make the summary longer or shorter.

In [223]:
def find_average_score(sentenceValue) -> int:
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    average = (sumValues / len(sentenceValue))

    return average
threshold = find_average_score(sentence_scores)

def generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary
summary = generate_summary(sentences, sentence_scores, 1 * threshold)
summary

' Once past the welcome message, callers on average hang up after just 65 seconds of listening to canned music. Problems are occurring because increased responsibility is not going hand-in-hand with more training, the survey found. Poor training frustrates both call centre workers and customers. Half of them argue that workers in other countries offer better skills for the money. But not everybody believes that outsourcing and offshoring are the solution. Nearly two-thirds of all firms polled for the survey have no plans to offshore their call centres. What are your experiences with call centres? Are you happy to listen to Vivaldi or Greensleeves, or do you want an immediate response? And if you work in a call centre: did your training prepare you for your job?'

#### 2.2.6 Computing ROUGE
We can now, as in the first model compute the ROUGE metrics and compare them:

In [221]:
# we get the summaries for each text:
model2_text1 = summary
#model2_text2 = summary2
#model2_text3 = summary3

# We compute rouge for each text:
scores_m2_t1 = rouge.get_scores(model2_text1, summ)
#scores_m2_t2 = rouge.get_scores(model2_text2, summ2)
#scores_m2_t3 = rouge.get_scores(model2_text3, summ3)

# The score for the first text:
pretty(scores_m2_t1[0])

rouge-1
	f
		0.33502537636012264
	p
		0.5238095238095238
	r
		0.2462686567164179
rouge-2
	f
		0.14285713851324983
	p
		0.224
	r
		0.10486891385767791
rouge-l
	f
		0.2868217007535605
	p
		0.3854166666666667
	r
		0.22839506172839505


#### 2.2.7 Concluding the second approach
This seems to be equally accurate than just taking the frequency of the words as a discriminant factor for the sentences as there is not a major improvement in the accuracy.

Another approach to go even further, is to consider the cosine similarity between the sentences. To do so, we implement the following code:

### 2.3 Adding in the Cosine similarity of the sentences
The next step now would be to look at even a better way to choosing between the sentences that is why we look at one implementation with the cosine similarity.

To implement the text summarizer, we based our work on the cosine similarity that considers the tf-idf matrix. By representing the sentences of a text  in a vector space, the idea of tf-idf is to reduce the weightage of frequent occurring words by comparing its proportional frequency in the document collection. 

The cosine similarity between two vectors (or two documents on the vector space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude. The range of values we can get go from 0 to 1, an represent how similar are two normalized vectors.

The code for the implementation of the cosine similarities using the TF-IDF approach was difficult to figure out. We ran into errors while trying to implement it using the sklearn library. That is why here we use the code manullay built by Ashish Singhal, which he beautifully explains in his [article](https://medium.com/datapy-ai/nlp-building-text-summarizer-part-1-902fec337b81). 

The pipeline of this process is the following:

1. Preprocessing the input text
2. Tokenizing the sentences and the Words based on N-grams
3. Calculating the TF - IDF score of each word in the sentence
4. Computing cosine-similarities values between each pair of sentences using the TF-IDF and compute average for each sentence;
5. Taking the top K sentences based on cosine-similarity average
6. Sorting the indexes in ascending order and build the summary

We here provide the code to do so:

#### 2.3.1 Text Preprocessing
The following functions are define for the preprocessing of the inputs:

In [359]:
class TextProcessing:
    
    stopWords = []
    wordLemmatizer = []
    def __init__(self):
        self.stopWords = set(stopwords.words('english'))
        self.wordLemmatizer = WordNetLemmatizer()
    
    def preprocessText(self, text):
        processedText = self.remove_special_characters(str(text))
        processedText = re.sub(r'\d+', '', processedText)
        return processedText
    
    def remove_special_characters(self,text):
        regex = r'[^a-zA-Z0-9\s]'
        text = re.sub(regex,'',text)
        return text
    
    def tokenizing(self,text):
        tokenized_words_with_stopwords = word_tokenize(text)
        tokenized_words = [word for word in tokenized_words_with_stopwords if word not in self.stopWords]
        tokenized_words = [word for word in tokenized_words if len(word) > 1]
        tokenized_words = [word.lower() for word in tokenized_words]
        return tokenized_words
    
    def sentenceTokenization(self,text):
        return sent_tokenize(text)
    
    def removingStopWordsFromSentence(self,sentence):
        tokenize_words_with_stopwords = word_tokenize(sentence)
        tokenized_words = [word for word in tokenize_words_with_stopwords if word not in self.stopWords]
        tokenized_words = [word for word in tokenized_words if len(word) > 1]
        tokenized_words = [word.lower() for word in tokenized_words]
        sentence = ' '.join(tokenized_words)
        return sentence

#### 2.3.2 Tokenization and TF-IDF
Here we define further functions for tokenization and getting the TF_IDF matrices:

In [360]:
class Ngrams:
    def generate_ngrams(self,listOfWords, n):
        nGramsList = []
        for num in range(0, len(listOfWords)):
            nGramWord = ' '.join(listOfWords[num:num+n])
            nGramsList.append(nGramWord)
        return nGramsList
    
class TfIdf:
    cleaningData =None
    sizeOfEachRow = None
    nGrams = None
    nGramNumber = None
    
    def __init__(self, nGramNumber):
        self.cleaningData = TextProcessing()
        self.nGrams = Ngrams()
        self.nGramNumber = nGramNumber
            
    def calculateTfIdfMatrix(self,words,sentences):
        wordFreq = self.frequencyOfWords(words)
        
        #Clean the tokenized Sentence
        sentences = self.cleanSentence(sentences)
        #print(sentences)
        self.sizeOfEachRow = self.countSizeOfEachRow(sentences)
        tfIdfMatrix = np.empty([len(sentences),self.sizeOfEachRow])
        
        cnt = 0
        for sent in sentences:
            tfIdfMatrix[cnt] = self.sentenceTFIDFValues(sent,wordFreq,sentences)
            cnt = cnt + 1
        
        return tfIdfMatrix
    
    def sentenceTFIDFValues(self,sent,wordFreq,sentences):
        eachRowOfMatrix = np.zeros(self.sizeOfEachRow)
        posTaggedSentence = self.posTagging(sent)
        cnt = 0
        for word in posTaggedSentence:
            if word.lower() not in self.cleaningData.stopWords and word not in self.cleaningData.stopWords and len(word) > 1:
                word = word.lower()
                #word = nltk.WordNetLemmatizer.lemmatize(word)
                eachRowOfMatrix[cnt] = self.wordTFID(wordFreq, word, sent, sentences)
                cnt  = cnt + 1
        #print(eachRowOfMatrix)
        return eachRowOfMatrix
    
    def countSizeOfEachRow(self,sentences):
        maxSize = 0
        for eachSentence in sentences:
            size = len(re.findall(r'\w+', eachSentence))
            if(size > maxSize):
                maxSize = size
        return maxSize
    
    def wordTFID(self,wordFreqDict, eachWord, eachSentence, sentences):
        tf = self.tfScore(eachWord, eachSentence)
        idf = self.idfScore(eachWord,sentences)
        return tf*idf
    
    def tfScore(self, word, sentence):
        wordFreqInSentence = 0
        sentenceInNGrams = self.nGrams.generate_ngrams(sentence.split(), self.nGramNumber)
        for eachWord in sentenceInNGrams:
            if word == eachWord:
                wordFreqInSentence = wordFreqInSentence + 1
        tf = wordFreqInSentence / len(sentenceInNGrams)
        return tf
            
    def idfScore(self,word, sentences):
        noOfSentencesContainingWord = 0
        for eachSentence in sentences:
            eachSentenceInNGrams = self.nGrams.generate_ngrams(eachSentence.split(), self.nGramNumber)
            for eachWord in eachSentenceInNGrams:
                if word == eachWord:
                      noOfSentencesContainingWord = noOfSentencesContainingWord + 1
                      break
        idf = math.log10(len(sentences)/noOfSentencesContainingWord)
        return idf

    def posTagging(self,text):
        posTag = nltk.pos_tag(self.nGrams.generate_ngrams(text.split(), self.nGramNumber))
        posTaggedNounVerb = []
        for word,tag in posTag:
            if tag == "NN" or tag == "NNP" or tag == "NNS" or tag == "VB" or tag == "VBD" or tag == "VBG" or tag == "VBN" or tag == "VB"or tag == "VBZ":
                posTaggedNounVerb.append(word)
        return posTaggedNounVerb
    def frequencyOfWords(self,words):
        words = [word.lower() for word in words]
        dict_freq = {}
        words_unique = []
        for word in words:
            if word not in words_unique:
                words_unique.append(word)
        for word in words_unique:
            dict_freq[word] = words.count(word)
        return dict_freq
    def cleanSentence(self,sentences):
        returnSentences = []
        for eachSentence in sentences:
            eachSentence = self.cleaningData.preprocessText(eachSentence)
            eachSentence = self.cleaningData.removingStopWordsFromSentence(eachSentence)
            returnSentences.append(eachSentence)
        return returnSentences

#### 2.3.3 Computing cosine-similarities values between each pair of sentences

In [361]:
class CosineSimilarity:
    def calculateCosineSimilarity(self,tfIdMatrix):
        print("Nos of rows",np.shape(tfIdMatrix)[0])
        dim = (np.shape(tfIdMatrix)[0], np.shape(tfIdMatrix)[0])
        resultMatrix = np.zeros(dim)
        rowCount = 0
        columnCount = 0
        for eachRow in tfIdMatrix:
            A = eachRow
            columnCount = 0
            for B in tfIdMatrix:
                abDotProduct = np.dot(A,B)
                denominator = np.sqrt(np.dot(A,A)) * np.sqrt(np.dot(B,B))
                cosTheta = abDotProduct/denominator
                resultMatrix[rowCount][columnCount] = cosTheta
                columnCount = columnCount + 1
            rowCount = rowCount + 1
        #print(resultMatrix)
        sentenceImportance = {}
        cnt = 0
        for eachRow in resultMatrix:
            sentenceImportance[cnt] = (np.sum(eachRow)/len(eachRow))
            cnt = cnt + 1
        print(sentenceImportance)
        return sentenceImportance

#### 2.3.4 Getting the summary
Putting it all together, we get:

In [367]:
file = "/Users/mohamedkhanafer/Desktop/NLP Summarizer/020.txt"
file = open(file, 'r')
text = file.read()

nGramNumber = 3
textProcessing = TextProcessing()
tokenizedSentence = textProcessing.sentenceTokenization(text)

noOfSentences = 0.40*len(tokenizedSentence)
text = textProcessing.preprocessText(text)
tokenizedWords = textProcessing.tokenizing(text)

nGrams = Ngrams()
tokenizedWordsWithNGrams = nGrams.generate_ngrams(tokenizedWords, nGramNumber)

tfId = TfIdf(nGramNumber)
tfIdMatrix = tfId.calculateTfIdfMatrix(tokenizedWordsWithNGrams, tokenizedSentence)

#print(tfIdMatrix)
outF = open('output.txt',"w")
outF.write(tfIdMatrix.__str__())

cosine = CosineSimilarity()
sentenceImportanceValues = cosine.calculateCosineSimilarity(tfIdMatrix)

sentenceImportanceValues = sorted(sentenceImportanceValues.items(), key = itemgetter(1), reverse=True)
cnt = 0
sentenceNo = []

for sentence_prob in sentenceImportanceValues:
    if cnt <= noOfSentences:
        sentenceNo.append(sentence_prob[0])
        cnt = cnt + 1
    else:
        break

sentenceNo.sort()

summary = []

for value in sentenceNo:
    summary.append(tokenizedSentence[value])

model3_text1 = " ".join(summary)

Nos of rows 29
{0: 0.803524809645824, 1: 0.8064916060205923, 2: 0.8515745819233851, 3: 0.8366853215999798, 4: 0.8546733257918234, 5: 0.821334572905506, 6: 0.827300234195767, 7: 0.8266287975931284, 8: 0.848329910125715, 9: 0.8457048556998461, 10: 0.7820208659368677, 11: 0.8064916060205923, 12: 0.7593703930814184, 13: 0.8213345729055062, 14: 0.836277411127503, 15: 0.8213345729055062, 16: 0.8631625764987642, 17: 0.8366853215999798, 18: 0.843353610011902, 19: 0.8064916060205923, 20: 0.8515221456574806, 21: 0.8064916060205923, 22: 0.8064916060205923, 23: 0.8090949706066456, 24: 0.7284703316847729, 25: 0.8454505339320116, 26: 0.6934715030658963, 27: 0.7585736469165062, 28: 0.6971928498777684}


#### 2.3.5 Computing ROUGE

In [369]:
# We compute rouge for the text:
scores_m3_t1 = rouge.get_scores(model3_text1, summ)

# The score for the first text:
pretty(scores_m2_t1[0])

rouge-1
	f
		0.33502537636012264
	p
		0.5238095238095238
	r
		0.2462686567164179
rouge-2
	f
		0.14285713851324983
	p
		0.224
	r
		0.10486891385767791
rouge-l
	f
		0.2868217007535605
	p
		0.3854166666666667
	r
		0.22839506172839505


#### 2.3.5 Conclusing remarks for the Cosine Similarity model

As before, the ROUGE metric seems to indicate a below average accuracy, similar to the first 2 models. 

What we could say is that even if this scoring method is quite accurate, there are some issues. A common one is that words that are semantically similar, are not being leveraged separately.

This is why considering a semantic embeddings approach looks to overcome this shortcoming. We thus next turn on to trying to implement the TextRank algorithm which takes into consideration the word embeddings as well as the cosine similarities between the sentences.

### 2.4 Looking at the TextRank algorithm
The TextRank algorithm is based on the the famous PageRank algorithm in which a matrix calculates the probability that a given user will move from one page to the next. 

In our case here with TextRank, we calculate a cosine similarity matrix where we get the similarity of each sentence compared to the others. Then, a graph is generated from this cosine similarity matrix we calculated. The PageRank ranking algorithm is then applied to the graph to get the scores for each sentence in the text. 

To be able to use this approach, we will have to generate sentence embeddings. We will do so using GloVe. This generates word embeddings in the form of vector representation of words. 

The difference with the TF-IDF approach is that those methods ignore the order of the words which is not the case here.

Using the sentence embeddings we get, we will thencreate a cosine similarity matrix that is used to build the graph. The TextRank algorithm will then be applied to the graph to evaluate the importance of each sentence. We finally decide on the number of sentences to generate the summary.

#### 2.4.1 Downloading Glove
We first download the word embeddings that we will be using:

In [None]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip

In [142]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

Glove has word vectors for 400,000 different terms. These are stored in the dictionary ‘word_embeddings’.

#### 2.4.2 Preprocessing the text
Like in previous approaches, we start by pre-processing the text, removing special characters, stop words and making everything lowercase.

In [252]:
# split the the text in the articles into sentences
sentences = nltk.sent_tokenize(text)

In [253]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

#### 2.4.3 Vector Representation of Sentences
We can now create vectors for sentences in our data with the help of the GloVe word vectors.

We can create vectors for the sentences we have. We will first take vectors (of 100 elements) for the constituent words in the different sentences. Then, we'll take the average of those vectors to arrive at a consolidated vector for the sentence.

In [254]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

#### 2.4.5 Preparing the Similarity Matrix
Next, we find similarities between the sentences in our text, using the cosine similarity approach. We create an empty similarity matrix that we populate with the different cosine similarities of the sentences.

In [255]:
sentence_vectors[0]

array([-0.00666804,  0.15979251,  0.25394633, -0.11617418, -0.24096921,
        0.06779418, -0.4080162 ,  0.1292868 ,  0.1334635 , -0.24081503,
        0.05120679,  0.02733329,  0.10570864, -0.24322556,  0.15362376,
       -0.25209716, -0.08903558,  0.083077  , -0.3836247 ,  0.34754878,
        0.14932856,  0.14116205, -0.07525749, -0.33506837,  0.05384732,
       -0.13523349,  0.03456487, -0.24192825,  0.1694555 , -0.04628104,
       -0.06076189,  0.4836448 ,  0.02332111, -0.04432758,  0.33846784,
        0.11788228, -0.13479716,  0.03488627,  0.10562545, -0.1381767 ,
       -0.27287388, -0.25508165, -0.08647952, -0.47215042, -0.30106494,
       -0.08949995,  0.07946592, -0.09022649,  0.26494446, -0.6004221 ,
        0.06406631, -0.0465796 , -0.14319856,  0.7679332 , -0.00201588,
       -1.5300213 ,  0.01108424, -0.11983108,  1.3407743 ,  0.09720675,
       -0.06271952,  0.24565583, -0.17076507, -0.03160147,  0.38494658,
        0.23424493,  0.14032683,  0.1130302 ,  0.39326158, -0.12

In [256]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

#### 2.4.6 Applying the Page Rank algorithm
Here, we convert the similarity matrix into a graph. The nodes of the graph represent the sentences and the edges  represent the similarity scores between the sentences. 
This is where we will use the PageRank algorithm to get to the sentence rankings.

In [257]:
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

#### 2.4.7 Getting the summary
We can now get the top N sentences summary based on the rankings.

In [258]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
#saving the output for the first text
model4_text1 =ranked_sentences[0][1] + ranked_sentences[1][1] + ranked_sentences[2][1] + ranked_sentences[3][1] + ranked_sentences[4][1] + ranked_sentences[5][1] + ranked_sentences[6][1]

# Extract top n sentences as the summary
for i in range(7):
    print(ranked_sentences[i][1])

They give three key reasons for not making the move:

- call centre operations are part of their business "core function", 
 - they are worried about the risk of going abroad, 
 - they fear that they will damage their brand if they join the offshoring drive.
The surge in customers trying to get through to call centres is also a reflection of the centres' growing range of tasks.
However, call centres also saw a sharp increase of customers simply abandoning calls, she says, from just over 5% in 2003 to a record 13.3% during last year.
In what Dimension Data calls an "alarming development", the average induction time for a call centre worker fell last year from 36 to just 21 days, leaving "agents not equipped to deal with customers".
As a result, call centres have a high "churn rate", with nearly a quarter of workers throwing in the towel every year, which in turn forces companies to pay for training new staff.
One possible reason for the lack in patience, Ms Diemont says, is the fact tha

As the code suggest, we are actually getting the sentences with the highest scores. We could take a look at the scores of the sentences below:

In [259]:
scores

{0: 0.03570772533591336,
 1: 0.03123396699017595,
 2: 0.03583310616428398,
 3: 0.03549663962568466,
 4: 0.03640903360326148,
 5: 0.035951171105981124,
 6: 0.035965782284599374,
 7: 0.03645660203141455,
 8: 0.03517052569714105,
 9: 0.03527483026970801,
 10: 0.03636229457702479,
 11: 0.03241994793795661,
 12: 0.03416691718539042,
 13: 0.03598004026570948,
 14: 0.03385965948706644,
 15: 0.035248358394444124,
 16: 0.03539477076716891,
 17: 0.03567909798673109,
 18: 0.03589941272978062,
 19: 0.03370493115669051,
 20: 0.035796803652664105,
 21: 0.03554486440720591,
 22: 0.027686905026556363,
 23: 0.033415916744826885,
 24: 0.03665215935123124,
 25: 0.03318339698363179,
 26: 0.03164998417525024,
 27: 0.030008550485803857,
 28: 0.03384660557670305}

We can now create a function that would run the model to reuse it easily later on:

In [293]:
def text_rank_summarizer(text):
    sentences = nltk.sent_tokenize(text) 
    clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ") 
    clean_sentences = [s.lower() for s in clean_sentences] 
    
    def remove_stopwords(sen): 
        sen_new = " ".join([i for i in sen if i not in stop_words])
        return sen_new
    
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    sentence_vectors = []
    for i in clean_sentences:
        if len(i) != 0:
            v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((100,))
        sentence_vectors.append(v)
    
    sim_mat = np.zeros([len(sentences), len(sentences)]) 
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
    
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)
    
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    
    summary =ranked_sentences[0][1] + ranked_sentences[1][1] + ranked_sentences[2][1] + ranked_sentences[3][1] + ranked_sentences[4][1] + ranked_sentences[5][1] + ranked_sentences[6][1]
    return summary

#### 2.4.8 Computing ROUGE
We can now compute the ROUGE metrics for the texts generated by our TextRank algorithm:

In [251]:
model4_text2 = text_rank_summarizer(text2)

In [260]:
model4_text3 = text_rank_summarizer(text3)

In [261]:
scores_m4_t1 = rouge.get_scores(model4_text1, summ)
scores_m4_t2 = rouge.get_scores(model4_text2, summ2)
scores_m4_t3 = rouge.get_scores(model4_text3, summ3)
# The score for the first text:
pretty(scores_m4_t1[0])

rouge-1
	f
		0.58799999502592
	p
		0.6336206896551724
	r
		0.5485074626865671
rouge-2
	f
		0.46987951309841774
	p
		0.5064935064935064
	r
		0.43820224719101125
rouge-l
	f
		0.57053291036173
	p
		0.5796178343949044
	r
		0.5617283950617284


In [263]:
pretty(scores_m4_t2[0])

rouge-1
	f
		0.7753086370316721
	p
		0.8674033149171271
	r
		0.7008928571428571
rouge-2
	f
		0.729528531037073
	p
		0.8166666666666667
	r
		0.6591928251121076
rouge-l
	f
		0.7888446165273567
	p
		0.8319327731092437
	r
		0.75


In [264]:
pretty(scores_m4_t3[0])

rouge-1
	f
		0.7458333284458334
	p
		0.8774509803921569
	r
		0.6485507246376812
rouge-2
	f
		0.6694560620590501
	p
		0.7881773399014779
	r
		0.5818181818181818
rouge-l
	f
		0.761006284352676
	p
		0.8402777777777778
	r
		0.6954022988505747


#### 2.4.9 Conclusion on the Model with TextRank
We see from the metrics that the accuracy is significantly higher than in our previous models.

## 3. Abstractive types of Summarizers 
Those types of summarizers are the more interesting ones. Here, our model generates new sentences from the original text we give it. This is opposed to the extractive approach we have been following so far where we used only the sentences that were present in the input text. 

Building an abstractive summarizer from scratch like we have been doing is a more challenging taks although feasible.

But here, we will first look at the implementation of an abstractive summarizer using Transfer Learning to better understand how this approach summarizes the input text before exploring further. 


**Note on Transfer Learning:**
We here mean that we will be using a pre-trained model as the starting point of our model which usually requires very significant computing and time resources to develop it.


### 3.1 A simple implementation of Abstractive Summarization using T5

After the recent success in various of their released algorithms like BERT, Google recently published another very powerful neural network pre-trained model, the [“Text-to-Text Transfer Transformer” (T5)](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html).

T5 is basically a very large neural network model that has been trained on a bunch of unlabeled text (the C4 collection of English web text) as well as labeled data from some popular NLP tasks. This model is then fine-tuned 
for each subsequent tasks that we want to solve. 

The performance those pre-trained models are able to achieve set the T5 as a state of the art tool for many NLP tasks such as text classification, question-answering and summarization tasks.

We will thus try implementing it here to try to see how well it would perform (we were influenced by the great article and explanation of [Ramsri Goutham](https://towardsdatascience.com/simple-abstractive-text-summarization-with-pretrained-t5-text-to-text-transfer-transformer-10f6d602c426).

#### 3.1.1 Importing the pre-trained model

In [279]:
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cpu')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242136741.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




#### 3.1.2 Preprocessing the text for our model
We make sure our text is ready to be fed in the model:

In [284]:
# We preprocess the text first, making it ready for T5:
text_preprocessed = text.strip().replace("\n","")
text_for_T5 = "summarize: " + text_preprocessed

# We can take a look at our original text: 
print ("The original text to summarize : \n", text_preprocessed)

# We now tokenize our text to be able to give it as input:
text_tokenized = tokenizer.encode(text_for_T5, return_tensors="pt").to(device)

Token indices sequence length is longer than the specified maximum sequence length for this model (854 > 512). Running this sequence through the model will result in indexing errors


The original text to summarize : 
 Call centre users 'lose patience'Customers trying to get through to call centres are getting impatient and quicker to hang up, a survey suggests.Once past the welcome message, callers on average hang up after just 65 seconds of listening to canned music. The drop in patience comes as the number of calls to call centres is growing at a rate of 20% every year. "Customers are getting used to the idea of an 'always available' society," says Cara Diemont of IT firm Dimension Data, which commissioned the survey. However, call centres also saw a sharp increase of customers simply abandoning calls, she says, from just over 5% in 2003 to a record 13.3% during last year. When automated phone message systems are taken out of the equation, where customers have to pick their way through multiple options and messages, the number of abandoned calls is even higher - a sixth of all callers give up rather than wait. One possible reason for the lack in patience, Ms Diem

#### 3.1.3 Running the T5 Summarizer Model
We are now ready to run the model. As mentionned, we know that the length of the summaries we have of our articles are between 90 and 180 words. We thus tweak our model to make it generate a summary with a similar length to be able to properly assess it:

In [296]:
summary_sentences = model.generate(text_tokenized, num_beams=4, no_repeat_ngram_size=2, min_length=90, max_length=180, early_stopping=True)

summary_model5_t1 = tokenizer.decode(summary_sentences[0], skip_special_tokens=True)
print ("\n\nThe summary of our Abstractive T5 model: \n",summary_model5_t1)




The summary of our Abstractive T5 model: 
 call centres are growing at a rate of 20% every year. the drop in patience comes as the number of calls to call centre is rising at 20% each year, according to the survey, commissioned by dimension data. more customers are calling 'on the move' using their mobile phones, she says. it's also reflected in the centres' growing range of tasks, such as mortgages, credit cards, insurance and current accounts.


We can create a function to run this model easily:

In [300]:
def T5_model(text):
    text_preprocessed = text.strip().replace("\n","")
    text_for_T5 = "summarize: " + text_preprocessed
    text_tokenized = tokenizer.encode(text_for_T5, return_tensors="pt").to(device)
    summary_sentences = model.generate(text_tokenized, num_beams=4, no_repeat_ngram_size=2, min_length=90, max_length=180, early_stopping=True)
    summary_output = tokenizer.decode(summary_sentences[0], skip_special_tokens=True)
    return summary_output

#### 3.1.4 Analyzing the output of the model and adding new texts

After playing around with the parameters and the number of words in the output, we see that the model is not able to formulate large summaries. The maximum words it is giving us is 70 words for this text. And when we set the minimum number of words to 30, it would rather give shorter summaries. 

To be able to compare the outputs of this model and our preceding models, it would make more sense to analyze a shorter text then. To do this, we use 2 other shorter texts along with their summaries from the BBC dataset too:

In [298]:
text4 = '''
Ask Jeeves tips online ad revival

Ask Jeeves has become the third leading online search firm this week to thank a revival in internet advertising for improving fortunes.

The firm's revenue nearly tripled in the fourth quarter of 2004, exceeding $86m (£46m). Ask Jeeves, once among the best-known names on the web, is now a relatively modest player. Its $17m profit for the quarter was dwarfed by the $204m announced by rival Google earlier in the week. During the same quarter, Yahoo earned $187m, again tipping a resurgence in online advertising.

The trend has taken hold relatively quickly. Late last year, marketing company Doubleclick, one of the leading providers of online advertising, warned that some or all of its business would have to be put up for sale. But on Thursday, it announced that a sharp turnaround had brought about an unexpected increase in profits. Neither Ask Jeeves nor Doubleclick thrilled investors with their profit news, however. In both cases, their shares fell by some 4%. Analysts attributed the falls to excessive expectations in some quarters, fuelled by the dramatic outperformance of Google on Tuesday.'''

text5 = '''Budget Aston takes on Porsche

British car maker Aston Martin has gone head-to-head with Porsche's 911 sports cars with the launch of its cheapest model yet.

With a price tag under £80,000, the V8 Vantage is tens of thousands of pounds cheaper than existing Aston models. The Vantage is "the most important car in the history of our company", said Aston's chief executive Ulrich Bez. Aston - whose cars were famously used by James Bond - will unveil the Vantage at the Geneva Motor Show on Thursday. Mr Bez - himself a former executive at rival Porsche - said the new car was the company's "most affordable car ever and makes the brand accessible". This in turn would make Aston Martin "globally visible, but still very, very exclusive", he added.

First shown as a concept car at the 2003 North American International Auto Show in Detroit, the V8 Vantage will be available in the UK in late summer. Development costs for the Vantage have been kept low by sharing a platform with Aston's DB9, which Mr Bez described as "the previous most important car for our company". There is currently an 18 months waiting list for the DB9, Mr Bez said. The Vantage will be built at the new Aston factory in Gaydon, near Warwick, and should more than double Aston's total output from about 2,000 presently.

'''

summ4 = '''
Ask Jeeves has become the third leading online search firm this week to thank a revival in internet advertising for improving fortunes.Its $17m profit for the quarter was dwarfed by the $204m announced by rival Google earlier in the week.During the same quarter, Yahoo earned $187m, again tipping a resurgence in online advertising.Neither Ask Jeeves nor Doubleclick thrilled investors with their profit news, however.Ask Jeeves, once among the best-known names on the web, is now a relatively modest player.'''

summ5 = '''
The Vantage is "the most important car in the history of our company", said Aston's chief executive Ulrich Bez.Development costs for the Vantage have been kept low by sharing a platform with Aston's DB9, which Mr Bez described as "the previous most important car for our company".Mr Bez - himself a former executive at rival Porsche - said the new car was the company's "most affordable car ever and makes the brand accessible".Aston - whose cars were famously used by James Bond - will unveil the Vantage at the Geneva Motor Show on Thursday.'''

#### 3.1.5 Computing the ROUGE metrics for comparison with the Extractive Summarizers
We re-run the first basic model as well as the TextRank model on those 2 texts, and we compare their outputs to the outputs of our T5 model:

##### 3.1.5.1 The ROUGE scores for the basic_summarizer
We tweak the function we created above to return 4 sentences rather than 7, to be able to better compare:

In [314]:
def basic_summarizer2(text):
    stop_words = set(stopwords.words("english"))
    frequency_of_words = {}  
    for word in nltk.word_tokenize(text):  
        if word not in stop_words:
            if word not in frequency_of_words.keys():
                frequency_of_words[word] = 1
            else:
                frequency_of_words[word] += 1
    max_frequency = max(frequency_of_words.values())
    for word in frequency_of_words.keys():  
        frequency_of_words[word] = (frequency_of_words[word]/max_frequency)
    list_of_sentence = nltk.sent_tokenize(text)
    score_of_sentence = {}  
    for sent in list_of_sentence:  
        for word in nltk.word_tokenize(sent.lower()):
            if word in frequency_of_words.keys():
                if len(sent.split(' ')) < 30:
                    if sent not in score_of_sentence.keys():
                        score_of_sentence[sent] = frequency_of_words[word]
                    else:
                        score_of_sentence[sent] += frequency_of_words[word]
    summary_sentences = heapq.nlargest(4, score_of_sentence, key=score_of_sentence.get)
    summary_output = ' '.join(summary_sentences)  
    return summary_output

In [315]:
# Running the models:
model1_text4 = basic_summarizer2(text4)
model1_text5 = basic_summarizer2(text5)

# Scoring the models:
scores_m1_t4 = rouge.get_scores(model1_text4, summ4)
scores_m1_t5 = rouge.get_scores(model1_text5, summ5)

pretty(scores_m1_t4[0])

rouge-1
	f
		0.7354838659929241
	p
		0.7916666666666666
	r
		0.6867469879518072
rouge-2
	f
		0.6666666616925115
	p
		0.7183098591549296
	r
		0.6219512195121951
rouge-l
	f
		0.7107437966696264
	p
		0.7543859649122807
	r
		0.671875


In [316]:
pretty(scores_m1_t5[0])

rouge-1
	f
		0.8085106333029652
	p
		0.8351648351648352
	r
		0.7835051546391752
rouge-2
	f
		0.7849462315643427
	p
		0.8111111111111111
	r
		0.7604166666666666
rouge-l
	f
		0.7971014442753624
	p
		0.7971014492753623
	r
		0.7971014492753623


##### 3.1.5.2 The ROUGE scores for the text_rank_summarizer
We also here change the parameters of the function to be able to better compare:

In [318]:
def text_rank_summarizer2(text):
    sentences = nltk.sent_tokenize(text) 
    clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ") 
    clean_sentences = [s.lower() for s in clean_sentences] 
    def remove_stopwords(sen): 
        sen_new = " ".join([i for i in sen if i not in stop_words])
        return sen_new
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    sentence_vectors = []
    for i in clean_sentences:
        if len(i) != 0:
            v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((100,))
        sentence_vectors.append(v)
    sim_mat = np.zeros([len(sentences), len(sentences)]) 
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    summary =ranked_sentences[0][1] + ranked_sentences[1][1] + ranked_sentences[2][1] + ranked_sentences[3][1]
    return summary

In [322]:
# Running the models:
model4_text4 = text_rank_summarizer2(text4)
model4_text5 = text_rank_summarizer2(text5)

# Scoring the models:
scores_m4_t4 = rouge.get_scores(model4_text4, summ4)
scores_m4_t5 = rouge.get_scores(model4_text5, summ5)

pretty(scores_m4_t4[0])

rouge-1
	f
		0.5423728763752435
	p
		0.5106382978723404
	r
		0.5783132530120482
rouge-2
	f
		0.45714285216261225
	p
		0.43010752688172044
	r
		0.4878048780487805
rouge-l
	f
		0.5147058773702423
	p
		0.4861111111111111
	r
		0.546875


In [323]:
pretty(scores_m4_t5[0])

rouge-1
	f
		0.6666666616863872
	p
		0.6272727272727273
	r
		0.711340206185567
rouge-2
	f
		0.556097555995717
	p
		0.5229357798165137
	r
		0.59375
rouge-l
	f
		0.6666666617147252
	p
		0.6071428571428571
	r
		0.7391304347826086


##### 3.1.5.3 The ROUGE scores for the Abstractive T5

In [324]:
# Running the models:
model5_text4 = T5_model(text4)
model5_text5 = T5_model(text5)

# Scoring the models:
scores_m5_t4 = rouge.get_scores(model5_text4, summ4)
scores_m5_t5 = rouge.get_scores(model5_text5, summ5)

pretty(scores_m5_t4[0])

rouge-1
	f
		0.6891891842631483
	p
		0.7846153846153846
	r
		0.6144578313253012
rouge-2
	f
		0.5890410909664103
	p
		0.671875
	r
		0.524390243902439
rouge-l
	f
		0.6837606788048799
	p
		0.7547169811320755
	r
		0.625


In [331]:
pretty(scores_m5_t5[0])

rouge-1
	f
		0.32335328854387035
	p
		0.38571428571428573
	r
		0.27835051546391754
rouge-2
	f
		0.19393938907327835
	p
		0.2318840579710145
	r
		0.16666666666666666
rouge-l
	f
		0.34645668795089596
	p
		0.3793103448275862
	r
		0.3188405797101449


##### 3.1.5.4 Comparing the scores and outputs

Looking at the ROUGE metrics, the model with the highest scores is the most basic extractor type. The TextRank model comes next and the T5 model seems to perform very bad on the text5.

However, here is where we have to look at the limitations of the ROUGE metric and add our opinion:

- The ROUGE metrics only assess the content selection and do not really take into consideration quality aspects, like the fluency, grammar, or coherence;
- As we mentionned above, to measure the content selection, they rely mostly on the lexical overlap between the text generated and the reference summary. And here an abstractive summary is actually expressing the same content as the reference text without any lexical overlap;

That is why we should be careful in our tracking of progress between our models. 

We add our subjective opinion here by looking at the reference summary as well as the output of our model:

In [337]:
print("The reference summary is: \n \n" + summ4)

The reference summary is: 
 

Ask Jeeves has become the third leading online search firm this week to thank a revival in internet advertising for improving fortunes.Its $17m profit for the quarter was dwarfed by the $204m announced by rival Google earlier in the week.During the same quarter, Yahoo earned $187m, again tipping a resurgence in online advertising.Neither Ask Jeeves nor Doubleclick thrilled investors with their profit news, however.Ask Jeeves, once among the best-known names on the web, is now a relatively modest player.


In [336]:
print("The output of the T5 Abstractive Summarizer is: \n \n" + model5_text4)

The output of the T5 Abstractive Summarizer is: 
 
the firm's revenue nearly tripled in the fourth quarter of 2004, exceeding $86m (£46m) it is now the third leading online search firm to thank a revival in internet advertising for improving fortunes. its $17m profit for the quarter was dwarfed by the $204m announced by rival Google earlier this week. During the same quarter, Yahoo earned $187m, again tipping an resurgence in online advertising.


In [339]:
print("The output of the Basic Summarizer is: \n \n" + model1_text4)

The output of the Basic Summarizer is: 
 
During the same quarter, Yahoo earned $187m, again tipping a resurgence in online advertising. The firm's revenue nearly tripled in the fourth quarter of 2004, exceeding $86m (£46m). Ask Jeeves, once among the best-known names on the web, is now a relatively modest player. 
Ask Jeeves tips online ad revival

Ask Jeeves has become the third leading online search firm this week to thank a revival in internet advertising for improving fortunes.


In [340]:
print("The output of the TextRank Summarizer is: \n \n" + model4_text4)

The output of the TextRank Summarizer is: 
 
Late last year, marketing company Doubleclick, one of the leading providers of online advertising, warned that some or all of its business would have to be put up for sale.Its $17m profit for the quarter was dwarfed by the $204m announced by rival Google earlier in the week.But on Thursday, it announced that a sharp turnaround had brought about an unexpected increase in profits.
Ask Jeeves tips online ad revival

Ask Jeeves has become the third leading online search firm this week to thank a revival in internet advertising for improving fortunes.


We directly see that the grammatical flow of the abstractive summarizer is better than the basic extractor one. The TextRank has managed to select more meaningful sentences than the basic one but still somehow has a poor flow in its tructure when compared to the output of T5.

## 4. Trying the output of a combined model

Out of curiosity, we wanted to see how would the an abstractive model that uses the output of an extractive summarizer perform. So here we combine the TextRank model with the T5 model and compare its output with the output of a simple T5 model:

In [371]:
model6_TextRank_text1 = text_rank_summarizer(text)
model6_Combined_text1 = T5_model(model6_TextRank_text1)

In [374]:
print("The output summary of the combined models is: \n \n" + model6_Combined_text1)

The output summary of the combined models is: 
 
the surge in customers trying to get through to call centres is also a reflection of the centres' growing range of tasks. however, call centre operations are part of their business "core function" they fear that they will damage their brand if they join the offshoring drive, says the centre's chief executive, who says they are 'not equipped to deal with customers' the average induction time for an employee fell last year from 36 to just 21 days, leaving "agents not equipped"


In [376]:
model_T5_t1 = T5_model(text)
print("The output summary of the T5 abstractive model on the same text is: \n \n" + model_T5_t1)

Token indices sequence length is longer than the specified maximum sequence length for this model (854 > 512). Running this sequence through the model will result in indexing errors


The output summary of the T5 abstractive model on the same text is: 
 
call centres are growing at a rate of 20% every year. the drop in patience comes as the number of calls to call centre is rising at 20% each year, according to the survey, commissioned by dimension data. more customers are calling 'on the move' using their mobile phones, she says. it's also reflected in the centres' growing range of tasks, such as mortgages, credit cards, insurance and current accounts.


In [377]:
print("The output summary of the T5 abstractive model on the same text is: \n \n" + model6_TextRank_text1)


The output summary of the T5 abstractive model on the same text is: 
 
They give three key reasons for not making the move:

- call centre operations are part of their business "core function", 
 - they are worried about the risk of going abroad, 
 - they fear that they will damage their brand if they join the offshoring drive.The surge in customers trying to get through to call centres is also a reflection of the centres' growing range of tasks.However, call centres also saw a sharp increase of customers simply abandoning calls, she says, from just over 5% in 2003 to a record 13.3% during last year.In what Dimension Data calls an "alarming development", the average induction time for a call centre worker fell last year from 36 to just 21 days, leaving "agents not equipped to deal with customers".As a result, call centres have a high "churn rate", with nearly a quarter of workers throwing in the towel every year, which in turn forces companies to pay for training new staff.One possible

### 4.1 Concluding on the approach 

An abastractive model seems to perform better on the original text rather than processed text coming from the TextRank model. This would be explained by the fact that the Abstractive model would better assess the text given a broader context and the output would be better built.

Probably the best way to combine the 2 models would be to use them together rather than combining them. For instance the abstractive model could be used as a general overview with a good wording while the extractor one would be used for a more in depth analysis.

## 5. Explanation of the final model chosen for the application

To briefly summarize our work, here are the models we have considered and built on:

1. Model 1 basic summarizer: this model assigned a score to each sentence according to the words in the frequency table;
2. Model 2 TF-IDF approach: this model did not consider the words frequency alone but computed the product of the term frequency and the inverse document frequency;
3. Model 3 Cosine Similarity: this model is based on cosine similarities between the sentences, and instead of considering the magnitude, the cosine similarities consider rather the orientation;
4. TextRank Algorithm: this model is based on a graph generated from a cosine similarity matrix built using sentence embeddings.

To be able to compare the performance of our extractor summarizers, we run the T5 Abstractive summarizer to see if our approach was solid.

We ended up approving on the TextRank model due to the following reasons:

- The model is an easy one to build which offered a better performance than Models 1,2,3 in terms of the ROUGE metric;
- When compared to the other models, this model offered a better readability when it comes to the output;
- It takes into account semantically similar words by using the embeddings approach;
- It is able to scale to big texts and not only give a brief summary. As noted above, the T5 abstractive model is  limited in terms of the length of the output. When we tried it with a lenghty text, it summarized it accurately in just 2-3 sentences but did not give more than 70 words. This could be an advantage when we have short texts; but when talking about lengthy business cases we believe a 10-20 sentences summary would be better, which is what our model has been able to accurately give us;
- As also explained, abstractive summarizers are harder to build and require an advanced understanding of Neural Networks. Even implementing a pre-trained model like the T5 took a long time to set up and download.

Given all those reasons and knowing we wanted to implement our model into a Web Application, the choice of the TextRank algorithm seemed the most appropriate.

The model that we will use in the deployed Web Application will thus be the following one:

In [None]:
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

def text_rank_summarizer(text):
    stop_words = set(stopwords.words("english"))
    word_embeddings = {}
    f = open('glove.6B.100d.txt', encoding='utf-8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        word_embeddings[word] = coefs
    f.close()
    sentences = nltk.sent_tokenize(text) 
    clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ") 
    clean_sentences = [s.lower() for s in clean_sentences] 
    
    def remove_stopwords(sen): 
        sen_new = " ".join([i for i in sen if i not in stop_words])
        return sen_new
    
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    sentence_vectors = []
    for i in clean_sentences:
        if len(i) != 0:
            v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((100,))
        sentence_vectors.append(v)
    
    sim_mat = np.zeros([len(sentences), len(sentences)]) 
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
    
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)
    
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    
    summary =ranked_sentences[0][1] + ranked_sentences[1][1] + ranked_sentences[2][1] + ranked_sentences[3][1] + ranked_sentences[4][1] + ranked_sentences[5][1] + ranked_sentences[6][1] + ranked_sentences[7][1] + ranked_sentences[8][1] + ranked_sentences[9][1]
    return summary

## 6. Further Future Work

Our plan was to not only stop at text summarization but continue and build on the output we have. We wanted to implement a feature that would use the keywords of our texts to either be able to answer questions on them or provide links for further information based on pieced of information in the studied text.

These were not implemented due to time constraint but we plan on continue working on this project.

In terms of the robustness of the model, a good way to add to what we already have would be to be able to combine the extractor model as well as the abstractive model we have shown. In this way the model would use the output of the abstractive summarizer as a general and quick overview of the text, one that is easy to read. And the output of the extractor would then be used as the detailed summary of the text.