## Text Summarization using the Gensim summarizer module

This is a simple method to tackle the problem of text summarization, it is an extractive approach which select sentences from the corpus that best represent it and arrange them to form a summary. *Extractive summary extract the important and meaningful sentences from the original text and placing them into summary without any changes* [1].
As described in an excellent paper about extractive summarization, cited in the references section: 

*Extractive techniques generally generate summaries through 3 phases or it essentially based on them. These phases are preprocessing step,
processing step and generation step:
1) Preprocessing step: the representation space dimensionality of the original text is reduced to involve a new structure representation. It usually includes:

    a. Stop-word elimination: Common words without semantics that do not collect information relevant to the task (for example, "the", "a", "an", "in") are eliminated.
    
    b. Steaming: Acquire the stem of each word by bringing the word to its base form.
    
    c. Part of speech tagging: The process of identifying and classifying words of the text on the basis of part of speech category they belong (nouns, verbs, adverbs, adjectives).
    
2) Processing step: It uses an algorithm with the help of features generated in the preprocessing step to convert the text structure to the summary structure. In which, the sentences are scored.

3) Generation step: sentences are ranked. Then, it pick up the most important sentences from the ranked structure to generate the final required summary.*

These techniques are very popular in the industry as they are very easy to implement. They use existing natural language phrases and are reasonably accurate. And they are very fast since they are an unsupervised algorithm, so they do not have to calculate loss function in every step.


### Installing and importing the libraries

In [None]:
#Install the gensim package for the first time only
#!pip install gensim

I will use the rouge library to calculate the ROUGE metrics to evaluate the results.This library is independant from the "official" ROUGE script (aka. ROUGE-155) and results may be slighlty different,but it is very easy to use.


In [70]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.0-py3-none-any.whl (14 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [71]:
import pandas as pd
import numpy as np
import random
#Load the gensim modules
from gensim.summarization import summarize
from gensim.summarization import keywords
from gensim.summarization.textcleaner import split_sentences
from gensim.summarization import summarizer
#Module for splitting the dsataset
from sklearn.model_selection import train_test_split
#Import library to calculate the evaluation metric
from rouge import FilesRouge


### Load the dataset

In [None]:
# Run only when new datafiles have been stored in GS
#%%bash
#gsutil cp gs://mlend_bucket/data/news_summary/news_summary_more.csv ../data/

In [2]:
summary = pd.read_csv('../data/news_summary.csv', encoding='iso-8859-1')
#raw = pd.read_csv('../data/news_summary_more.csv', encoding='iso-8859-1')
summary.head(5)

Unnamed: 0,author,date,headlines,read_more,text,ctext
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


In [3]:
#Drop duplicate rows
summary.drop_duplicates(subset=["ctext"],inplace=True)
#Drop rows with null values in the text variable
summary.dropna(inplace=True)
summary.reset_index(drop=True,inplace=True)
# we are using the text variable as the summary and the ctext as the source text
dataset = summary[['headlines','ctext']].copy()
dataset.columns = ['summary','text']
dataset.head(5)

Unnamed: 0,summary,text
0,Daman & Diu revokes mandatory Rakshabandhan in...,The Daman and Diu administration on Wednesday ...
1,Malaika slams user who trolled her for 'divorc...,"From her special numbers to TV?appearances, Bo..."
2,'Virgin' now corrected to 'Unmarried' in IGIMS...,The Indira Gandhi Institute of Medical Science...
3,Aaj aapne pakad liya: LeT man Dujana before be...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Hotel staff to get training to spot signs of s...,Hotels in Mumbai and other Indian cities are t...


### Data preprocess and cleanings

Lets dive into the dataset to verify come cleanings we need to apply to our dataset.
- Expand contractions
- Puntuaction separation
- Remove multispaces


In [77]:
import re
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", 
                       "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", 
                       "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", 
                       "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", 
                       "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                       "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", 
                       "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", 
                       "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", 
                       "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", 
                       "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have",
                       "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", 
                       "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", 
                       "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", 
                       "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", 
                       "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", 
                       "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is",
                       "they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", 
                       "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", 
                       "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", 
                       "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  
                       "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", 
                       "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", 
                       "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", 
                       "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", 
                       "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have",
                       "y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", 
                       "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

punct = "/-'.,?!#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'

def expand_contractions(text):
    ''' Expand the contractions (some well-known of them) in a text'''
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in text.split(" ")])
    return text

def remove_mult_spaces(text):
    re_mult_space = re.compile(r"  *") # replace multiple spaces with just one
    return re_mult_space.sub(r' ', text)

def sep_punctuation(text, punct):
# Separate punctuation with whitespaces
    for p in punct:
        text = text.replace(p, f'{p} ')

    return text

def remove_CTL(text):
    url = re.compile(r'\n')
    return url.sub(r' ',text)

def clean_text(text):
    new_text=text
    new_text=new_text.apply(lambda x : expand_contractions(x))
    new_text=new_text.apply(lambda x : sep_punctuation(x,punct))
    new_text=new_text.apply(lambda x : remove_mult_spaces(x))
    new_text=new_text.apply(lambda x : remove_CTL(x))
    return new_text

In [78]:
dataset_text=clean_text(dataset['text'])
dataset_summary=clean_text(dataset['summary'])

### Splitting the dataset into a train and a test set


Although it is not strictly necessary, we will divide the data set into an input set and a test set in order 
to use the latter to evaluate the performance of the algorithm. The gensim summarizer only works on the source text we provide to produce the summary, it does not use any other data or information so it does matter if we split or how we split the data.

Lets divide the data into a 20% for test set

In [79]:
train_x, test_x, train_y, test_y = train_test_split(dataset_text, dataset_summary, test_size=0.2, random_state=42, shuffle=True)
print('Length train set: ',len(train_x),len(train_y))
print('Length test set: ',len(test_x),len(test_y))
train_x.reset_index(drop=True, inplace=True)
train_y.reset_index(drop=True, inplace=True)
test_x.reset_index(drop=True, inplace=True)
test_y.reset_index(drop=True, inplace=True)


Length train set:  3472 3472
Length test set:  869 869


### Build the summarizer with gensim

**DESCRIBE HOW WE ARE GOING TO BUILD**

In [80]:
#Set the desire length for the summarize to produce
summary_length = 25

In [81]:
def generate_summary(data_text, summary_length):
    ''' Generate a summary for every element in the data_text using the
        gensim summarizer method.
        
        Input:
           - data_text: list of strings to summarize
           - summary_lenght: how many words will the summary contain
        Output:
           - A list of strings, the summary for every string in the data_text input
           - errors: number of strings that can not be summarize
           - no_summarizables: number of strings 
    '''
    
    summaries=[]
    errors=0
    no_summarizable=0
    # Set the minimun of sentences in a text to be summarize
    summarizer.INPUT_MIN_LENGTH = 3
    # for every string in the input list
    for i, source_text in enumerate(data_text):
        # if the number of sentences in the source text is less thsn 1 
        #it can not be summarize
        if len(split_sentences(source_text))> 1:
            try:
                # Sometimes the sentences in the surce text are not correct
                # to produce a summary
                summaries.append(summarize(source_text, word_count = summary_lenght))
            except:
                summaries.append('Error')
                errors +=1
        else:
            summaries.append('No Summarizable')
            no_summarizable +=1

    return summaries, errors, no_summarizable


In [82]:
# Generate the summaries for the data and test set
print("\nGenerating the summary for the train set\n")
train_preds,errors, no_summarizable=generate_summary(train_x, summary_length)
print('Errors: ',errors)
print('No Summarizable: ',no_summarizable)
print("\nGenerating the summary for the test set\n")
test_preds,errors, no_summarizable=generate_summary(test_x, summary_length)
print('Errors: ',errors)
print('No Summarizable: ',no_summarizable)


Generating the summary for the train set

Errors:  1
No Summarizable:  33

Generating the summary for the test set

Errors:  0
No Summarizable:  13


For the test set there are any errors but 13 no summarizable texts. Lets explore some of them summaries we have extracted:

In [83]:
import random
#Print some summaries to analyze them
print('Examples: \n')
for i in random.sample(range(50),10):
    print('i: ',i,' : ', test_preds[i],'\n')
    

Examples: 

i:  25  :  While Deoband is known for its historical significance and Darul Uloom Deoband has been a centre of Islamic learning since the Mughal era, Brijesh Singh had demanded that the city be renamed Deovrind after BJP came into power. 

i:  23  :  Maharashtra' s bar dancers have waltzed up to the Supreme Court and argued that the prevailing ban- like atmosphere in the state could push them into activities like prostitution. 

i:  48  :   

i:  28  :   

i:  36  :  The MMRDA said it fixed 353 potholes on the Western Express Highway, which were reported between July 1 and July 10, last week. 

i:  49  :   

i:  1  :   

i:  42  :  ALSO READ: Ratan Tata does not speak the truth: Cyrus Mistry In the next few weeks, at least six Tata Group companies are expected to hold such meetings to remove Cyrus Mistry from the position of director. 

i:  12  :  New Delhi, Feb 19 ( PTI) FAQs about menstruation will now be answered at Delhi government schools with an NGO conducting " perio

Once we have inspected the results we observed that there are three kind of errors or incoherent results: 
- Errors, when the gensim summarizer produce an error
- No summarizable, when the summarizer can not applyied because of the number of sentences
- Null, when the summary obtained is empty or null

So we need to discard this rows from our predicted summaries

In [84]:
#Search for the Error rows
test_errors = np.where(np.array(test_preds)=='Error')[0]
print('Errors: ',len(test_errors))
##Search for the No summarizable text in the test set
test_no_summa = np.where(np.array(test_preds)=='No Summarizable')[0]
print('No summarizable: ',len(test_no_summa))
##Search for the nulls sumaries in the test set
test_nulls = np.where(np.array(test_preds)=='')[0]
print('Nulls summaries: ',len(test_nulls))

#Discard the incorrect summaries
incorrect_preds=np.concatenate((test_errors,test_no_summa,test_nulls))
predicted_summaries = [test_preds[i] for i in range(len(test_preds)) if i not in incorrect_preds]
labeled_summaries =  [test_y[i] for i in range(len(test_y)) if i not in incorrect_preds]

Errors:  0
No summarizable:  13
Nulls summaries:  66


In [95]:
len(labeled_summaries),len(predicted_summaries),len(summaries)

(790, 790, 790)

### Evaluating the results using ROUGE metrics 

**DESCRIBE ROUGE METRIC**

#### Create predicted and real summaries file

In [109]:
def save_textfile(filename, strings):
    ''' Save the contect of a list of strings to a file called filename
    
        Input:
           - filename: name of the file to save the strings
           - strings: a list of string to save to disk
    '''
    
    with open(filename, 'w') as f:
        for item in strings:
            #Remove any \n in the string
            item = remove_CTL(item)
            f.write("%s\n" % item)


In [104]:
#Save the files with the predicted summaries
save_textfile('predicted_summaries.txt',predicted_summaries)
save_textfile('labeled_summaries.txt',labeled_summaries)

When the files with the predicted and real summaries are stored, withone summary per line, we can call the rouge library to get the metrics for our evaluation method.
Lets try it,

In [107]:
#Create the rouge object and scores the ROUGE metrics in average 
files_rouge = FilesRouge()
scores = files_rouge.get_scores('predicted_summaries.txt', 'labeled_summaries.txt', avg=True)

In [113]:
#print('Average Results on the Test set:\n')
print(scores)

{'rouge-1': {'f': 0.15058549183980122, 'p': 0.10255481089937296, 'r': 0.306434167098725}, 'rouge-2': {'f': 0.03799941783852152, 'p': 0.02541178949636445, 'r': 0.08221898003543593}, 'rouge-l': {'f': 0.13370336655643658, 'p': 0.09253319907707971, 'r': 0.2580442272214432}}


There are two aspects that may impact the need for human post-processing:
- Does the summary sound fluent?
- Is summary adequate? I.e. is the length appropriate and does it cover the most important information of the text it summarizes?

ROUGE doesn't try to assess how fluent the summary: ROUGE only tries to assess the adequacy, by simply counting how many n-grams in your generated summary matches the n-grams in your reference summary (or summaries, as ROUGE supports multi-reference corpora).This is the process for one document-summary pair. You repeat the process for all documents, and average all the scores and that gives you a ROUGE-N score. So a higher score would mean that on average there is a high overlap of n-grams between your summaries and the references.

Example:

S1. police killed the gunman
S2. police kill the gunman
S3. the gunman kill police

S1 is the reference and S2 and S3 are candidates. Note S2 and S3 both have one overlapping bigram with the reference, so they have the same ROUGE-2 score, although S2 should be better. An additional ROUGE-L score deals with this, where L stands for Longest Common Subsequence. In S2, the first word and last two words match the reference, so it scores 3/4, whereas S3 only matches the bigram, so scores 2/4. See the paper for more details

Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner. High-order n-gram ROUGE measures try to judge fluency to some degree.

**What is the best way to really understand what a ROUGE score actually measures?**

In short and approximately:

ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.
ROUGE is more interpretable than BLEU (from {2}: "Other Known Deficiencies of Bleu: Scores hard to interpret"). I said approximately because the original ROUGE implementation from the paper that introduced ROUGE {3} may perform a few more things such as stemming.

Code to check the count of lines in the files

In [106]:
fname = "predicted_summaries.txt"
count = 0
with open(fname, 'r') as f:
    for line in f:
        count += 1
print("Total number of lines is:", count)

Total number of lines is: 790


#### References

[1]. El-Refaiy, Ahmed & Abas, A.R. & Elhenawy, I.. (2018). Review of recent techniques for extractive text summarization. Journal of Theoretical and Applied Information Technology. 96. 7739-7759. 


[2] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text summarization branches out: Proceedings of the ACL-04 workshop, vol. 8. 2004. https://scholar.google.com/scholar?cluster=2397172516759442154&hl=en&as_sdt=0,5 ; http://anthology.aclweb.org/W/W04/W04-1013.pdf