# Unsupervised Text Summarisation using Text Rank Algorithm

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
import re
from nltk.corpus import stopwords
stopWords = stopwords.words('english')
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

##  Load and Clean Data

In [2]:
#Helper functions

def clean(email):
    return re.sub(r"^.*:\s.*|^-.*\n?.*-$|\n|^>", '', email, 0, re.MULTILINE)
def removeStopwords(sentence):
    newSentence = " ".join([i for i in sentence if i not in stopWords])
    return newSentence
def prettySentences(sentence):
    for s in sentence:
        print(s)
        print()

As there are 50,000+ emails, we will use a subset of emails to test the algorithm

In [3]:
chunk = pd.read_csv('../input/enron-email-dataset/emails.csv', chunksize=2000)
data = next(chunk)

## Preprocess Email

In [4]:
print(data.message[9])

Message-ID: <30795301.1075855687494.JavaMail.evans@thyme>
Date: Mon, 16 Oct 2000 06:44:00 -0700 (PDT)
From: phillip.allen@enron.com
To: zimam@enron.com
Subject: FW: fixed forward or other Collar floor gas price terms
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: zimam@enron.com
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 10/16/2000 
01:42 PM ---------------------------


"Buckner, Buck" <buck.buckner@honeywell.com> on 10/12/2000 01:12:21 PM
To: "'Pallen@Enron.com'" <Pallen@Enron.com>
cc:  
Subject: FW: fixed forward or other Collar floor gas price terms


Phillip,

> As discussed  during our phone conversation, In a Parallon 75 microturbine
> power generation deal for a national accounts customer, I am developing a
> proposal to sell power to customer at fixed or collar/floor pr

* Naively clean the email my removing the metadata, newlines and some syntax which occurs when the email is forwarded

In [5]:
myData = []
for sentence in data.message:
    myData.append(clean(sentence))

In [6]:
print(myData[9])

"Buckner, Buck" <buck.buckner@honeywell.com> on 10/12/2000 01:12:21 PMPhillip, As discussed  during our phone conversation, In a Parallon 75 microturbine power generation deal for a national accounts customer, I am developing a proposal to sell power to customer at fixed or collar/floor price. To do so I need a corresponding term gas price for same. Microturbine is an onsite generation product developed by Honeywell to generate electricity on customer site (degen). using natural gas. In doing so,  I need your best fixed price forward gas price deal for 1, 3, 5, 7 and 10 years for annual/seasonal supply to microturbines to generate fixed kWh for customer. We have the opportunity to sell customer kWh 's using microturbine or sell them turbines themselves. kWh deal must have limited/ no risk forward gas price to make deal work. Therein comes Sempra energy gas trading, truly you. We are proposing installing 180 - 240 units across a large number of stores (60-100) in San Diego. Store number

## Split into Sentences

Use NLTK library to tokenize the sentences. This is being done as I am following an extractive model for Text Summarisation and I would like to extract the most important sentences in the email.

In [7]:
from nltk.tokenize import sent_tokenize

sentences = []

for sentence in myData:
    sentences.append(sent_tokenize(sentence))

In [8]:
prettySentences(sentences[9])

"Buckner, Buck" <buck.buckner@honeywell.com> on 10/12/2000 01:12:21 PMPhillip, As discussed  during our phone conversation, In a Parallon 75 microturbine power generation deal for a national accounts customer, I am developing a proposal to sell power to customer at fixed or collar/floor price.

To do so I need a corresponding term gas price for same.

Microturbine is an onsite generation product developed by Honeywell to generate electricity on customer site (degen).

using natural gas.

In doing so,  I need your best fixed price forward gas price deal for 1, 3, 5, 7 and 10 years for annual/seasonal supply to microturbines to generate fixed kWh for customer.

We have the opportunity to sell customer kWh 's using microturbine or sell them turbines themselves.

kWh deal must have limited/ no risk forward gas price to make deal work.

Therein comes Sempra energy gas trading, truly you.

We are proposing installing 180 - 240 units across a large number of stores (60-100) in San Diego.

Sto

## GloVe Embeddings

Glove embeddings were available on Kaggle and were attached to the Notebook. Here I am importing the 100 dimension version of Glove Embedding. Using a higher dimensions presently challenges in the Text Rank Algorithm Converging.

In [9]:
wordEmbeddings = {}

with open('../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        key = values[0]
        wordEmbeddings[key] = np.asarray(values[1:], dtype='float32')

In [10]:
len(wordEmbeddings)

400000

## Preprocess noise in sentences

We would like to remove all alpha numeric characters from the sentence and convert all the chracters to lower case.

In [11]:
cleanSentences = []

# make alphabets lowercase
for email in sentences:
        email = [re.sub(r"[^a-zA-Z]", " ", s, 0, re.MULTILINE) for s in email]
        email = [re.sub(r"\s+", " ", s, 0, re.MULTILINE) for s in email]
        cleanSentences.append([s.lower() for s in email])

In [12]:
prettySentences(cleanSentences[9])

 buckner buck buck buckner honeywell com on pmphillip as discussed during our phone conversation in a parallon microturbine power generation deal for a national accounts customer i am developing a proposal to sell power to customer at fixed or collar floor price 

to do so i need a corresponding term gas price for same 

microturbine is an onsite generation product developed by honeywell to generate electricity on customer site degen 

using natural gas 

in doing so i need your best fixed price forward gas price deal for and years for annual seasonal supply to microturbines to generate fixed kwh for customer 

we have the opportunity to sell customer kwh s using microturbine or sell them turbines themselves 

kwh deal must have limited no risk forward gas price to make deal work 

therein comes sempra energy gas trading truly you 

we are proposing installing units across a large number of stores in san diego 

store number varies because of installation hurdles face at small percent 

## Remove Stop Words

Remove Stop Words which occur often in sentences but do not add any value or information to the sentences

In [13]:
for i in range(len(cleanSentences)):
    cleanSentences[i] = [removeStopwords(r.split()) for r in cleanSentences[i]]

In [14]:
prettySentences(cleanSentences[9])

buckner buck buck buckner honeywell com pmphillip discussed phone conversation parallon microturbine power generation deal national accounts customer developing proposal sell power customer fixed collar floor price

need corresponding term gas price

microturbine onsite generation product developed honeywell generate electricity customer site degen

using natural gas

need best fixed price forward gas price deal years annual seasonal supply microturbines generate fixed kwh customer

opportunity sell customer kwh using microturbine sell turbines

kwh deal must limited risk forward gas price make deal work

therein comes sempra energy gas trading truly

proposing installing units across large number stores san diego

store number varies installation hurdles face small percent

gas requirement microturbines mmcf per year gas likely consumed may september peak electric period

need detail breakout commodity transport cost firm interruptible

additional questions give call

let assure real 

## Represent Sentence as Vectors (Taking average of all words)

We want to represent the sentence as a node in the Text Rank Algorithm. We do this by taking an average of all the words in the sentence.

In [15]:
sentenceVectors = []

for email in cleanSentences:
    temp = []
    for s in email:
        if len(s) != 0:
            v = sum([ wordEmbeddings.get(w, np.zeros((100,))) for w in s.split()])/(len(s.split()) + 0.001)
        else:
            v = np.zeros((100,))
        temp.append(v)
    sentenceVectors.append(temp)

In [16]:
sentenceVectors[9][0]

array([-0.00468579,  0.11893937,  0.27804963,  0.01830821,  0.01522299,
       -0.33332529, -0.11906888, -0.09412961,  0.07365905, -0.06874301,
        0.05856576, -0.02399825,  0.03514577, -0.06001841,  0.07035317,
       -0.22505107,  0.01556205,  0.18181841,  0.00168186, -0.07575724,
        0.07189374, -0.04741132,  0.00657123, -0.01528418,  0.19881609,
       -0.16628451, -0.12298737, -0.13709352, -0.04477246,  0.05795908,
        0.20846984,  0.57821922,  0.09934372, -0.07212303,  0.05853643,
        0.22405299,  0.00982063, -0.02549743,  0.14813818, -0.17511261,
        0.11814637, -0.28899234,  0.02520481, -0.07032932, -0.26279486,
        0.16866876, -0.17404244, -0.24757794,  0.08289612, -0.44547772,
        0.03211591,  0.00291385, -0.08135813,  0.45512722,  0.0210353 ,
       -1.24673548,  0.02554253, -0.15740101,  1.16909775,  0.30779748,
        0.09638102,  0.09968338, -0.11361894, -0.0354592 ,  0.35974519,
       -0.14989971,  0.31196655,  0.1526222 ,  0.25323844,  0.05

## Similarity Matrix

Create a similarity matrix between all the sentences within an email using cosine similarity 

In [17]:
similarityMatrix = []
for i in range(len(cleanSentences)):
    email = cleanSentences[i]
    temp = np.zeros((len(email), len(email)))
    j_range = temp.shape[0]
    k_range = temp.shape[1]
    for j in range(j_range):
        for k in range(k_range):
            if j != k:
                temp[j][k] = cosine_similarity(sentenceVectors[i][j].reshape(1, 100),
                                                             sentenceVectors[i][k].reshape(1, 100))[0][0]
    similarityMatrix.append(temp)

In [18]:
similarityMatrix[9]

array([[0.        , 0.8349139 , 0.78126698, 0.69079213, 0.87709036,
        0.79730623, 0.88519434, 0.75732063, 0.76894329, 0.7990599 ,
        0.76434235, 0.8418759 , 0.81483853, 0.78216   , 0.86432732],
       [0.8349139 , 0.        , 0.74710993, 0.82935303, 0.93127769,
        0.76576298, 0.90654916, 0.81927234, 0.6963799 , 0.79733682,
        0.88611868, 0.82626361, 0.75397122, 0.72586042, 0.74677434],
       [0.78126698, 0.74710993, 0.        , 0.75911861, 0.83364256,
        0.86083816, 0.79159838, 0.75093062, 0.71146822, 0.7326766 ,
        0.75762932, 0.78197226, 0.61383603, 0.58328325, 0.76195926],
       [0.69079213, 0.82935303, 0.75911861, 0.        , 0.77505106,
        0.7343154 , 0.77175152, 0.82399732, 0.68721616, 0.69503987,
        0.81327026, 0.72392225, 0.58932388, 0.55777133, 0.66403179],
       [0.87709036, 0.93127769, 0.83364256, 0.77505106, 0.        ,
        0.86539281, 0.9327848 , 0.80254686, 0.75058013, 0.83409691,
        0.8945377 , 0.87441206, 0.74700147, 

## Apply Text Rank Algorithm

Apply the Text Rank Algorithm for each email

In [19]:
scores = []
for i in similarityMatrix:
    nxGraph= nx.from_numpy_array(i)
    scores.append(nx.pagerank_numpy(nxGraph))


In [20]:
scores[9]

{0: 0.06987283336106877,
 1: 0.06990240015904793,
 2: 0.06560796468151953,
 3: 0.06371707952682155,
 4: 0.07185698852358498,
 5: 0.06671443267897388,
 6: 0.07215122896321827,
 7: 0.06457557936742557,
 8: 0.0636749618424344,
 9: 0.06636958348390279,
 10: 0.06759823666337764,
 11: 0.06804275555118505,
 12: 0.06265334312626286,
 13: 0.06148249517499241,
 14: 0.06578011689618454}

sort the sentces by descending order of their scores

In [21]:
rankedSentences = []

for i in range(len(scores)):
    rankedSentences.append(sorted(((scores[i][j],s) for j,s in enumerate(sentences[i])), reverse=True))

Show the top 2 sentences if the email has only 1 sentence, only print that

## Example of the Summarised Text

In [22]:
myData[108]

"   \tEnron North America Corp.\t\tSturm/HOU/ECT@ECT, Larry May/Corp/Enron@Enron, Kate Fraser/HOU/ECT@ECT, Zimin Lu/HOU/ECT@ECT, Greg Couch/HOU/ECT@ECT, John Griffith/Corp/Enron@Enron, Sandra F Brawner/HOU/ECT@ECT, John J Lavorato/Corp/Enron@Enron, Hunter S Shively/HOU/ECT@ECT, Phillip K Allen/HOU/ECT@ECT, Scott Neal/HOU/ECT@ECT, Thomas A Martin/HOU/ECT@ECT, Steve Jackson/HOU/ECT@ECTPlease plan to attend a meeting on Friday, August 11 at 11:15 a.m. in 30C1 to discuss the transportation model.  Now that we have had several traders managing transportation positions for several months, I would like to discuss any issues you have with the way the model works.   I have asked Zimin Lu (Research), Mark Breese and John Griffith (Structuring) to attend so they will be available to answer any technical questions.   The point of this meeting is to get all issues out in the open and make sure everyone is comfortable with using the model and position manager, and to make sure those who are managing

In [23]:
example = 108

print(rankedSentences[example][0][1])
print(rankedSentences[example][1][1])

The point of this meeting is to get all issues out in the open and make sure everyone is comfortable with using the model and position manager, and to make sure those who are managing the books believe in the model's results.
Now that we have had several traders managing transportation positions for several months, I would like to discuss any issues you have with the way the model works.


In [24]:
for i in range(len(rankedSentences)):
    print("--------------------------------")            
    if len(rankedSentences[i]) > 2:
        print(rankedSentences[i][0][1])
        print(rankedSentences[i][1][1])
    else:
        for j in range(len(rankedSentences[i])):
             print(rankedSentences[i][j][1])
    print("--------------------------------")            
    print()

--------------------------------
Here is our forecast
--------------------------------

--------------------------------
I would even try and get some honest opinions on whether a trip is even desired or necessary.As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.
I would suggest holding the business plan meetings here then take a trip without any formal business meetings.
--------------------------------

--------------------------------
way to go!!
test successful.
--------------------------------

--------------------------------
Plus your thoughts on any changes that need to be made.
Randy, Can you send me a schedule of the salary and level of everyone in the scheduling group.
--------------------------------

--------------------------------
Let's shoot for Tuesday at 11:45.
--------------------------------

--------------------------------
Greg, How about either n