This notebook follows the tutorial found at:
https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/


In [2]:
import mailparser
import sys
from os import listdir
from os.path import isfile, join
import pandas as pd
import email
import numpy as np
from talon.signature.bruteforce import extract_signature
import nltk
from nltk.tokenize import sent_tokenize
import re
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

#one time executions
#nltk.download('punkt')
#nltk.download('stopwords')
#! wget http://nlp.stanford.edu/data/glove.6B.zip
#! unzip glove*.zip


Load dataframe produced by the Data_Wranlging notebook. 

In [3]:
#Load Data
PICKLE_LOC = "../data/dataframes/wrangled_enron_df.pkl"
enron_df = pd.read_pickle(PICKLE_LOC)

Checking earliest and latest date. 

In [46]:
start_date = '2001-10-01 00:00:00'
end_date = '2001-10-14 23:59:59'
summarization_mask = (enron_df['Date'] >= start_date) & (enron_df['Date'] <= end_date)
enron_masked_df = enron_df.loc[summarization_mask]
print("Total number of emails to summarize: " + str(len(enron_masked_df)))

Total number of emails to summarize: 89


In [5]:
#Splitting the text in emails into sentences
sentences = enron_masked_df.Extractive_Sentences.tolist()
#flatten list as tuples containting (sentence, dataframe index) to be used to reassociate summary with original email. 
sentences = []
sentences_list = enron_masked_df.Extractive_Sentences.tolist()
for counter, sublist in enumerate(sentences_list):
    for item in sublist:
        sentences.append([counter, item]) 

In [6]:
#Pull out clean tokenized sentences. 
clean_sentences = enron_masked_df.Tokenized_Body.tolist()
#flatten list
clean_sentences = [y for x in clean_sentences for y in x]

In [9]:
#Extract word vectors
word_embeddings = {}
f = open('glove.6B.300d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [10]:
#Creating vectors for the sentences. 
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((300,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((300,))
  sentence_vectors.append(v)

In [11]:
len(sentence_vectors)

1158

In [12]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
#Initialize matrix with cosine similarity scores. 
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,300), sentence_vectors[j].reshape(1,300))[0,0]
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

In [13]:
#Pair sentence with it's similarity score then sort. 
ranked_sentences = sorted(((scores[i],s[0],s[1]) for i,s in enumerate(sentences)), reverse=True)

In [43]:
#color scheme to help distinguish summarizaiton text. 
class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

# Specify number of sentences as a fraction of total emails. 
sn = len(enron_masked_df) // 10

# Generate summary
for i in range(sn):
    #pull date and subject from original email
    email_date = str(enron_masked_df['Date'].iloc[ranked_sentences[i][1]])
    email_subject = enron_masked_df['Subject'].iloc[ranked_sentences[i][1]]
    #print summaries 
    print( bcolors.BOLD + "Date: "+ email_date  + 
          " Subject: " + email_subject + bcolors.ENDC +
          "\nSummary: " + ranked_sentences[i][2])

[1mDate: 2001-10-02 21:24:19 Subject: Lay It On the Line Survey Results[0m
Summary: Work environment: pay and benefits true meritocracy unlimited opportunities The results of our survey confirmed for the Management Committee and me the great things you feel about Enron and also what you see as the issues we must address to move Enron forward.
[1mDate: 2001-10-02 19:04:24 Subject: Jeffrey Skilling, your October E-lert is now available[0m
Summary: A complete story on the origin of Halloween will be on the Business Cafe Web site October 2531 at http://www.BusinessCafeOnline.com Your October 2001 issue of Elert for personal development in business is now available on http://www.BusinessCafeOnline.com Included in the October 2001 issue are these three articles: Flying Off Over Office Politics What Men Can Learn
[1mDate: 2001-10-04 03:13:05 Subject: h: Eyeforenergy Briefing[0m
Summary: EDITORIAL A Focus on the latest developments in Europe ARTICLES European Utility Sector: Survival of 

In [55]:
#Examples of a full email. 
enron_masked_df['Body'].iloc[ranked_sentences[0][1]]

'Recently, I asked you to "Lay It On the Line" and tell me what you believe are the most critical issues that we need to address. Many of you responded, and I want to share the survey results with you. Additionally, I want to share with you the things we are currently doing or plan to do to move Enron forward. First, let\'s take a look at the survey results. In just three days, more than 4,000 employees completed the survey. I sincerely appreciate your participation. I. We asked you to identify the top five issues facing Enron today. The responses were: 1. Stock price 2. The PRC 3. Employee/internal morale 4. External reputation and image 5. "Walking the talk" of integrity and respect II. We asked you general questions about working at Enron, comparing 2001 to 2000. % of employees responding strongly agree/agree: I am committed to staying at Enron. 64.1% Enron is a company I would invest my money in. 60.1% I am treated with respect. 60.0% I would recommend Enron as a place to work for 