This notebook follows the tutorial found at:
https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/


In [1]:
import mailparser
import sys
from os import listdir
from os.path import isfile, join
import pandas as pd
import email
import numpy as np
from talon.signature.bruteforce import extract_signature

Following code is cleanup described in data wrangling notebook

In [2]:
def clean_body(mail_body):
    delimiters = ["-----Original Message-----","To:","From"]
    
    #Trying to split string by biggest delimiter. 
    old_len = sys.maxsize
    
    for delimiter in delimiters:
        split_body = mail_body.split(delimiter,1)
        new_len = len(split_body[0])
        if new_len <= old_len:
            old_len = new_len
            final_split = split_body
            
    #Then pull chain message
    if (len(final_split) == 1):
        mail_chain = None
    else:
        mail_chain = final_split[1] 
    
    #The following uses Talon to try to get a clean body, and seperate out the rest of the email. 
    clean_body, sig = extract_signature(final_split[0])
    
    return {'Body': clean_body, 'Chain' : mail_chain, 'Signature': sig}

The following function processes the email and appends the parsed pieces into a pandas dataframe. We need to start with a list, that will represent each row of the dataframe. Each item in the row is a dict of parsed email values.


In [3]:
enron_email_list = []
def process_email(email_path, employee, folder, email_name):
    mail = mailparser.parse_from_file(email_path)
    
    full_body = email.message_from_string(mail.body)
    #Only getting first payload
    if full_body.is_multipart():
        return
    else:
        mail_body = full_body.get_payload()    
    
    split_body = clean_body(mail_body)
    headers = mail.headers
    #Reformating date to be more pandas readable
    date_time = headers.get('Date')
    
    if date_time:
        date_time = email.utils.format_datetime(email.utils.parsedate_to_datetime(date_time))
    
    email_dict = {
                "Employee" : employee,
                "Email Folder": folder,
                "Message-ID": headers.get('Message-ID'),
                "Date" : date_time,
                "From" : headers.get('From'),
                "Subject": headers.get('Subject'),
                "Body" : split_body['Body'],
                "Chain" : split_body['Chain'],
                "Signature": split_body['Signature'],
                "Full_Email_Path" : email_path #for debug purposes. 
    }
    
    #Append row to dataframe. 
    enron_email_list.append(email_dict)

Now go through each person's inbox then load up each email to be cleaned and added to the dataframe. 

In [4]:
mail_dir = '../data/maildir/'
mailbox = "skilling-j" #testing
#mailboxes = listdir(mail_dir)

#for mailbox in mailboxes:
inbox = listdir(mail_dir + mailbox)
for folder in inbox:
    path = mail_dir + mailbox + "/" + folder
    emails = listdir(path)
    for single_email in emails:
        full_path = path + "/" + single_email
        process_email(full_path, mailbox, folder, single_email)
enron_email_list_df = pd.DataFrame(enron_email_list)
enron_email_list_df.head()

Unnamed: 0,Body,Chain,Date,Email Folder,Employee,From,Full_Email_Path,Message-ID,Signature,Subject
0,"Simon, I was unable to open the attached docum...",:\tSimone La rose/ENRON@enronXgate on 05/16/20...,"Wed, 16 May 2001 21:18:00 -0700",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/1.,<25701656.1075840162944.JavaMail.evans@thyme>,"Thanks,\nJoannie","Re: EnronOnline Executive Summary for May 15, ..."
1,"Erica,\n\nDue to the fact that Jeff is unable ...","\t""'jeff.skilling@enron.com'"" <jeff.skilling@e...","Mon, 14 May 2001 15:20:00 -0700",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/16.,<8541273.1075840163402.JavaMail.evans@thyme>,,Re: Telephone Call with Jerry Murdock
2,"PLEASE PLAN TO ATTEND\n\n""NEW BUSINESS CR...",,"Wed, 01 Aug 2001 12:25:58 -0700",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/110.,<20911721.1075852686924.JavaMail.evans@thyme>,,Please Plan to Attend
3,"The Stanford Business School, in collaboration...",,"Thu, 03 May 2001 18:30:00 -0700",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/28.,<20013668.1075840163679.JavaMail.evans@thyme>,,Stanford/McKinsey GLOBE Survey
4,test,,"Tue, 15 May 2001 14:09:00 -0700",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/5.,<30255134.1075840163074.JavaMail.evans@thyme>,,test


Need to convert date to pandas datetime. 

In [5]:
enron_email_list_df['Date'] = pd.to_datetime(enron_email_list_df.Date)

In [6]:
enron_email_list_df.dtypes

Body                       object
Chain                      object
Date               datetime64[ns]
Email Folder               object
Employee                   object
From                       object
Full_Email_Path            object
Message-ID                 object
Signature                  object
Subject                    object
dtype: object

Describing the table shows that the Subject column has a large amount of blanks that should be set to null. All the other columns seem to make sense. 

In [7]:
enron_email_list_df.describe()

Unnamed: 0,Body,Chain,Date,Email Folder,Employee,From,Full_Email_Path,Message-ID,Signature,Subject
count,4139,1349,4139,4139,4139,4139,4139,4139,643,4139.0
unique,2386,684,2506,12,1,919,4139,4139,243,1960.0
top,The report named: Violation/Notification Memo ...,All Enron Worldwide\ncc: \nSubject: Successi...,2000-07-27 14:31:00,inbox,skilling-j,sherri.sera@enron.com,../data/maildir/skilling-j/notes_inbox/66.,<16800043.1075863609768.JavaMail.evans@thyme>,"Regards,\nSherri Sera\nAssistant to Jeff Skill...",
freq,42,14,8,1252,4139,662,1,1,56,198.0
first,,,1980-01-01 00:00:00,,,,,,,
last,,,2001-11-19 21:34:59,,,,,,,


All the blanks will be replaced with NaN to keep values consistent.

In [8]:
enron_email_list_df = enron_email_list_df.replace('',np.NaN)

In [9]:
enron_email_list_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4139 entries, 0 to 4138
Data columns (total 10 columns):
Body               4115 non-null object
Chain              1348 non-null object
Date               4139 non-null datetime64[ns]
Email Folder       4139 non-null object
Employee           4139 non-null object
From               4139 non-null object
Full_Email_Path    4139 non-null object
Message-ID         4139 non-null object
Signature          643 non-null object
Subject            3941 non-null object
dtypes: datetime64[ns](1), object(9)
memory usage: 323.4+ KB


Taking a look at some examples.

Checking earliest and latest date. 

In [10]:
print(enron_email_list_df.Date.min())
print(enron_email_list_df.Date.max())

1980-01-01 00:00:00
2001-11-19 21:34:59


In [11]:
start_date = '2001-10-01 00:00:00'
end_date = '2001-10-14 23:59:59'
summarization_mask = (enron_email_list_df['Date'] >= start_date) & (enron_email_list_df['Date'] <= end_date)
enron_summary_test = enron_email_list_df.loc[summarization_mask]


Importing NLP libraries.

In [12]:
import nltk
from nltk.tokenize import sent_tokenize
import re
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

#one time executions
#nltk.download('punkt')
#nltk.download('stopwords')
#! wget http://nlp.stanford.edu/data/glove.6B.zip
#! unzip glove*.zip


In [13]:
#This function removes stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [14]:
#Splitting the text in emails into sentences
sentences = []
for s in enron_summary_test['Body']:
    sentences.append(sent_tokenize(s))
sentences = [y for x in sentences for y in x]

In [15]:
sentences[:5]

['Please ignore.',
 '_________________________________________________________________\nGet your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp',
 'Good Monday Morning - Comments',
 'During the weekend of October 6 and 7, 2001 the Enterprise Storage Team will be migrating all production users off the current hardware (Solar) that houses their home and application directories (no production databases are affected, but client software will be) to new hardware.',
 'This migration requires a total system outage of approximately 6 hours.']

In [16]:
#removing punctuation, numbers and special characters. Then lowercasing. 
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
clean_sentences = [s.lower() for s in clean_sentences]
stop_words = stopwords.words('english')
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [17]:
#Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [18]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

In [19]:
len(sentence_vectors)

1167

In [20]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

# Specify number of sentences to form the summary
sn = 10

# Generate summary
for i in range(sn):
  print(ranked_sentences[i][1])



[IMAGE]     EDITORIAL [IMAGE] A Focus on the latest developments in Europe   ARTICLES [IMAGE] European Utility Sector: Survival of the Fittest Through Mergers & Acquisition [IMAGE] EU Deregulation: Out of Chaos Comes Order   NEWS ROUND UP [IMAGE] Commodities Now honours Women in the Energy Industry  [IMAGE] SPL WorldGroup Supports Siebel 7 for Competitive Energy Markets Worldwide >  [IMAGE] BG Enters, Enron Leaves Indian Upstream Market in $388 Million Deal  [IMAGE] Congress Puts Wind Energy Tax Credit Back on the Table  [IMAGE] e-Acumen launches WeatherBook, a breakthrough weather risk management platform with industry-leading analytics  [IMAGE] Two New WindPower Projects Deliver Renewable Energy to California; Expected to Produce Enough Energy to Power More Than 40,000 Homes  [IMAGE] The Structure Group named Houston's sixth fastest-growing private company at annual Houston 100
 awards luncheon    EVENTS [IMAGE]  Eyeforenergy Asia 2001   [IMAGE]  Green Convention 2001   [IMAGE]  EEO 

In [21]:
enron_summary_test['Body']

394     Please ignore.\n\n____________________________...
415                        Good Monday Morning - Comments
424     During the weekend of October 6 and 7, 2001 th...
428     [IMAGE]\n\t\t\t[IMAGE]\t[IMAGE]\t\n\t[IMAGE]\t...
443     Learn more about\n"The Perks & Pitfalls of Cre...
451                                - kudlow100901 (2).doc
455     Attached is a the agenda, minutes of previous ...
460     The migration of UNIX home directories and app...
479     THE SOURCE\nThe eSource Bulletin\tOctober 2001...
498     Brother Betas,\n\nI am George Pereira from the...
518     Woody,  I like this one!!  Will be in all week...
519                                          Jeff. \nFYI-
529     This message is a follow-up to the one we sent...
536     INCREDIBLY LOW TRAVEL COST TO CALGARY \nZIFF E...
557     Many of you have requested more time in which ...
580     Enron in Action can be accessed through the ne...
610     [IMAGE] \t\n[IMAGE]\t [IMAGE]     Dear Jeff Sk...
618     Bares 