The goal of this data wrangling Notebook is to clean the data for a minimum viable system that will acomplish the following: Given a single person's inbox and a time range give a summary of the emails. 

The initial idea to do this involve the following ideas:
    1) I will start out using extractive summarization methods, which reuse sentences that were already written. 
    2) There would be a ratio of actual sentences vs summarized sentences. For example, for every 5 email sentences, 1 summary sentence will be generated.



In [1]:
import mailparser
import sys
from os import listdir
from os.path import isfile, join
import pandas as pd
import email
from talon.signature.bruteforce import extract_signature

First I will get the list of each mail directory in the dataset. There should be 150 different people in the dataset. 

In [2]:
mail_dir = '../data/maildir/'
mailboxes = listdir(mail_dir)
print(len(mailboxes))

150


Temporarily start with one person. This will be for each directoy later on. 

In [3]:
#mailbox = mailboxes[0]
mailbox = "skilling-j"
#inbox = listdir(mail_dir + mailbox)
#print(inbox)

The following function tries to split the body of the email from the rest of the chain. 

We need to start with a list, that will represent each row of the dataframe. Each item in the row is a dict of parsed email values. 

In [4]:
enron_email_list = []

The following splits the chain emails from the actual body of the email. It seems like "----Original Message-----" indicates a forwarded part of an email so this is one of the delimiters. 

There are also emails that don't seem to necessarily be part of a chain, but would be referencing another email, which means more email metadata would be found in the body. If this meta data is found first, then the email will be split there instead. I'm trying to extract only the written message by the original sender. 

Lastly Talon will be used to extract the signatures. 

In [5]:
def clean_body(mail_body):
    delimiters = ["-----Original Message-----","To:","From"]
    
    #Trying to split string by biggest delimiter. 
    old_len = sys.maxsize
    
    for delimiter in delimiters:
        split_body = mail_body.split(delimiter,1)
        new_len = len(split_body[0])
        if new_len <= old_len:
            old_len = new_len
            final_split = split_body
            
    #Then pull chain message
    if (len(final_split) == 1):
        mail_chain = ""
    else:
        mail_chain = final_split[1] 
    
    #The following uses Talon to try to get a clean body, and seperate out the rest of the email. 
    clean_body, sig = extract_signature(final_split[0])
    
    return {'Body': clean_body, 'Chain' : mail_chain, 'Signature': sig}

Now put each email into a pandas dataframe. process_email will create a 1 row dataframe to append to the full dataframe. 

In [6]:
def process_email(email_path, employee, folder, email_name):
    mail = mailparser.parse_from_file(email_path)
    
    full_body = email.message_from_string(mail.body)
    mail_body = full_body.get_payload()    
    split_body = clean_body(mail_body)
    
    email_dict = {
                "Employee" : employee,
                "Folder" : folder,
                "Email-Name" : email_name,
                "Body" : split_body['Body'],
                "Chain" : split_body['Chain'],
                "Signature": split_body['Signature'],
                "Full_Email_Path" : email_path #for debug purposes. 
    }
    
    #Put rest of header into the row. 
    email_dict.update(mail.headers)
    enron_email_list.append(email_dict)
    

In [7]:
#process_email(full_path, mailbox, folder[0], single_email[0])

#for mailbox in mailboxes:
inbox = listdir(mail_dir + mailbox)
for folder in inbox:
    path = mail_dir + mailbox + "/" + folder
    emails = listdir(path)
    for single_email in emails:
        full_path = path + "/" + single_email
        process_email(full_path, mailbox, folder, single_email)
enron_email_list_df = pd.DataFrame(enron_email_list)
enron_email_list_df.head()

Unnamed: 0,Bcc,Body,Cc,Chain,Content-Transfer-Encoding,Content-Type,Date,Email-Name,Employee,Folder,...,Signature,Subject,To,X-FileName,X-Folder,X-From,X-Origin,X-To,X-bcc,X-cc
0,,"Simon, I was unable to open the attached docum...",,:\tSimone La rose/ENRON@enronXgate on 05/16/20...,quoted-printable,text/plain; charset=us-ascii,"Wed, 16 May 2001 21:18:00 -0700 (PDT)",1.0,skilling-j,sent_items,...,"Thanks,\nJoannie","Re: EnronOnline Executive Summary for May 15, ...",simone.rose@enron.com,jskillin.pst,\jskillin\Sent Items,Jeff Skilling,SKILLING-J,Simone La rose <Simone La rose/ENRON@enronXgate>,,
1,,"Erica,\n\nDue to the fact that Jeff is unable ...",,"\t""'jeff.skilling@enron.com'"" <jeff.skilling@e...",7bit,text/plain; charset=us-ascii,"Mon, 14 May 2001 15:20:00 -0700 (PDT)",16.0,skilling-j,sent_items,...,,Re: Telephone Call with Jerry Murdock,eharris@insightpartners.com,jskillin.pst,\jskillin\Sent Items,Jeff Skilling,SKILLING-J,Erica Harris <EHarris@insightpartners.com>,Sherri Sera <Sherri Sera/Corp/Enron@ENRON>,
2,"kenneth.lay@enron.com, christie.patrick@enron.com","PLEASE PLAN TO ATTEND\n\n""NEW BUSINESS CR...","kenneth.lay@enron.com, christie.patrick@enron.com",,7bit,text/plain; charset=us-ascii,"Wed, 1 Aug 2001 12:25:58 -0700 (PDT)",110.0,skilling-j,sent_items,...,,Please Plan to Attend,"k..allen@enron.com, frank.bay@enron.com, sally...",JKING6 (Non-Privileged).pst,\JSKILLIN (Non-Privileged)\Sent Items,"Skilling, Jeff </O=ENRON/OU=NA/CN=RECIPIENTS/C...",King-J,"Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENT...",,"Lay, Kenneth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=..."
3,,"The Stanford Business School, in collaboration...",,,7bit,text/plain; charset=us-ascii,"Thu, 3 May 2001 18:30:00 -0700 (PDT)",28.0,skilling-j,sent_items,...,,Stanford/McKinsey GLOBE Survey,jeff_donahue@enron.net,jskillin.pst,\jskillin\Sent Items,Jeff Skilling,SKILLING-J,jeff_donahue <jeff_donahue@enron.net>,,
4,,test,,,7bit,text/plain; charset=us-ascii,"Tue, 15 May 2001 14:09:00 -0700 (PDT)",5.0,skilling-j,sent_items,...,,test,joannie.williamson@enron.com,jskillin.pst,\jskillin\Sent Items,Jeff Skilling,SKILLING-J,Joannie Williamson <Joannie Williamson/Corp/En...,,


Check and cast datatypes to appropriate types. 

In [8]:
enron_email_list_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4139 entries, 0 to 4138
Data columns (total 24 columns):
Bcc                          921 non-null object
Body                         4139 non-null object
Cc                           921 non-null object
Chain                        4139 non-null object
Content-Transfer-Encoding    4139 non-null object
Content-Type                 4139 non-null object
Date                         4139 non-null object
Email-Name                   4139 non-null object
Employee                     4139 non-null object
Folder                       4139 non-null object
From                         4139 non-null object
Full_Email_Path              4139 non-null object
Message-ID                   4139 non-null object
Mime-Version                 4139 non-null object
Signature                    643 non-null object
Subject                      4139 non-null object
To                           3745 non-null object
X-FileName                   4139 non-null obj

Description of the dataset. I will now identify irrelevant headers and drop the column. 

In [9]:
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns',500)
enron_email_list_df.describe()

Unnamed: 0,Bcc,Body,Cc,Chain,Content-Transfer-Encoding,Content-Type,Date,Email-Name,Employee,Folder,From,Full_Email_Path,Message-ID,Mime-Version,Signature,Subject,To,X-FileName,X-Folder,X-From,X-Origin,X-To,X-bcc,X-cc
count,921,4139,921,4139.0,4139,4139,4139,4139.0,4139,4139,4139,4139,4139,4139.0,643,4139.0,3745,4139,4139,4139,4139,4139,4139.0,4139.0
unique,329,2386,329,684.0,2,2,2506,1413.0,1,12,919,4139,4139,1.0,243,1960.0,621,6,27,1200,4,898,6.0,412.0
top,sherri.sera@enron.com,The report named: Violation/Notification Memo <http://erv.corp.enron.com/link,sherri.sera@enron.com,,7bit,text/plain; charset=us-ascii,"Thu, 27 Jul 2000 07:31:00 -0700 (PDT)",1.0,skilling-j,inbox,sherri.sera@enron.com,../data/maildir/skilling-j/inbox/53.,<29311992.1075845487422.JavaMail.evans@thyme>,1.0,"Regards,\nSherri Sera\nAssistant to Jeff Skilling\n713.853.5984\n713.646.8381 (fax)\nsherri.sera@enron.com",,jeff.skilling@enron.com,jskillin.nsf,\JSKILLIN (Non-Privileged)\Inbox,Sherri Sera,SKILLING-J,"Skilling, Jeff </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSKILLIN>",,
freq,71,42,71,2791.0,3863,3550,8,12.0,4139,1252,662,1,1,4139.0,56,198.0,1128,2335,727,662,2795,329,4134.0,3209.0


It appears that the column 'Mime-Version' has one value that does not seem like it would affect email summarization. It could safely be dropped. 

In [10]:
enron_email_list_df['Mime-Version'].value_counts()

1.0    4139
Name: Mime-Version, dtype: int64

In [11]:
enron_email_list_df = enron_email_list_df.drop(columns=['Mime-Version'])

In [12]:
enron_email_list_df['Content-Transfer-Encoding'].value_counts()

7bit                3863
quoted-printable    276 
Name: Content-Transfer-Encoding, dtype: int64

In [13]:
enron_email_list_df = enron_email_list_df.drop(columns=['Content-Transfer-Encoding'])

In [14]:
enron_email_list_df['Content-Type'].value_counts()

text/plain; charset=us-ascii          3550
text/plain; charset=ANSI_X3.4-1968    589 
Name: Content-Type, dtype: int64

In [15]:
enron_email_list_df = enron_email_list_df.drop(columns=['Content-Type'])

The X-* metadata does not seem to be releveant to the goal of summarizing email bodies. I explore a few of them below, then drop all all the X-* columns from the pandas dataframe. 

In [16]:
enron_email_list_df['X-Origin'].value_counts()

SKILLING-J    2795
Dasovich-J    676 
Skilling-J    601 
King-J        67  
Name: X-Origin, dtype: int64

In [17]:
enron_email_list_df['X-FileName'].value_counts()

jskillin.nsf                     2335
JDASOVIC (Non-Privileged).pst    676 
JSKILLIN (Non-Privileged).pst    601 
jskillin.pst                     456 
JKING6 (Non-Privileged).pst      67  
Skilling, Jeff.pst               4   
Name: X-FileName, dtype: int64

In [18]:
enron_email_list_df = enron_email_list_df.drop(columns=['X-FileName'])

In [19]:
enron_email_list_df['X-Folder'].value_counts()

\JSKILLIN (Non-Privileged)\Inbox                               727
\Jeffrey_Skilling_Dec2000\Notes Folders\All documents          508
\JSKILLIN (Non-Privileged)\Deleted Items                       483
\Jeffrey_Skilling_Dec2000\Notes Folders\Discussion threads     421
\jskillin\Inbox                                                406
\Jeff_Skilling_Oct2001\Notes Folders\All documents             325
\Jeff_Skilling_Oct2001\Notes Folders\Discussion threads        230
\Jeffrey_Skilling_Dec2000\Notes Folders\Notes inbox            141
\Jeffrey_Skilling_Dec2000\Notes Folders\Sent                   139
\Jeff_Skilling_Oct2001\Notes Folders\'sent mail                139
\Jeff_Skilling_Oct2001\Notes Folders\Sent                      137
\Jeffrey_Skilling_Dec2000\Notes Folders\'sent mail             136
\JSKILLIN (Non-Privileged)\Skilling, Jeff\Inbox                115
\Jeff_Skilling_Oct2001\Notes Folders\Notes inbox               102
\jskillin\Sent Items                                          

In [20]:
enron_email_list_df['X-Origin'].value_counts()

SKILLING-J    2795
Dasovich-J    676 
Skilling-J    601 
King-J        67  
Name: X-Origin, dtype: int64

In [21]:
enron_email_list_df = enron_email_list_df.drop(columns=['X-Folder'])
enron_email_list_df = enron_email_list_df.drop(columns=['X-From'])
enron_email_list_df = enron_email_list_df.drop(columns=['X-Origin'])
enron_email_list_df = enron_email_list_df.drop(columns=['X-To'])
enron_email_list_df = enron_email_list_df.drop(columns=['X-bcc'])
enron_email_list_df = enron_email_list_df.drop(columns=['X-cc'])

In [22]:
#enron_email_list_df.iloc[0]

In [23]:
#enron_email_list_df['Body'].iloc[0]

In [24]:
enron_email_list_df.describe()

Unnamed: 0,Bcc,Body,Cc,Chain,Date,Email-Name,Employee,Folder,From,Full_Email_Path,Message-ID,Signature,Subject,To
count,921,4139,921,4139.0,4139,4139.0,4139,4139,4139,4139,4139,643,4139.0,3745
unique,329,2386,329,684.0,2506,1413.0,1,12,919,4139,4139,243,1960.0,621
top,sherri.sera@enron.com,The report named: Violation/Notification Memo <http://erv.corp.enron.com/link,sherri.sera@enron.com,,"Thu, 27 Jul 2000 07:31:00 -0700 (PDT)",1.0,skilling-j,inbox,sherri.sera@enron.com,../data/maildir/skilling-j/inbox/53.,<29311992.1075845487422.JavaMail.evans@thyme>,"Regards,\nSherri Sera\nAssistant to Jeff Skilling\n713.853.5984\n713.646.8381 (fax)\nsherri.sera@enron.com",,jeff.skilling@enron.com
freq,71,42,71,2791.0,8,12.0,4139,1252,662,1,1,56,198.0,1128


In [25]:
enron_email_list_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4139 entries, 0 to 4138
Data columns (total 14 columns):
Bcc                921 non-null object
Body               4139 non-null object
Cc                 921 non-null object
Chain              4139 non-null object
Date               4139 non-null object
Email-Name         4139 non-null object
Employee           4139 non-null object
Folder             4139 non-null object
From               4139 non-null object
Full_Email_Path    4139 non-null object
Message-ID         4139 non-null object
Signature          643 non-null object
Subject            4139 non-null object
To                 3745 non-null object
dtypes: object(14)
memory usage: 452.8+ KB


The following code is to look at what an example of a summary might look like, ignoring date. 

In [26]:
enron_email_list_df['Body'].head(15)

0     Simon, I was unable to open the attached document for Jeff Skilling.  Could you please resend.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

In [27]:
enron_email_list_df['Subject'].head(15)

0     Re: EnronOnline Executive Summary for May 15, 2001
1     Re: Telephone Call with Jerry Murdock             
2     Please Plan to Attend                             
3     Stanford/McKinsey GLOBE Survey                    
4     test                                              
5     Re: Washington Pictures                           
6     PRC memo                                          
7     Re:                                               
8     Re: Information for Jeff Skilling from Kevin Scott
9     Re:                                               
10    RE:                                               
11    FINAL - Nevis Itinerary                           
12    RE: Dinner Monday, April 23                       
13    Re: My Resignation, and a Question                
14    Re: Guidance                                      
Name: Subject, dtype: object

In [28]:
enron_email_list_df['Full_Email_Path'].head(15)

0     ../data/maildir/skilling-j/sent_items/1.  
1     ../data/maildir/skilling-j/sent_items/16. 
2     ../data/maildir/skilling-j/sent_items/110.
3     ../data/maildir/skilling-j/sent_items/28. 
4     ../data/maildir/skilling-j/sent_items/5.  
5     ../data/maildir/skilling-j/sent_items/179.
6     ../data/maildir/skilling-j/sent_items/22. 
7     ../data/maildir/skilling-j/sent_items/46. 
8     ../data/maildir/skilling-j/sent_items/35. 
9     ../data/maildir/skilling-j/sent_items/38. 
10    ../data/maildir/skilling-j/sent_items/47. 
11    ../data/maildir/skilling-j/sent_items/4.  
12    ../data/maildir/skilling-j/sent_items/39. 
13    ../data/maildir/skilling-j/sent_items/23. 
14    ../data/maildir/skilling-j/sent_items/19. 
Name: Full_Email_Path, dtype: object