The goal of this notebook is to clean the Enron Email data set to perform email text summarization: Given a single person’s folder and a time range give a summary for each email. 

I plan to use a extractive summarization approach where key sentences are taken from the input email to form the summary. 

My approach to cleaning the data involves the following steps:

1. I have unpacked the maildir directory from the Enron data set. I will be focusing on each person’s inbox, which means all other folders will be ignored for now.

2. I will process each inbox email to add the following information to a Pandas data frame:

Employee: The username of the email owner. 
Body: Cleaned body of the email. 
Subject: The title of the email. 
From: The original sender of the email
Message-ID: Used to remove duplicate emails, as each email has a unique ID. 
Chain: The parsed out email chain from a email that was forwarded. 
Signature: The extracted signature from the body.
Date: Time the email was sent. 

3. During the email processing the full body of the email will be cleaned using a combination of delimiter splitting, and the talon email cleaning library to extract signatures. 

4. Null values will be set or cleaned up. 

5. Then the Date column will be set to Datetime. 

In [1]:
import mailparser
import sys
from os import listdir
from os.path import isfile, join
import pandas as pd
import email
from talon.signature.bruteforce import extract_signature

First define the function that will clean up the email body. The following splits the chain emails from the actual body of the email. It appears that "----Original Message-----" indicates a forwarded part of an email so this is one of the delimiters. 

There are also emails that don't seem to necessarily be part of a chain, but would be referencing another email, which means more email metadata would be found in the body. If this meta data is found first, then the email will be split there instead. The goal is to cut out the largest amount of email forwarding and related metadata to extract only the written message by the original sender. 

Lastly Talon will be used to extract the signatures. 

In [2]:
def clean_body(mail_body):
    delimiters = ["-----Original Message-----","To:","From"]
    
    #Trying to split string by biggest delimiter. 
    old_len = sys.maxsize
    
    for delimiter in delimiters:
        split_body = mail_body.split(delimiter,1)
        new_len = len(split_body[0])
        if new_len <= old_len:
            old_len = new_len
            final_split = split_body
            
    #Then pull chain message
    if (len(final_split) == 1):
        mail_chain = None
    else:
        mail_chain = final_split[1] 
    
    #The following uses Talon to try to get a clean body, and seperate out the rest of the email. 
    clean_body, sig = extract_signature(final_split[0])
    
    return {'Body': clean_body, 'Chain' : mail_chain, 'Signature': sig}

The following function processes the email and appends the parsed pieces into a pandas dataframe. We need to start with a list, that will represent each row of the dataframe. Each item in the row is a dict of parsed email values.


In [3]:
enron_email_list = []
def process_email(email_path, employee, folder, email_name):
    mail = mailparser.parse_from_file(email_path)
    
    full_body = email.message_from_string(mail.body)
    mail_body = full_body.get_payload()    
    split_body = clean_body(mail_body)
    headers = mail.headers
    
    email_dict = {
                "Employee" : employee,
                "Email Folder": folder,
                "Message-ID": headers.get('Message-ID'),
                "Date" : headers.get('Date'),
                "From" : headers.get('From'),
                "Subject": headers.get('Subject'),
                "Body" : split_body['Body'],
                "Chain" : split_body['Chain'],
                "Signature": split_body['Signature'],
                "Full_Email_Path" : email_path #for debug purposes. 
    }
    
    #Append row to dataframe. 
    enron_email_list.append(email_dict)

Now go through each person's inbox then load up each email to be cleaned and added to the dataframe. 

In [5]:
mail_dir = '../data/maildir/'
mailbox = "skilling-j" #testing

#for mailbox in mailboxes:
inbox = listdir(mail_dir + mailbox)
for folder in inbox:
    path = mail_dir + mailbox + "/" + folder
    emails = listdir(path)
    for single_email in emails:
        full_path = path + "/" + single_email
        process_email(full_path, mailbox, folder, single_email)
enron_email_list_df = pd.DataFrame(enron_email_list)
enron_email_list_df.head()

Unnamed: 0,Body,Chain,Date,Email Folder,Employee,From,Full_Email_Path,Message-ID,Signature,Subject
0,"Simon, I was unable to open the attached docum...",:\tSimone La rose/ENRON@enronXgate on 05/16/20...,"Wed, 16 May 2001 21:18:00 -0700 (PDT)",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/1.,<25701656.1075840162944.JavaMail.evans@thyme>,"Thanks,\nJoannie","Re: EnronOnline Executive Summary for May 15, ..."
1,"Erica,\n\nDue to the fact that Jeff is unable ...","\t""'jeff.skilling@enron.com'"" <jeff.skilling@e...","Mon, 14 May 2001 15:20:00 -0700 (PDT)",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/16.,<8541273.1075840163402.JavaMail.evans@thyme>,,Re: Telephone Call with Jerry Murdock
2,"PLEASE PLAN TO ATTEND\n\n""NEW BUSINESS CR...",,"Wed, 1 Aug 2001 12:25:58 -0700 (PDT)",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/110.,<20911721.1075852686924.JavaMail.evans@thyme>,,Please Plan to Attend
3,"The Stanford Business School, in collaboration...",,"Thu, 3 May 2001 18:30:00 -0700 (PDT)",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/28.,<20013668.1075840163679.JavaMail.evans@thyme>,,Stanford/McKinsey GLOBE Survey
4,test,,"Tue, 15 May 2001 14:09:00 -0700 (PDT)",sent_items,skilling-j,jeff.skilling@enron.com,../data/maildir/skilling-j/sent_items/5.,<30255134.1075840163074.JavaMail.evans@thyme>,,test


Convert each column from object to it's appropriate type. 

In [29]:
enron_email_list_df[['Body']] = enron_email_list_df[['Body']].astype(str)

A quick look shows that the Subject column has a large amount of blanks that should be set to null. All the other columns seem to make sense. 

In [30]:
enron_email_list_df.describe()

Unnamed: 0,Body,Chain,Date,Email Folder,Employee,From,Full_Email_Path,Message-ID,Signature,Subject
count,4139,1349,4139,4139,4139,4139,4139,4139,643,4139.0
unique,2386,684,2506,12,1,919,4139,4139,243,1960.0
top,The report named: Violation/Notification Memo ...,All Enron Worldwide\ncc: \nSubject: Successi...,"Thu, 27 Jul 2000 07:31:00 -0700 (PDT)",inbox,skilling-j,sherri.sera@enron.com,../data/maildir/skilling-j/all_documents/211.,<28134358.1075845505384.JavaMail.evans@thyme>,"Regards,\nSherri Sera\nAssistant to Jeff Skill...",
freq,42,14,8,1252,4139,662,1,1,56,198.0


In [31]:
enron_email_list_df.dtypes

Body               object
Chain              object
Date               object
Email Folder       object
Employee           object
From               object
Full_Email_Path    object
Message-ID         object
Signature          object
Subject            object
dtype: object

All the blanks will be replaced with NaN to keep blank values consistent.

In [26]:
#enron_email_list_df.replace(r'', None, regex=False)

In [23]:
#enron_email_list_df['Subject'].value_counts()