The goal of this notebook is to clean the Enron Email data set to perform email text summarization: Given a single person’s folder and a time range give a summary for each email. 

My approach to cleaning the data involves the following steps:

1. I have unpacked the maildir directory from the Enron data set. I will be focusing on each person’s inbox, which means all other folders will be ignored for now.

2. I will process each inbox email to add the following information to a Pandas data frame:

Employee: The username of the email owner. <br>
Body: Cleaned body of the email. <br>
Subject: The title of the email. <br>
From: The original sender of the email <br>
Message-ID: Used to remove duplicate emails, as each email has a unique ID. <br>
Chain: The parsed out email chain from a email that was forwarded. <br>
Signature: The extracted signature from the body.<br>
Date: Time the email was sent. <br>

3. During the email processing the full body of the email will be cleaned using a combination of delimiter splitting, and the talon email cleaning library to extract signatures. 

4. Null values and column types will be set or cleaned up.

5. The body of the email will be tokenized into seperate sentences. Extractive sentences will be used to help generate the full summary, the tokenized body is used for the ML model. 

6. The pandas dataframe will be saved for future notebooks. 

In [1]:
import mailparser
import sys
from os import listdir
from os.path import isfile, join
import pandas as pd
import email
import numpy as np
from talon.signature.bruteforce import extract_signature
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import re

In [2]:
#Local filepaths
mail_dir = '../data/maildir/'
mailbox = "skilling-j" #Start with single mailbox. 
PICKLE_LOC = "../data/dataframes/wrangled_enron_df.pkl"

First define the function that will clean up the email body. The following splits the chain emails from the actual body of the email. It appears that "----Original Message-----" indicates a forwarded part of an email so this is one of the delimiters. 

There are also emails that don't seem to necessarily be part of a chain, but would be referencing another email, which means more email metadata would be found in the body. If this meta data is found first, then the email will be split there instead. The goal is to cut out the largest amount of email forwarding and related metadata to extract only the written message by the original sender. 

Lastly Talon will be used to extract the signatures. 

In [3]:
def clean_body(mail_body):
    delimiters = ["-----Original Message-----","To:","From"]
    
    #Trying to split string by biggest delimiter. 
    old_len = sys.maxsize
    
    for delimiter in delimiters:
        split_body = mail_body.split(delimiter,1)
        new_len = len(split_body[0])
        if new_len <= old_len:
            old_len = new_len
            final_split = split_body
            
    #Then pull chain message
    if (len(final_split) == 1):
        mail_chain = None
    else:
        mail_chain = final_split[1] 
    
    #The following uses Talon to try to get a clean body, and seperate out the rest of the email. 
    clean_body, sig = extract_signature(final_split[0])
    
    return {'Body': clean_body, 'Chain' : mail_chain, 'Signature': sig}

The following function processes the email and appends the parsed pieces into a pandas dataframe. We need to start with a list, that will represent each row of the dataframe. Each item in the row is a dict of parsed email values.


In [4]:
enron_email_list = []
def process_email(email_path, employee, folder, email_name):
    mail = mailparser.parse_from_file(email_path)
    full_body = email.message_from_string(mail.body)
    
    #Only getting first payload
    if full_body.is_multipart():
        return
    else:
        mail_body = full_body.get_payload()    
    
    split_body = clean_body(mail_body)
    headers = mail.headers
    #Reformating date to be more pandas readable
    date_time = headers.get('Date')
    
    if date_time:
        date_time = email.utils.format_datetime(email.utils.parsedate_to_datetime(date_time))
    
    email_dict = {
                "Employee" : employee,
                "Email Folder": folder,
                "Message-ID": headers.get('Message-ID'),
                "Date" : date_time,
                "From" : headers.get('From'),
                "Subject": headers.get('Subject'),
                "Body" : split_body['Body'],
                "Chain" : split_body['Chain'],
                "Signature": split_body['Signature'],
                "Full_Email_Path" : email_path #for debug purposes. 
    }
    
    #Append row to dataframe. 
    enron_email_list.append(email_dict)

In [5]:
#Go through each person's inbox then load up each email to be cleaned and added to the dataframe. 

#mailboxes = listdir(mail_dir)
#for mailbox in mailboxes:
inbox = listdir(mail_dir + mailbox)
for folder in inbox:
    path = mail_dir + mailbox + "/" + folder
    emails = listdir(path)
    for single_email in emails:
        full_path = path + "/" + single_email
        process_email(full_path, mailbox, folder, single_email)
enron_email_list_df = pd.DataFrame(enron_email_list)
#enron_email_list_df.head()

In [6]:
#Convert date to pandas datetime.
enron_email_list_df['Date'] = pd.to_datetime(enron_email_list_df.Date)

#Blanks are replaced with NaN in the whole dataframe. Then rows with a 'NaN' in the body will be dropped. 
enron_email_list_df = enron_email_list_df.replace('',np.NaN)
enron_email_list_df = enron_email_list_df.dropna(subset=['Body'])

#Remove all Duplicate emails 
enron_email_list_df = enron_email_list_df.drop_duplicates(subset='Body')

Removing strings related to attatchments and certain non numerical characters. This is to improve the extractive summary output. 

In [7]:
patterns = ["\[IMAGE\]","-", "_", "\*"]
for pattern in patterns:
    enron_email_list_df['Body'] = pd.Series(enron_email_list_df['Body']).str.replace(pattern, "")
    
#Remove multiple spaces. 
enron_email_list_df.Body = enron_email_list_df.Body.replace('\s+', ' ', regex=True)

In [8]:
#Fully tokenize the Body of each email. 

#This function removes stopwords
def remove_stopwords(sen):
    stop_words = stopwords.words('english')
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

#This function splits up the body into sentence tokens and removes stop words. 
def tokenize_email(text):
    clean_sentences = sent_tokenize(text, language='english')
    #removing punctuation, numbers and special characters. Then lowercasing. 
    clean_sentences = [re.sub('[^a-zA-Z ]', '',s) for s in clean_sentences]
    clean_sentences = [s.lower() for s in clean_sentences]
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    return clean_sentences


In [11]:
#This tokenizing will be the extracted sentences that may be chosen to form the email summaries. 
enron_email_list_df['Extractive_Sentences'] = enron_email_list_df['Body'].apply(sent_tokenize)

#Splitting the text in emails into cleaned sentences
enron_email_list_df['Tokenized_Body'] = enron_email_list_df['Body'].apply(tokenize_email)

In [12]:
#Store the cleaned DataFrame to disk. 
enron_email_list_df.to_pickle(PICKLE_LOC)