The goal of this notebook is to clean the Enron Email data set to perform email text summarization: Given a single person’s folder and a time range give a summary for each email. 

My approach to cleaning the data involves the following steps:

1. I have unpacked the maildir directory from the Enron data set. I will be focusing on each person’s inbox, which means all other folders will be ignored for now.

2. I will process each inbox email to add the following information to a Pandas data frame:

Employee: The username of the email owner. <br>
Body: Cleaned body of the email. <br>
Subject: The title of the email. <br>
From: The original sender of the email <br>
Message-ID: Used to remove duplicate emails, as each email has a unique ID. <br>
Chain: The parsed out email chain from a email that was forwarded. <br>
Signature: The extracted signature from the body.<br>
Date: Time the email was sent. <br>

3. During the email processing the full body of the email will be cleaned using a combination of delimiter splitting, and the talon email cleaning library to extract signatures. 

4. Null values and column types will be set or cleaned up.

5. The body of the email will be tokenized into seperate sentences then appended to the dataframe:

Extractive_Sentences: Used to help generate the full summary. <br>
Tokenized_Body: used as an input for the ML model. <br>

6. The pandas dataframe will be saved for future notebooks. 

-------------

In [1]:
import mailparser
import sys
from os import listdir
from os.path import isfile, join
import pandas as pd
import email
import numpy as np
from talon.signature.bruteforce import extract_signature
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import re
import xml.etree.ElementTree as ET



In [2]:
#Local filepaths
mail_dir = '../data/maildir/'
#Used to look at single mailbox. 
#mailbox = "skilling-j" 
#PICKLE_LOC = "../data/dataframes/wrangled_enron_df.pkl"
#following location is for full dataframe
ENRON_PICKLE_LOC = "../data/dataframes/wrangled_enron_full_df.pkl"
BC3_EMAIL_PICKLE_LOC = "../data/dataframes/wrangled_BC3_email_df.pkl"
BC3_SUMMARY_PICKLE_LOC = "../data/dataframes/wrangled_BC3_summary_df.pkl"

First define the function that will clean up the email body. The following splits the chain emails from the actual body of the email. It appears that "----Original Message-----" indicates a forwarded part of an email so this is one of the delimiters. 

There are also emails that don't seem to necessarily be part of a chain, but would be referencing another email, which means more email metadata would be found in the body. If this meta data is found first, then the email will be split there instead. The goal is to cut out the largest amount of email forwarding and related metadata to extract only the written message by the original sender. 

Lastly Talon will be used to extract the signatures. 

In [3]:
def clean_body(mail_body):
    delimiters = ["-----Original Message-----","To:","From"]
    
    #Trying to split string by biggest delimiter. 
    old_len = sys.maxsize
    
    for delimiter in delimiters:
        split_body = mail_body.split(delimiter,1)
        new_len = len(split_body[0])
        if new_len <= old_len:
            old_len = new_len
            final_split = split_body
            
    #Then pull chain message
    if (len(final_split) == 1):
        mail_chain = None
    else:
        mail_chain = final_split[1] 
    
    #The following uses Talon to try to get a clean body, and seperate out the rest of the email. 
    clean_body, sig = extract_signature(final_split[0])
    
    return {'Body': clean_body, 'Chain' : mail_chain, 'Signature': sig}

The following function processes the email and appends the parsed pieces into a pandas dataframe. We need to start with a list, that will represent each row of the dataframe. Each item in the row is a dict of parsed email values.


In [4]:
def process_date(date_time):
    try:
        date_time = email.utils.format_datetime(email.utils.parsedate_to_datetime(date_time))
    except:
        date_time = None
    return date_time

In [5]:
enron_email_list = []
def process_email(email_path, employee, folder):
    mail = mailparser.parse_from_file(email_path)
    full_body = email.message_from_string(mail.body)
    
    #Only getting first payload
    if full_body.is_multipart():
        return
    else:
        mail_body = full_body.get_payload()    
    
    split_body = clean_body(mail_body)
    headers = mail.headers
    #Reformating date to be more pandas readable
    date_time = process_date(headers.get('Date'))

    email_dict = {
                "Employee" : employee,
                "Email Folder": folder,
                "Message-ID": headers.get('Message-ID'),
                "Date" : date_time,
                "From" : headers.get('From'),
                "Subject": headers.get('Subject'),
                "Body" : split_body['Body'],
                "Chain" : split_body['Chain'],
                "Signature": split_body['Signature'],
                "Full_Email_Path" : email_path #for debug purposes. 
    }
    
    #Append row to dataframe. 
    enron_email_list.append(email_dict)

In [6]:
#Email body cleaning at dataframe level
def clean_email_df(df):
    #Removing strings related to attatchments and certain non numerical characters.
    patterns = ["\[IMAGE\]","-", "_", "\*"]
    for pattern in patterns:
        df['Body'] = pd.Series(df['Body']).str.replace(pattern, "")
    
    #Remove multiple spaces. 
    df.Body = df.Body.replace('\s+', ' ', regex=True)

    #Blanks are replaced with NaN in the whole dataframe. Then rows with a 'NaN' in the body will be dropped. 
    df = df.replace('',np.NaN)
    df = df.dropna(subset=['Body'])

    #Remove all Duplicate emails 
    df = df.drop_duplicates(subset='Body')
    return df

In [7]:
#Fully tokenize the Body of each email. 
#This function removes stopwords
def remove_stopwords(sen):
    stop_words = stopwords.words('english')
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

#This function splits up the body into sentence tokens and removes stop words. 
def tokenize_email(text):
    clean_sentences = sent_tokenize(text, language='english')
    #removing punctuation, numbers and special characters. Then lowercasing. 
    clean_sentences = [re.sub('[^a-zA-Z ]', '',s) for s in clean_sentences]
    clean_sentences = [s.lower() for s in clean_sentences]
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    return clean_sentences

In [None]:
#Go through each person's inbox then load up each email to be cleaned and added to the dataframe. 
mailboxes = listdir(mail_dir)
for mailbox in mailboxes:
    inbox = listdir(mail_dir + mailbox)
    for folder in inbox:
        path = mail_dir + mailbox + "/" + folder
        emails = listdir(path)
        for single_email in emails:
            full_path = path + "/" + single_email
            process_email(full_path, mailbox, folder)
    enron_email_list_df = pd.DataFrame(enron_email_list)

#Convert date to pandas datetime.
enron_email_list_df['Date'] = pd.to_datetime(enron_email_list_df.Date)
enron_email_list_df = clean_email_df(enron_email_list_df)


In [None]:
#This tokenizing will be the extracted sentences that may be chosen to form the email summaries. 
enron_email_list_df['Extractive_Sentences'] = enron_email_list_df['Body'].apply(sent_tokenize)

#Splitting the text in emails into cleaned sentences
enron_email_list_df['Tokenized_Body'] = enron_email_list_df['Body'].apply(tokenize_email)

#Tokenizing the bodies might have revealed more duplicate emails, or extremely similar emails that should be droped. 
enron_email_list_df = enron_email_list_df.loc[enron_email_list_df.astype(str).drop_duplicates(subset='Tokenized_Body').index]

In [None]:
enron_email_list_df.head(1)

In [None]:
#Store the cleaned DataFrame to disk. 
enron_email_list_df.to_pickle(ENRON_PICKLE_LOC)

# Part 2, BC3 Corpus 

Part 2 of this notebook wrangles the BC3 Email Corpus which consists of 40 email threads (3222 sentences) from the W3C corpus. These set of emails contain human summaries of each email, which will be useful to evaluate the performance of the text summarization algorithim. 

This dataset is split into two xml files. One contains the original emails split line by line, and the other contains the summarizations created by the annotators. Each email may contain several summarizations from different annotators. Summarizations may also be over several emails. 

I will link the two data frames using the thread number in combination of the email number. 

1. The first dataframe will contain the wrangled original emails containing the following information

Listno: Thread identifier
Email_num: Email in thread sequence
From: The original sender of the email
To: The recipient of the email. 
Recieved: Time email was recieved. 
Subject: Title of email.
Body: Original body.

2. The second dataframe contains the summarizations of each email containing the following:

Annotator: Person who created summarization. 
Email_num: Email in thread sequence.
Listno: Thread identifier.
Summary: Human summarization of the email. 

In [36]:
def parse_bc3_emails(root):
    BC3_email_list = []
    #The emails are seperated by threads.
    for thread in root:
        email_num = 0
        #Iterate through the thread elements <name, listno, Doc>
        for thread_element in thread:
            #Getting the listno allows us to link the summaries to the correct emails
            if thread_element.tag == "listno":
                listno = thread_element.text
            #Each Doc element is a single email
            if thread_element.tag == "DOC":
                email_num += 1
                email_metadata = []
                for email_attribute in thread_element:
                    #If the email_attri is text, then each child contains a line from the body of the email
                    if email_attribute.tag == "Text":
                        email_body = ""
                        for sentence in email_attribute:
                            email_body += sentence.text
                    else:
                        #The attributes of the Email <Recieved, From, To, Subject, Text> appends in this order. 
                        email_metadata.append(email_attribute.text)
                        
                #Use same enron cleaning methods on the body of the email
                split_body = clean_body(email_body)
                    
                email_dict = {
                    "Listno" : listno,
                    "Date" : process_date(email_metadata[0]),
                    "From" : email_metadata[1],
                    "To" : email_metadata[2],
                    "Subject" : email_metadata[3],
                    "Body" : split_body['Body'],
                    "Email_num": email_num
                }
                
                BC3_email_list.append(email_dict)           
    return pd.DataFrame(BC3_email_list)

In [37]:
def parse_bc3_summaries(root):
    BC3_summary_list = []
    for thread in root:
        #Iterate through the thread elements <listno, name, annotation>
        for thread_element in thread:
            if thread_element.tag == "listno":
                listno = thread_element.text
            #Each Doc element is a single email
            if thread_element.tag == "annotation":
                for annotation in thread_element:
                #If the email_attri is summary, then each child contains a summarization line
                    if annotation.tag == "summary":
                        summary_dict = {}
                        for summary in annotation:
                            #Generate the set of emails the summary sentence belongs to (often a single email)
                            email_nums = summary.attrib['link'].split(',')
                            s = set()
                            for num in email_nums:
                                s.add(num.split('.')[0].strip()) 
                            #Remove empty strings, since they summarize whole threads instead of emails. 
                            s = [x for x in set(s) if x]
                            for email_num in s:
                                if email_num in summary_dict:
                                    summary_dict[email_num] += ' ' + summary.text
                                else:
                                    summary_dict[email_num] = summary.text
                    #get annotator description
                    elif annotation.tag == "desc":
                        annotator = annotation.text
                #For each email summarizaiton create an entry
                for email_num, summary in summary_dict.items():
                    email_dict = {
                        "Listno" : listno,
                        "Annotator" : annotator,
                        "Email_num" : email_num,
                        "Summary" : summary
                    }      
                    BC3_summary_list.append(email_dict)
    return pd.DataFrame(BC3_summary_list)

In [40]:
#load BC3 Email Corpus 
parsedXML = ET.parse( "../data/BC3_Email_Corpus/corpus.xml" )
root = parsedXML.getroot()

#Clean up BC3 email body's same way as Enron emails
bc3_email_df = parse_bc3_emails(root)
bc3_email_df = clean_email_df(bc3_email_df)
bc3_email_df['Date'] = pd.to_datetime(bc3_email_df.Date, utc=True)

#Tokenize emails 
bc3_email_df['Extractive_Sentences'] = bc3_email_df['Body'].apply(sent_tokenize)
bc3_email_df['Tokenized_Body'] = bc3_email_df['Body'].apply(tokenize_email)

#Tokenizing the bodies might have revealed more duplicate emails, or extremely similar emails that should be droped. 
bc3_email_df = bc3_email_df.loc[bc3_email_df.astype(str).drop_duplicates(subset='Tokenized_Body').index]
bc3_email_df.head(1)

Unnamed: 0,Body,Date,Email_num,From,Listno,Subject,To,Extractive_Sentences,Tokenized_Body
0,"The IETF meetings tend to become too large, creating logistics and planning problems. I suggest that future meetings are held for two weeks, with applications and user services issues the first week, and all other issues the second week. Those who so wish could attend both weeks, and other people could attend only one week. Those who choose to attend both weeks would be able to cover more groups and do better liaisons between the different areas. The Friday of the first week could discuss applications issues which might be of special interest to the other areas, and the Monday of the second week would schedule other groups which might be of special interest to applications people, so some people could attend MondayMonday or FridayFriday. Jacob Palme &lt;jpalme@dsv.su.se&gt; (Stockholm University and KTH) for more info see URL: http://www.dsv.su.se/~jpalme",1998-12-08 15:30:52+00:00,1,Jacob Palme <jpalme@dsv.su.se>,007-7484738,Extending IETF meetings to two weeks?,discuss@apps.ietf.org,"[The IETF meetings tend to become too large, creating logistics and planning problems., I suggest that future meetings are held for two weeks, with applications and user services issues the first week, and all other issues the second week., Those who so wish could attend both weeks, and other people could attend only one week., Those who choose to attend both weeks would be able to cover more groups and do better liaisons between the different areas., The Friday of the first week could discuss applications issues which might be of special interest to the other areas, and the Monday of the second week would schedule other groups which might be of special interest to applications people, so some people could attend MondayMonday or FridayFriday., Jacob Palme &lt;jpalme@dsv.su.se&gt; (Stockholm University and KTH) for more info see URL: http://www.dsv.su.se/~jpalme]","[ietf meetings tend become large creating logistics planning problems, suggest future meetings held two weeks applications user services issues first week issues second week, wish could attend weeks people could attend one week, choose attend weeks would able cover groups better liaisons different areas, friday first week could discuss applications issues might special interest areas monday second week would schedule groups might special interest applications people people could attend mondaymonday fridayfriday, jacob palme ltjpalmedsvsusegt stockholm university kth info see url httpwwwdsvsusejpalme]"


In [41]:
#Load summaries and process
parsedXML = ET.parse( "../data/BC3_Email_Corpus/annotation.xml" )
root = parsedXML.getroot()
bc3_summary_df = parse_bc3_summaries(root)

In [42]:
#Store dataframes to disk
bc3_email_df.to_pickle(BC3_EMAIL_PICKLE_LOC)
bc3_summary_df.to_pickle(BC3_SUMMARY_PICKLE_LOC)

In [43]:
bc3_summary_df.head()

Unnamed: 0,Annotator,Email_num,Listno,Summary
0,Annotator7-Part2,1,067-11978590,"Wendy states that they are moving forward with a face to face meeting in Bristol, England on Oct 5-6. The Authoring tool group will join them to discuss revisions to the WCAG on the 6th. She asks the group if they like to discuss the draft they have now and asks if they want to work on a draft for 2 full days."
1,Annotator7-Part2,2,067-11978590,William responds that it is mandatory to discuss the draft to avoid conflicts and oversights. He thinks it'll take greater than 2 days to work on the draft. He is looking forward to the meeting in Bristol.
2,Annotator7-Part2,3,067-11978590,"Jonathan asks he is attending the venue, what the agenda is and if it would be useful if he attended with a presentation."
3,Annotator7-Part2,4,067-11978590,Charles responds and says it suits him to meet and discuss the draft.
4,Annotator7-Part2,5,067-11978590,"Wendy answers all of Jonathan's questions: registration has not started, there is no agenda yet, and yes it would be worthwhile if Jonathan attended with a presentation."


In [24]:
#Look at single thread email. 
bc3_email_df.iloc[1]['Body']

'> The IETF meetings tend to become too large, creating logistics and planning problems. ... My problem over the past year or so is that there are only a few session I wish to attend, but I cannot know for sure when they will be scheduled, so I cannot make reasonable travel arrangements (a week in Orlando for 6 hours of meetings is hard to sell to management). Now I know there is a rationale here, and that one is encouraged to participate broadly. And I am hopeful that new activities (my own and in the IETF) will give me many more reasons to attend. But firmer scheduling would be a big win. regards, Terry Terry Allen Electronic Commerce and Publishing Consultant tallen[at]sonic.net http://www.sonic.net/~tallen/ DocBook: http://www.ora.com/davenport/index.html Common Business Library: http://www.veosystems.com/'

In [25]:
#look at summaries 
pd.set_option('display.max_colwidth', -1)
bc3_summary_df['Summary'].loc[(bc3_summary_df['Listno'] == bc3_email_df.iloc[0]['Listno']) & (bc3_summary_df['Email_num'] == '2')]

118    Terry supported Jacob's idea and suggested a firmer scheduling.             
124    Some argue that it'd be more useful to prepare the meeting schedule earlier.
Name: Summary, dtype: object