# Summarizing Emails using Machine Learning
## Table of Contents
1. Imports & Initalization <br>
2. Data Input <br>
    A. Enron Email Dataset <br>
    B. BC3 Corpus <br>
3. Preprocessing <br>
    A. Delete bad data. <br>
    B. Sentence Cleaning <br>
    C. Tokenizing <br>
4. Store Data
    A. Locally as pickle
    B. Into database. 
5. Data Exploration

The goal of this notebook is to clean both the Enron Email and BC3 Corpus data sets to perform email text summarization. The BC3 Corpus contains human summarizations that can be used to calculate ROUGE metrics to better understand how accurate the summarizations are. The Enron dataset is far more comprehensive, but lacks summaries to test against. 

## Imports & Initalization

In [1]:
#File system / database libraries
import sys
from os import listdir
from os.path import isfile, join
import configparser
from sqlalchemy import create_engine

#Data science tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Email cleaning 
import email
import mailparser
import xml.etree.ElementTree as ET
from talon.signature.bruteforce import extract_signature
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import re

#Parallelizaiton 
import dask.dataframe as dd
from distributed import Client
import multiprocessing as mp



In [2]:
#Set local location of emails. 
mail_dir = '../data/maildir/'
#mail_dir = '../data/testdir/'

## Data Input: 
### A. Enron Email Dataset
The raw enron email dataset contains a maildir directory that contains folders seperated by employee which contain the emails. The following processes the raw text of each email into a dask dataframe with the following columns: 

Employee: The username of the email owner. <br>
Body: Cleaned body of the email. <br>
Subject: The title of the email. <br>
From: The original sender of the email <br>
Message-ID: Used to remove duplicate emails, as each email has a unique ID. <br>
Chain: The parsed out email chain from a email that was forwarded. <br>
Signature: The extracted signature from the body.<br>
Date: Time the email was sent. <br>

In [3]:
def process_email(index):
    #This function attempts to split a raw email into constituent parts that can be used as features. 
    email_path = index[0]
    employee = index[1]
    folder = index[2]
    
    mail = mailparser.parse_from_file(email_path)
    full_body = email.message_from_string(mail.body)
    
    #Only retrieve the body of the email. 
    if full_body.is_multipart():
        return
    else:
        mail_body = full_body.get_payload()    
    
    split_body = clean_body(mail_body)
    headers = mail.headers
    #Reformating date to be more pandas readable
    date_time = process_date(headers.get('Date'))

    email_dict = {
                "employee" : employee,
                "email_folder": folder,
                "message_id": headers.get('Message-ID'),
                "date" : date_time,
                "from" : headers.get('From'),
                "subject": headers.get('Subject'),
                "body" : split_body['body'],
                "chain" : split_body['chain'],
                "signature": split_body['signature'],
                "full_email_path" : email_path #for debug purposes. 
    }
    
    #Append row to dataframe. 
    return email_dict

In [4]:
def clean_body(mail_body):
    delimiters = ["-----Original Message-----","To:","From"]
    
    #Trying to split string by biggest delimiter. 
    old_len = sys.maxsize
    
    for delimiter in delimiters:
        split_body = mail_body.split(delimiter,1)
        new_len = len(split_body[0])
        if new_len <= old_len:
            old_len = new_len
            final_split = split_body
            
    #Then pull chain message
    if (len(final_split) == 1):
        mail_chain = None
    else:
        mail_chain = final_split[1] 
    
    #The following uses Talon to try to get a clean body, and seperate out the rest of the email. 
    clean_body, sig = extract_signature(final_split[0])
    
    return {'body': clean_body, 'chain' : mail_chain, 'signature': sig}

In [5]:
def process_date(date_time):
    try:
        date_time = email.utils.format_datetime(email.utils.parsedate_to_datetime(date_time))
    except:
        date_time = None
    return date_time

In [6]:
def generate_email_paths(mail_dir):
    #Generator to list each email path
    mailboxes = listdir(mail_dir)
    for mailbox in mailboxes:
        inbox = listdir(mail_dir + mailbox)
        for folder in inbox:
            path = mail_dir + mailbox + "/" + folder
            emails = listdir(path)
            for single_email in emails:
                full_path = path + "/" + single_email
                if isfile(full_path): #Skip directories.
                    yield (full_path, mailbox, folder)
    

In [None]:
#Use multiprocessing to speed up initial data load and processing. Also helps partition DASK dataframe. 
try:
    cpus = mp.cpu_count()
except NotImplementedError:
    cpus = 2
pool = mp.Pool(processes=cpus)
print("CPUS: " + str(cpus))

indexes = generate_email_paths(mail_dir)
enron_email_df = pool.map(process_email,indexes)
enron_email_df = pd.DataFrame(enron_email_df)

In [None]:
enron_email_df.describe()

## Data Input: 
### B. BC3 Corpus

This dataset is split into two xml files. One contains the original emails split line by line, and the other contains the summarizations created by the annotators. Each email may contain several summarizations from different annotators and summarizations may also be over several emails. I will create a data frame for both xml files, then join them together using the thread number in combination of the email number for a single final dataframe. 

The first dataframe will contain the wrangled original emails containing the following information:

Listno: Thread identifier <br>
Email_num: Email in thread sequence <br>
From: The original sender of the email <br>
To: The recipient of the email. <br>
Recieved: Time email was recieved. <br>
Subject: Title of email. <br>
Body: Original body. <br>

In [None]:
def parse_bc3_emails(root):
    BC3_email_list = []
    #The emails are seperated by threads.
    for thread in root:
        email_num = 0
        #Iterate through the thread elements <name, listno, Doc>
        for thread_element in thread:
            #Getting the listno allows us to link the summaries to the correct emails
            if thread_element.tag == "listno":
                listno = thread_element.text
            #Each Doc element is a single email
            if thread_element.tag == "DOC":
                email_num += 1
                email_metadata = []
                for email_attribute in thread_element:
                    #If the email_attri is text, then each child contains a line from the body of the email
                    if email_attribute.tag == "Text":
                        email_body = ""
                        for sentence in email_attribute:
                            email_body += sentence.text
                    else:
                        #The attributes of the Email <Recieved, From, To, Subject, Text> appends in this order. 
                        email_metadata.append(email_attribute.text)
                        
                #Use same enron cleaning methods on the body of the email
                split_body = clean_body(email_body)
                    
                email_dict = {
                    "listno" : listno,
                    "date" : process_date(email_metadata[0]),
                    "from" : email_metadata[1],
                    "to" : email_metadata[2],
                    "subject" : email_metadata[3],
                    "body" : split_body['body'],
                    "email_num": email_num
                }
                
                BC3_email_list.append(email_dict)           
    return pd.DataFrame(BC3_email_list)

In [None]:
#load BC3 Email Corpus. Much smaller dataset has no need for parallel processing. 
parsedXML = ET.parse( "../data/BC3_Email_Corpus/corpus.xml" )
root = parsedXML.getroot()

#Clean up BC3 emails the same way as the Enron emails. 
bc3_email_df = parse_bc3_emails(root)

In [None]:
bc3_email_df.info()

In [None]:
bc3_email_df.head(3)

The second dataframe contains the summarizations of each email:

Annotator: Person who created summarization. <br>
Email_num: Email in thread sequence. <br>
Listno: Thread identifier. <br>
Summary: Human summarization of the email. <br>

In [None]:
def parse_bc3_summaries(root):
    BC3_summary_list = []
    for thread in root:
        #Iterate through the thread elements <listno, name, annotation>
        for thread_element in thread:
            if thread_element.tag == "listno":
                listno = thread_element.text
            #Each Doc element is a single email
            if thread_element.tag == "annotation":
                for annotation in thread_element:
                #If the email_attri is summary, then each child contains a summarization line
                    if annotation.tag == "summary":
                        summary_dict = {}
                        for summary in annotation:
                            #Generate the set of emails the summary sentence belongs to (often a single email)
                            email_nums = summary.attrib['link'].split(',')
                            s = set()
                            for num in email_nums:
                                s.add(num.split('.')[0].strip()) 
                            #Remove empty strings, since they summarize whole threads instead of emails. 
                            s = [x for x in set(s) if x]
                            for email_num in s:
                                if email_num in summary_dict:
                                    summary_dict[email_num] += ' ' + summary.text
                                else:
                                    summary_dict[email_num] = summary.text
                    #get annotator description
                    elif annotation.tag == "desc":
                        annotator = annotation.text
                #For each email summarizaiton create an entry
                for email_num, summary in summary_dict.items():
                    email_dict = {
                        "listno" : listno,
                        "annotator" : annotator,
                        "email_num" : email_num,
                        "summary" : summary
                    }      
                    BC3_summary_list.append(email_dict)
    return pd.DataFrame(BC3_summary_list)

In [None]:
#Load summaries and process
parsedXML = ET.parse( "../data/BC3_Email_Corpus/annotation.xml" )
root = parsedXML.getroot()
bc3_summary_df = parse_bc3_summaries(root)

In [None]:
bc3_summary_df.head(3)

## Imports & Initalization: 
### A. Cleaning bad data. 

In [None]:
#Convert date to pandas datetime.
enron_email_df['date'] = pd.to_datetime(enron_email_df['date'], utc=True)
bc3_email_df['date'] = pd.to_datetime(bc3_email_df.date, utc=True)

#Look at the timeframe
start_date = str(enron_email_df.date.min())
end_date =  str(enron_email_df.date.max())
print("Start Date: " + start_date)
print("End Date: " + end_date)

In [None]:
#Since the data was collected in May 2002 according to wikipedia, its a bit strange to see emails past that date. 
#Reading some of the emails seem to suggest it's mostly spam. 
enron_email_df[(enron_email_df.date > '2003-01-01')]

In [None]:
#Quick look at emails before 1999, 
enron_email_df[(enron_email_df.date < '1999-01-01')].date.value_counts()

In [None]:
enron_email_df[(enron_email_df.date == '1980-01-01')].head()

In [None]:
#The emails seem legetimate, but there seems to be a glut of emails dated exactly on 1980-01-01. 
#Keep emails between Jan 1st 1999 and June 1st 2002. 
enron_email_df = enron_email_df[(enron_email_df.date > '1998-01-01') & (enron_email_df.date < '2002-06-01')]

### B. Sentence Cleaning

The raw enron email Corpus tends to have a large amount of unneeded characters that can interfere with tokenizaiton. It's best to do a bit more cleaning.

In [None]:
def clean_email_df(df):
    #Removing strings related to attatchments and certain non numerical characters.
    patterns = ["\[IMAGE\]","-", "_", "\*", "+","\".\""]
    for pattern in patterns:
        df['body'] = pd.Series(df['body']).str.replace(pattern, "")
    
    #Remove multiple spaces. 
    df['body'] = df['body'].replace('\s+', ' ', regex=True)

    #Blanks are replaced with NaN in the whole dataframe. Then rows with a 'NaN' in the body will be dropped. 
    df = df.replace('',np.NaN)
    df = df.dropna(subset=['body'])

    #Remove all Duplicate emails 
    df = df.drop_duplicates(subset='body')
    return df

In [None]:
#Apply clean to both datasets. 
enron_email_df = clean_email_df(enron_email_df)
bc3_email_df = clean_email_df(bc3_email_df)

### C. Tokenizing

It's important to split up sentences into it's constituent parts for the ML algorithim that will be used for text summarization. This will be applied to both the Enron and BC3 datasets. 

In [None]:
def remove_stopwords(sen):
    #This function removes stopwords
    stop_words = stopwords.words('english')
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

def tokenize_email(text):
    #This function splits up the body into sentence tokens and removes stop words. 
    clean_sentences = sent_tokenize(text, language='english')
    #removing punctuation, numbers and special characters. Then lowercasing. 
    clean_sentences = [re.sub('[^a-zA-Z ]', '',s) for s in clean_sentences]
    clean_sentences = [s.lower() for s in clean_sentences]
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    return clean_sentences

Starting with the Enron dataset. 

In [None]:
#This tokenizing will be the extracted sentences that may be chosen to form the email summaries. 
enron_email_df['extractive_sentences'] = enron_email_df['body'].apply(sent_tokenize)
#Splitting the text in emails into cleaned sentences
enron_email_df['tokenized_body'] = enron_email_df['body'].apply(tokenize_email)
#Tokenizing the bodies might have revealed more duplicate emails that should be droped. 
enron_email_df = enron_email_df.loc[enron_email_df.astype(str).drop_duplicates(subset='tokenized_body').index]

Now working on the BC3 Dataset. 

In [None]:
bc3_email_df['extractive_sentences'] = bc3_email_df['body'].apply(sent_tokenize)
bc3_email_df['tokenized_body'] = bc3_email_df['body'].apply(tokenize_email)
bc3_email_df = bc3_email_df.loc[bc3_email_df.astype(str).drop_duplicates(subset='tokenized_body').index]

## Store Data
### Locally as pickle

In [None]:
#Local locations for pickle files. 
ENRON_PICKLE_LOC = "../data/dataframes/wrangled_enron_full_df.pkl"
BC3_EMAIL_PICKLE_LOC = "../data/dataframes/wrangled_BC3_email_df.pkl"
BC3_SUMMARY_PICKLE_LOC = "../data/dataframes/wrangled_BC3_summary_df.pkl"

In [None]:
#Store dataframes to disk
#enron_email_df.to_pickle(ENRON_PICKLE_LOC)
#bc3_email_df.to_pickle(BC3_EMAIL_PICKLE_LOC)
#bc3_summary_df.to_pickle(BC3_SUMMARY_PICKLE_LOC)

Store data into a Postgres Database

In [None]:
#Configure postgres database
config = configparser.ConfigParser()
config.read('config_notebook.ini')

#database_config = 'LOCAL_POSTGRES'
database_config = 'AWS_POSTGRES'

POSTGRES_ADDRESS = config[database_config]['POSTGRES_ADDRESS']
POSTGRES_USERNAME = config[database_config]['POSTGRES_USERNAME']
POSTGRES_PASSWORD = config[database_config]['POSTGRES_PASSWORD']
POSTGRES_DBNAME = config[database_config]['POSTGRES_DBNAME']

#now create database connection
postgres_str = ('postgresql+psycopg2://{username}:{password}@{ipaddress}/{dbname}'
                .format(username=POSTGRES_USERNAME, 
                        password=POSTGRES_PASSWORD,
                        ipaddress=POSTGRES_ADDRESS,
                        dbname=POSTGRES_DBNAME))

cnx = create_engine(postgres_str)

In [None]:
#Store data. 
#enron_email_df.to_sql('full_enron_emails', cnx)

## Data Exploration

In [None]:
#Dask can help speed up exploration computations. 

In [None]:
client = Client(processes = True)
client.cluster

In [None]:
#Make into dask dataframe. 
enron_email_df = dd.from_pandas(enron_email_df, npartitions=cpus)
enron_email_df.columns

In [None]:
#Used to create a describe summary of the dataset. Ignoring tokenized columns. 
enron_email_df[['body', 'chain', 'date', 'email_folder', 'employee', 'from', 'full_email_path', 'message_id', 'signature', 'subject']].describe().compute()

In [None]:
#Get word frequencies from tokenized word lists
def get_word_freq(df):
    freq_words=dict()
    for tokens in df.tokenized_words.compute():
        for token in tokens:
            if token in freq_words:
                freq_words[token] += 1
            else: 
                freq_words[token] = 1
    return freq_words     

In [None]:
def tokenize_word(sentences):
    tokens = []
    for sentence in sentences:
        tokens = word_tokenize(sentence)
    return tokens

In [None]:
#Tokenize the sentences 
enron_email_df['tokenized_words'] = enron_email_df['tokenized_body'].apply(tokenize_word).compute()

In [None]:
#Creating word dictionary to understand word frequencies. 
freq_words = get_word_freq(enron_email_df)
print('Unique words: {:,}'.format(len(freq_words)))

In [None]:
word_data = []
#Sort dictionary by highest word frequency. 
for key, value in sorted(freq_words.items(), key=lambda item: item[1], reverse=True):
    word_data.append([key, freq_words[key]])

#Prepare to plot bar graph of top words. 
#Create dataframe with Word and Frequency, then sort in Descending order. 
freq_words_df = pd.DataFrame.from_dict(freq_words, orient='index').reset_index()
freq_words_df = freq_words_df.rename(columns={"index": "Word", 0: "Frequency"})
freq_words_df = freq_words_df.sort_values(by=['Frequency'],ascending = False)
freq_words_df.reset_index(drop = True, inplace=True)
freq_words_df.head(30).plot(x='Word', kind='bar', figsize=(20,10))