# Exploring the Enron Emails Dataset

The Enron corpus is the largest public domain database of real e-mails in the world.  This version of the dataset contains over 500,000 emails from about 150 users, mostly senior management at Enron.  The corpus is valuable for research in that it provides a rich example of how a real organization uses e-mails and has had a widespread influence on today's software for fraud detection.  Visit [here](https://en.wikipedia.org/wiki/Enron_scandal) to learn more about the Enron scandal.  

The purpose of this project is to explore the data to check for fraud and to see what sort of information was leaked in the e-mails.  Checkout the data set [here](https://www.cs.cmu.edu/~./enron/).  

## 1. Looking At The Data

In [110]:
import pandas as pd
import numpy as np

filepath = "data/emails.csv"
# Read the data into a pandas dataframe called emails
emails = pd.read_csv(filepath)
print("Successfully loaded {} rows and {} columns!".format(emails.shape[0], emails.shape[1]))
# Print column names
print(emails.columns)
# Store column headers 
headers = [header for header in emails.columns]
# Print the first 5 rows of the dataset
print(emails.head())

Successfully loaded 517401 rows and 2 columns!
Index([u'file', u'message'], dtype='object')
                       file                                            message
0     allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...
1    allen-p/_sent_mail/10.  Message-ID: <15464986.1075855378456.JavaMail.e...
2   allen-p/_sent_mail/100.  Message-ID: <24216240.1075855687451.JavaMail.e...
3  allen-p/_sent_mail/1000.  Message-ID: <13505866.1075863688222.JavaMail.e...
4  allen-p/_sent_mail/1001.  Message-ID: <30922949.1075863688243.JavaMail.e...


Numpy and pandas were imported, then the csv file containing the e-mails was read into a dataframe called **`emails`**.  The reading may take a while due to the size of the file.  Next, the shape of the dataset, column names and a sample of five rows within the dataset were printed.  There are 517,401 rows and 2 columns.  

**`file`** - contains the original directory and filename of each email. The root level of this path is the employee (surname first followed by first name initial) to whom the emails belong. 

**`message`** - contains the email text

### E-mails are MIME formatted

Here is a sample of the standard e-mail found in the data.  It contains a list of headers and a message body.  Note that there is a header label called "Mime-Version", which signifies that the e-mails in this dataset are MIME formatted.  MIME stands for Multipurpose Internet Mail Extensions and virtually all human-written email is transmitted in MIME format.  Python has a built in [MIME handling package](https://docs.python.org/2/library/email.html) and this is what will be used to dissect the data needed out of each e-mail.

In [117]:
print(emails.loc[0]["message"])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


## 2. Data Cleaning

Here are the list of things that needs to be performed on the data before:
* Check for missing values
* Tokenization
* Feature Engineering
* Remove unwanted characters 

### Missing Values

The `emails` dataframe was checked for missing values.  In this case, there were no missing values.

In [118]:
# Check for null values
null_values = emails.isnull().values.any()
if null_values == False: print "No NaN values"

No NaN values


### Introducing the Bag-of-words model

For the computer to make inferences of the e-mails, it has to be able to interpret the text by making a numerical representation of it.  One way to do this is by using something called a [**Bag-of-words model**](https://en.wikipedia.org/wiki/Bag-of-words_model).  It will take each e-mail as a string and convert it into a numerical vector.  In this case, each string will be converted into a 1-dimensional array of 0s and 1s.  The first step in creating a Bag-of-words model is called tokenization.  By tokenizing each e-mail, each string is split into a list of words. 

### Tokenization

In this step, the MIME handling python package mentioned earlier was used to extract both the headers and the messages found within each e-mail.  The data found within the headers section of each e-mail will be added to the `emails` dataframe as new features.  They are stored in the `new_features` dictionary.  All tokens are stored in `tokenized_messages` for further processing.

**Why lowercase the message body?**

Because a human may know that "Forecast" and "forecast" means the same thing, but the computer does not know this.  Also, while building the matrix using the bag-of-words model, lowercasing also reduces the chance of the same word being duplicated and entered as a separate word.

**Note**: Running the code below may take a few minutes to complete

In [120]:
# MIME handling package
import email

# List of tokens
tokenized_messages = []
# Used to store data for new features
new_features = {}

for item in emails["message"]: 
    # Return a message object structure from a string
    e = email.message_from_string(item)
    # A list of tuples containing the header keys and values
    header_list = e.items()
    # Add data to dictionary 
    for key, value in header_list:
        if key in new_features:
            values = new_features.get(key)
            values.append(value)
            new_features[key] = values
        else:
            new_features[key] = [value]
    # get message body  
    message_body = e.get_payload()
    # lower case messages
    message_body = message_body.lower()
    # split message into tokens
    tokens = message_body.split(" ")
    tokenized_messages.append(tokens)
print(tokenized_messages[0])

['here', 'is', 'our', 'forecast\n\n', '']


### Adding new columns

In [41]:
#create new column for employee name using data from the "file" column
employees = []

# empty array to store shorter and longer than average directories
outlier_dir = []
# extract "file" column
file_info = emails[headers[0]]
for row in file_info:
    tokens = row.split("/")
    if len(tokens) < 3 or len(tokens)> 3:
        outlier_dir.append(tokens)
    employees.append(tokens[0])
# create column and set its values
emails["employee"] = employees 

In [42]:
# shows the dataframe with the appended column
print(emails.head())

                       file  \
0     allen-p/_sent_mail/1.   
1    allen-p/_sent_mail/10.   
2   allen-p/_sent_mail/100.   
3  allen-p/_sent_mail/1000.   
4  allen-p/_sent_mail/1001.   

                                             message employee  
0  Message-ID: <18782981.1075855378110.JavaMail.e...  allen-p  
1  Message-ID: <15464986.1075855378456.JavaMail.e...  allen-p  
2  Message-ID: <24216240.1075855687451.JavaMail.e...  allen-p  
3  Message-ID: <13505866.1075863688222.JavaMail.e...  allen-p  
4  Message-ID: <30922949.1075863688243.JavaMail.e...  allen-p  


In [43]:
# show the number of emails sent by each employee
print(emails["employee"].value_counts()[:10])

kaminski-v      28465
dasovich-j      28234
kean-s          25351
mann-k          23381
jones-t         19950
shackleton-s    18687
taylor-m        13875
farmer-d        13032
germany-c       12436
beck-s          11830
Name: employee, dtype: int64


In [44]:
print(len(outlier_dir))

26555


The e-mails could be filtered by employee names, which can be retrived from the filename found in the `file` column.  Although the subfolders in the filepath could be used as filters, they will remain untouched for now.

An empty array was created to hold the values for the `employee` column.  The `file` column from the `emails` dataframe was extracted into an array called `file_info`.  `file_info` was then loooped over to get the string found at each index. Each string was then split into tokens, with the employee name located at index 0.  Each name was then added to the `employees` array.  This array was then set to be the values in the new `employee` column.  

A sample of five rows was printed to show that the new column was added.  Also, the number of e-mails sent by each employee was printed in the table.

Note that there were also 26,555 directories that contained less or more than three folders.  Although this figure may seem large, it represents approximately 5.0% of all directories listed in the dataset.  


###  Remove unwanted HTML Markup, punctuations and emoticons

Take a look at a sample e-mail message below and you will see that it contains HTML markup, punctuations, possible emoticons and other unwanted characters.  While it may be useful to retain some punctuations and emoticons, the majority does not contain any useful information for this analysis.  For simplicity, all unwanted characters except for possible characters such as ":)" will be removed.  

In [75]:
print(tokenized_emails[0])
print ""
unwanted_characters = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", 
                       "<", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ">", "@", 
                       "(", ")", '\\', "~", "{", "}", "*", "^", "\n", "xto"]

cleaned_tokenized_emails = []
for item in tokenized_emails[:1]:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in unwanted_characters:
            token = token.replace(punc, "")
        tokens.append(token)
    cleaned_tokenized_emails.append(tokens)
print cleaned_tokenized_emails

['Message-ID:', '<18782981.1075855378110.JavaMail.evans@thyme>\nDate:', 'Mon,', '14', 'May', '2001', '16:39:00', '-0700', '(PDT)\nFrom:', 'phillip.allen@enron.com\nTo:', 'tim.belden@enron.com\nSubject:', '\nMime-Version:', '1.0\nContent-Type:', 'text/plain;', 'charset=us-ascii\nContent-Transfer-Encoding:', '7bit\nX-From:', 'Phillip', 'K', 'Allen\nX-To:', 'Tim', 'Belden', '<Tim', 'Belden/Enron@EnronXGate>\nX-cc:', '\nX-bcc:', '\nX-Folder:', '\\Phillip_Allen_Jan2002_1\\Allen,', 'Phillip', "K.\\'Sent", 'Mail\nX-Origin:', 'Allen-P\nX-FileName:', 'pallen', '(Non-Privileged).pst\n\nHere', 'is', 'our', 'forecast\n\n', '']

[['messageid', 'javamailevansthymedate', 'mon', '', 'may', '', '', '', 'pdtfrom', 'phillipallenenroncomto', 'timbeldenenroncomsubject', 'mimeversion', 'contenttype', 'textplain', 'charset=usasciicontenttransferencoding', 'bitxfrom', 'phillip', 'k', 'allen', 'tim', 'belden', 'tim', 'beldenenronenronxgatexcc', 'xbcc', 'xfolder', 'phillip_allen_jan_allen', 'phillip', 'ksent', 