# Exploring the Enron Emails Dataset

The Enron corpus is the largest public domain database of real e-mails in the world.  This version of the dataset contains over 500,000 emails from about 150 users, mostly senior management at Enron.  The corpus is valuable for research in that it provides a rich example of how a real organization uses e-mails and has had a widespread influence on today's software for fraud detection.  Visit [here](https://en.wikipedia.org/wiki/Enron_scandal) to learn more about the Enron scandal.  

The purpose of this project is to explore the data to check for fraud and to see what sort of information was leaked in the e-mails.  Checkout the data set [here](https://www.cs.cmu.edu/~./enron/).  

## 1. Looking At The Data

In [196]:
import pandas as pd
import numpy as np
from IPython.display import display

filepath = "data/emails.csv"
# Read the data into a pandas dataframe called emails
emails = pd.read_csv(filepath)
emails = emails.iloc[:1000] # testing 
print("Successfully loaded {} rows and {} columns!".format(emails.shape[0], emails.shape[1]))
# Print column names
print(emails.columns)
# Store column headers 
headers = [header for header in emails.columns]
# Print the first 5 rows of the dataset
print(display(emails.head()))

Successfully loaded 1000 rows and 2 columns!
Index([u'file', u'message'], dtype='object')


Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


None


Numpy and pandas were imported, then the csv file containing the e-mails was read into a dataframe called **`emails`**.  The reading may take a while due to the size of the file.  Next, the shape of the dataset, column names and a sample of five rows within the dataset were printed.  There are 517,401 rows and 2 columns.  

**`file`** - contains the original directory and filename of each email. The root level of this path is the employee (surname first followed by first name initial) to whom the emails belong. 

**`message`** - contains the email text

### E-mails are MIME formatted

Here is a sample of the standard e-mail found in the data.  It contains a list of headers and a message body.  Note that there is a header label called "Mime-Version", which signifies that the e-mails in this dataset are MIME formatted.  MIME stands for Multipurpose Internet Mail Extensions and virtually all human-written email is transmitted in MIME format.  Python has a built in [MIME handling package](https://docs.python.org/2/library/email.html) and this is what will be used to dissect the data needed out of each e-mail.

In [186]:
print(emails.loc[0]["message"])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


## 2. Data Cleaning

Here are the list of things that needs to be performed on the data:
* Check for missing values
* Tokenization
* Feature Engineering
* Remove unwanted characters 

### Missing Values

The `emails` dataframe was checked for missing values.  In this case, there were no missing values.

In [187]:
# Check for null values
null_values = emails.isnull().values.any()
if null_values == False: print "No NaN values"

No NaN values


### Introducing the Bag-of-words model

For the computer to make inferences of the e-mails, it has to be able to interpret the text by making a numerical representation of it.  One way to do this is by using something called a [**Bag-of-words model**](https://en.wikipedia.org/wiki/Bag-of-words_model).  It will take each e-mail as a string and convert it into a numerical vector.  In this case, each string will be converted into a 1-dimensional array of 0s and 1s.  The first step in creating a Bag-of-words model is called tokenization.  By tokenizing each e-mail, each string is split into a list of words. 

### Tokenization

In this step, the MIME handling python package mentioned earlier was used to extract both the headers and the messages found within each e-mail.  The data found within the headers section of each e-mail will be added to the `emails` dataframe as new features.  They are stored in the `header_data` dictionary.  All tokens are stored in `tokenized_messages` for further processing.

**Why lowercase the message body?**

Because a human may know that "Forecast" and "forecast" means the same thing, but the computer does not know this.  Also, while building the matrix using the bag-of-words model, lowercasing also reduces the chance of the same word being duplicated and entered as a separate word.

**Note**: Running the code below may take a few minutes to complete

In [197]:
# MIME handling package
import email

# List of tokens
tokenized_messages = []
# Used to store data for new features
header_data = {}

for item in emails["message"]: 
    # Return a message object structure from a string
    e = email.message_from_string(item)
    # A list of tuples containing the header keys and values
    header_list = e.items()
    # Add data to dictionary 
    for key, value in header_list:
        if key in header_data:
            values = header_data.get(key)
            values.append(value)
            header_data[key] = values
        else:
            header_data[key] = [value]
    # get message body  
    message_body = e.get_payload()
    # lower case messages
    message_body = message_body.lower()
    # split message into tokens
    tokens = message_body.split(" ")
    tokenized_messages.append(tokens)
print(tokenized_messages[0])

['here', 'is', 'our', 'forecast\n\n', '']


### Adding new columns

Here are a list of keys pulled from the `header_data` dictionary which will be used as labels for the new columns in the `email` dataframe

In [198]:
headers = list(header_data.keys())
print(headers)

['Cc', 'X-cc', 'From', 'Subject', 'X-Folder', 'Content-Transfer-Encoding', 'X-bcc', 'Bcc', 'To', 'X-Origin', 'X-FileName', 'X-From', 'Date', 'X-To', 'Message-ID', 'Content-Type', 'Mime-Version']


In [200]:
def add_column(df, data):
    # Return our updated dataframe with the added columns
    for key, value in data.items():
        df[key] = pd.Series(value)
    return df
emails = add_column(emails, header_data)

### Here is a summary of the new `emails` dataset containing all the new columns.  

In [201]:
print(emails.shape)
print(display(emails.head(1)))

(1000, 19)


Unnamed: 0,file,message,Cc,X-cc,From,Subject,X-Folder,Content-Transfer-Encoding,X-bcc,Bcc,To,X-Origin,X-FileName,X-From,Date,X-To,Message-ID,Content-Type,Mime-Version
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,"john.lavorato@enron.com, hunter.shively@enron.com",,phillip.allen@enron.com,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",7bit,,"john.lavorato@enron.com, hunter.shively@enron.com",tim.belden@enron.com,Allen-P,pallen (Non-Privileged).pst,Phillip K Allen,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",Tim Belden <Tim Belden/Enron@EnronXGate>,<18782981.1075855378110.JavaMail.evans@thyme>,text/plain; charset=us-ascii,1.0


None


###  Remove unwanted HTML Markup, punctuations and emoticons

In [236]:
unwanted_characters = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", 
                       "<", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ">", "@", 
                       "(", ")", '\\', "~", "{", "}", "*", "^", "!", "\n"]

cleaned_tokenized_emails = []
for item in tokenized_messages:
    tokens = []
    for token in item:
        for punc in unwanted_characters:
            token = token.replace(punc, " ")
        tokens.append(token)
    cleaned_tokenized_emails.append(tokens)
print cleaned_tokenized_emails[0]

['here', 'is', 'our', 'forecast  ', '']


## 3. Construct a Bag-of-words model

### Count words

Now that the data has been cleaned, it is time to construct a bag-of-words model to get the word counts.  Scikit-learn has a `CountVectorizer` class that is able to do just that.  It takes in a list of strings, in our case words, and outputs a dictionary mapping words as keys to their respective integer indices.  

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
words = ', '.join([str(x) for x in cleaned_tokenized_emails])
docs = np.array([words])
bag = count.fit_transform(docs)
print(count.vocabulary_)

Here we convert the dictionary to a feature vector, where each index position corresponds to the values found in the CountVectorizer vocabulary.  

In [264]:
bag = bag.toarray()
print(bag.shape)
print(bag)

(1, 8058)
[[2 1 3 ..., 2 4 2]]


### Word Relevance using term frequency-inverse document frequency

Apply the Term frequency-inverse document frequency (tf-df) to downweight words that appear frequently in the e-mails but do not contain useful information.  Sci-kit learn has a transformer called the `TfidTransformer` to do this.  The `TfidTransformer` also normalizes the tf-idfs using L2-normalization.  Using L2-normalization helps to penalize the weight of the tf-dfs and prevent overfitting

In [265]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[  1.75562649e-04   8.77813243e-05   2.63343973e-04 ...,   1.75562649e-04
    3.51125297e-04   1.75562649e-04]]
