## Exploring the Enron Emails Dataset

The Enron corpus is the largest public domain database of real e-mails in the world.  This version of the dataset contains over 500,000 emails from about 150 users, mostly senior management at Enron.  The corpus is valuable for research in that it provides a rich example of how a real organization uses e-mails and has had a widespread influence on today's software for fraud detection.  Visit [here](https://en.wikipedia.org/wiki/Enron_scandal) to learn more about the Enron scandal.  

The purpose of this project is to explore the data to check for fraud and to see what sort of information was leaked in the e-mails.  Checkout the data set [here](https://www.cs.cmu.edu/~./enron/).  

## 1. Looked At The Data

In [25]:
import pandas as pd
import numpy as np

filepath = "data/emails.csv"
# read the data into a pandas dataframe called emails
emails = pd.read_csv(filepath)
print("Successfully loaded {} rows and {} columns!".format(emails.shape[0], emails.shape[1]))
# print column names
print(emails.columns)
# print the first 5 rows of the dataset
print(emails.head())

Successfully loaded 517401 rows and 2 columns!
Index([u'file', u'message'], dtype='object')
                       file                                            message
0     allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...
1    allen-p/_sent_mail/10.  Message-ID: <15464986.1075855378456.JavaMail.e...
2   allen-p/_sent_mail/100.  Message-ID: <24216240.1075855687451.JavaMail.e...
3  allen-p/_sent_mail/1000.  Message-ID: <13505866.1075863688222.JavaMail.e...
4  allen-p/_sent_mail/1001.  Message-ID: <30922949.1075863688243.JavaMail.e...


Numpy and pandas were imported, then the csv file containing the e-mails was read into a dataframe called **emails**.  The reading may take a while due to the size of the file.  Next, the shape of the dataset, column names and a sample of five rows within the dataset were printed.  There are 517,401 rows and 2 columns.  

**`file`** - contains the original directory and filename of each email. The root level of this path is the employee to whom the emails belong.

**`message`** - contains the email text

Notice that the values in the "file" column contains information that could be used to sort the e-mails. The e-mails could be sorted by directory, filename and by the employeee.  This should be useful given the size of the dataset and the possible need to filter the e-mails for specific records.  The "file" column will be used to create three newly engineered columns containing the split strings, for the directory, employee and the path.  

In [50]:
null_values = emails.isnull().values.any()
if null_values == False: print "No NaN values"
    

                           file  \
0         allen-p/_sent_mail/1.   
1        allen-p/_sent_mail/10.   
2       allen-p/_sent_mail/100.   
3      allen-p/_sent_mail/1000.   
4      allen-p/_sent_mail/1001.   
5      allen-p/_sent_mail/1002.   
6      allen-p/_sent_mail/1003.   
7      allen-p/_sent_mail/1004.   
8       allen-p/_sent_mail/101.   
9       allen-p/_sent_mail/102.   
10      allen-p/_sent_mail/103.   
11      allen-p/_sent_mail/104.   
12      allen-p/_sent_mail/105.   
13      allen-p/_sent_mail/106.   
14      allen-p/_sent_mail/107.   
15      allen-p/_sent_mail/108.   
16      allen-p/_sent_mail/109.   
17       allen-p/_sent_mail/11.   
18      allen-p/_sent_mail/110.   
19      allen-p/_sent_mail/111.   
20      allen-p/_sent_mail/112.   
21      allen-p/_sent_mail/113.   
22      allen-p/_sent_mail/114.   
23      allen-p/_sent_mail/115.   
24      allen-p/_sent_mail/116.   
25      allen-p/_sent_mail/117.   
26      allen-p/_sent_mail/118.   
27      allen-p/_sen

In [3]:
emails = emails.dropna()
print emails.shape

(517401, 2)


In [5]:
print emails.loc[0]["message"]
print emails.head()


Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 
                       file                                            message
0     allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...
1    allen-p/_sent_mail/10.  Message-ID: <15464986.1075855378456.JavaMail.e...
2   allen-p/_sent_mail/100.  Message-ID: <24216240.1075855687451.JavaMail.e...
3  allen-p/_sent_mail/1000.  Message-ID: <13505866.1075863688222.JavaMail.e...
4  allen-p/_sent_mail/1001.  Message-ID: <30922949.1075863688243.JavaMail.e...
