## Exploring the Enron Emails Dataset

The Enron corpus is the largest public domain database of real e-mails in the world.  This version of the dataset contains over 500,000 emails from about 150 users, mostly senior management at Enron.  The corpus is valuable for research in that it provides a rich example of how a real organization uses e-mails and has had a widespread influence on today's software for fraud detection.  Visit [here](https://en.wikipedia.org/wiki/Enron_scandal) to learn more about the Enron scandal.  

The purpose of this project is to explore the data to check for fraud and to see what sort of information was leaked in the e-mails.  Checkout the data set [here](https://www.cs.cmu.edu/~./enron/).  

## 1. Looking At The Data

In [71]:
import pandas as pd
import numpy as np

filepath = "data/emails.csv"
# read the data into a pandas dataframe called emails
emails = pd.read_csv(filepath)
print("Successfully loaded {} rows and {} columns!".format(emails.shape[0], emails.shape[1]))
# print column names
print(emails.columns)
# store column headers 
headers = [header for header in emails.columns]
# print the first 5 rows of the dataset
print(emails.head())

Successfully loaded 517401 rows and 2 columns!
Index([u'file', u'message'], dtype='object')
                       file                                            message
0     allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...
1    allen-p/_sent_mail/10.  Message-ID: <15464986.1075855378456.JavaMail.e...
2   allen-p/_sent_mail/100.  Message-ID: <24216240.1075855687451.JavaMail.e...
3  allen-p/_sent_mail/1000.  Message-ID: <13505866.1075863688222.JavaMail.e...
4  allen-p/_sent_mail/1001.  Message-ID: <30922949.1075863688243.JavaMail.e...


Numpy and pandas were imported, then the csv file containing the e-mails was read into a dataframe called **`emails`**.  The reading may take a while due to the size of the file.  Next, the shape of the dataset, column names and a sample of five rows within the dataset were printed.  There are 517,401 rows and 2 columns.  

**`file`** - contains the original directory and filename of each email. The root level of this path is the employee (surname first followed by first name initial) to whom the emails belong. 

**`message`** - contains the email text

## 2. Data Cleaning

### Missing Values

In [72]:
# check for null values
null_values = emails.isnull().values.any()
if null_values == False: print "No NaN values"

No NaN values


Before creating the new column, the `emails` dataframe was checked for missing values.  In this case, there were no missing values.

### Adding in values

In [73]:
#create new column for employee name using data from the "file" column
employees = []

# empty array to store shorter and longer than average directories
outlier_dir = []
# extract "file" column
file_info = emails[headers[0]]
for row in file_info:
    tokens = row.split("/")
    if len(tokens) < 3 or len(tokens)> 3:
        outlier_dir.append(tokens)
    employees.append(tokens[0])
# create column and set its values
emails["employee"] = employees 

In [74]:
# shows the dataframe with the appended column
print(emails.head())

                       file  \
0     allen-p/_sent_mail/1.   
1    allen-p/_sent_mail/10.   
2   allen-p/_sent_mail/100.   
3  allen-p/_sent_mail/1000.   
4  allen-p/_sent_mail/1001.   

                                             message employee  
0  Message-ID: <18782981.1075855378110.JavaMail.e...  allen-p  
1  Message-ID: <15464986.1075855378456.JavaMail.e...  allen-p  
2  Message-ID: <24216240.1075855687451.JavaMail.e...  allen-p  
3  Message-ID: <13505866.1075863688222.JavaMail.e...  allen-p  
4  Message-ID: <30922949.1075863688243.JavaMail.e...  allen-p  


In [75]:
# show the number of emails sent by each employee
print(emails["employee"].value_counts()[:10])

kaminski-v      28465
dasovich-j      28234
kean-s          25351
mann-k          23381
jones-t         19950
shackleton-s    18687
taylor-m        13875
farmer-d        13032
germany-c       12436
beck-s          11830
Name: employee, dtype: int64


In [76]:
print(len(outlier_dir))

26555


The e-mails could be filtered by employee names, which can be retrived from the filename found in the `file` column.  Although the subfolders in the filepath could be used as filters, they will remain untouched for now.

An empty array was created to hold the values for the `employee` column.  The `file` column from the `emails` dataframe was extracted into an array called `file_info`.  `file_info` was then loooped over to get the string found at each index. Each string was then split into tokens, with the employee name located at index 0.  Each name was then added to the `employees` array.  This array was then set to be the values in the new `employee` column.  

A sample of five rows was printed to show that the new column was added.  Also, the number of e-mails sent by each employee was printed in the table.

Note that there were in also 26,555 directories that directories that contained less or more than three folders.  Although this figure may seem large, it represents approximately 5.0% of all directories listed in the dataset.  
