The first step is to load in the email and financial data with all of the features. I will convert the dictionary into a pandas dataframe for easier cleaning and manipulation.

In [257]:
import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tester
%matplotlib inline


### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi". 
payment_data = ['salary',
                'bonus',
                'long_term_incentive',
                'deferred_income',
                'deferral_payments',
                'loan_advances',
                'other',
                'expenses',                
                'director_fees', 
                'total_payments']

stock_data = ['exercised_stock_options',
              'restricted_stock',
              'restricted_stock_deferred',
              'total_stock_value']

email_data = ['to_messages',
              'from_messages',
              'from_poi_to_this_person',
              'from_this_person_to_poi',
              'shared_receipt_with_poi']
              
              
features_list = ['poi'] + payment_data + stock_data + email_data
                 # You will need to use more features

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

df = pd.DataFrame.from_dict(data_dict, orient='index')
df = df.replace('NaN', np.nan)
df = df[features_list]

In [258]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 20 columns):
poi                          146 non-null bool
salary                       95 non-null float64
bonus                        82 non-null float64
long_term_incentive          66 non-null float64
deferred_income              49 non-null float64
deferral_payments            39 non-null float64
loan_advances                4 non-null float64
other                        93 non-null float64
expenses                     95 non-null float64
director_fees                17 non-null float64
total_payments               125 non-null float64
exercised_stock_options      102 non-null float64
restricted_stock             110 non-null float64
restricted_stock_deferred    18 non-null float64
total_stock_value            126 non-null float64
to_messages                  86 non-null float64
from_messages                86 non-null float64
from_poi_to_this_person      86 non-null float

I want to convert all of the data types to floating point numbers except for the poi column which can remain as a boolean. 

According to the official documentation for the dataset, values of NaN in the financial dataset represent 0 and not unknown quantities. However, for the email data, NaNs stand for unknown information. Therefore, I will replace any financial data that is NaN with 0 but will fill in the NaNs for the email data with the median of the column grouped by person of interest. In other words, if a person has a NaN value for 'to_messages', and they are a person of interest, I will fill in that value with the median value of 'to_messages' for a person of interest.

In [259]:
df[payment_data] = df[payment_data].fillna(0)
df[stock_data] = df[stock_data].fillna(0)

In [260]:
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy = 'mean', axis=0)

df_poi = df[df['poi'] == True]
df_nonpoi = df[df['poi']==False]

df_poi.ix[:, email_data] = imp.fit_transform(df_poi.ix[:,email_data])
df_nonpoi.ix[:, email_data] = imp.fit_transform(df_nonpoi.ix[:,email_data])


In [261]:
df = df_poi.append(df_nonpoi)

One simple way to check for outliers/incorrect data is to add up all of the payment related columns for each person and see if that is equal to the total payment recorded for the individual. I can also do the same for stock payments. If the data was entered by hand, I would expect that there would be at least a few errors. 

In [262]:
df[df[payment_data[:-1]].sum(axis='columns') != df['total_payments']]

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi
BELFER ROBERT,False,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0,2007.111111,668.763889,58.5,36.277778,1058.527778
BHATNAGAR SANJAY,False,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0,523.0,29.0,0.0,1.0,463.0


In order to correct the discrepancies which most likely arise from incorrect data entry, I can use the official financial data gathered by FineLaw and [available through Udacity's GitHub](https://github.com/udacity/ud120-projects/blob/master/final_project/enron61702insiderpay.pdf). 
For Robert Belfer, the financial data has been shifted one column to the right, and for Sanjay Bhatnagar, the financial data has been shifted one column to the left. 

In [263]:
# Retrieve the incorrect data for Belfer
belfer_financial = df.ix['BELFER ROBERT', 1:15].tolist()
# Delete the first element to shift left and add on a 0 to end as indicated in financial data
belfer_financial.pop(0)
belfer_financial.append(0)
# Reinsert corrected data
df.ix['BELFER ROBERT', 1:15] = belfer_financial

# Retrieve the incorrect data for Bhatnagar
bhatnagar_financial = df.ix['BHATNAGAR SANJAY', 1:15].tolist()
# Delete the last element to shift right and add on a 0 to beginning
bhatnagar_financial.pop(-1)
bhatnagar_financial = [0] + bhatnagar_financial
# Reinsert corrected data
df.ix['BHATNAGAR SANJAY', 1:15] = bhatnagar_financial

In [264]:
len(df[df[payment_data[:-1]].sum(axis='columns') != df['total_payments']])

0

In [265]:
len(df[df[stock_data[:-1]].sum(axis='columns') != df['total_stock_value']])

0

Correcting the shifted financial data eliminated two errors. However, there may still be outliers in the dataset that need to be removed. Looking through the official financial PDF, I can see that I need to remove 'TOTAL' as it is entered as an individual (even though this is correct data, it is not a person and will be of no value when trying to identify persons of interest). Likewise, there is an entry for 'THE TRAVEL AGENCY IN THE PARK', which according to the documentation was a company co-owned by Enron's former Chairman's sister and is clearly not an individual that should be included in the dataset.  

In [266]:
df.drop(axis=0, labels=['TOTAL','THE TRAVEL AGENCY IN THE PARK'], inplace=True)

I can now look for individual outliers. However, I will need to be conservative in terms of removing the outliers because the dataset is rather small for machine learning in the first place. Moreover, the outliers might actually be important as they could represent patterns in the data that would aid in the identification of persons of interest. Using the [official definition of a mild outlier](http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm) as either -1.5 times the Interquartile Range (IQR) below the 1st interquartile or +1.5 times the IQR above the 3rd quartile, I will count the number of columns in which each indivdual is an outlier. 

In [267]:
IQR = df.quantile(q=0.75) - df.quantile(q=0.25)

In [268]:
first_quartile = df.quantile(q=0.25)
third_quartile = df.quantile(q=0.75)

In [269]:
outliers = df[(df>(third_quartile + 1.5*IQR) ) | (df<(first_quartile - 1.5*IQR) )].count(axis=1)
outliers.sort_values(axis=0, ascending=False, inplace=True)
outliers.head()

LAY KENNETH L         15
FREVERT MARK A        12
BELDEN TIMOTHY N       9
SKILLING JEFFREY K     9
BAXTER JOHN C          8
dtype: int64

As this point, I need to do some research before blinding deleting outliers, especially if the outliers are persons of interest. Based on the small number of persons of interest initially in the dataset, I will choose to not remove any individuals who are persons are interest regardless of the number of outliers they may have. An outlier could be a sign of fradulent activity, as it could be evidence that someone is laundering illegal funds through the company payroll or maybe an accomplish is being paid to remain quiet about the activity. I will examine the top seven outliers which is around 5% of the total dataset. 

In [270]:
outliers = outliers[:7].index.tolist()

In [271]:
outliers

['LAY KENNETH L',
 'FREVERT MARK A',
 'BELDEN TIMOTHY N',
 'SKILLING JEFFREY K',
 'BAXTER JOHN C',
 'LAVORATO JOHN J',
 'DELAINEY DAVID W']

In [272]:
df_outliers = df.ix[outliers, :]

In [273]:
df_outliers

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi
LAY KENNETH L,True,1072321.0,7000000.0,3600000.0,-300000.0,202911.0,81525000.0,10359729.0,99832.0,0.0,103559793.0,34348384.0,14761694.0,0.0,49110078.0,4273.0,36.0,123.0,16.0,2411.0
FREVERT MARK A,False,1060932.0,2000000.0,1617011.0,-3367011.0,6426990.0,2000000.0,7427621.0,86987.0,0.0,17252530.0,10433518.0,4188667.0,0.0,14622185.0,3275.0,21.0,242.0,6.0,2979.0
BELDEN TIMOTHY N,True,213999.0,5249999.0,0.0,-2334434.0,2144013.0,0.0,210698.0,17355.0,0.0,5501630.0,953136.0,157569.0,0.0,1110705.0,7991.0,484.0,228.0,108.0,5521.0
SKILLING JEFFREY K,True,1111258.0,5600000.0,1920000.0,0.0,0.0,0.0,22122.0,29336.0,0.0,8682716.0,19250000.0,6843672.0,0.0,26093672.0,3627.0,108.0,88.0,30.0,2042.0
BAXTER JOHN C,False,267102.0,1200000.0,1586055.0,-1386055.0,1295738.0,0.0,2660303.0,11200.0,0.0,5634343.0,6680544.0,3942714.0,0.0,10623258.0,2007.111111,668.763889,58.5,36.277778,1058.527778
LAVORATO JOHN J,False,339288.0,8000000.0,2035380.0,0.0,0.0,0.0,1552.0,49537.0,0.0,10425757.0,4158995.0,1008149.0,0.0,5167144.0,7259.0,2585.0,528.0,411.0,3962.0
DELAINEY DAVID W,True,365163.0,3000000.0,1294981.0,0.0,0.0,0.0,1661.0,86174.0,0.0,4747979.0,2291113.0,1323148.0,0.0,3614261.0,3093.0,3069.0,66.0,609.0,2097.0


There are a few considerations to make here:
1. Kenneth Lay, [the CEO of Enron from 1986-2001](http://www.biography.com/people/kenneth-lay-234611), presided over many of the illegal business activites and hence is one of the most important persons of interest. 
2. Mark Frevert served as chief executive of [Enron Europe from 1986-2000 and was appointed as chairman of Enron in 2001](http://www.risk.net/risk-management/2123422/ten-years-after-its-collapse-enron-lives-energy-markets). He was a major player in the firm, although not a person of interest. I believe that he is not representative of the average employee at Enron during this time because of his substantial compensation and will remove him from the dataset. 
3. Timothy Belden was the [former head of trading for Enron](http://articles.latimes.com/2007/feb/15/business/fi-enron15) who developed the strategy to illegally raise energy prices in California. He was a person of interest and will definitely remain in the dataset. 
4. Jeffrey Skilling [replaced Kenneth Lay as CEO of Enron in 2001 and orchestrated much of the fraud](http://www.biography.com/people/jeffrey-skilling-235386) that destroyed Enron. As a person of interest, he will remain in the dataset. 
5. John Baxter was a former vice Enron vice chairman and [died of an apparent self-inflicted gunshot](https://www.wsws.org/en/articles/2002/01/enro-j28.html) before he was able to testify against other Enron executives. I will remove him from the dataset as he is not a person of interest. 
6. John Lavorato was a top executive in the energy-trading branch of Enron and received large bonuses to [keep him from leaving Enron](http://www.nytimes.com/2002/06/18/business/officials-got-a-windfall-before-enron-s-collapse.html). As he was not a person of interest, and the large bonus ended up skewing his total pay towards the top of the range, I think it would be appropriate to remove him from the dataset. 
4. Lawrence Whalley [served as the president of Enron](http://www.corpwatch.org/article.php?id=13194) and fired Andrew Fastow once it was apparent the severity of Enron's situation. He was investigated thoroughly but not identified as a person of interest and therefore will be removed from the dataset.  

Total, that is four people to remove from the dataset. I believe these removals are justified primarily because none of these individuals were persons of interest and they all were upper-level executives with pay levels far above the average employee. I hesitate to remove any samples from the data, but I believe that removing these individuals will improve the quality of the classifier. I can try with and without removing these individuals and measure the accuracy, precision, and recall of the classifier to determine if my choice was justified.  

In [274]:
df.drop(axis=0, labels=['FREVERT MARK A', 'LAVORATO JOHN J', 'WHALLEY LAWRENCE G', 'BAXTER JOHN C'], inplace=True)

In [275]:
len(df)

140

In [276]:
df['poi'].value_counts()

False    122
True      18
Name: poi, dtype: int64

In [277]:
df.isnull().sum().sum()

0

In [278]:
df[df==0].count().sum()

1150

There are a total of 2800 observations of financial and email data in the set now that the data cleaning has been finished. Of these, __1150 or 41%__ are 0 financial values. There are 18 persons of interest, comprising __12.9%__ of the individuals. 

The next step is to begin training some classifiers with default parameters in order to identify existing features that are most predicative of persons of interest. After that, if the classifier performance is low, I will try and devise additional features and then fine-tune the algorithms. 

In [286]:
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import scale
from sklearn.tree import DecisionTreeClassifier

df.dropna(inplace=True)
df.ix[:,1:] = scale(df.ix[:,1:])
data_dict = df.to_dict(orient='index')

my_dataset = data_dict
data = featureFormat(my_dataset, features_list, sort_keys=True)
labels, features = targetFeatureSplit(data)

# Create the classifier, GaussianNB has no parameters to tune
clf = DecisionTreeClassifier(criterion='entropy')
dump_classifier_and_data(clf, my_dataset, features_list)
clf = tester.main()

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.86429	Precision: 0.52941	Recall: 0.45000	F1: 0.48649	F2: 0.46392
	Total predictions:  140	True positives:    9	False positives:    8	False negatives:   11	True negatives:  112



In [287]:
type(clf)

NoneType

In [288]:
len(df)

140