The first step is to load in the email and financial data with all of the features. I will convert the dictionary into a pandas dataframe for easier cleaning and manipulation.

In [1]:
import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tester
%matplotlib inline


### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi". 
payment_data = ['salary',
                'bonus',
                'long_term_incentive',
                'deferred_income',
                'deferral_payments',
                'loan_advances',
                'other',
                'expenses',                
                'director_fees', 
                'total_payments']

stock_data = ['exercised_stock_options',
              'restricted_stock',
              'restricted_stock_deferred',
              'total_stock_value']

email_data = ['to_messages',
              'from_messages',
              'from_poi_to_this_person',
              'from_this_person_to_poi',
              'shared_receipt_with_poi']
              
              
features_list = ['poi'] + payment_data + stock_data + email_data
                 # You will need to use more features

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

df = pd.DataFrame.from_dict(data_dict, orient='index')
df = df.replace('NaN', np.nan)
df = df[features_list]

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 20 columns):
poi                          146 non-null bool
salary                       95 non-null float64
bonus                        82 non-null float64
long_term_incentive          66 non-null float64
deferred_income              49 non-null float64
deferral_payments            39 non-null float64
loan_advances                4 non-null float64
other                        93 non-null float64
expenses                     95 non-null float64
director_fees                17 non-null float64
total_payments               125 non-null float64
exercised_stock_options      102 non-null float64
restricted_stock             110 non-null float64
restricted_stock_deferred    18 non-null float64
total_stock_value            126 non-null float64
to_messages                  86 non-null float64
from_messages                86 non-null float64
from_poi_to_this_person      86 non-null float

I want to convert all of the data types to floating point numbers except for the poi column which can remain as a boolean. 

According to the official documentation for the dataset, values of NaN in the financial dataset represent 0 and not unknown quantities. However, for the email data, NaNs stand for unknown information. Therefore, I will replace any financial data that is NaN with 0 but will fill in the NaNs for the email data with the median of the column grouped by person of interest. In other words, if a person has a NaN value for 'to_messages', and they are a person of interest, I will fill in that value with the median value of 'to_messages' for a person of interest.

In [3]:
df[payment_data] = df[payment_data].fillna(0)
df[stock_data] = df[stock_data].fillna(0)

In [4]:
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy = 'mean', axis=0)

df_poi = df[df['poi'] == True]
df_nonpoi = df[df['poi']==False]

df_poi.ix[:, email_data] = imp.fit_transform(df_poi.ix[:,email_data])
df_nonpoi.ix[:, email_data] = imp.fit_transform(df_nonpoi.ix[:,email_data])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [5]:
df = df_poi.append(df_nonpoi)

One simple way to check for outliers/incorrect data is to add up all of the payment related columns for each person and see if that is equal to the total payment recorded for the individual. I can also do the same for stock payments. If the data was entered by hand, I would expect that there would be at least a few errors. 

In [6]:
df[df[payment_data[:-1]].sum(axis='columns') != df['total_payments']]

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi
BELFER ROBERT,False,0.0,0.0,0.0,0.0,-102500.0,0.0,0.0,0.0,3285.0,102500.0,3285.0,0.0,44093.0,-44093.0,2007.111111,668.763889,58.5,36.277778,1058.527778
BHATNAGAR SANJAY,False,0.0,0.0,0.0,0.0,0.0,0.0,137864.0,0.0,137864.0,15456290.0,2604490.0,-2604490.0,15456290.0,0.0,523.0,29.0,0.0,1.0,463.0


In order to correct the discrepancies which most likely arise from incorrect data entry, I can use the official financial data gathered by FineLaw and [available through Udacity's GitHub](https://github.com/udacity/ud120-projects/blob/master/final_project/enron61702insiderpay.pdf). 
For Robert Belfer, the financial data has been shifted one column to the right, and for Sanjay Bhatnagar, the financial data has been shifted one column to the left. 

In [7]:
# Retrieve the incorrect data for Belfer
belfer_financial = df.ix['BELFER ROBERT', 1:15].tolist()
# Delete the first element to shift left and add on a 0 to end as indicated in financial data
belfer_financial.pop(0)
belfer_financial.append(0)
# Reinsert corrected data
df.ix['BELFER ROBERT', 1:15] = belfer_financial

# Retrieve the incorrect data for Bhatnagar
bhatnagar_financial = df.ix['BHATNAGAR SANJAY', 1:15].tolist()
# Delete the last element to shift right and add on a 0 to beginning
bhatnagar_financial.pop(-1)
bhatnagar_financial = [0] + bhatnagar_financial
# Reinsert corrected data
df.ix['BHATNAGAR SANJAY', 1:15] = bhatnagar_financial

In [8]:
len(df[df[payment_data[:-1]].sum(axis='columns') != df['total_payments']])

0

In [9]:
len(df[df[stock_data[:-1]].sum(axis='columns') != df['total_stock_value']])

0

Correcting the shifted financial data eliminated two errors. However, there may still be outliers in the dataset that need to be removed. Looking through the official financial PDF, I can see that I need to remove 'TOTAL' as it is entered as an individual (even though this is correct data, it is not a person and will be of no value when trying to identify persons of interest). Likewise, there is an entry for 'THE TRAVEL AGENCY IN THE PARK', which according to the documentation was a company co-owned by Enron's former Chairman's sister and is clearly not an individual that should be included in the dataset.  

In [10]:
df.drop(axis=0, labels=['TOTAL','THE TRAVEL AGENCY IN THE PARK'], inplace=True)

I can now look for individual outliers. However, I will need to be conservative in terms of removing the outliers because the dataset is rather small for machine learning in the first place. Moreover, the outliers might actually be important as they could represent patterns in the data that would aid in the identification of persons of interest. Using the [official definition of a mild outlier](http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm) as either -1.5 times the Interquartile Range (IQR) below the 1st interquartile or +1.5 times the IQR above the 3rd quartile, I will count the number of columns in which each individual is an outlier. 

In [11]:
IQR = df.quantile(q=0.75) - df.quantile(q=0.25)

In [12]:
first_quartile = df.quantile(q=0.25)
third_quartile = df.quantile(q=0.75)

In [13]:
outliers = df[(df>(third_quartile + 1.5*IQR) ) | (df<(first_quartile - 1.5*IQR) )].count(axis=1)
outliers.sort_values(axis=0, ascending=False, inplace=True)
outliers.head(12)

LAY KENNETH L         15
FREVERT MARK A        12
BELDEN TIMOTHY N       9
SKILLING JEFFREY K     9
BAXTER JOHN C          8
LAVORATO JOHN J        8
DELAINEY DAVID W       7
KEAN STEVEN J          7
HAEDICKE MARK E        7
WHALLEY LAWRENCE G     7
RICE KENNETH D         6
KITCHEN LOUISE         6
dtype: int64

As this point, I need to do some research before blinding deleting outliers, especially if the outliers are persons of interest. Based on the small number of persons of interest initially in the dataset, I will choose to not remove any individuals who are persons are interest regardless of the number of outliers they may have. An outlier could be a sign of fradulent activity, as it could be evidence that someone is laundering illegal funds through the company payroll or maybe an accomplish is being paid to remain quiet about the activity. I will examine the top seven outliers which is around 5% of the total dataset. 

In [14]:
outliers = outliers[:10].index.tolist()

In [15]:
outliers

['LAY KENNETH L',
 'FREVERT MARK A',
 'BELDEN TIMOTHY N',
 'SKILLING JEFFREY K',
 'BAXTER JOHN C',
 'LAVORATO JOHN J',
 'DELAINEY DAVID W',
 'KEAN STEVEN J',
 'HAEDICKE MARK E',
 'WHALLEY LAWRENCE G']

In [16]:
df_outliers = df.ix[outliers, :]

In [17]:
df_outliers

Unnamed: 0,poi,salary,bonus,long_term_incentive,deferred_income,deferral_payments,loan_advances,other,expenses,director_fees,total_payments,exercised_stock_options,restricted_stock,restricted_stock_deferred,total_stock_value,to_messages,from_messages,from_poi_to_this_person,from_this_person_to_poi,shared_receipt_with_poi
LAY KENNETH L,True,1072321.0,7000000.0,3600000.0,-300000.0,202911.0,81525000.0,10359729.0,99832.0,0.0,103559793.0,34348384.0,14761694.0,0.0,49110078.0,4273.0,36.0,123.0,16.0,2411.0
FREVERT MARK A,False,1060932.0,2000000.0,1617011.0,-3367011.0,6426990.0,2000000.0,7427621.0,86987.0,0.0,17252530.0,10433518.0,4188667.0,0.0,14622185.0,3275.0,21.0,242.0,6.0,2979.0
BELDEN TIMOTHY N,True,213999.0,5249999.0,0.0,-2334434.0,2144013.0,0.0,210698.0,17355.0,0.0,5501630.0,953136.0,157569.0,0.0,1110705.0,7991.0,484.0,228.0,108.0,5521.0
SKILLING JEFFREY K,True,1111258.0,5600000.0,1920000.0,0.0,0.0,0.0,22122.0,29336.0,0.0,8682716.0,19250000.0,6843672.0,0.0,26093672.0,3627.0,108.0,88.0,30.0,2042.0
BAXTER JOHN C,False,267102.0,1200000.0,1586055.0,-1386055.0,1295738.0,0.0,2660303.0,11200.0,0.0,5634343.0,6680544.0,3942714.0,0.0,10623258.0,2007.111111,668.763889,58.5,36.277778,1058.527778
LAVORATO JOHN J,False,339288.0,8000000.0,2035380.0,0.0,0.0,0.0,1552.0,49537.0,0.0,10425757.0,4158995.0,1008149.0,0.0,5167144.0,7259.0,2585.0,528.0,411.0,3962.0
DELAINEY DAVID W,True,365163.0,3000000.0,1294981.0,0.0,0.0,0.0,1661.0,86174.0,0.0,4747979.0,2291113.0,1323148.0,0.0,3614261.0,3093.0,3069.0,66.0,609.0,2097.0
KEAN STEVEN J,False,404338.0,1000000.0,300000.0,0.0,0.0,0.0,1231.0,41953.0,0.0,1747522.0,2022048.0,4131594.0,0.0,6153642.0,12754.0,6759.0,140.0,387.0,3639.0
HAEDICKE MARK E,False,374125.0,1150000.0,983346.0,-934484.0,2157527.0,0.0,52382.0,76169.0,0.0,3859065.0,608750.0,524169.0,-329825.0,803094.0,4009.0,1941.0,180.0,61.0,1847.0
WHALLEY LAWRENCE G,False,510364.0,3000000.0,808346.0,0.0,0.0,0.0,301026.0,57838.0,0.0,4677574.0,3282960.0,2796177.0,0.0,6079137.0,6019.0,556.0,186.0,24.0,3920.0


There are a few considerations to make here:
1. Kenneth Lay, [the CEO of Enron from 1986-2001](http://www.biography.com/people/kenneth-lay-234611), presided over many of the illegal business activites and hence is one of the most important persons of interest. 
2. Mark Frevert served as chief executive of [Enron Europe from 1986-2000 and was appointed as chairman of Enron in 2001](http://www.risk.net/risk-management/2123422/ten-years-after-its-collapse-enron-lives-energy-markets). He was a major player in the firm, although not a person of interest. I believe that he is not representative of the average employee at Enron during this time because of his substantial compensation and will remove him from the dataset. 
3. Timothy Belden was the [former head of trading for Enron](http://articles.latimes.com/2007/feb/15/business/fi-enron15) who developed the strategy to illegally raise energy prices in California. He was a person of interest and will definitely remain in the dataset. 
4. Jeffrey Skilling [replaced Kenneth Lay as CEO of Enron in 2001 and orchestrated much of the fraud](http://www.biography.com/people/jeffrey-skilling-235386) that destroyed Enron. As a person of interest, he will remain in the dataset. 
5. John Baxter was a former vice Enron vice chairman and [died of an apparent self-inflicted gunshot](https://www.wsws.org/en/articles/2002/01/enro-j28.html) before he was able to testify against other Enron executives. I will remove him from the dataset as he is not a person of interest. 
6. John Lavorato was a top executive in the energy-trading branch of Enron and received large bonuses to [keep him from leaving Enron](http://www.nytimes.com/2002/06/18/business/officials-got-a-windfall-before-enron-s-collapse.html). As he was not a person of interest, and the large bonus ended up skewing his total pay towards the top of the range, I think it would be appropriate to remove him from the dataset. 
4. Lawrence Whalley [served as the president of Enron](http://www.corpwatch.org/article.php?id=13194) and fired Andrew Fastow once it was apparent the severity of Enron's situation. He was investigated thoroughly but not identified as a person of interest and therefore will be removed from the dataset.  

Total, that is four people to remove from the dataset. I believe these removals are justified primarily because none of these individuals were persons of interest and they all were upper-level executives with pay levels far above the average employee. I hesitate to remove any samples from the data, but I believe that removing these individuals will improve the quality of the classifier. I can try with and without removing these individuals and measure the accuracy, precision, and recall of the classifier to determine if my choice was justified.  

In [18]:
df.drop(axis=0, labels=['FREVERT MARK A', 'LAVORATO JOHN J', 'WHALLEY LAWRENCE G', 'BAXTER JOHN C'], inplace=True)

In [19]:
len(df)

140

In [20]:
df['poi'].value_counts()

False    122
True      18
Name: poi, dtype: int64

In [21]:
df.isnull().sum().sum()

0

In [22]:
df[df==0].count().sum()

1150

There are a total of 2800 observations of financial and email data in the set now that the data cleaning has been finished. Of these, __1150 or 41%__ are 0 financial values. There are 18 persons of interest, comprising __12.9%__ of the individuals. 

The next step is to begin training some classifiers with default parameters in order to identify existing features that are most predicative of persons of interest. After that, if the classifier performance is low, I will try and devise additional features and then fine-tune the algorithms. 

### First Algorithm Testing

The four algorithms I have selected for initial testing are Gaussian Naive Bayes (GaussianNB), DecisionTreeClassifier, Support Vector Classifier (SVC), and KMeans. I will run all of the algorithms with the default parameters except I will alter the kernel used in the Support Vector Machine to be linear. I will also select 2 to be the number of clusters for KMeans as I know in advance that there are two categories I want to identify. Although accuracy would seem to be the obvious choice for evaluating the quality of a classifier, accuracy can be a crude measure at times and is not suited for some datasets including this one. For example, if a classifier were to guess that all of the samples in my cleaned dataset were _not_ persons of interest, it would have an accuracy of 87.1%. However, this clearly would not satisfy the objective of this investigation which is to create a classifier that can identify persons of interest. Therefore, different metrics are needed to evaluate the tuned algorithm (a classifier is the algorithm plus the parameters selected) to gauge its effectiveness. The two selected for this project are [Precision and Recall](https://en.wikipedia.org/wiki/Precision_and_recall).

* __Precision__ is the number of correct positive results over the total number of positive labels assigned. In other words, it is the fraction of persons of interest predicted by the algorithm that are truly persons of interest.  Mathematically precision is defined as 

$$ precision = \frac{true\ positives}{true\ positives + false\ positives} $$ 

* __Recall__ is the number of correct positive results divided by the number of positive results that should have been identified. In other words, it is the fraction of the total number of persons of interest that the classifier correctly labels. Mathematically, recall is defined as

$$ recall = \frac{true\ positives}{true\ positives + false\ negatives} $$ 

Precision is also known as positive predictive value while recall is the sensitivity of the classifier. A combined measured of precision and recall is the [__F1 score__](https://en.wikipedia.org/wiki/F1_score). Is it the harmonic mean of precision and recall. Mathematically, the F1 score is defined as:

$$ F1\ Score = \frac{2\ (precision\ x\ recall)}{precision + recall} $$

For this project, the objective was a precision and a recall greater than 0.3. However, I believe it is possible to do much better than that with the right feature selection and algorithm tuning. 

The only preparation I will do for this initial testing of the different algorithms is to scale the data such that it has a zero mean and a unit variance. This process is called [normalization](http://www.analytictech.com/ba762/handouts/normalization.htm) and is accomplished using the scale function from the sklearn preprocessing module. 

In [23]:
from sklearn.preprocessing import scale
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.ensemble import AdaBoostClassifier
from tester import test_classifier

scaled_df = df.copy()
scaled_df.ix[:,1:] = scale(scaled_df.ix[:,1:])
data_dict = scaled_df.to_dict(orient='index')

my_dataset = data_dict
data = featureFormat(my_dataset, features_list, sort_keys=True)
labels, features = targetFeatureSplit(data)

# Create the classifier, GaussianNB has no parameters to tune
clf = GaussianNB()
dump_classifier_and_data(clf, my_dataset, features_list)
clf = tester.main()

	Accuracy: 0.70714	Precision: 0.30909	Recall: 0.85000	F1: 0.45333	F2: 0.62963
	Total predictions:  140	True positives:   17	False positives:   38	False negatives:    3	True negatives:   82



In [24]:
len(df)

140

In [25]:
clf = DecisionTreeClassifier()
clf = test_classifier(clf, my_dataset, features_list)
clf.feature_importances_

	Accuracy: 0.82857	Precision: 0.40000	Recall: 0.40000	F1: 0.40000	F2: 0.40000
	Total predictions:  140	True positives:    8	False positives:   12	False negatives:   12	True negatives:  108



array([ 0.04772727,  0.22272727,  0.19772727,  0.10738636,  0.        ,
        0.        ,  0.02160393,  0.11853147,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.17897727,
        0.        ,  0.        ,  0.        ,  0.10531915])

In [26]:
features_list

['poi',
 'salary',
 'bonus',
 'long_term_incentive',
 'deferred_income',
 'deferral_payments',
 'loan_advances',
 'other',
 'expenses',
 'director_fees',
 'total_payments',
 'exercised_stock_options',
 'restricted_stock',
 'restricted_stock_deferred',
 'total_stock_value',
 'to_messages',
 'from_messages',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'shared_receipt_with_poi']

In [27]:
clf = SVC(kernel='linear')
dump_classifier_and_data(clf, my_dataset, features_list)
tester.main()

	Accuracy: 0.86429	Precision: 0.57143	Recall: 0.20000	F1: 0.29630	F2: 0.22989
	Total predictions:  140	True positives:    4	False positives:    3	False negatives:   16	True negatives:  117



In [28]:
clf = KMeans(n_clusters=2)
dump_classifier_and_data(clf, my_dataset, features_list)
tester.main()

	Accuracy: 0.72143	Precision: 0.12000	Recall: 0.15000	F1: 0.13333	F2: 0.14286
	Total predictions:  140	True positives:    3	False positives:   22	False negatives:   17	True negatives:   98



The results from running the four classifiers on the entire featureset with no algorithm tuning are summarized in the table below

| Classifier            | Precision | Recall  | F1 Score | Accuracy |
|-----------------------|-----------|---------|----------|----------|
| GaussianNB            | 0.30909   | 0.85000 | 0.45333  | 0.70714  |
| DecisionTree          | 0.5000    | 0.45000 | 0.47368  | 0.85714  |
| SVC (kernel='linear') | 0.57143   | 0.2000  | 0.29630  | 0.86429  |
| KMeans (n_clusters=2) | 0.17647   | 0.15000 | 0.16216  | 0.77857  |

From the first run through the four algorithms, I can see that the decision tree performed best, followed by the gaussian naive bayes, support vector machine, and Kmeans clustering. In fact, the decision tree and naive Bayes classifiers both perform well enough to meet the standards for the project. Nonetheless, there is much work that can be done to improve these metrics. 

### New Features

The standard features for the dataset perform adequately, but not stellar. I want to define new features that will allow for more accurate predictions of poi. Additionally, I want to perform feature reduction using PCA in order to choose the dimensions that have the greatest variance and hopefully, predictive power. 

Three new features can be created from the emails at the moment. The first will be the ratio of emails to an individual from a person of interest to all emails addressed to that person, the second is the same but for messages to persons of interest, and the third will be the ratio of email receipts shared with a person of interest to all emails addressed to that individual. 

In [29]:
df['to_poi_ratio'] = df['from_poi_to_this_person'] / df['to_messages']
df['from_poi_ratio'] = df['from_this_person_to_poi'] / df['from_messages']
df['shared_poi_ratio'] = df['shared_receipt_with_poi'] / df['to_messages']

In [30]:
features_list.append('to_poi_ratio')
features_list.append('from_poi_ratio')
features_list.append('shared_poi_ratio')

In [31]:
features_list

['poi',
 'salary',
 'bonus',
 'long_term_incentive',
 'deferred_income',
 'deferral_payments',
 'loan_advances',
 'other',
 'expenses',
 'director_fees',
 'total_payments',
 'exercised_stock_options',
 'restricted_stock',
 'restricted_stock_deferred',
 'total_stock_value',
 'to_messages',
 'from_messages',
 'from_poi_to_this_person',
 'from_this_person_to_poi',
 'shared_receipt_with_poi',
 'to_poi_ratio',
 'from_poi_ratio',
 'shared_poi_ratio']

In [32]:
df['total_payments']

BELDEN TIMOTHY N                 5501630.0
BOWEN JR RAYMOND M               2669589.0
CALGER CHRISTOPHER F             1639297.0
CAUSEY RICHARD A                 1868758.0
COLWELL WESLEY                   1490344.0
DELAINEY DAVID W                 4747979.0
FASTOW ANDREW S                  2424083.0
GLISAN JR BEN F                  1272284.0
HANNON KEVIN P                    288682.0
HIRKO JOSEPH                       91093.0
KOENIG MARK E                    1587421.0
KOPPER MICHAEL J                 2652612.0
LAY KENNETH L                  103559793.0
RICE KENNETH D                    505050.0
RIEKER PAULA H                   1099100.0
SHELBY REX                       2003885.0
SKILLING JEFFREY K               8682716.0
YEAGER F SCOTT                    360300.0
ALLEN PHILLIP K                  4484442.0
BADUM JAMES P                     182466.0
BANNANTINE JAMES M                916197.0
BAY FRANKLIN R                    827696.0
BAZELIDES PHILIP J                860136.0
BECK SALLY 

At this point I will also create some new features using the financial data. I have a few theories that I formed from my initial data exploration and reading about the Enron case. I think that people recieving large bonuses may be more likely to be persons of interest becuase the bonuses could be a result of fraudelent activity, or perhaps a bribe to keep someone quiet. Whatever the case may be, I will create two new features that are the bonus in relation to the salary, and the bonus in relation to total payments. I am creating lots of extra features at this point and now have a total of 27. However, I will perform feature reduction/selection eventually so I am not worried about the large number of features. Moreover the algorithms I am using are able to train relatively quickly even with the large number of features because the total amount of data samples is small. 

In [33]:
df['bonus_to_salary'] = df['bonus'] / df['salary']
df['bonus_to_total'] = df['bonus'] / df['total_payments']
features_list.append('bonus_to_salary')
features_list.append('bonus_to_total')
                          

In [34]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import f_classif, SelectKBest

In [35]:
len(df)

140

In [36]:
# Fill any NaN financial data with a 0
df.fillna(value= 0, inplace=True)

# Create a copy of the dataframe and normalize it to zero mean and unit variance
scaled_df = df.copy()
scaled_df.ix[:,1:] = scale(scaled_df.ix[:,1:])
data_dict = scaled_df.to_dict(orient='index')

# 
my_dataset = data_dict
data = featureFormat(my_dataset, features_list, sort_keys=True)
labels, features = targetFeatureSplit(data)

# Create the classifier, GaussianNB has no parameters to tune
clf = GaussianNB()
clf = test_classifier(clf, my_dataset, features_list)

	Accuracy: 0.72857	Precision: 0.32000	Recall: 0.80000	F1: 0.45714	F2: 0.61538
	Total predictions:  140	True positives:   16	False positives:   34	False negatives:    4	True negatives:   86



In [37]:
clf = DecisionTreeClassifier()
clf = test_classifier(clf, my_dataset, features_list)

	Accuracy: 0.90714	Precision: 0.68421	Recall: 0.65000	F1: 0.66667	F2: 0.65657
	Total predictions:  140	True positives:   13	False positives:    6	False negatives:    7	True negatives:  114



In [38]:
clf = SVC(kernel='linear')
clf = test_classifier(clf, my_dataset, features_list)

	Accuracy: 0.83571	Precision: 0.33333	Recall: 0.15000	F1: 0.20690	F2: 0.16854
	Total predictions:  140	True positives:    3	False positives:    6	False negatives:   17	True negatives:  114



In [39]:
clf = KMeans(n_clusters=2)
clf = test_classifier(clf, my_dataset, features_list)

	Accuracy: 0.71429	Precision: 0.16667	Recall: 0.25000	F1: 0.20000	F2: 0.22727
	Total predictions:  140	True positives:    5	False positives:   25	False negatives:   15	True negatives:   95



After adding in the five additional features, I retested the algorithms with all of the features. The results are summarized in the table below.

| Classifier            | Precision | Recall  | F1 Score | Accuracy |
|-----------------------|-----------|---------|----------|----------|
| GaussianNB            | 0.35556   | 0.80000 | 0.49321  | 0.76429  |
| DecisionTree          | 0.60000   | 0.60000 | 0.60000  | 0.88571  |
| SVC (kernel='linear') | 0.71429 | 0.25000 | 0.37037  | 0.87857  |
| KMeans (n_clusters=2) | 0.12500   | 0.25000 | 0.16667  | 0.64286  |

The F1 Score improved for all four of the classifiers with the addition of the two additional created features. The F1 score for the decision tree is stil the highest followed by the Gaussian Naive Bayes. At this point, neither the SVC with the linear kernel nor the KMeans clustering pass the standards of 0.3 for precision and recall. I will drop the latter two algorithms and I will also drop the GaussianNB in favor of [AdaBoost](http://rob.schapire.net/papers/explaining-adaboost.pdf) because it has GaussianNB does not have any tunable parameters and therefore I will not be able to improve the precision or recall any beyond what I do to alter the features. AdaBoost fits multiple classifiers on a dataset and adjusts the weights of incorrectly classified instances with each iteration to concentrate on the difficult to classify samples. 

The next step I want to take is feature selection. Looking at the features importances for both AdaBoost and DecisionTree, I can see that there are some features that have a zero importance. I will use selectKbest to select the best performing features and use gridSearchCV to pick the optimum number of features to select. I will also add one more feature, shared_ratio, or the ratio of messags an indidual recieved that were shared with a person of interest divided by the total number of emails received by that individual. 

In [40]:
len(features_list)

25

At this point I have 25 features. I know that not all of them are going to be contributing to the accuracy of my model. Therefore, it is time to perform dimensionality reduction. First, I will simply use the .feature_importances\_ attribute of the DecisionTreeClassifier and the AdaBoostClassifier to determine the most important features for each algorithm. 

In [41]:
clf_tree = DecisionTreeClassifier()
clf_tree = test_classifier(clf_tree, my_dataset, features_list)

tree_feature_importances = [0] + (clf_tree.feature_importances_)
tree_features = zip(tree_feature_importances, features_list)
tree_features = sorted(tree_features, key= lambda x:x[0], reverse=True)

for i in range(10):
    print('{} : {:.4f}'.format(tree_features[i][1], tree_features[i][0]))

	Accuracy: 0.90714	Precision: 0.68421	Recall: 0.65000	F1: 0.66667	F2: 0.65657
	Total predictions:  140	True positives:   13	False positives:    6	False negatives:    7	True negatives:  114

to_poi_ratio : 0.3782
from_this_person_to_poi : 0.2485
other : 0.2476
loan_advances : 0.0665
shared_receipt_with_poi : 0.0592
poi : 0.0000
salary : 0.0000
bonus : 0.0000
long_term_incentive : 0.0000
deferred_income : 0.0000


In [42]:
dump_classifier_and_data(clf_tree, my_dataset, features_list)
tester.main()

	Accuracy: 0.89286	Precision: 0.63158	Recall: 0.60000	F1: 0.61538	F2: 0.60606
	Total predictions:  140	True positives:   12	False positives:    7	False negatives:    8	True negatives:  113



In [43]:
clf_ada = AdaBoostClassifier()
clf_ada = test_classifier(clf_ada, my_dataset, features_list)
ada_feature_importances = [0] + clf_ada.feature_importances_
ada_features = zip(ada_feature_importances, features_list)
for pair in ada_features:
    importance = round(pair[0],4)
    if importance == 0.0:
        ada_features.remove(pair)
        
ada_features = sorted(ada_features, key=lambda x:x[0], reverse=True)
for i in range(10):
    print('{} : {:.4f}'.format(ada_features[i][1], ada_features[i][0]))

	Accuracy: 0.94286	Precision: 0.83333	Recall: 0.75000	F1: 0.78947	F2: 0.76531
	Total predictions:  140	True positives:   15	False positives:    3	False negatives:    5	True negatives:  117

from_this_person_to_poi : 0.1200
total_payments : 0.1000
from_poi_to_this_person : 0.1000
shared_receipt_with_poi : 0.1000
long_term_incentive : 0.0800
to_messages : 0.0800
to_poi_ratio : 0.0800
loan_advances : 0.0600
restricted_stock_deferred : 0.0600
from_poi_ratio : 0.0400


It is interesting to compare the feature importances for the DecisionTree and the AdaBoost classifiers. They definitely are not in close agreement for the top ten even though both manage a respectable F1 Score greater than 0.5. The next step is to make a pipeline and then let GridSearchCV do the tough work of selecting the optimal number of features to keep. 

In [44]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report

data_dict = featureFormat(my_dataset, features_list)
labels, features = targetFeatureSplit(data_dict)

cv = StratifiedShuffleSplit(random_state = 42)
for train_idx, test_idx in cv.split(features, labels): 
    features_train = []
    features_test  = []
    labels_train   = []
    labels_test    = []
    for ii in train_idx:
        features_train.append( features[ii] )
        labels_train.append( labels[ii] )
    for jj in test_idx:
        features_test.append( features[jj] )
        labels_test.append( labels[jj] )

In [45]:
from sklearn.model_selection import GridSearchCV

n_features = np.arange(1, len(features_list))
pipe = Pipeline([
    ('select_features', SelectKBest()),
    ('classify', DecisionTreeClassifier())
])



param_grid = [
    {
        'select_features__k': n_features
    }
]

tree_clf= GridSearchCV(pipe, param_grid=param_grid, scoring='f1', cv = 10)
tree_clf.fit(features_train, labels_train)


  'precision', 'predicted', average, warn_for)


GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('select_features', SelectKBest(k=10, score_func=<function f_classif at 0x000000000C4F0D68>)), ('classify', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'select_features__k': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=0)

In [46]:
tree_clf.best_score_

0.62959183673469388

In [47]:
tree_clf.best_params_

{'select_features__k': 19}

According to the grid search performed with SelectKBest with the number of features ranging from 1 to the (number of features - 1), the optimal number of features for the decision tree classifier is 20. 

In [51]:
tree_clf = Pipeline([
    ('select_features', SelectKBest(k=19)),
    ('classify', DecisionTreeClassifier()),
])

dump_classifier_and_data(tree_clf, my_dataset, features_list)
tester.main()

	Accuracy: 0.91429	Precision: 0.68182	Recall: 0.75000	F1: 0.71429	F2: 0.73529
	Total predictions:  140	True positives:   15	False positives:    7	False negatives:    5	True negatives:  113



Running the DecisionTreeClassifier with SelectKMeans and k = 20 yields an F1 score of 0.71429. I am very pleased with that result but will still aim higher. I have not forgotten about the AdaBoostClassifier and I will try to reduce the number of features for that algorithm in the same manner. 

In [52]:
n_features = np.arange(1, len(features_list))
pipe = Pipeline([
    ('select_features', SelectKBest()),
    ('classify', AdaBoostClassifier())
])

param_grid = [
    {
        'select_features__k': n_features
    }
]

ada_clf= GridSearchCV(pipe, param_grid=param_grid, scoring='f1')
ada_clf.fit(features_train, labels_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('select_features', SelectKBest(k=10, score_func=<function f_classif at 0x000000000C4F0D68>)), ('classify', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'select_features__k': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=0)

In [53]:
ada_clf.best_params_

{'select_features__k': 20}

In [54]:
ada_clf.best_score_

0.68994708994708986

In [58]:
dump_classifier_and_data(ada_clf, my_dataset, features_list)
tester.main()

	Accuracy: 0.90000	Precision: 0.66667	Recall: 0.60000	F1: 0.63158	F2: 0.61224
	Total predictions:  140	True positives:   12	False positives:    6	False negatives:    8	True negatives:  114



The ideal number of parameters found by the GridSearch for the AdaBoostClassifier was also 20. This resulted in a slightly lower F1 score of 0.689, but still quite high. 

At this point I could also perform Principal Component Analysis, but I think that the performance I am seeing is high enough. PCA creates new features that do not necessarily represent actual quantifiable values in the dataset, and I like the idea that I know exactly all the features I am putting into the model. This is one way that I try to combat the [black box problem](http://www.nature.com/news/can-we-open-the-black-box-of-ai-1.20731) in machine learning. If I at least know what is going in, then I can try to understand why the model returned a certain classification and maybe it can inform my thinking for future machine learning systems. 

The next step is to begin tuning the classifiers. I will use GridSearchCV again and I will input a wide variety of parameters to try in the parameter grid. The decision tree classifier will be up first. Looking at the sci-kit learn [documentation for the DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), there are a number of parameters that can be tuned. However, I will focus on just a few. The first will be __criterion__ for splitting, either 'gini', or 'entropy' to maximize the [information gain](https://en.wikipedia.org/wiki/Information_gain_ratio). The other three I will try will be __min_samples_split__, __max_depth__, and __max_features__. I will continue to use a cross-validation with 10 folds in the grid search and the scoring criterion will remain set at F1 because that is what I am trying to maximize.`

In [59]:
tree_pipe = Pipeline([
    ('select_features', SelectKBest(k=19)),
    ('classify', DecisionTreeClassifier()),
])

param_grid = dict(classify__criterion = ['gini', 'entropy'] , 
                  classify__min_samples_split = [2, 4, 6, 8, 10],
                  classify__max_depth = [None, 5, 10, 15, 20],
                  classify__max_features = [None, 'sqrt', 'log2', 'auto'])

tree_clf = GridSearchCV(tree_pipe, param_grid = param_grid, scoring='f1', cv=10)
tree_clf.fit(features_train, labels_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('select_features', SelectKBest(k=19, score_func=<function f_classif at 0x000000000C4F0D68>)), ('classify', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'classify__max_features': [None, 'sqrt', 'log2', 'auto'], 'classify__min_samples_split': [2, 4, 6, 8, 10], 'classify__criterion': ['gini', 'entropy'], 'classify__max_depth': [None, 5, 10, 15, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=0)

In [60]:
tree_clf.best_score_

0.82645502645502644

In [61]:
tree_clf.best_params_

{'classify__criterion': 'entropy',
 'classify__max_depth': None,
 'classify__max_features': None,
 'classify__min_samples_split': 8}

I will now implement the best parameters selected by the Grid Search and cross validation them using the tester.py function. I am seeing quite high F1 scores, and I am wondering if perhaps I am making some large mistake, such as training on the same data that I test on. I don't think this is so because I am using cross validation in the GridSearch, but it only takes a few seconds to test so I might as well. 

In [62]:
tree_clf = Pipeline([
    ('select_features', SelectKBest(k=20)),
    ('classify', DecisionTreeClassifier(criterion='entropy', max_depth=5, max_features=None, min_samples_split=10))
])

dump_classifier_and_data(tree_clf, my_dataset, features_list)
tester.main()

	Accuracy: 0.94286	Precision: 1.00000	Recall: 0.60000	F1: 0.75000	F2: 0.65217
	Total predictions:  140	True positives:   12	False positives:    0	False negatives:    8	True negatives:  120



According to the cross validation in tester.py, my F1 score is indeed very high with the recommended parameters. Now, I am satisfied with the recall and precision score of the DecisionTreeClassifier. I will play around with the AdaBoostClassifier using the same approach because I am genuinely curious to see if I can beat an F1 score of 0.767. It's time to look at the Sci-kit learn [documentation for the AdaBoost Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) to see the parameters available to tune. Again, the AdaBoost classifier works by iterating with several simpler classifiers and adjusting the weights given to each feature based on whether it labeled the sample correctly or not. The weights are adjusted with each iteration to improve the overall classifier with time. I do not know much about this classifer, but I can read the documentation and automate the optimal selection of parameters using GridSearch Cross Validation. 

The AdaBoostClassifier boosts another 'base' classifier, which by default is the Decision Tree. I can alter this using the __base_estimator__ parameter to tree out a random forest and the gaussian naive bayes classification. Ideally, AdaBoost is used on weak classifiers, or those that perform only slightly better than random. One example would be a [decision stump](http://stackoverflow.com/questions/12097155/weak-classifier) or a decision tree with only a single layer. However, I will try stronger classifiers for the base as well. The other parameters I can change are __n_estimators__ which is how many weak models to fit and __learning_rate__, a measure of the weight given to each classifier.

In [63]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
ada_pipe = Pipeline([('select_features', SelectKBest(k=20)),
                     ('classify', AdaBoostClassifier())
                    ])

param_grid = dict(classify__base_estimator=[DecisionTreeClassifier(), RandomForestClassifier(), GaussianNB()],
                  classify__n_estimators = [30, 50, 70, 120],
                  classify__learning_rate = [0.5, 1, 1.5, 2, 4])

ada_clf = GridSearchCV(ada_pipe, param_grid=param_grid, scoring='f1', cv=10)
ada_clf.fit(features, labels)

In [None]:
ada_clf.best_score_

In [None]:
ada_clf.best_params_