### Identify Fraud from Enron Email

<p>Machine learning techniques were applied to data from 146 Enron employees.  There were 21 features for each employee and 3066 total data points.  The data for each employee fell into 3 categories: financial, email, and poi label.  The poi label signified whether or not an employee was a person of interest in the Enron case.  18 employees were labeled as a poi and 128 employees were not.  5 machine learning algorithms were deployed to predict an employee's poi label based on their financial and email features.  1 algorithm was selected and tuned for final analysis.</p>

### Data Structure

The data was stored in a dictionary of dictionaries.  For the outer dict, each key was a person and each value was a feature name.  For the inner dict, each key was a feature name and each value was the value of that feature.  

```Python
datadict['SKILLING JEFFREY K']['salary']
```

###### Financial Features

<p>['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)</p>

###### Email Features

<p>['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)</p>

###### POI Label

<p>[‘poi’] (boolean, represented as integer)</p>

### Missing Values

<p>These are the total 'NaN' values for each feature.</p>

POI: 0  
Salary: 50  
Deferred Income: 96  
Loan Advances: 141  
Other: 53  
Long Term Incentive: 79  
Percent Exercised Stock: 44  
Percent Restricted Stock: 38  
Percent Restricted Stock Deferred: 130  
Percent to POI: 58  
Percent from POI: 58  
Percent Shared with POI: 58  
Percent Deferral Payments: 106  
Percent Expenses: 50  
Percent Director Fees: 130  
Percent Bonus: 63  


### Outliers

<p>Many outliers in this dataset were important because they helped identify persons of interest.  However, some outliers did not correspond to a person.  The max value for financial features was from a 'TOTAL' key rather than a 'PERSON' key.  Another non-person key 'THE TRAVEL AGENCY IN THE PARK' was identified and these keys were removed from the data dictionary.</p>

### Feature Selection

<p>SelectKBest was used to get the feature importances for all features.  The feature importances are based on the chi squared value between each feature and the poi label.  The four features with the highest importance scores were selected.  The first element 'poi' is a label.  Feature scaling was not required for the decision tree algorithm that was deployed.</p>      

```Python
features_list = ['poi','exercised_stock_options', 'total_stock_value', 'bonus', 'salary']
```  

exercised_stock_options: 24.815079733218194  
total_stock_value: 24.182898678566879  
bonus: 20.792252047181535  
salary: 18.289684043404513  
deferred_income: 11.458476579280369  
long_term_incentive: 9.9221860131898225  
restricted_stock: 9.2128106219771002  
total_payments: 8.7727777300916756  
shared_receipt_with_poi: 8.589420731682381  
loan_advances: 7.1840556582887247  
expenses: 6.0941733106389453  
from_poi_to_this_person: 5.2434497133749582  
other: 4.1874775069953749  
from_this_person_to_poi: 2.3826121082276739  
director_fees: 2.1263278020077054  
to_messages: 1.6463411294420076  
deferral_payments: 0.22461127473600989  
from_messages: 0.16970094762175533  
restricted_stock_deferred: 0.065499652909942141

### Feature Creation

<p>Additional features 'percent_exercised_stock' and 'percent_bonus' were created but their feature importances were very low.  'percent_exercised_stock' was the percent of 'total_stock_value' that was exercised and 'percent_bonus' was the percent of 'salary' equal to an employee's bonus.</p>

exercised_stock_options: 21.153646538437151  
total_stock_value: 20.492888346982209  
bonus: 17.326074648455403  
salary: 14.579307471130718  
percent_bonus: 8.5695799793786058  
percent_exercised_stock: 0.6724199516795144  

### Fitting classifiers without validation

<p>A variety of classifiers were fit to all of the features and labels in the data set.  The features were used to make predictions about labels and these predicted labels were compared with the true labels.  The accuracy, precision, and recall of each algorithm was measured.  The precision is the amount of true positives divided by the sum of true positives and false positives.  It is the number of true positive 'poi' labels divided by all positive 'poi' labels regardless of whether or not they are true.  The recall is the number of true positives divided by the sum of true positives and false negatives.  It is the probability of algorithm to correctly identify a 'poi'.</p>

###### GaussianNB

accuracy: 0.869230769231  
precision: 0.545454545455  
recall: 0.333333333333

###### Decision Tree

accuracy: 1.0  
precision: 1.0  
recall: 1.0  

###### SVC

accuracy: 1.0  
precision: 1.0  
recall: 1.0  

###### KNeighbors

accuracy: 0.907692307692  
precision: 0.8  
recall: 0.444444444444  

###### AdaBoost

accuracy: 1.0  
precision: 1.0  
recall: 1.0 

### GridSearchCV

###### Cross Validation

<p>A decision tree algorithm was tuned using GridSearchCV.  Because GridSearchCV can do cross validation, the whole data set was passed to the algorithm instead of splitting the data into testing and training sets.  This is more useful because when data is split into train/test sets, the goal is to maximize the training set size to acheive the best learning outcome as well as maximize the test set size to acheive the best validation.  Partitioning the data like this caused a tradeoff because every data point used for the test set cannot be used by the training set and visa versa.  Cross validation partitions the data into bins of equal size and the learning experiment is run once for every bin in the set.  One bin is selected as the test set and the rest of the bins are selected for training.   This is done for every bin and the test results for each bin are averaged.  This takes more computing time but has better accuracy than training and testing sets.</p>  

###### Parameter Tuning

<p>Failing to tune the parameters of an algorithm, or tuning them poorly, can lessen the accuracy of the classifier.  Poor tuning can also increase the accuracy of the classifier and precision score while decreasing the recall score.  The 'splitter' parameter was varied between "best" and "random" and the 'random_state' parameter was varied between 'None, 20, 30, 40, 50, 60'.  The 'max_depth' parameter was initially adjusted as well, but it had poor results.</p>

###### Decision Tree

best_parameters: {'splitter': "random", 'random_state': 40}  
accuracy: 0.79592  
precision: 0.33367   
recall: 0.32750 

### Effects of Additional Features

<p>The decision tree algorithm was attempted with one of the percentage features created earlier, rather than the original features.  The results were slightly worse.</p>  
```Python
features_list = ['poi','percent_exercised_stock', 'bonus', 'salary']
```

###### Decision Tree

best_parameters: {'splitter': "random", 'random_state': 55}  
accuracy: 0.78108  
precision: 0.30825    
recall: 0.34000    

### Improving Feature Selection

<p>I tried the decision tree algorithm again with only the top 3 features from the SelectKBest scores earlier.  The results were slightly better.</p>

```Python
features_list = ['poi','exercised_stock_options', 'total_stock_value', 'bonus']
```  
###### Decision Tree

best_parameters: {'splitter': "random", 'random_state': 50}  
accuracy: 0.81223  
precision: 0.39249    
recall: 0.40250    