# Project Explanation
The goal of this project was to determine if a machine learning algorithm could correctly be trained to identify people that likely commited fraud, considered "Persons of Interest" or "POI" given some information about them. For this project the data sources were generated by Katie Malone.

The dataset had numerous columns, it can logically be surmised that the people who commited fraud likely used it for financial gain and that studying the financial data could give a good guess as to who was aware of or commited fraduluent activity. The data also included features from the The dataset contained 148 rows, 21 columns and 18 were people of interest. This was a fairly imbalanced dataset with only 12% of rows being a POI. An initial exploration showed two outliers, Total and The Travel Agency. Both were removed from the dataset. Finally the dataset showed numerous missing values. When inspecting the source pdf it became clear these indicated a zero value. The exception was the email_address column, however this column was recoded and eventually removed as discussed below.

## Feature Selection
Feature Selection was done through exploratory analysis, hand selecting variables, and feature engineering. Boxplots of all the features were compared between the POI and the Non POI groups. From the boxplots it was judged that there was a distinct difference in the distributions of the salaries between the two groups.

When studying the missing values it became apparent that the proporition of missing features for the POIs was different than the non POI. For instance in the Other column the POIs had no missing values, whereas 43% of the Non POIs were missing values.The Other, Expenses, and Bonus features were recoded into booleans, with 1 indicating a value, and 0 indicating no value. These three features along with the Salary column were selected for use in the model. In the final model Principal Component Analysis was then performed to reduce the four features into two. Lastly all features were scaled using a Standard Scaler before the model was fitted.

In initial tests attempts were made to use automated feature selection algorithms such as KBest and SelectPercentile but those were ultimately not used in the final predictor.

## Algorithm
A number of algorithms were spot checked at the start of modeling. In no particular order here were the ones tested

* Random Forest
* Decision Tree
* Gaussian Bayes
* K Means
* SVC with rbf kernel
* SVC with linear kernel
* LinearSVC (A different implementation in sklearn)
* Adaboost
* Logistic Regression

All models were tested by splitting the dataset into test and training methods. Then a GridSearchCV method was used to fit multiple parameters combinations with the F1 score as the objective. These initial tests showed the most promise with the LinearSVC, K means and Logistic Regression models. The other models either had a 0 recall or precision, or had poorer scores for both measures.

### Final Model Scores

The models were evaluated based on precision and recall due to the unbalanced nature of the dataset. Simply by predicting that all rows were non POIs roughly an ~85% accuracy could be obtained. However this would make the model completely ineffective of identifying POIs.

**Due to random nature of testing actual values might be slightly different**

| Model         | Precision     | Recall| F1 Score |
| ------------- |:-------------:| ----- |--------- |
| KMeans        | .176          |  .256 | .31      |
| LinearSVC     | .24           |   .23 | .20      |
| Logistic Reg  | .31           |   .35 | .33      |

# Tuning
Turning the parameters of the model essentially means tweaking the way the model classifies data into predictions. The models were initially tuned by testing various parameters specific to each model by using the GridSearchCV function of Sklearn. The models were ranked by their F1 score, which uses both recall and precision in it's calculation. Scoring was done by using StratifiedShuffleSplits with 1000 iterations on a train set. The best estimator was scored on a holdout test set.

The three chosen models were more finely tuned using further GridSearchCV iterations, as well as hand tuning on the tester script. The parameter that had the largest effect was weighting the features for the Logistic Regression and LinearSVC.

# Feature Selection
Feature selection was done through PCA and univariate feature selection with the KBest method. The number of components, and k features were selected through GridSearch testing. Through testing the engineered features showed more value than the original feature, for instance the other_bool hold a higher score than the other col"

In [3]:
import pandas as pd
pd.read_pickle("Kbestdf.p").sort_values('score', ascending = False)

Unnamed: 0,score,pvalue
exercised_stock_options,14.510657,0.000243
total_stock_value,12.905193,0.000515
salary,8.500972,0.0044
other_bool,8.13037,0.005309
expenses_bool,7.733091,0.006502
deferred_income,6.178484,0.014624
bonus_bool,4.944041,0.028477
bonus,4.048657,0.046951
expenses,3.316696,0.071631
long_term_incentive,2.895653,0.091989


# Validation

Validation is the act of checking the model performance against a list of known results. The classic mistake is to fit and validate the model against the same dataset, thereby over predicting the model accuracy and fit. The strategy used in the analysis was two fold. For the initial fit the various models were validated using a holdout test dataset. For the final model the entire dataset was split using a Kfold split that was shuffled over a 1000 iterations to compile aggregate metrics for each iteration.

# Evaluation Metrics
For this project precision and recall were used to determine the final efficiacy of the effort. Precision for the model is the Number of Correctly Identified POIs over the Total Positive Predictions. Recall is the Number of Correctly Identified POIs over the Total Number of POIs. Accuracy was not used as the dataset was very imbalanced. For a crime case, such as this one, a high recall would be desirable as it means we are correctly flagging all fraudulent inviduals for investigation, even if the overall precision of the model is low. 

# Useful References
* http://scikit-learn.org/stable/auto_examples/grid_search_digits.html
* http://scikit-learn.org/stable/tutorial/machine_learning_map/
* http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html