#### Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  

The Enron dataset contains data for 146 Enron employees with 21 features including one feature called POI (person of interest) to mark employees who conducted fraudulent activites. The features can be divided into 3 categories:  
    
   - 14 financial features that relate to employees' financial charateristics, e.g. salary, bonus and total stock value etc.;
   - 6 email features that counting number of emails created from/to a person and from/to a POI, e.g. number of emails from POIs to this person, number of emails from this person to POIs etc..
   - 1 label feature that is the POI column to indicate whether the employee is an POI or not. There are 18 employees that have been labelled as POIs and 128 employees labelled as non-POIs.
   
Except email address field and the label field, all other features are numeric.

The goal of this project is to use a variaty of machine learning algorithms to classify employees to POI or non-POI category based on financial features and email features. The classification result of each algorithm will then be validated against the label feature and we will use different metrics to measure and compare the quality of each classification model. 

An outlier with name 'TOTAL' was identified and removed while plotting the salary - bonus scatter plot, as this is the aggregation row for all employee data in the dataset. Also by checking the data manually, a row named 'THE TRAVEL AGENCY IN THE PARK' was also removed as this not seems like a name of a person and therefore removed.  
   
After removing outliner, there are 18 POIs and 126 non-POIs.

While doing data wrangling and EDA (referring to Wrangling and EDA.ipynb), I found that in both finacial features and email features there are NaN values. Considering the size of the dataset is small (144 data points after outlier removal), I am not going to remove these data points. By looking at the histogram of each feature, I found that for most of the features, the distribution is heavily skewed. Therefore NaNs are replaced with median value of the feature. 



#### What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  

Email address is removed from the feature list as the text of email address will not help in classifying employees.   
Intuitively I would think the proportion of emails to or from a POI to the total number of messages of a person may tell something. In other words, the more a person has interactions with POIs, the higher the chance that this person is also a POI. So I have created a column called 'poi_messages_total_messages_ratio'. This column is showing the result of 
<br>
```
    (from_poi_to_this_person + from_this_person_to_poi) / (to_messages + from_messages) 
```
<br>

A boxplot (see details in Wrangling and EDA.ipynb) shows that in general POIs have higher poi_messages_total_messages_ratio than non-POIs meaning that comparing to non-POIs, POIs have more email interactions with other POIs. So I kept this created feature in my features list for automated feature selection used in the implementation of pipeline of different algorithms. 

StandardScaler, SelectKBest and PCA algorithms together with KNeighborsClassifier were implemented in a Pipeline object. We need to scale the features because we use both finance features and email features to create the classification model. The finance features are having bigger variance than email features. If scaling is not adopted, most likely PCA will ignore email features even though email features still have great impact on determine if an employee is POI or not. 

By Using GridSearchCV I was able to give the pipeline a range of numbers and let the GridSearchCV help to find out the best number of features and the best features. GridSearchCV found when k = 13 - meaning that it automatically found the best 13 features and these 13 features can give the best f1 score.
These 13 features and their associated feature scores and p-values are listed below. We can see that the newly created feature 'poi_messages_total_messages_ratio' had been selected to be one of the best features.

In [12]:
from IPython.display import HTML, display
import tabulate
table = [['Feature Name', 'Feature Score', 'Feature P-Value'],
         ['exercised_stock_options', '27.45', '0.000'],
         ['total_stock_value', '23.67', '0.000'],
         ['bonus', '15.80', '0.000'],
         ['salary', '10.90', '0.001'],
         ['deferred_income', '10.29', '0.002'],
         ['restricted_stock', '8.46', '0.004'],
         ['total_payments', '8.41', '0.004'],
         ['long_term_incentive', '8.36', '0.004'],
         ['shared_receipt_with_poi', '7.48', '0.007'],
         ['from_poi_to_this_person', '4.28', '0.040'],
         ['other', '3.96', '0.049'],
         ['poi_messages_total_messages_ratio', '3.87', '0.051'],
         ['loan_advances', '3.85', '0.052']]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

0,1,2
Feature Name,Feature Score,Feature P-Value
exercised_stock_options,27.45,0.000
total_stock_value,23.67,0.000
bonus,15.80,0.000
salary,10.90,0.001
deferred_income,10.29,0.002
restricted_stock,8.46,0.004
total_payments,8.41,0.004
long_term_incentive,8.36,0.004
shared_receipt_with_poi,7.48,0.007


ExtraTreesClassifier was another used machine learning algorithm. No automated feature selection was adopted since the algorithm would calculate the feature importace and create a split on the feathre that is separating class labels the best. 

Here is the feature importance calculated by ExtraTreesClassifier. It shows that the added feature 'poi_messages_total_messages_ratio' actually has the highest importance among all other features.

In [13]:
from IPython.display import HTML, display
import tabulate
table_feature_importances = [
         ['Ranking', 'Feature Name', 'Feature Importance'],
         [1, 'poi_messages_total_messages_ratio', 0.120736988642],
         [2, 'total_stock_value', 0.0934400583374],
         [3, 'long_term_incentive', 0.0851866900613],
         [4, 'restricted_stock', 0.0790503262437],
         [5, 'bonus', 0.0765667638712],
         [6, 'deferred_income', 0.069378440718],
         [7, 'exercised_stock_options', 0.0667708268178],
         [8, 'expenses', 0.0621512010884],
         [9, 'total_payments', 0.0601142662556],
         [10, 'salary', 0.0596834338248],
         [11, 'shared_receipt_with_poi', 0.0485913063034],
         [12, 'to_messages', 0.0418112765536],
         [13, 'from_poi_to_this_person', 0.0400984895026],
         [14, 'other', 0.0389731907082],
         [15, 'from_messages', 0.0212271791413],
         [16, 'deferral_payments', 0.0183765027787],
         [17, 'from_this_person_to_poi', 0.0109602498195],
         [18, 'director_fees', 0.00619003984389],
         [19, 'restricted_stock_deferred', 0.000692769488946],
         [20, 'loan_advances', 0.0]]
display(HTML(tabulate.tabulate(table_feature_importances, tablefmt='html')))

0,1,2
Ranking,Feature Name,Feature Importance
1,poi_messages_total_messages_ratio,0.120736988642
2,total_stock_value,0.0934400583374
3,long_term_incentive,0.0851866900613
4,restricted_stock,0.0790503262437
5,bonus,0.0765667638712
6,deferred_income,0.069378440718
7,exercised_stock_options,0.0667708268178
8,expenses,0.0621512010884
9,total_payments,0.0601142662556


#### What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?

I used 2 algorithms to be the classification models - ExtraTreesClassifier and KNeighborsClassifier.

KNeighborsClassifier algorithm gave better performance in terms of the accuracy of the classification based on Precision and Recall score. KNeighboursClassifer also performed faster with the give parameter grid than ExtraTreesClassifier with the given parameter grid.

1. Meaning of parameter tune and why it is important
2. At least one important parameter tuned with at least 3 settings investigated systematically, or any of the following are true:

--GridSearchCV used for parameter tuning  
--Several parameters tuned  
--Parameter tuning incorporated into algorithm selection (i.e. parameters tuned for more than one algorithm, and best algorithm-tune combination selected for final analysis).

#### What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]



#### What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]



#### Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]

1. At least two appropriate metrics are used to evaluate algorithm performance (e.g. precision and recall), and the student articulates what those metrics measure in context of the project task.

2. Response addresses what validation is and why it is important.

3. Performance of the final algorithm selected is assessed by splitting the data into training and testing sets or through the use of cross validation, noting the specific type of validation performed.

4. When tester.py is used to evaluate performance, precision and recall are both at least 0.3.