# P5 - Identify Fraud from Enron Email

## Motivation

This work is part of the Udacity Data Analyst Nanoprogram. This is the final project for the Machine Learning course. The goal of this project is to use machine learning algorithms to build an efficient classifier that can spot PoI (Person of Interest) in the Enron Dataset provided by the Udacity team.

## Dataset

The dataset used in this project has been extracted by the Udacity team from the huge Enron dataset. The original Enron dataset contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 500 000 messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. 

https://www.cs.cmu.edu/~./enron/

The Udacity team extracted some useful features related to email correspondence for 145 users and enhanced it with financial data for these 145 users (salary, stock options, etc.).

## Summary

The first part is dedicated to analyze the features, get insights about the dataset, and spot the main outliers. Some feature engineering is done by adding 3 new features built out of existing ones. Their impact on the classifier performance seems to be positive. 

Tree-based feature selection is implemented to reduce the number of features and limit the risk of overfitting. When calculating the impact of this selection on a RandomForest Classifier, it appears that the performance gain is also limited. As a consequence, the entire feature list is used in the rest of the work, combined with a Principal Component Analysis to reduce its dimensionality. 

Different basic machine learning algorithms are then tested, with limited success. These algorithms are very dependent from hyper parameters. These can for instance play a strong role on overfitting. In order to determine the best parameters for this Enron Dataset, a Grid Search Analysis is done on different Classifiers. The Grid Search Analysis is done using a f-score as scorer similar to the one calculated by the tester.py file.

In order to take the best out of these Classifiers, feature preparation is integrated in a pipeline estimator. More specifically, features are first scaled with a Standard Scaler, then their dimension is reduced with a Principal Component Analysis before supplying them to the Classification Algorithm. 

Each classifier performance is calculated with the test_classifier function from the tester.py file. This function conducts a 1000-folds cross validation with the given classifier and the given feature list. It prints out different performance metrics (accuracy, precision, recall, f-scores) that help us assess the Classifier performance.

In the end, the best classifier that pops out is an optimized Logistic Regression Classifier using a L1 distance penalty. This is the algorithm that has been implemented in my poi_id.py file. 

## Limitations and potentials

One of the major limitation of this work was the dataset size and the numurous missing data it contains. There were only 145 usable observations to build a classifier and most of them were missing values for one or more features. This increased a lot the risk of overfitting. 

In order to tackle this issue, the StratifiedShuffleSplit function from the sickit learn cross-validation package was used to implement 10-folds to 1000-folds cross validation when developing and testing the different algorithms. 

In order to improve further the classifier performance, it would be interesting to try to implement some specific ensemble classifiers combining for example logistic regression with SVC or decision trees. 


In [10]:
import sys
import pickle
from time import time
import numpy as np
import pandas as pd
from tester import dump_classifier_and_data, test_classifier

from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.decomposition import RandomizedPCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline

sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

def scorer_f(clf, X, y):
    '''
    This function is used to evaluate the f-score from the tester.py file of a given classifier 
    on a given feature & label dataset. It is used in the grid search algorythm to select the best fit.
    The code is extracted from the tester.py file.
    '''
    true_negatives = 0
    false_negatives = 0
    true_positives = 0
    false_positives = 0
    clf.fit(X, y)
    predictions = clf.predict(X)
    for prediction, truth in zip(predictions, y):
        if prediction == 0 and truth == 0:
            true_negatives += 1
        elif prediction == 0 and truth == 1:
            false_negatives += 1
        elif prediction == 1 and truth == 0:
            false_positives += 1
        elif prediction == 1 and truth == 1:
            true_positives += 1
    try:
        precision = 1.0*true_positives/(true_positives+false_positives)
        recall = 1.0*true_positives/(true_positives+false_negatives)
        f1 = 2.0 * true_positives/(2*true_positives + false_positives+false_negatives)
        f2 = (1+2.0*2.0) * precision*recall/(4*precision + recall)
        return f2
    except ZeroDivisionError:
        # print "Got a divide by zero when trying out:", clf
        # print "Precision or recall may be undefined due to a lack of true positive predicitons."
        return 0

# Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
    

## Data exploration

In [11]:
# Import dataset into a pandas dataframe to easily manipulate the data
df = pd. DataFrame.from_dict(data_dict, orient="index")  
print df.head(5) 
print 

# Total number of points
print "Total number of observations (including potential outliers): ", len(df)
print


# Allocation accross classes
print "Number of PoI: {0} ({1:.1f}%)".format(sum(df.poi), 100.0*sum(df.poi)/len(df.poi))
print

# Total nubmer of features
print "Total number of features: ", len(df.columns)-1
print

# Missing values
for feat in df.columns:
    try:
        n_NaN = sum(df[feat]=="NaN")
        print "Number of Nan values for {0}: {1} ({2:.1f}%)".format(feat, n_NaN, 100.0* n_NaN/len(df.poi) )
    except TypeError:
        pass
print

# features_list is a list of strings, each of which is a feature name.
# The first feature must be "poi".

features_list = ['poi']
for feat in df.columns:
    if feat not in ['poi', 'email_address']:
        features_list.append(feat)

print "Features selected to be used by the classifiers: "
print features_list

                    salary to_messages deferral_payments total_payments  \
ALLEN PHILLIP K     201955        2902           2869717        4484442   
BADUM JAMES P          NaN         NaN            178980         182466   
BANNANTINE JAMES M     477         566               NaN         916197   
BAXTER JOHN C       267102         NaN           1295738        5634343   
BAY FRANKLIN R      239671         NaN            260455         827696   

                   exercised_stock_options    bonus restricted_stock  \
ALLEN PHILLIP K                    1729541  4175000           126027   
BADUM JAMES P                       257817      NaN              NaN   
BANNANTINE JAMES M                 4046157      NaN          1757552   
BAXTER JOHN C                      6680544  1200000          3942714   
BAY FRANKLIN R                         NaN   400000           145796   

                   shared_receipt_with_poi restricted_stock_deferred  \
ALLEN PHILLIP K                       1407  

There are only 146 observations in the dataset, including potential outliers. Moreover, the classes allocation is skewed, as only 12% of the observations are PoI. It also appears that a lot of values are missing.


## Identify and remove outliers

In [3]:
# check each feature for outliers based on the 2 standard variation rule. 
for feat in df.columns:
    try:
        lim_h = np.mean(df[feat][df[feat] <> "NaN"]) + 2*np.std(df[feat][df[feat] <> "NaN"])
        print df[(df[feat] > lim_h) & (df[feat] <> "NaN")][feat]
        print
    except TypeError:
        pass

TOTAL    26704229
Name: salary, dtype: object

BECK SALLY W          7315
BELDEN TIMOTHY N      7991
KEAN STEVEN J        12754
KITCHEN LOUISE        8305
LAVORATO JOHN J       7259
SHAPIRO RICHARD S    15149
Name: to_messages, dtype: object

TOTAL    32083396
Name: deferral_payments, dtype: object

LAY KENNETH L    103559793
TOTAL            309886585
Name: total_payments, dtype: object

TOTAL    311764000
Name: exercised_stock_options, dtype: object

TOTAL    97343619
Name: bonus, dtype: object

TOTAL    130322299
Name: restricted_stock, dtype: object

BELDEN TIMOTHY N      5521
KEAN STEVEN J         3639
KITCHEN LOUISE        3669
LAVORATO JOHN J       3962
SHAPIRO RICHARD S     4527
WHALLEY LAWRENCE G    3920
Name: shared_receipt_with_poi, dtype: object

BHATNAGAR SANJAY    15456290
Name: restricted_stock_deferred, dtype: object

TOTAL    434509511
Name: total_stock_value, dtype: object

TOTAL    5235198
Name: expenses, dtype: object

Series([], Name: loan_advances, dtype: object)


The previous code checked for each numerical feature which observation was lying outside of a 2 standard deviation range from the mean. This helped to recognize observations that were often lying outside of the most common values.

This is how I saw that there is a "Total" observation include in the dataset that must aggregate part of the other observations. This line is not useful for the classifier, and may even lead it into error. Hence, it is dropped from the dataset. 

In [12]:
# remove the TOTAL entry, as it appears clearcly as a general outlier. 
data_dict.pop("TOTAL")

{'bonus': 97343619,
 'deferral_payments': 32083396,
 'deferred_income': -27992891,
 'director_fees': 1398517,
 'email_address': 'NaN',
 'exercised_stock_options': 311764000,
 'expenses': 5235198,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 83925000,
 'long_term_incentive': 48521928,
 'other': 42667589,
 'poi': False,
 'restricted_stock': 130322299,
 'restricted_stock_deferred': -7576788,
 'salary': 26704229,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 309886585,
 'total_stock_value': 434509511}

## Working on the features

### Adding new features

There are 3 new features that I was willing to implement. The “to_messages_poi_ratio” feature is the ratio between the number of emails from a person to a poi and the total number of emails sent by this person. It seems to me more relevant to work with this ratio, as I would expect PoI to send more often emails to other PoIs. 

In the same extent, I added the “from_poi_to_this_person” and the “shared_receipt_with_poi” features which calculate similar ratios with the emails received and shared by a person.


In [13]:
# Task 3: Create new feature(s)
# Adding the ratio of messages received, shared and sent to POIs
for name in data_dict.keys():
    try:
        data_dict[name]["to_messages_poi_ratio"] = \
            1.0*data_dict[name]["from_this_person_to_poi"]/data_dict[name]["from_messages"]
        data_dict[name]["from_messages_poi_ratio"] = \
            1.0*data_dict[name]["from_poi_to_this_person"]/data_dict[name]["to_messages"]
        data_dict[name]["shared_messages_poi_ratio"] = \
            1.0*data_dict[name]["shared_receipt_with_poi"]/data_dict[name]["to_messages"]
    except TypeError:
        data_dict[name]["to_messages_poi_ratio"] = "NaN"
        data_dict[name]["from_messages_poi_ratio"] = "NaN"
        data_dict[name]["shared_messages_poi_ratio"] = "NaN"

features_list.append("to_messages_poi_ratio")
features_list.append("from_messages_poi_ratio")
features_list.append("shared_messages_poi_ratio")
        
# Store to my_dataset for easy export below.
my_dataset = data_dict

# Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys=True)
labels, features = targetFeatureSplit(data)


In [6]:
# initial feature list to be compared with the extended one
features_list_init = ['poi']
for feat in df.columns:
    if feat not in ['poi', 'email_address']:
        features_list_init.append(feat)

# Using a Random Forest Classifier to study the impact of adding these new features
clf = RandomForestClassifier(class_weight='balanced', random_state=42)
print "Classification performance with the initial feature list: "
print features_list_init
test_classifier(clf, my_dataset, features_list_init)

print "Classification performance with the extended feature list: "
print features_list
test_classifier(clf, my_dataset, features_list)

Classification performance with the initial feature list: 
['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person']
RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
	Accuracy: 0.85680	Precision: 0.33772	Recall: 0.07700	F1: 0.12541	F2: 0.09106
	Total predictions: 15000	True positives:  154	False positives:  302	False negatives: 1846	True negatives: 12698

Classification performance 

Apparently, adding these new features seems to provide more of usable information. Tuning the hyper parameters mayeven increase the usefulness of these added features.

### Tree-based Feature Selection 

The following chunk of code runs a random forest classifier to determine each feature importance ratio. I use a 100-folds stratified shuffle split in order to avoid overfitting issues by cross-validating my selection. 

If a feature importance ratio is lower than 1%, the feature is deleted from the features list. Then the algorithm restarts with the new features list, until there are only important features left.

In [28]:
# features list to be reduced
features_list_selected = list(features_list)
print features_list_selected

# StratifiedShuffleSplit

n_feat_del = 0
stop = False
while not(stop):
    n_feat_del_ini = n_feat_del
    # Extract features and labels from dataset for local testing
    data = featureFormat(my_dataset, features_list_selected, sort_keys=True)
    labels_sel, features_sel = targetFeatureSplit(data)
    
    # Using a 100-folds stratified shuffle split to to feature selection with cross-validation
    folds = 100
    cv = StratifiedShuffleSplit(labels_sel, folds, random_state = 42)
    
    feat_imp = [0 for i in range(len(features_sel))]
    for train_idx, test_idx in cv:
        features_train = [features_sel[i] for i in train_idx]
        features_test  = [features_sel[i] for i in test_idx]
        labels_train   = [labels_sel[i] for i in train_idx]
        labels_test    = [labels_sel[i] for i in test_idx]

        # Choose the most important features with a random forest classifier
        clf = RandomForestClassifier(random_state=42)
        clf.fit(features_train, labels_train)
        feat_imp = [feat_imp[i] + clf.feature_importances_[i]/folds for i in range(len(clf.feature_importances_))]
        
    print
    print "deleting features if their importance is lower than 0.01%"
    for i in range(len(feat_imp)):
        if feat_imp[i] < 0.01:
            n_feat_del += 1
            print "deleting ", features_list_selected[i+1], " ", clf.feature_importances_[i]
            del features_list_selected[i+1]
    if n_feat_del_ini == n_feat_del:
        print "End"
        stop = True

print
print "Tree-based selected list of features"
print features_list_selected
    

['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'from_poi_to_this_person', 'to_messages_poi_ratio', 'from_messages_poi_ratio', 'shared_messages_poi_ratio']

deleting features if their importance is lower than 0.01%
deleting  deferral_payments   0.0096149068323
deleting  total_stock_value   0.0
deleting  other   0.0
deleting  from_poi_to_this_person   0.0

deleting features if their importance is lower than 0.01%
deleting  restricted_stock_deferred   0.0099537037037
deleting  from_messages   0.0
deleting  long_term_incentive   0.0

deleting features if their importance is lower than 0.01%
deleting  loan_advances   0.0076171875
deleting  deferred_income   0.0

deleting features if their importan

The performance brought by this selected features list is then compared to the one brought by the entire feature list using a Random Forest Classifier. 

In [29]:
clf = RandomForestClassifier(class_weight='balanced', random_state=42)
print "testing: ", clf

# test the classifier with the testing function provided
print "Random Forest trained with the entire feature list: "
test_classifier(clf, my_dataset, features_list)

print "Random Forest trained with the Tree-based selected feature list: "
test_classifier(clf, my_dataset, features_list_selected)
print

testing:  RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
Random Forest trained with the entire feature list: 
RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
	Accuracy: 0.86540	Precision: 0.48148	Recall: 0.12350	F1: 0.19658	F2: 0.14507
	Total predictions: 15000	True positives:  247	False positives:  266	False negatives: 1753	True negatives: 12734

Random Forest trained with the Tree-based selec

Using this feature selection seems to only slightly increase the performance for the standard random forest classifier.

For the rest of my work, **I will continue with the entire features list**, but I will also implement a Principal Component Analysis to reduce the dimensionnality of my features and tackle the overfitting risk. 

## Trying some basic estimators

### Standard SVC, Random Forest and Logistic Regression

In [14]:
# Creating a pipeline to include scaling when using SVC
estimators_svc = [
    ('scaler',StandardScaler()),
    ('svc', SVC())]
pipe_svc = Pipeline(estimators_svc)

# Creating a pipeline to include scaling when using Logistic regression
estimators_logreg = [
    ('scaler',StandardScaler()),
    ('logreg', LogisticRegression())]
pipe_logreg = Pipeline(estimators_logreg)


for clf in [pipe_svc, RandomForestClassifier(), pipe_logreg]:
    # test the classifier with the testing function provided
    # a 1000-folds cross-validation is done and different metrics are calculated (accuracy, precision, recall, custom f-scores)
    # to assess the classifier performance.
    test_classifier(clf, my_dataset, features_list)
    print
    

true negatives:  13000
false negatives:  2000
false positives:  0
true positives:  0
Got a divide by zero when trying out: Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
Precision or recall may be undefined due to a lack of true positive predicitons.

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
	Accuracy: 0.86140	Precision: 0.43857	Recall: 0.14100	F1: 0.21339	F2: 0.16314
	Total predictions: 15000	True positive

Using the standard classification algorithms, without scaling, dimensionality reduction ,or hyper parameter tuning does not provide a sufficient level of performance.

The Random Forest classifier gives the best results when looking at the f1 and f2 scores. But as it is an ensemble classifier, it may be better suited to tackle overfitting per default. 

The SVC classifier is not well performing as it classifies all the users as non-PoI. 

## Hyperparameter tuning

Machine Learning algorithms are very dependent from hyper parameters. These paramaters can for instance play a strong role when dealing with overfitting. The goal of this chapter is to optimize the main parameters of the previous algorithms in order to increase the classification performance. In order to determine the best parameters for this Enron Dataset, a Grid Search Analysis is done on different Classifiers. 

The Grid Search approach trains a classifier for each hyper parameter combination of values and conducts a 10-folds cross-validation on the provided data set to select the best estimator. An f-score similar to the one calculated by the tester.py file is used to compare the different classifiers. I used then the test_classifier function from the tester.py file to calculate the classifier performance.

In addition to optimizing the Classification algorithm itself, I also work on pipeline estimators that also include a standard scaler and a PCA transformation to increase the classification algorithm performance.

### Grid Search Optimization: Scaling and SVC


In [15]:
# Creating a pipeline to include scaling when using SVC
estimators = [
    ('scaler',StandardScaler()),
    ('svc', SVC(class_weight="balanced", degree=2))]
pipe = Pipeline(estimators)

# Creating a stratified shuffle split to replace the standard Grid Search cross-validation
cv = StratifiedShuffleSplit(labels, 100, random_state = 42)

print "Fitting the classifier to the training set"
t0 = time()

# the linear and polynomial kernels are not included here. After some first tests, the test_classifier function 
# or the grid search algorythm gets stuck. Without PCA, the classification may not be done properly with linear kernels.
# The polynomial kernel was not working either with the PCA
param_grid = {
    'svc__kernel':['sigmoid', 'rbf'], 
    'svc__C': np.logspace(-3, 3, 7),
    'svc__gamma':np.logspace(-3, 3, 7)
          }

clf = GridSearchCV(pipe, param_grid, scoring=scorer_f, cv=cv)
clf = clf.fit(features, labels)
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_

print "Testing the best estimator"
t0 = time()
test_classifier(clf.best_estimator_, my_dataset, features_list)
print "done in %0.3fs" % (time() - t0)

Fitting the classifier to the training set
done in 44.050s
Best estimator found by grid search:
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=0.10000000000000001, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=2, gamma=1.0, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
Testing the best estimator
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=0.10000000000000001, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=2, gamma=1.0, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
	Accuracy: 0.27440	Precision: 0.15523	Recall: 1.00000	F1: 0.26874	F2: 0.47884
	Total predictions: 15000	True positives: 2000	False positives: 10884	False negatives:    0	True negatives: 2116

done 

### Grid Search Optimization: Scaling, PCA and SVC

In [16]:
estimators = [
    ('scaler',StandardScaler()),
    ('pca', RandomizedPCA(whiten=True)), 
    ('svc', SVC(class_weight="balanced", degree=2))]
pipe = Pipeline(estimators)

# Creating a stratified shuffle split to replace the standard Grid Search cross-validation
cv = StratifiedShuffleSplit(labels, 100, random_state = 42)

param_grid = {
    'pca__n_components' : [4, 6, 10, 15, 20],
    'svc__kernel':['linear', 'sigmoid'], 
    'svc__C': np.logspace(-3, 3, 7),
    'svc__gamma':np.logspace(-3, 3, 7)
             }

#    'svc__kernel':['linear', 'sigmoid', 'rbf', 'poly'], 
clf = GridSearchCV(pipe, param_grid, scoring=scorer_f, cv=cv)

print "Fitting the classifier to the training set"
t0 = time()
clf.fit(features, labels)
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_

print "Testing the best estimator"
t0 = time()
test_classifier(clf.best_estimator_, my_dataset, features_list)
print "done in %0.3fs" % (time() - t0)

Fitting the classifier to the training set
done in 1206.864s
Best estimator found by grid search:
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', RandomizedPCA(copy=True, iterated_power=3, n_components=10, random_state=None,
       whiten=True)), ('svc', SVC(C=10.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=2, gamma=0.001, kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
Testing the best estimator
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', RandomizedPCA(copy=True, iterated_power=3, n_components=10, random_state=None,
       whiten=True)), ('svc', SVC(C=10.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=2, gamma=0.001, kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))

### Grid Search Optimization: Random Forest

In [19]:
# Creating a stratified shuffle split to replace the standard Grid Search cross-validation
cv = StratifiedShuffleSplit(labels, 100, random_state = 42)

print "Fitting the classifier to the training set"
t0 = time()
param_grid = {
         'n_estimators': [1, 5, 10, 15, 20],
         'min_samples_leaf' : [1, 2, 3] 
          }

clf = GridSearchCV(RandomForestClassifier(class_weight='balanced', random_state=42, n_jobs=-1), 
                   param_grid, scoring=scorer_f, cv=cv)
clf = clf.fit(features, labels)
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_

print "Testing the best estimator"
t0 = time()
test_classifier(clf.best_estimator_, my_dataset, features_list)
print "done in %0.3fs" % (time() - t0)

Fitting the classifier to the training set
done in 606.900s
Best estimator found by grid search:
RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
Testing the best estimator
RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
	Accuracy: 0.86673	Precision: 0.50082	Recall: 0.15250	F1: 0.23381	F2: 0.17714
	Total predictions: 15000	True positives:  305	False positives:  304	False negatives: 1695	True ne

### Grid Search Optimization: PCA & Random Forest 

In [24]:
estimators = [
    ('scaler',StandardScaler()),
     ('pca', RandomizedPCA(whiten=True)), 
    ('randomforest', RandomForestClassifier(class_weight='balanced', random_state=42, n_jobs=-1))]
pipe = Pipeline(estimators)

# Creating a stratified shuffle split to replace the standard Grid Search cross-validation
cv = StratifiedShuffleSplit(labels, 100, random_state = 42)

param_grid = {
    'pca__n_components':[4, 6, 10, 15, 20],
    'randomforest__n_estimators':[1, 5, 10, 15, 20],
    'randomforest__min_samples_leaf':[1, 2, 3] 
             }
clf = GridSearchCV(pipe, param_grid, scoring=scorer_f, cv=cv)

print "Fitting the classifier to the training set"
t0 = time()
clf = clf.fit(features, labels)
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_

print "Testing the best estimator"
t0 = time()
test_classifier(clf.best_estimator_, my_dataset, features_list)
print "done in %0.3fs" % (time() - t0)

Fitting the classifier to the training set
done in 2873.080s
Best estimator found by grid search:
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', RandomizedPCA(copy=True, iterated_power=3, n_components=10, random_state=None,
       whiten=True)), ('randomforest', RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max...timators=20, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False))])
Testing the best estimator
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', RandomizedPCA(copy=True, iterated_power=3, n_components=10, random_state=None,
       whiten=True)), ('randomforest', RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max...timators=20, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False))])
	Accuracy: 0.83733	Precision: 0.23236	Recall: 0

### Grid Search Optimization: Logistic Regression

In [17]:
# Creating a pipeline to include scaling when using Logistic regression
estimators = [('scaler',StandardScaler()),
              ('logreg', LogisticRegression(class_weight="balanced", random_state=42))]
pipe = Pipeline(estimators)

# Creating a stratified shuffle split to replace the standard Grid Search cross-validation
cv = StratifiedShuffleSplit(labels, 100, random_state = 42)

print "Fitting the classifier to the training set"
t0 = time()
param_grid = {
         'logreg__C': np.logspace(-7, 7, 15),
         'logreg__penalty' : ["l1", "l2"] 
          }

clf = GridSearchCV(pipe, param_grid, scoring=scorer_f, cv=cv)
clf = clf.fit(features, labels)
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_

print "Testing the best estimator"
t0 = time()
test_classifier(clf.best_estimator_, my_dataset, features_list)
print "done in %0.3fs" % (time() - t0)


Fitting the classifier to the training set
done in 90.614s
Best estimator found by grid search:
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logreg', LogisticRegression(C=10.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])
Testing the best estimator
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logreg', LogisticRegression(C=10.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])
	Accuracy: 0.79853	Precision: 0.33301	Recall: 0.50950	F1: 0.40277	F2: 0.46067
	Total predictions: 15000	True positives: 1019	False positi

### Grid Search Optimization: Scaling, PCA and Logistic Regression

In [18]:
estimators = [('scaler',StandardScaler()),
              ('pca', RandomizedPCA(whiten=True)), 
              ('logreg', LogisticRegression(class_weight="balanced", random_state=42, n_jobs=-1))]
pipe = Pipeline(estimators)

# Creating a stratified shuffle split to replace the standard Grid Search cross-validation
cv = StratifiedShuffleSplit(labels, 100, random_state = 42)

param_grid = {
    'pca__n_components' : [4, 6, 10, 15, 20],
    'logreg__C': np.logspace(-7, 7, 15),
    'logreg__penalty' : ["l1", "l2"] 
             }
clf = GridSearchCV(pipe, param_grid, scoring=scorer_f, cv=cv)

print "Fitting the classifier to the training set"
t0 = time()
clf.fit(features, labels)
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_

print "Testing the best estimator"
t0 = time()
test_classifier(clf.best_estimator_, my_dataset, features_list)
print "done in %0.3fs" % (time() - t0)

Fitting the classifier to the training set
done in 186.484s
Best estimator found by grid search:
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', RandomizedPCA(copy=True, iterated_power=3, n_components=15, random_state=None,
       whiten=True)), ('logreg', LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=-1, penalty='l2', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])
Testing the best estimator
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', RandomizedPCA(copy=True, iterated_power=3, n_components=15, random_state=None,
       whiten=True)), ('logreg', LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=-1, penalty='l2', random_st

The test_classifier function conducts a 1000-folds cross validation with the given classifier and the given feature list. It prints out different performance metrics (accuracy, precision, recall, f-scores) that help us assess the Classifier performance. The precision and the recall are the 2 main metrics that I was trying to maximize. The goal was to achieve  precision and recal values higher than 0.3.

In this specific case:
- Recall is the ratio of correctly identified PoI to total number of real PoI. 
- Precision is the ratio of correctly identified PoI to total number of identified PoI.

In general, scaling the features and reducing their dimensionality should help to increase the classifier performance. 

But to my surprise, the best classification performance is provided by the simple optimized Logistic Regression. Moreover, this algorithm is really faster to train than the SVC and the Random Forest.

I suspect that, as the C hyper parameter from the logistic regression algorithm has been developed to tackle the issue of overfitting, it may mitigate the need to use a dimensionality reduction approach. It also confirms that logistic regression does not need scaling to be properly trained. 

As a consequence, I decided to implement this optimized algorithm in the poi_id file. 


## Bibliography


- Sickit Learn examples : http://scikit-learn.org/ (RandomForestClassifier, AdaBoostClassifier, LogisticRegression, Pipeline, GridSearchCV, StratifiedShuffleSplit, RandomizedPCA, SVC)
- Python Machine Learning, Sebastian Raschka, 2015, Packt Publishing
- Udacity Nanoprogram - Machine Learning Course