# Detect Enron Poi

In choosing the features, the following strategy was used:
- Start with all features
- Run SelectKBest for all features, plot results
- Omit features that do not seem to affect the model
- Run the model and see the test results

Additionally, since the data seemed to have quite a few features compared to the dataset size, PCA was also used:
- Run PCA with different amount of components
- Run the model and see the test results
- Choose the amount of PCA components with the best results

In [None]:
# Create a list of feature names, handy for plot labeling etc.
features_only_list = list(features_list)
features_only_list.remove('poi')

labels, features = targetFeatureSplit(data)

features_train, features_test, labels_train, labels_test = \
    stratifiedShuffleSplit(features, labels)

# Get feature importance:
importance, selector = getFeatureImportance(features_train, labels_train, features_only_list, k = 8)
plt.figure()
plotFeatureImportance(importance, features_only_list, plt)
plt.show()

### First try with all features: 
print 'P_values of the features:'
pprint.pprint([i[2] for i in importance])
print '\n'
# Do some hyperparam validation:
best_svc, svc_grid_scores = ClassifySVM.gridsearch(features_train, labels_train)

# Do fits based on hyperparam validation
nbfit = ClassifyNB.train(features_train, labels_train)
svmfit = ClassifySVM.train(features_train, labels_train, best_svc)

### Probably better to test with precision and recall:
print 'Naive bayes:'
test_classifier(nbfit, data)
print 'SVM:'
test_classifier(svmfit, data)


Already looking at the p-values, there seem to be quite a few variables that are likely to not be good features for the model. It seems that with a confidence interval of 95%, the 8 best features should be used.

Lets try with selector:

In [None]:
### Next try with selector
selector_train = selector.transform(features_train)

# Do some hyperparam validation:
best_svc, svc_grid_scores = ClassifySVM.gridsearch(selector_train, labels_train)

# Do fits based on hyperparam validation
nbfit = ClassifyNB.train(selector_train, labels_train)
svmfit = ClassifySVM.train(selector_train, labels_train, best_svc)

### Probably better to test with precision and recall:
print 'Naive bayes:'
test_classifier(nbfit, data)
print 'SVM:'
test_classifier(svmfit, data)

Quite a bit better, especially with SVM.

PCA was used the following way:

In [None]:
labels, features = targetFeatureSplit(data)

features_train, features_test, labels_train, labels_test = \
    stratifiedShuffleSplit(features, labels)

### Do some PCA
pca = PCA.doPCA(features_train, n = 3)
transformed_train = pca.transform(features_train)
transformed_test = pca.transform(features_test)

features_only_list = ['pca'+str(i) for i in range(len(transformed_train[0]))]

# Do some hyperparam validation:
best_svc, svc_grid_scores = ClassifySVM.gridsearch(transformed_train, labels_train)

# Do fits based on hyperparam validation
nbfit = ClassifyNB.train(transformed_train, labels_train)
svmfit = ClassifySVM.train(transformed_train, labels_train, best_svc)

### Probably better to test with precision and recall:
print 'Naive bayes:'
test_classifier(nbfit, data)
print 'SVM:'
test_classifier(svmfit, data)

PCA yeilded results that are exactly the same as SelectKFeatures.

Lets try combining both:

In [None]:
from sklearn.decomposition import RandomizedPCA
from sklearn.feature_selection import SelectKBest, f_classif

pca = RandomizedPCA(n_components=1)
selector = SelectKBest(score_func=f_classif, k=1)

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("f_classif", selector)])

# Use combined features to transform dataset:
combined_train = combined_features.fit(features_train, labels_train).transform(features_train)

# Do some hyperparam validation:
best_svc, svc_grid_scores = ClassifySVM.gridsearch(combined_train, labels_train)

# Do fits based on hyperparam validation
nbfit = ClassifyNB.train(combined_train, labels_train)
svmfit = ClassifySVM.train(combined_train, labels_train, best_svc)

### Probably better to test with precision and recall:
print 'Naive bayes:'
test_classifier(nbfit, data)
print 'SVM:'
test_classifier(svmfit, data)

Unsurprisingly, same results.

Now, the Precision of the model should be bumped to over 0.3 in order to pass the evaluation requirements. To do that, lets organize a new feature.

Lets try to get a bit more bang for our buck from the email features. Lets see how a ratio-feature affects the model by organizing the following feature:
$$X_{email\_from\_poi\_ratio} = \frac{X_{from\_poi\_to\_this\_person}+X_{to\_poi\_from\_this\_person}}{X_{from\_messages}+X_{to\_messages}}$$