The starter code can be found in the final_project directory of the codebase that you downloaded for use with the mini-projects. Some relevant files: 

**poi_id.py** : Starter code for the POI identifier, you will write your analysis here. You will also submit a version of this file for your evaluator to verify your algorithm and results. 

**final_project_dataset.pkl** : The dataset for the project, more details below. 

**tester.py** : When you turn in your analysis for evaluation by Udacity, you will submit the algorithm, dataset and list of features that you use (these are created automatically in **poi_id.py**). The evaluator will then use this code to test your result, to make sure we see performance that’s similar to what you report. You don’t need to do anything with this code, but we provide it for transparency and for your reference. 

**emails_by_address** : this directory contains many text files, each of which contains all the messages to or from a particular email address. It is for your reference, if you want to create more advanced features based on the details of the emails dataset. You do not need to process the e-mail corpus in order to complete the project.

# Steps to Success
We will provide you with starter code that reads in the data, takes your features of choice, then puts them into a numpy array, which is the input form that most sklearn functions assume. Your job is to engineer the features, pick and tune an algorithm, and to test and evaluate your identifier. Several of the mini-projects were designed with this final project in mind, so be on the lookout for ways to use the work you’ve already done.

As preprocessing to this project, we've combined the Enron email and financial data into a dictionary, where each key-value pair in the dictionary corresponds to one person. The dictionary key is the person's name, and the value is another dictionary, which contains the names of all the features and their values for that person. The features in the data fall into three major types, namely financial features, email features and POI labels.

**financial features**: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)

**email features**: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)

**POI label**: [‘poi’] (boolean, represented as integer)

You are encouraged to make, transform or rescale new features from the starter features. If you do this, you should store the new feature to my_dataset, and if you use the new feature in the final algorithm, you should also add the feature name to my_feature_list, so your evaluator can access it during testing. For a concrete example of a new feature that you could add to the dataset, refer to the lesson on Feature Selection.

In addition, we advise that you keep notes as you work through the project. As part of your project submission, you will compose answers to a series of questions (given on the next page) to understand your approach towards different aspects of the analysis. Your thought process is, in many ways, more important than your final project and we will by trying to probe your thought process in these questions.

Free form questions 
https://docs.google.com/document/d/1NDgi1PrNJP7WTbfSUuRUnz8yzs5nGVTSzpO7oeNTEWA/pub?embedded=true

Rubric
https://review.udacity.com/#!/rubrics/27/view

A list of Web sites, books, forums, blog posts, github repositories etc. that you referred to or used in this submission (add N/A if you did not use such resources). Please carefully read the following statement and include it in your document “I hereby confirm that this submission is my work. I have cited above the origins of any parts of the submission that were taken from Websites, books, forums, blog posts, github repositories, etc.

# Findings
- deferred_income,director_fees,restricted_stock_deferred give precison and recall > 0.3

In [20]:
%matplotlib inline
from sklearn.cross_validation import StratifiedShuffleSplit

import time
def print_time(name, start_time):
    print "{} time: {}s".format(name, round(time.time() - start_time, 3))

DISPLAY_STRING = "{:>0.{display_precision}f}\t\t"

PERF_FORMAT_STRING = "\
Accuracy\tPrecision\tRecall\t\tF1\t\tF2\n\
{0}{0}{0}{0}{0}".format(
    DISPLAY_STRING
)

RESULTS_FORMAT_STRING = "\
Total predictions, \tTrue positives, \tFalse positives, \tFalse negatives, \tTrue negatives \n\
{:4d}\t\t\t{:4d}\t\t\t{:4d}\t\t\t{:4d}\t\t\t{:4d}"

def ratio(numerator, denominator):
    if numerator == 0 or denominator == 0:
        return 0
    
    return float(numerator) / denominator


def get_quadrant(true_negatives, false_negatives, true_positives, false_positives, predictions, labels_test):
    for prediction, truth in zip(predictions, labels_test):
        if prediction == 0 and truth == 0:
            true_negatives += 1
        elif prediction == 0 and truth == 1:
            false_negatives += 1
        elif prediction == 1 and truth == 0:
            false_positives += 1
        elif prediction == 1 and truth == 1:
            true_positives += 1
        else:
            print "Warning: Found a predicted label not == 0 or 1."
            print "All predictions should take value 0 or 1."
            print "Evaluating performance for processed predictions:"
            break
    
    return true_negatives, false_negatives, true_positives, false_positives


def get_stats(true_negatives, false_negatives, true_positives, false_positives):
    total_predictions = true_negatives + false_negatives + false_positives + true_positives
    accuracy = ratio(true_positives + true_negatives, total_predictions)
    precision = ratio(true_positives, true_positives + false_positives)
    recall = ratio(true_positives, true_positives + false_negatives)
    f1 = ratio(2.0 * true_positives, 2 * true_positives + false_positives + false_negatives)
    f2 = ratio(5 * precision * recall, (4 * precision) + recall)

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'f2': f2,
        'total_predictions': total_predictions,
        'true_positives': true_positives,
        'false_positives': false_positives,
        'false_negatives': false_negatives,
        'true_negatives': true_negatives
    }

def get_labels_and_features(dataset, feature_list):
    data = featureFormat(dataset, feature_list, sort_keys = True)
    return targetFeatureSplit(data)

def test_classifier(clf, dataset, feature_list, folds = 1000):
        
    labels, features = get_labels_and_features(dataset, feature_list)
    cv = StratifiedShuffleSplit(labels, folds, random_state = 42)

    true_negatives = 0
    false_negatives = 0
    true_positives = 0
    false_positives = 0
    

    for train_idx, test_idx in cv:
        #print "counting {}, {}, {}, {}".format(true_negatives, false_negatives, true_positives, false_positives)
        
        features_train = []
        features_test  = []
        labels_train   = []
        labels_test    = []
        for ii in train_idx:
            features_train.append( features[ii] )
            labels_train.append( labels[ii] )
        for jj in test_idx:
            features_test.append( features[jj] )
            labels_test.append( labels[jj] )
        
        clf.fit(features_train, labels_train)
        
        predictions = clf.predict(features_test)
        
        true_negatives, false_negatives, true_positives, false_positives = get_quadrant(
            true_negatives, false_negatives, true_positives, false_positives, predictions, labels_test
        )
        
    return get_stats(true_negatives, false_negatives, true_positives, false_positives)

def test_classifier_fast(clf, dataset, feature_list):
    labels, features = get_labels_and_features(dataset, feature_list)

    t0 = time.time()
    clf.fit(features_train, labels_train)
    print_time("training", t0)

    t0 = time.time()
    predictions = clf.predict(features_test)
    print_time("predictions", t0)

    true_negatives, false_negatives, true_positives, false_positives = get_quadrant(
        0, 0, 0, 0, predictions, labels_test
    )

    return get_stats(true_negatives, false_negatives, true_positives, false_positives)


CLF_PICKLE_FILENAME = "my_classifier.pkl"
DATASET_PICKLE_FILENAME = "my_dataset.pkl"
FEATURE_LIST_FILENAME = "my_feature_list.pkl"


def load_classifier_and_data():
    with open(CLF_PICKLE_FILENAME, "r") as clf_infile:
        clf = pickle.load(clf_infile)
    with open(DATASET_PICKLE_FILENAME, "r") as dataset_infile:
        dataset = pickle.load(dataset_infile)
    with open(FEATURE_LIST_FILENAME, "r") as featurelist_infile:
        feature_list = pickle.load(featurelist_infile)
    return clf, dataset, feature_list

    
#test_classifier(clf, my_dataset, features_list)
#print(PERF_FORMAT_STRING)

In [4]:
from multiprocessing import Pool
from sklearn.naive_bayes import GaussianNB
import time
import itertools
import pandas as pd
import pickle

import time
import numpy as np

STORE_LOCATION = "intermediate_results/"

def store_dictionary(dictionary, file_name):
    with open(STORE_LOCATION + file_name, 'w') as f:
        pickle.dump(dictionary, f)
    
def dict_to_df(dictionary):
    return pd.DataFrame.from_dict(dictionary, orient = 'index')

def read_dictionary(dict_file_name):
    with open(STORE_LOCATION + dict_file_name, 'r') as f:
        tmp_dict = pickle.load(f)
    return tmp_dict, dict_to_df(tmp_dict)

def result_for_features(dictionary, features):
    return dictionary[feature_list_to_key(features)]

def get_rows_above_threshold(df, 
                             threshold_precision = 0.3,
                             threshold_recall = 0.3, 
                             threshold_accuracy = 0.8, 
                             metrics = ['precision', 'recall', 'accuracy']):
    condition = ((df.precision >= threshold_precision) & 
                 (df.recall >= threshold_recall) & 
                 (df.accuracy > threshold_accuracy))
    return df[condition].loc[:, metrics]
#     return df[condition].loc[:, :]


def get_combination_features(n):
    for combination_length in xrange(1, n + 1):
        for comb in itertools.combinations(all_features, combination_length):
            yield ['poi'] + list(comb)
            
def get_number_combinations(n):
    return len(list(get_combination_features(n))) - 1

def feature_list_to_key(_features_list):
    copy = _features_list[:]
    copy.remove('poi')
    return ",".join(sorted(copy))

def estimate_time(current, total, time_taken):
    if current == 0:
        return 0
    return (time_taken * total) / float(current)

def report_time(current, total, begin_time, estimates, how_often = 1):
    estimates[:-10] = []
    time_taken = time.time() - begin_time
    
    start_estimate = estimate_time(current, total, time_taken)
    if current == 0:
        estimate = start_estimate
    else:
        estimate = np.mean([start_estimate, np.mean(estimates)])
    estimates.append(estimate)
    
    if (current > 0) and (current % how_often == 0):
        print("reached {:5d} / {:5d} || done {:6.2f} %  || Time taken {:10.2f} / {:10.2f}".format(
                current, total, ratio(time_taken * 100, estimate), time_taken, estimate
            ))
        
def at_most_features(df, n):
    return df[df.index.map(lambda x: len(x.split(',')) <= n)]

def above_threshold_at_most(df, p, r, a, n):
    return at_most_features(get_rows_above_threshold(df, p, r, a, 
                                                     ['precision', 'recall', 'accuracy', 'f1', 'f2']), n)

In [5]:
import sklearn
sklearn.__version__

'0.17.1'


---


Let me start by copying the code from the main file and breaking it up here so that I can arrange it and play around with that

In [6]:
import sys

projects_home = '/home/aseem/projects/ud120-projects'
final_project_home = projects_home + '/final_project/'
sys.path.append(final_project_home)
sys.path.append(projects_home + '/tools/')

In [24]:
import sys
import pickle

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data#, test_classifier

with open(final_project_home + "final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
    
    #remove outliers
    del data_dict['TOTAL']
    
    #Replace NaN with 0
    for key, value in data_dict.iteritems():
        for k, v in value.iteritems():
            if v == 'NaN':
                value[k] = 0

    #remove features with not enough value
    columns_to_remove = ['email_address', 'loan_advances', 'restricted_stock_deferred', 'director_fees']
    for key, value in data_dict.iteritems():
        for column in columns_to_remove:
            del value[column]
    
    #Create new features
    for key, value in data_dict.iteritems():
        value['fraction_poi_to_this_person'] = ratio(value['from_poi_to_this_person'], value['to_messages'])
        value['fraction_from_this_person_to_poi'] = ratio(value['from_this_person_to_poi'], value['from_messages'])
        value['total_income'] = value['salary'] + value['bonus'] + value['long_term_incentive'] + \
                value['other'] + value['expenses']
        
#Store to my Dataset for easy export
my_dataset = data_dict

#Make final classifier for export
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler

steps = [
    ('scaler', MinMaxScaler()),
    ('classifier', KNeighborsClassifier(n_neighbors = 1))
]

from sklearn.pipeline import Pipeline
clf = Pipeline(steps)


features_list = [
    'poi',
    'exercised_stock_options',
    'fraction_from_this_person_to_poi',
    'from_messages',
    'from_poi_to_this_person'
]

#dump_classifier_and_data(clf, my_dataset, features_list)
#clf, my_dataset, feature_list = load_classifier_and_data()

test_classifier(clf, my_dataset, features_list)

{'accuracy': 0.88,
 'f1': 0.6232339089481946,
 'f2': 0.6062919975565059,
 'false_negatives': 809,
 'false_positives': 631,
 'precision': 0.6536772777167947,
 'recall': 0.5955,
 'total_predictions': 12000,
 'true_negatives': 9369,
 'true_positives': 1191}

In [53]:
### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
all_features = [
#     'poi',
     'salary',
     'to_messages',
     'deferral_payments',
     'total_payments',
     'exercised_stock_options',
     'bonus',
     'restricted_stock',
     'shared_receipt_with_poi',
#      'restricted_stock_deferred',
     'total_stock_value',
     'expenses',
#      'loan_advances',
     'from_messages',
     'other',
     'from_this_person_to_poi',
#      'director_fees',
     'deferred_income',
     'long_term_incentive',
    'from_poi_to_this_person',
    'fraction_poi_to_this_person',
    'fraction_from_this_person_to_poi',
    'total_income'
]

features_list = [
    'poi',
    'exercised_stock_options',
    'fraction_from_this_person_to_poi',
    'from_messages',
    'from_poi_to_this_person'
]

feature_list_to_key(features_list)

'exercised_stock_options,fraction_from_this_person_to_poi,from_messages,from_poi_to_this_person'

In [44]:
### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.
my_dataset = data_dict
    

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
print len(data)
labels, features = targetFeatureSplit(data)

120


In [45]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines. For more info:
### http://scikit-learn.org/stable/modules/pipeline.html

# Provided to give you a starting point. Try a variety of classifiers.
from sklearn.naive_bayes import GaussianNB
#clf = GaussianNB()

from sklearn import svm
#clf = svm.SVC(kernel="linear")

from sklearn.neighbors import KNeighborsClassifier
#clf = KNeighborsClassifier(n_neighbors = 1)

In [46]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall 
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info: 
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html

# Example starting point. Try investigating other evaluation techniques!
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)
    
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.

dump_classifier_and_data(clf, my_dataset, features_list)

In [69]:
def do_neighbors1(number_features = 1):
    
    k_neigh_result_simple_test = {}
    estimates = []
    errors = []
    
    total_iterations = get_number_combinations(number_features)
    
    t0 = time.time()
    for i, tmp_list in enumerate(get_combination_features(number_features)):
        n_neighbors = 1
    #     for n_neighbors in xrange(1, 3):

        clf = KNeighborsClassifier(n_neighbors = n_neighbors)
        #print "features_list is {}".format(features_list)

        start = time.time()
        #result = test_classifier_fast(clf, my_dataset, features_list)
        result = test_classifier(clf, my_dataset, tmp_list)
        #result['n_neighbors'] = n_neighbors
        result['time_taken'] = time.time() - start
        
        
        key = "{}".format(feature_list_to_key(tmp_list))
    #     key = "{},{}".format(feature_list_to_key(tmp_list), n_neighbors)
        k_neigh_result_simple_test[key] = result
        
        if i > 0 and ((i <= 100 and i % 5 == 0) or i % 25 == 0):
            report_time(i, total_iterations, t0, estimates)
        
    else:
        print("*" * 30)
        print("total   {:4d} at {:10.4f}".format(i, time.time() - t0))
    
    return k_neigh_result_simple_test
    
#k_neigh_result_simple_test = do_neighbors1(5)

In [5]:
k_neigh_results, k_neigh_df = read_dictionary('features_5_raw_features_k_neigh.pkl')

In [8]:


def print_required1(df):
    print above_threshold_at_most(df, 0.3, 0.3, 0.8, 3).describe()

print '**** naive bayes'
print ""
#print_required1(naive_bayes_df)
print ""
print "**** k neigh neighbors"
print ""
#print_required1(k_neigh_df)

**** naive bayes


**** k neigh neighbors



In [220]:
#Use features found using Naive Bayes brute force
list_of_features_list = []
for index in get_rows_above_threshold(naive_bayes_df, 0.4).index:
    list_of_features_list.append(['poi'] + index.split(','))
    
len(list_of_features_list)

292

In [54]:
from sklearn import svm
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import make_scorer, f1_score, precision_score

def do_grid_search1():
    feature_list = ['poi', 'salary']

    t0 = time.time()

    param_grid = {
        'C': [1e-1, 1e0, 1e1]
    }
    clf = GridSearchCV(
        svm.SVC(kernel='linear'), 
        scoring = make_scorer(precision_score),
        param_grid=param_grid, 
        n_jobs=3, 
        verbose = 4
    )
    print_time("creation", t0)

    t0 = time.time()
    labels, features = get_labels_and_features(my_dataset, feature_list)
    features_train, features_test, labels_train, labels_test = \
        train_test_split(features, labels, test_size=0.3, random_state=42)
    print_time("split", t0)

    t0 = time.time()
    clf.fit(features_train, labels_train)
    print_time("training", t0)
    print "best estimator {}".format(clf.best_estimator_)

    t0 = time.time()
    predictions = clf.predict(features_test)
    print_time("predictions", t0)

    return get_stats(*get_quadrant(
        0, 0, 0, 0, predictions, labels_test
    ))
    
#do_grid_search1()

In [55]:
def do_svc_simple_test():
    svc_result_simple_test = {}

    for features_list in list_of_features_list:

        print "features_list is {}".format(features_list)

        start = time.time()
        result = test_classifier_fast(clf, my_dataset, features_list)
        result['time_taken'] = time.time() - start

        print "time_taken was {}".format(result['time_taken'])
        svc_result_simple_test[feature_list_to_key(features_list)] = result

In [18]:
dict_to_df(svc_result_simple_test)

Unnamed: 0,f1,f2,recall,precision,time_taken,accuracy
"deferred_income,exercised_stock_options,from_messages",0.363636,0.3125,0.285714,0.5,28.559094,0.758621
"deferred_income,exercised_stock_options,from_messages,from_poi_to_this_person",0.363636,0.3125,0.285714,0.5,33.051872,0.758621
"deferred_income,exercised_stock_options,from_messages,from_this_person_to_poi",0.363636,0.3125,0.285714,0.5,31.269867,0.758621
"deferred_income,exercised_stock_options,long_term_incentive,salary",0.363636,0.3125,0.285714,0.5,33.846827,0.758621
"deferred_income,exercised_stock_options,long_term_incentive,total_stock_value",0.363636,0.3125,0.285714,0.5,37.181807,0.758621
"deferred_income,expenses,from_messages,total_stock_value",0.363636,0.3125,0.285714,0.5,41.274096,0.758621
"deferred_income,from_messages,long_term_incentive,total_stock_value",0.363636,0.3125,0.285714,0.5,39.73673,0.758621


In [39]:
store_dictionary(svc_result_simple_test, "svc_on_naive_bayes_top_features.pkl")

In [None]:
svc_results = {}
for features_list in list_of_features_list:
    folds = 10
    start = time.time()
    
    clf = SVC(kernel='linear')
    
    result = test_classifier(clf, my_dataset, features_list, folds=folds, verbose=True, verbose_at=1)

    result['folds'] = folds
    result['time_taken'] = time.time() - start
    
    print "time taken was {}".format(result['time_taken'])
    print "\n"
    svc_results[feature_list_to_key(features_list)] = result

features list is ['poi', 'deferred_income', 'exercised_stock_options', 'from_messages']
clf = SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), feature_list = ['poi', 'deferred_income', 'exercised_stock_options', 'from_messages'], folds = 10, verbose_at = 1


In [None]:
dict_to_df(svc_results)

In [128]:
# def add_results_to_map(clf, dictionary, features_list, my_dataset = data_dict):
#     current_key = feature_list_to_key(features_list)

#     dictionary[current_key] = test_classifier(clf, my_dataset, features_list)

def get_dict(clf, number_features = 2):
    dictionary = {}
    estimates = []
    start = time.time()
    
    #Tried Using threads to make it faster. It was faster. 
    # Just due to some odd reason the resulting dictionary 
    # did not have anything when accessed from outside the function
    # Maybe TODO later
    
    #number_threads = 3
    #pool = Pool(number_threads)
    #pool.map(add_results_to_map, get_combination_features(number_features))

    total_iterations = get_number_combinations(number_features)
    for i, tmp_list in enumerate(get_combination_features(number_features)):
        
        current_key = feature_list_to_key(tmp_list)
        dictionary[current_key] = test_classifier(clf, my_dataset, tmp_list)
        
        report_time(i, total_iterations, start, estimates, 2)
    else:
        print("*" * 30)
        print("total   {:4d} at {:10.4f}".format(i, time.time() - start))
    
    return dictionary

def main():
    
    main_start = time.time()
    
    clf = GaussianNB()

    from sklearn.svm import SVC
    #clf = SVC(kernel="linear")

    return get_dict(clf, number_features = 1)
    
#temp_results_all = main()

reached     2 /    17 || done    17 %  || Time taken          2 /         10
reached     4 /    17 || done    29 %  || Time taken          3 /         10
reached     6 /    17 || done    41 %  || Time taken          4 /         10
reached     8 /    17 || done    53 %  || Time taken          5 /         10
reached    10 /    17 || done    65 %  || Time taken          7 /         10
reached    12 /    17 || done    73 %  || Time taken          8 /         10
reached    14 /    17 || done    84 %  || Time taken          9 /         10
reached    16 /    17 || done    95 %  || Time taken         10 /         10
******************************
total     17 at    10.4362


In [116]:
above_threshold_at_most(dict_to_df(temp_results_all), 0.3, 0.3, 0.7, 4)

Unnamed: 0,precision,recall,accuracy,f1,f2
exercised_stock_options,0.460545,0.321,0.904091,0.378315,0.341707


In [4]:
naive_bayes_results, naive_bayes_df = read_dictionary('features_4_raw_features_naive_bayes.pkl')

In [117]:
above_threshold_at_most(naive_bayes_df, 0.3, 0.3, 0.7, 1)

Unnamed: 0,precision,recall,accuracy,f1,f2
exercised_stock_options,0.460545,0.321,0.904091,0.378315,0.341707


In [30]:
def get_dict(clf, number_features = 2, report_on = 2):
    dictionary = {}
    estimates = []
    start = time.time()

    total_iterations = get_number_combinations(number_features)
    for i, tmp_list in enumerate(get_combination_features(number_features)):
        
        current_key = feature_list_to_key(tmp_list)
        dictionary[current_key] = test_classifier(clf, my_dataset, tmp_list)
        
        report_time(i, total_iterations, start, estimates, report_on)
    else:
        print("*" * 30)
        print("total   {:4d} at {:10.4f}".format(i, time.time() - start))
    
    return dictionary

def main():
    
    main_start = time.time()
    
    from sklearn.neighbors import KNeighborsClassifier
    algorithm = KNeighborsClassifier(n_neighbors = 1)
    
    from sklearn.preprocessing import MinMaxScaler
#     from sklearn.decomposition import PCA
    steps = [
        ('scaler', MinMaxScaler()),
        ('classifier', algorithm)
    ]

    from sklearn.pipeline import Pipeline
    clf = Pipeline(steps)

    return get_dict(clf, number_features = 4, report_on = 5)
    
#temp_results_all = main()

In [51]:
above_threshold_at_most(dict_to_df(temp_results_all), 0.55, 0.55, 0.85, 4)

Unnamed: 0,precision,recall,accuracy,f1,f2
"exercised_stock_options,fraction_from_this_person_to_poi,from_messages,from_poi_to_this_person",0.653677,0.5955,0.88,0.623234,0.606292
"exercised_stock_options,fraction_from_this_person_to_poi,from_poi_to_this_person",0.644087,0.58,0.876583,0.610366,0.591776


In [29]:
#store_dictionary(temp_results_all, "min_max_kneighbors_4_features_with_3_new_features.pkl")