# Homework 3
## Prediction of non funded projects using cross validation
Machine Learning for Public Policy

Camilo Arias

- load, clean and transform functions in pipeline.py

- classifiers in classifiers.py

- Functions to run cross validation in prediction.py

In [6]:
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pipeline as ppln
import classifiers as classif
import prediction
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Parameters of the model

In [7]:
seed = 1234
params = {
    'days': 60,
    'test_size': 0.3,
    'work_with_sample': False,
    'thresholds': [0.5, 0.7, 0.8, 0.9, 0.95, 0.98, 0.99],
    'svm_scores': [-1, -0.8, -0.6, -0.4, -0.3, -0.1, 0],
    'models_to_run': ['KNN', 'decision_tree', 'logistic_reg', 'svm', 'random_forest', 'gradient_boost'],
    'KNN': {'k': 100,
            'weights': 'uniform',
            'metric': 'euclidean'},

    'decision_tree': {'criterion': 'gini',
                      'max_depth': 25},

    'logistic_reg': {'C': 1,
                     'penalty': 'l1',
                     'fit_intercept': True,
                     'seed': seed},

    'svm': {'C': 1,
            'seed': seed},
    'random_forest': {'criterion': 'gini',
                      'max_depth': 25,
                      'n_estimators': 100,
                      'seed': seed},

    'gradient_boost': {'max_depth': 25,
                       'n_estimators': 100,
                       'loss': 'deviance',
                       'seed': seed},
    'out_csv': 'results1.csv'
}
outcome_var = "not_funded_in_{}_days".format(params['days'])

## Loading data and cleaning

In [8]:
projects_df = ppln.load_from_csv('projects_2012_2013.csv')
projects_df = ppln.create_outcome_var(projects_df, params['days'])
initial_length = projects_df.shape[1]
to_discrete = ['total_price_including_optional_support', 'students_reached']
to_dummy = ['school_state', 'school_metro', 'teacher_prefix', 'primary_focus_subject',
            'primary_focus_area', 'secondary_focus_subject', 'secondary_focus_area',
            'resource_type', 'poverty_level', 'grade_level', 'total_price_including_optional_support',
            'students_reached']
for col in to_discrete:
    projects_df[col] = ppln.discretize(projects_df[col], 4, string=True)
projects_df = ppln.make_dummies_from_categorical(projects_df, to_dummy)
projects_df['semester'], semesters = ppln.set_semester(projects_df['date_posted'])


### To run model only using a sample of samplesize: params['work_with_sample']

In [9]:
if params['work_with_sample']:
    projects_df = projects_df.sample(frac=params['work_with_sample'],
                                     random_state=seed)
projects_df.shape                                                      

(124976, 166)

### Setting X and Y

In [10]:
features = list(projects_df.columns[initial_length - len(to_dummy):]) #Get only new dummies
features += ['school_charter', 'school_magnet', 'eligible_double_your_impact_match']
y = projects_df[outcome_var]
x = projects_df[features]

In [7]:
dict_results = prediction.run_(x, y, projects_df['semester'], params)

Begining cross k: 1
Train set has 26386 rows, with semester values of [0]
Test set has 32771 rows, with semester values of [1]

Fitting KNN

Classifying model KNN with threshold 0.5
Classifying model KNN with threshold 0.7
Classifying model KNN with threshold 0.8
Classifying model KNN with threshold 0.9
Classifying model KNN with threshold 0.95
Classifying model KNN with threshold 0.98
Classifying model KNN with threshold 0.99

Fitting decision_tree



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Classifying model decision_tree with threshold 0.5
Classifying model decision_tree with threshold 0.7
Classifying model decision_tree with threshold 0.8
Classifying model decision_tree with threshold 0.9
Classifying model decision_tree with threshold 0.95
Classifying model decision_tree with threshold 0.98
Classifying model decision_tree with threshold 0.99

Fitting logistic_reg

Classifying model logistic_reg with threshold 0.5
Classifying model logistic_reg with threshold 0.7
Classifying model logistic_reg with threshold 0.8
Classifying model logistic_reg with threshold 0.9
Classifying model logistic_reg with threshold 0.95
Classifying model logistic_reg with threshold 0.98
Classifying model logistic_reg with threshold 0.99


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Fitting svm

Classifying model svm with threshold -1
Classifying model svm with threshold -0.8
Classifying model svm with threshold -0.6
Classifying model svm with threshold -0.4
Classifying model svm with threshold -0.3
Classifying model svm with threshold -0.1
Classifying model svm with threshold 0

Fitting random_forest

Classifying model random_forest with threshold 0.5
Classifying model random_forest with threshold 0.7
Classifying model random_forest with threshold 0.8
Classifying model random_forest with threshold 0.9
Classifying model random_forest with threshold 0.95
Classifying model random_forest with threshold 0.98
Classifying model random_forest with threshold 0.99

Fitting gradient_boost



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Classifying model gradient_boost with threshold 0.5
Classifying model gradient_boost with threshold 0.7
Classifying model gradient_boost with threshold 0.8
Classifying model gradient_boost with threshold 0.9
Classifying model gradient_boost with threshold 0.95
Classifying model gradient_boost with threshold 0.98
Classifying model gradient_boost with threshold 0.99
Begining cross k: 2
Train set has 59157 rows, with semester values of [0 1]
Test set has 21774 rows, with semester values of [2]

Fitting KNN

Classifying model KNN with threshold 0.5
Classifying model KNN with threshold 0.7
Classifying model KNN with threshold 0.8
Classifying model KNN with threshold 0.9
Classifying model KNN with threshold 0.95
Classifying model KNN with threshold 0.98
Classifying model KNN with threshold 0.99

Fitting decision_tree



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Classifying model decision_tree with threshold 0.5
Classifying model decision_tree with threshold 0.7
Classifying model decision_tree with threshold 0.8
Classifying model decision_tree with threshold 0.9
Classifying model decision_tree with threshold 0.95
Classifying model decision_tree with threshold 0.98
Classifying model decision_tree with threshold 0.99

Fitting logistic_reg

Classifying model logistic_reg with threshold 0.5
Classifying model logistic_reg with threshold 0.7
Classifying model logistic_reg with threshold 0.8
Classifying model logistic_reg with threshold 0.9
Classifying model logistic_reg with threshold 0.95
Classifying model logistic_reg with threshold 0.98
Classifying model logistic_reg with threshold 0.99

Fitting svm



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Classifying model svm with threshold -1
Classifying model svm with threshold -0.8
Classifying model svm with threshold -0.6
Classifying model svm with threshold -0.4
Classifying model svm with threshold -0.3
Classifying model svm with threshold -0.1
Classifying model svm with threshold 0

Fitting random_forest

Classifying model random_forest with threshold 0.5
Classifying model random_forest with threshold 0.7
Classifying model random_forest with threshold 0.8
Classifying model random_forest with threshold 0.9
Classifying model random_forest with threshold 0.95
Classifying model random_forest with threshold 0.98
Classifying model random_forest with threshold 0.99

Fitting gradient_boost



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Classifying model gradient_boost with threshold 0.5
Classifying model gradient_boost with threshold 0.7
Classifying model gradient_boost with threshold 0.8
Classifying model gradient_boost with threshold 0.9
Classifying model gradient_boost with threshold 0.95
Classifying model gradient_boost with threshold 0.98
Classifying model gradient_boost with threshold 0.99
Begining cross k: 3
Train set has 80931 rows, with semester values of [0 1 2]
Test set has 44045 rows, with semester values of [3]

Fitting KNN

Classifying model KNN with threshold 0.5
Classifying model KNN with threshold 0.7
Classifying model KNN with threshold 0.8
Classifying model KNN with threshold 0.9
Classifying model KNN with threshold 0.95
Classifying model KNN with threshold 0.98


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Classifying model KNN with threshold 0.99

Fitting decision_tree

Classifying model decision_tree with threshold 0.5
Classifying model decision_tree with threshold 0.7
Classifying model decision_tree with threshold 0.8
Classifying model decision_tree with threshold 0.9
Classifying model decision_tree with threshold 0.95
Classifying model decision_tree with threshold 0.98
Classifying model decision_tree with threshold 0.99

Fitting logistic_reg

Classifying model logistic_reg with threshold 0.5
Classifying model logistic_reg with threshold 0.7
Classifying model logistic_reg with threshold 0.8
Classifying model logistic_reg with threshold 0.9
Classifying model logistic_reg with threshold 0.95
Classifying model logistic_reg with threshold 0.98


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Classifying model logistic_reg with threshold 0.99

Fitting svm

Classifying model svm with threshold -1
Classifying model svm with threshold -0.8
Classifying model svm with threshold -0.6
Classifying model svm with threshold -0.4
Classifying model svm with threshold -0.3
Classifying model svm with threshold -0.1
Classifying model svm with threshold 0

Fitting random_forest

Classifying model random_forest with threshold 0.5
Classifying model random_forest with threshold 0.7
Classifying model random_forest with threshold 0.8
Classifying model random_forest with threshold 0.9
Classifying model random_forest with threshold 0.95


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Classifying model random_forest with threshold 0.98
Classifying model random_forest with threshold 0.99

Fitting gradient_boost

Classifying model gradient_boost with threshold 0.5
Classifying model gradient_boost with threshold 0.7
Classifying model gradient_boost with threshold 0.8
Classifying model gradient_boost with threshold 0.9
Classifying model gradient_boost with threshold 0.95
Classifying model gradient_boost with threshold 0.98
Classifying model gradient_boost with threshold 0.99


In [8]:
results = pd.DataFrame(dict_results)

In [9]:
results.head(30)

Unnamed: 0,model,cross_k,threshold,precision,recall,AUC ROC
0,KNN,1,0.5,0.46792,0.050261,0.65134
1,KNN,1,0.7,0.0,0.0,0.65134
2,KNN,1,0.8,0.0,0.0,0.65134
3,KNN,1,0.9,0.0,0.0,0.65134
4,KNN,1,0.95,0.0,0.0,0.65134
5,KNN,1,0.98,0.0,0.0,0.65134
6,KNN,1,0.99,0.0,0.0,0.65134
7,decision_tree,1,0.5,0.322044,0.370604,0.578744
8,decision_tree,1,0.7,0.311539,0.238356,0.578744
9,decision_tree,1,0.8,0.298312,0.205798,0.578744


In [10]:
multi_index = results.sort_values(['model', 'cross_k', 'threshold']).set_index(['model', 'cross_k', 'AUC ROC', 'threshold'])

In [11]:
with pd.option_context("display.max_rows", 200, "display.max_columns", 10):
    print(multi_index)

                                           precision    recall
model          cross_k AUC ROC  threshold                     
KNN            1       0.651340  0.50       0.467920  0.050261
                                 0.70       0.000000  0.000000
                                 0.80       0.000000  0.000000
                                 0.90       0.000000  0.000000
                                 0.95       0.000000  0.000000
                                 0.98       0.000000  0.000000
                                 0.99       0.000000  0.000000
               2       0.663391  0.50       0.614679  0.009741
                                 0.70       0.000000  0.000000
                                 0.80       0.000000  0.000000
                                 0.90       0.000000  0.000000
                                 0.95       0.000000  0.000000
                                 0.98       0.000000  0.000000
                                 0.99       0.000000  0