## Homework 5
## Improving prediction Homework 3
## Prediction of non funded projects using cross validation
Machine Learning for Public Policy

Camilo Arias

- load, clean and transform functions in pipeline.py

- classifiers in classifiers.py

- Functions to run cross validation in prediction.py


## Improvements

- Feature generation after split
- Includes Bagging
- Calculares precision and recall for top-k% and not for absolute threshold. The same for the plots.
- Uses one function called run_model to build any classifier
- Runs every classifier with different parameters
- Leaves 6 months for temporal holdouts.

In [1]:
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pipeline as ppln
import classifiers as classif
import prediction
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
import warnings; warnings.simplefilter('ignore')

## Parameters of the model

In [12]:
params = {
    'days': 60,
    'test_days': 180,
    'cross_ks': 3,
    'test_size': 0.3,
    'discretize_bins': 4,
    'work_with_sample': 1,
    'seed': 1234,
    'n_bins': 4,
    'top_ks': [0.01, 0.05, 0.1, 0.2, 0.3],
    'id_columns': ['projectid', 'teacher_acctid', 'schoolid'],
    'cols_to_drop': ['datefullyfunded'],
    'model_params': {
    'KNN': {'k': [5, 10],
            'weights': ['uniform', 'distance'],
            'metric': ['euclidean', 'manhattan', 'minkowski']},
    'decision_tree': {'criterion': ['gini', 'entropy'],
                      'max_depth': [20, 30, 40]},

    'logistic_reg': {'C': [10**-2, 1 , 10**2],
                     'penalty': ['l1', 'l2'],
                     'fit_intercept': [True, False]},

    'svm': {'C': [10**-2, 10**-1, 1 , 10, 10**2]},

    'random_forest': {'criterion': ['gini', 'entropy'],
                      'max_depth': [10, 15],
                      'n_estimators': [80, 100, 150]},

    'gradient_boost': {'max_depth': [10, 15],
                       'n_estimators': [80, 100, 150],
                       'loss': ['deviance', 'exponential']},
    'bagging': {'base_estimator': [None],
                'n_estimators': [80, 100, 150]}}
}
outcome_var = "not_funded_in_{}_days".format(params['days'])

In [3]:
models = ppln.get_all_combinations(params['model_params'])

## Loading data and cleaning

In [4]:
projects_df = ppln.load_from_csv('projects_2012_2013.csv')
projects_df = ppln.create_outcome_var(projects_df, params['days'])

### To run model only using a sample of samplesize: params['work_with_sample']

In [6]:
if params['work_with_sample']:
    projects_df = projects_df.sample(frac=params['work_with_sample'],
                                     random_state=params['seed'])
projects_df.shape                                    

(124976, 27)

## Setting un bimesters

In [7]:
bimester_serie, bimesters = ppln.group_by_days(projects_df['date_posted'], 61)

In [8]:
test_size = params['test_days']//params['days']
test_size

3

### Setting X and Y

In [13]:
y = projects_df[outcome_var]
x = projects_df.drop(outcome_var, axis=1)
x = x.drop(params['id_columns'], axis=1)
x = x.drop(params['cols_to_drop'], axis=1)

In [14]:
x.columns

Index(['school_ncesid', 'school_latitude', 'school_longitude', 'school_city',
       'school_state', 'school_metro', 'school_district', 'school_county',
       'school_charter', 'school_magnet', 'teacher_prefix',
       'primary_focus_subject', 'primary_focus_area',
       'secondary_focus_subject', 'secondary_focus_area', 'resource_type',
       'poverty_level', 'grade_level',
       'total_price_including_optional_support', 'students_reached',
       'eligible_double_your_impact_match', 'date_posted'],
      dtype='object')

## Running models

In [15]:
first_models = {k: models[k] for k in ['KNN', 'decision_tree', 'logistic_reg', 'svm']}
second_models = {k: models[k] for k in ['random_forest', 'gradient_boost', 'bagging']}
results_df = pd.DataFrame()
dict_results_1 = prediction.run(x=x, y=y, groups_serie=bimester_serie,
                                  test_size=test_size, wait_size=1,
                                  num_of_trains=params['cross_ks'],
                                  models_dict=first_models,
                                  seed=params['seed'],
                                  top_ks=params['top_ks'],
                                  n_bins=params['discretize_bins'])
results_df_1 = pd.DataFrame(dict_results_1)
results_df_1.to_csv('results_part1.csv')
dict_results_2 = prediction.run(x=x, y=y, groups_serie=bimester_serie,
                                  test_size=test_size, wait_size=1,
                                  num_of_trains=params['cross_ks'],
                                  models_dict=second_models,
                                  seed=params['seed'],
                                  top_ks=params['top_ks'],
                                  n_bins=params['discretize_bins'])
results_df_2 = pd.DataFrame(dict_results_2)
results_df_2.to_csv('results_part2.csv')

results_df = pd.concat([results_df_1, results_df_2])
results_df.to_csv('results_complete.csv')

Begining cross k: 1
Train set has 26617 rows, with group values of [0, 1, 2]
Test set has 33269 rows, with group values of [[4, 5, 6], [7, 8, 9], [10, 11, 12]]

Fitting KNN

Built model KNN with specification {'k': 5, 'weights': 'uniform', 'metric': 'euclidean'}
Built model KNN with specification {'k': 5, 'weights': 'uniform', 'metric': 'manhattan'}
Built model KNN with specification {'k': 5, 'weights': 'uniform', 'metric': 'minkowski'}
Built model KNN with specification {'k': 5, 'weights': 'distance', 'metric': 'euclidean'}
Built model KNN with specification {'k': 5, 'weights': 'distance', 'metric': 'manhattan'}
Built model KNN with specification {'k': 5, 'weights': 'distance', 'metric': 'minkowski'}
Built model KNN with specification {'k': 10, 'weights': 'uniform', 'metric': 'euclidean'}
Built model KNN with specification {'k': 10, 'weights': 'uniform', 'metric': 'manhattan'}
Built model KNN with specification {'k': 10, 'weights': 'uniform', 'metric': 'minkowski'}
Built model KNN wit

Built model decision_tree with specification {'criterion': 'gini', 'max_depth': 30}
Built model decision_tree with specification {'criterion': 'gini', 'max_depth': 40}
Built model decision_tree with specification {'criterion': 'entropy', 'max_depth': 20}
Built model decision_tree with specification {'criterion': 'entropy', 'max_depth': 30}
Built model decision_tree with specification {'criterion': 'entropy', 'max_depth': 40}

Fitting logistic_reg

Built model logistic_reg with specification {'C': 0.01, 'penalty': 'l1', 'fit_intercept': True, 'seed': 1234}
Built model logistic_reg with specification {'C': 0.01, 'penalty': 'l1', 'fit_intercept': False, 'seed': 1234}
Built model logistic_reg with specification {'C': 0.01, 'penalty': 'l2', 'fit_intercept': True, 'seed': 1234}
Built model logistic_reg with specification {'C': 0.01, 'penalty': 'l2', 'fit_intercept': False, 'seed': 1234}
Built model logistic_reg with specification {'C': 1, 'penalty': 'l1', 'fit_intercept': True, 'seed': 1234}

Built model gradient_boost with specification {'max_depth': 15, 'n_estimators': 100, 'loss': 'exponential', 'seed': 1234}
Built model gradient_boost with specification {'max_depth': 15, 'n_estimators': 150, 'loss': 'deviance', 'seed': 1234}
Built model gradient_boost with specification {'max_depth': 15, 'n_estimators': 150, 'loss': 'exponential', 'seed': 1234}

Fitting bagging

Built model bagging with specification {'base_estimator': None, 'n_estimators': 80, 'seed': 1234}
Built model bagging with specification {'base_estimator': None, 'n_estimators': 100, 'seed': 1234}
Built model bagging with specification {'base_estimator': None, 'n_estimators': 150, 'seed': 1234}
Begining cross k: 3
Train set has 80959 rows, with group values of [0, 1, 2, 3, 4, 5, 6, 7, 8]
Test set has 31702 rows, with group values of [[4, 5, 6], [7, 8, 9], [10, 11, 12]]

Fitting random_forest

Built model random_forest with specification {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 80, 'seed': 1234}
Bui

In [19]:
results_df.groupby(['model', 'top_k']).agg({'precision':'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,precision
model,top_k,Unnamed: 2_level_1
KNN,0.01,0.255133
KNN,0.05,0.425126
KNN,0.1,0.404581
KNN,0.2,0.390206
KNN,0.3,0.375266
bagging,0.01,0.422838
bagging,0.05,0.424903
bagging,0.1,0.415995
bagging,0.2,0.401992
bagging,0.3,0.389976


In [21]:
results_df.groupby(['model', 'parameters']).agg({'precision':'mean'}).sort

Unnamed: 0_level_0,Unnamed: 1_level_0,precision
model,parameters,Unnamed: 2_level_1
KNN,"{'k': 10, 'weights': 'distance', 'metric': 'euclidean'}",0.334989
KNN,"{'k': 10, 'weights': 'distance', 'metric': 'manhattan'}",0.334089
KNN,"{'k': 10, 'weights': 'distance', 'metric': 'minkowski'}",0.334989
KNN,"{'k': 10, 'weights': 'uniform', 'metric': 'euclidean'}",0.421239
KNN,"{'k': 10, 'weights': 'uniform', 'metric': 'manhattan'}",0.421239
KNN,"{'k': 10, 'weights': 'uniform', 'metric': 'minkowski'}",0.421239
KNN,"{'k': 5, 'weights': 'distance', 'metric': 'euclidean'}",0.303932
KNN,"{'k': 5, 'weights': 'distance', 'metric': 'manhattan'}",0.303987
KNN,"{'k': 5, 'weights': 'distance', 'metric': 'minkowski'}",0.303932
KNN,"{'k': 5, 'weights': 'uniform', 'metric': 'euclidean'}",0.420371


In [27]:
results_df.groupby(['model', 'cross_k']).agg({'precision':'mean'}).sort_values('precision', ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,precision
model,cross_k,Unnamed: 2_level_1
random_forest,2,0.533716
logistic_reg,2,0.520248
svm,2,0.504736
random_forest,3,0.498621
gradient_boost,2,0.493661
logistic_reg,3,0.48623
svm,3,0.475416
bagging,2,0.464664
gradient_boost,3,0.444144
logistic_reg,1,0.431884
