## Gradient Boosting

The purpose of this notebook is to train a gradient boosted model to predict whether a donors choose project will be funded or not.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('bmh')

%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score, roc_auc_score,f1_score
from sklearn.preprocessing import StandardScaler
import pickle

Load features and text data, engineered in previous notebook

In [12]:
with open('Data/main_df.pkl', 'rb') as f:
    main_df = pickle.load(f)

In [13]:
with open('Data/word_freqs_titles.pkl', 'rb') as f:
    word_freqs_titles = pickle.load(f)

In [14]:
with open('Data/word_freqs_essays.pkl', 'rb') as f:
    word_freqs_essays = pickle.load(f)

In [15]:
with open('Data/word_freqs_needs.pkl', 'rb') as f:
    word_freqs_needs = pickle.load(f)

Combine text and non text data

In [16]:
from scipy.sparse import hstack
use_in_models = hstack((word_freqs_titles,main_df.drop(['Project ID', 'School ID', 'Teacher ID','Funded?'],axis='columns').values))

use_in_models=hstack((use_in_models,word_freqs_essays))

use_in_models=hstack((use_in_models,word_freqs_needs))

Split data into train, validation, and testing sets for modeling

In [17]:
X_train_whole, X_test, y_train_whole, y_test = train_test_split(use_in_models,main_df['Funded?'],
                                                  test_size=0.2,random_state=42)

In [18]:
X_train, X_val, y_train, y_val = train_test_split(X_train_whole,y_train_whole,
                                                  test_size=0.2,random_state=42)

## Gradient Boost

The final model is shown below. Specific parameters were tuned to achieve best results on validation data. Test data was not used until final iteration.

In [9]:
import xgboost as xgb

gbm = xgb.XGBClassifier( 
                        n_estimators=30000,
                        max_depth=7,
                        objective='binary:logistic', #new objective
                        learning_rate=.02, 
                        subsample=.1,
                        min_child_weight=4,
                        colsample_bytree=.8,
                        scale_pos_weight=3.67
                       )

eval_set=[(X_train,y_train),(X_val,y_val)]
fit_model = gbm.fit( 
                    X_train, y_train, 
                    eval_set=eval_set,
                    eval_metric='logloss', #new evaluation metric: classification error (could also use AUC, e.g.)
                    early_stopping_rounds=50,
                    verbose=False
                   )



In [10]:
y_predgbm=gbm.predict(X_val, ntree_limit=gbm.best_ntree_limit)
y_train_gbm=gbm.predict(X_train, ntree_limit=gbm.best_ntree_limit)

  if diff:
  if diff:


In [11]:
print("Train GB Accuracy: "+str(accuracy_score(y_train, y_train_gbm)))
print("Train GB Recall: "+str(recall_score(y_train, y_train_gbm)))
print("Train GB Precision: "+str(precision_score(y_train, y_train_gbm)))
print("Train GB F1: "+str(f1_score(y_train, y_train_gbm)))

print("Val GB Accuracy: "+str(accuracy_score(y_val, y_predgbm)))
print("Val GB Recall: "+str(recall_score(y_val, y_predgbm)))
print("Val GB Precision: "+str(precision_score(y_val, y_predgbm)))
print("Val GB F1: "+str(f1_score(y_val, y_predgbm)))

Train GB Accuracy: 0.700386469755304
Train GB Recall: 0.7996670294361525
Train GB Precision: 0.41515633565502236
Train GB F1: 0.5465598427306075
Val GB Accuracy: 0.6785737272317693
Val GB Recall: 0.7525912456163139
Val GB Precision: 0.38949462900471893
Val GB F1: 0.5133243559304015


In [12]:
filename = 'Data/gbm_model_3.sav'
pickle.dump(fit_model, open(filename, 'wb'))

## Test Set

In [23]:
with open('Data/gbm_model_3.sav', 'rb') as f:
    final_model = pickle.load(f)



In [24]:
y_pred_test = final_model.predict(X_test)

In [25]:
print("Test GB Accuracy: "+str(accuracy_score(y_test, y_pred_test)))
print("Test GB Recall: "+str(recall_score(y_test, y_pred_test)))
print("Test GB Precision: "+str(precision_score(y_test, y_pred_test)))
print("Test GB F1: "+str(f1_score(y_test, y_pred_test)))

Test GB Accuracy: 0.6801570918350629
Test GB Recall: 0.7554955808731123
Test GB Precision: 0.3937529527981789
Test GB F1: 0.5176923999971764
