In [24]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [25]:
%reload_ext autoreload

## Part II: Model Building

Here you try your hand at model building to predict appointment no shows.

### Preprocessing

Package 'proj2_lib' now includes code to carry out preprocessing steps from part I. Here's how to use it:

In [26]:
import proj2_lib.util as utils

First, it includes a dictionary used for configuring path and file names
used through the project

In [27]:
utils.file_config

{'feature_pipeline_file': 'feature_pipeline.pkl',
 'labels_pipeline_file': 'labels_pipeline.pkl',
 'objstore_path': 'objects',
 'processed_data_path': 'processed_data',
 'raw_data_csv': 'KaggleV2-May-2016.csv',
 'raw_data_path': 'data',
 'test_csv': 'test_set.csv',
 'train_csv': 'train_set.csv'}

`feature_pipeline_file`: file storing the preprocessing pipeline used for preparing the feature matrix

`labels_pipeline_file`: file storing the preprocessing pipeline used for
preparing labels

`objstore_path`: directory to store python objects to disk

`processed_data_path`: directory containing processed data

`raw_data_csv`: name of the csv download from Kaggle

`raw_data_path`: directory containing raw data

`test_csv`: name of csv file containing test set

`train_csv`: name of csv file containing train set

You can change these paths and names to suit your project directory structure if you need so. E.g.,

In [28]:
file_config = utils.file_config
#config['raw_data_path'] = "some_other_directory"

First step is to create train test sets. Code is in file `proj2_lib/util.py` function `make_train_test_sets`. You
can edit that function as needed to include your own part I code if you so desire. The result will be to 
create files `train_set.csv` and `test_set.csv` in your `processed_data` directory (unless you change any of the entries in the configuration directory as above).

In [31]:
# ONLY NEED TO RUN THIS STEP ONCE (switch this to True to run it)
RUN_MAKE_TRAIN_TEST_FILES = True
if RUN_MAKE_TRAIN_TEST_FILES:
    utils.make_train_test_sets(config=file_config)

Next step is to fit the preprocessing pipelines. This is done in file `proj2_lib/preprocess.py`. Again you can edit code as needed in that file to incorporate your part I solution as you wish. The result will be to create files `feature_pipeline.pkl` and `labels_pipeline.pkl` containing the fit preprocessing pipelines we can then use to preprocess data.

In [33]:
import proj2_lib.preprocess as preprocess

# ONLY NEED TO RUN THIS STEP ONCE
RUN_FIT_PREPROCESSING = True
if RUN_FIT_PREPROCESSING:
    preprocess.fit_save_pipelines(config=file_config)

Finally, once we do that, we can get a training matrix and labels:

In [34]:
train_X, train_y = preprocess.load_train_data(config=file_config)

In [35]:
print(train_X.shape)
print(train_y.shape)

(90514, 105)
(90514,)


### Model Building

Using `sklearn` fit:
    - DecisionTree classifier
    - RandomForest classifier
    - Linear SVM classifier
    - SVM with Radial Basis Kernel classifier
    
Use default parameters for now.
Using 10-fold cross validation report both accuracy and AUC for each of the above four models.

QUESTION: Should you use accuracy or AUC for this task as a performance metric?

_ANSWER HERE_

## Decision Tree Model
### Here I am building a decision tree model

In [87]:
# build your models here


from sklearn.tree import DecisionTreeClassifier
DT_clf = DecisionTreeClassifier()
DT_clf_fit=DT_clf.fit(train_X, train_y)

n_nodes = DT_clf.tree_.node_count
children_left = DT_clf.tree_.children_left
children_right = DT_clf.tree_.children_right
feature = DT_clf.tree_.feature
threshold = DT_clf.tree_.threshold
DT_clf_fit

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

## Random Forest Model
### Here I am building a random forest model

In [45]:
from sklearn.ensemble import RandomForestClassifier
RF_clf = RandomForestClassifier()
RF_clf_fit=RF_clf.fit(train_X,train_y)
RF_clf_fit

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [51]:
print(RF_clf.feature_importances_)
len((RF_clf.feature_importances_))

[  1.31643899e-02   1.54275489e-02   1.41307162e-02   1.41712178e-02
   1.46843735e-02   1.35188237e-04   0.00000000e+00   1.70640277e-01
   4.06676865e-02   8.49516830e-03   1.04989206e-02   6.20693191e-03
   1.02438855e-02   1.28090233e-02   7.64134572e-03   7.24926614e-03
   2.00223006e-02   3.23069374e-01   3.61826596e-02   3.26648364e-03
   3.22063794e-03   6.12383562e-04   1.12244360e-04   2.01911745e-05
   1.06230440e-05   5.89537448e-03   1.22038377e-03   1.26463829e-03
   1.66524650e-03   4.89645338e-03   3.02962307e-03   1.52521339e-03
   5.62873310e-03   6.65332537e-03   7.01928165e-03   1.37378412e-03
   2.90140739e-03   4.09596628e-03   4.64727072e-03   5.08045070e-03
   1.29544365e-03   1.65142051e-03   1.86806229e-03   2.73066273e-03
   1.10446070e-03   2.29967432e-03   2.76422802e-03   3.92518714e-03
   1.01548037e-03   2.94202070e-03   3.66952737e-03   5.24031114e-03
   1.00698454e-03   3.04076671e-03   4.28108019e-03   8.72856296e-05
   6.75824715e-05   4.53158075e-03

105

## Linear SVM Model
### Here I am building a linear SVM model

In [52]:
from sklearn.svm import LinearSVC
SVM_clf = LinearSVC()
SVM_clf_fit=SVM_clf.fit(train_X,train_y)
SVM_clf_fit

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [55]:
print(SVM_clf.coef_)
print(SVM_clf.intercept_)

[[ -1.56375836e-02  -1.84660885e-02  -4.19392575e-02  -5.15276295e-02
   -3.00777533e-02  -8.62549894e-02   0.00000000e+00   1.44237597e-02
   -4.76703468e-01  -2.29575874e-02   1.18749987e-01   1.37007767e-01
    7.10344759e-02  -4.04979143e-02   3.72598514e-02   9.48207261e-02
   -8.64056566e-02  -5.97024591e-02   1.08833684e-03  -1.76417626e-01
   -1.35153303e-01  -1.33883643e-01   7.74319340e-02   1.24119336e-01
   -4.56272161e-01   4.28796084e-02  -7.20388795e-02   3.00010592e-02
    4.72895891e-02   5.04051688e-02   3.70521404e-03  -5.26479596e-02
    1.66961686e-02   4.67088366e-02   1.64575397e-02  -5.41775963e-02
    3.45089100e-02  -2.31272047e-02   5.70372120e-03   3.74835441e-03
   -1.67998525e-01  -7.09494354e-02  -3.37505821e-02  -9.58989276e-02
   -2.07910446e-02  -3.30467896e-04   4.46146953e-03  -4.63316417e-02
   -1.89386828e-03  -4.33300473e-02   1.38578349e-02   6.39902286e-02
    7.78606761e-02   6.22534308e-02  -2.96033562e-02  -2.32816184e-01
    1.01938614e-01  

## RBF SVM Model
### Here I am building a SVM with radial basis kernel  model

In [58]:
from sklearn import svm
RBF_SVM_clf= svm.SVC(kernel="rbf")
RBF_SVM_clf_fit= RBF_SVM_clf.fit(train_X,train_y)
RBF_SVM_clf_fit


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

### Here we look at mean and standard deviation of accuracy and roc-auc score for the 4 models we fit above. We use 10 fold cross validation. From the results we can see that for all the models accuracy doesn't change a lot. Roc-auc score is the one changing a lot and it seems that roc-auc score is more important here. Based on the results of this section Linear SVM and Random Forest have higher roc-auc scores and we choose these two as the better models. 

In [70]:
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import cross_val_score

for clf in (DT_clf,RF_clf,SVM_clf,RBF_SVM_clf):
    accuracy_scores = cross_val_score(clf, train_X, train_y, 
                        scoring="accuracy", cv=10)
    roc_scores = cross_val_score(clf, train_X, train_y, 
                        scoring="roc_auc", cv=10)
    print("clf", clf, "Scores:", "accuracy","Mean: %0.2f (+/- %0.2f)" % (accuracy_scores.mean(), accuracy_scores.std() * 2))
    print("clf", clf, "Scores:", "AUC","Mean: %0.2f (+/- %0.2f)" % (roc_scores.mean(), roc_scores.std() * 2))
    
  



clf DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best') Scores: accuracy Mean: 0.74 (+/- 0.01)
clf DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best') Scores: AUC Mean: 0.59 (+/- 0.01)
clf RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators

### Model Tuning

Based on the above, choose two methods and fit a tuned model:
    - use 5-fold cross validation for model selection
    - use 10-fold cross validation for model assessment (based on appropriate performance metric)

Report estimated performance for both tuned classifiers

### Here we use 5 fold cross validation to select the best model based on a range of parameters. The parameters for best models are shown in the results. 

In [78]:
# tune your models here
import numpy as np
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 30, 60,90,100], 'max_features': [5, 10, 18, 25, 30]},
    {'bootstrap': [False], 'n_estimators': [3, 30, 60,90,100], 'max_features': [5, 10, 18, 25, 30]}
]

RF_gs = GridSearchCV(estimator=RF_clf, param_grid=param_grid, cv= 5, scoring="roc_auc")
RF_gs.fit(train_X, train_y)
print ("Random_Forest_Best_params_AUC: ", RF_gs.best_params_)
print("Best RF AUC score is {}".format(RF_gs.best_score_))



Cs = [0.001, 0.01, 0.1, 1, 10]
param_grid = {'C': Cs}
SVM_gs = GridSearchCV(estimator=SVM_clf, param_grid=param_grid, cv=5, scoring="roc_auc" )

SVM_gs.fit(train_X, train_y)
print ("Linear_SVM_Best_params_AUC: ", SVM_gs.best_params_) 
print("Best Linear SVM AUC score is {}".format(SVM_gs.best_score_))



Random_Forest_Best_params_AUC:  {'max_features': 18, 'n_estimators': 90}
Best RF AUC score is 0.7241072475754243
Linear_SVM_Best_params_AUC:  {'C': 0.01}
Best Linear SVM AUC score is 0.7233435445332445


### Here we are using 10 fold cross validation for model assessment. We use the best estimator in each model and then find the mean and standard deviation.

In [79]:
for clf in (RF_gs.best_estimator_ ,SVM_gs.best_estimator_  ):
    
    roc_scores = cross_val_score(clf, train_X, train_y, 
                        scoring="roc_auc", cv=10)
    
    print("clf", clf, "Scores:", "AUC","best_gs_Mean: %0.2f (+/- %0.2f)" % (roc_scores.mean(), roc_scores.std() * 2))
    
    

        

clf RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=18, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=90, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False) Scores: AUC best_gs_Mean: 0.73 (+/- 0.01)
clf LinearSVC(C=0.01, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0) Scores: AUC best_gs_Mean: 0.72 (+/- 0.01)


### Linear SVM with Gradient Descent

In [107]:
import numpy as np

# initialize model parameters w and b
# intializing to 0 is not a good idea
# it should be a random vector see np.random.randn
# YOU NEED TO IMPLEMENT THIS
def _initialize_parameters(nfeatures):
    w = np.full((nfeatures), np.random.randn())
    b = np.full((1), np.random.randn())
    return w, b

# this is a vectorized version of positive_part operation
# we can use this for hinge loss as post_part(1.0 - y*f)
pos_part = np.vectorize(lambda u: u if u > 0. else 0.)

# compute the value of the linear SVM objective function
# given current signed distances, and parameter vector w
def _get_objective(f, y, w, lam):
    loss = np.sum(pos_part(1.0 - y*f))
    penalty = lam * np.dot(w.transpose(),w)
    return loss + penalty

# compute the signed distances
# based on current model estimates
# w and b
# YOU NEED TO IMPLEMENT THIS
def _get_signed_distances(X, w, b):
    nobs = X.shape[0]
    f = np.full(nobs, 0.0)
    w_T=w.copy()
    w_len=X.shape[1]
    w_T.resize(w_len,1) 
    #f=np.multiply(X,w_T)+b
    f=np.mat(X)*np.mat(w_T)+b
    #print("f:",f)
    return f
    

# compute gradients with respect to w and b
# YOU NEEED TO IMPLEMENT THIS
subgrad = np.vectorize(lambda u: 0. if u >= 1. else -1.)

def _get_gradients(f, X, y, w, b, lam):
    yf = y * f
    t = subgrad(yf)
    ty = t * y
    
    gw = np.sum(np.dot(X.T, ty.T))+lam*w
    gb = np.sum(ty)
    return gw, gb

# fit an SVM using gradient descent
# X: matrix of feature values
# y: labels (-1 or 1)
# n_iter: numer of iterations
# eta: learning rate
def fit_svm(X, y, lam, n_iter=100, eta=.4):
    nexamples, nfeatures = X.shape
    
    w, b = _initialize_parameters(nfeatures)
    
    for k in range(n_iter):
        f = _get_signed_distances(X, w, b)
        
        # print information and 
        # update the learning rate
        if k % 10 == 0:
            obj = _get_objective(f, y, w, lam)
            eta = eta / 2.0
            print("it: %d, obj %.2f" % (k, obj))
        
        gw, gb = _get_gradients(f, X, y, w, b, lam)
        #print("gw:",gw)
        w = w - eta * gw
        b = b - eta * b
    return w, b

In [109]:
w,b = fit_svm(train_X, train_y, 1.0, n_iter=100)

it: 0, obj 165.89
it: 10, obj 1.91
it: 20, obj 0.23
it: 30, obj 0.08
it: 40, obj 0.05
it: 50, obj 0.04
it: 60, obj 0.03
it: 70, obj 0.03
it: 80, obj 0.03
it: 90, obj 0.03
