# Part III: Ensembles and Final Result

## AdaBoost

Train an AdaBoost classifier and compare its performance to results obtained in Part II using 10 fold CV.

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
import noshow_lib.util as utils
import noshow_lib.preprocess as preprocess

In [6]:
file_config = utils.file_config
file_config['raw_data_path'] = 'C:/Users/yazdan/Desktop/cmsc643_noshow-main/cmsc643_noshow-master'
file_config['processed_data_path'] = "Processed"

In [7]:
utils.file_config

{'raw_data_path': 'C:/Users/yazdan/Desktop/cmsc643_noshow-main/cmsc643_noshow-master',
 'raw_data_csv': 'KaggleV2-May-2016.csv',
 'processed_data_path': 'Processed',
 'train_csv': 'train_set.csv',
 'test_csv': 'test_set.csv',
 'objstore_path': 'objects',
 'feature_pipeline_file': 'feature_pipeline.pkl',
 'labels_pipeline_file': 'labels_pipeline.pkl'}

In [8]:
train_X, train_y = preprocess.load_train_data(config=file_config)

In [6]:
# AdaBoost code goes here

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import mean_squared_error, roc_auc_score
from sklearn.model_selection import cross_val_score, GridSearchCV


from sklearn.ensemble import AdaBoostClassifier


AB_model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2))
AB_model.fit(train_X, train_y)

#Accuracy Eval
AB_model_accuracy = cross_val_score(AB_model, train_X, train_y, scoring="accuracy", cv=10)
AB_model_auc = cross_val_score(AB_model, train_X, train_y, scoring="roc_auc", cv=10)

print(AB_model_accuracy)
print(AB_model_auc)


[0.79807799 0.79818845 0.7977466  0.79730476 0.79597923 0.79675246
 0.79794521 0.79794521 0.79794521 0.7942996 ]
[0.72655342 0.72599615 0.73008018 0.73593013 0.72343734 0.73179628
 0.72824432 0.72474657 0.73303143 0.72261238]


## xgBoost

Train an xgBoost classifier and compare its performance to results in Part II using 10 fold CV. `sklearn` has a gradient boosting model included http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html which you can use. The `xgboost` package https://xgboost.readthedocs.io/en/latest/python/python_intro.htmlhas a wrapper you can use with sklearn as well https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn. The latter is more efficient at training time.

In [7]:
# xgboost code here

from sklearn.ensemble import GradientBoostingClassifier

XB_model = GradientBoostingClassifier()
XB_model.fit(train_X, train_y)

#Accuracy Eval
XB_model_accuracy = cross_val_score(XB_model, train_X, train_y, scoring="accuracy", cv=10)
XB_model_auc = cross_val_score(XB_model, train_X, train_y, scoring="roc_auc", cv=10)

print(XB_model_accuracy)
print(XB_model_auc)

[0.79796752 0.79863029 0.7977466  0.79851983 0.79840937 0.79730476
 0.79871852 0.79783473 0.79805568 0.79761379]
[0.72981495 0.73023025 0.73317226 0.73895891 0.7267114  0.73508897
 0.73467984 0.72839343 0.74008282 0.72581275]


## Stacking

Choose a set of 5 or so classifiers. Write a function that trains an ensemble using stacking

In [33]:

def build_stack_ensemble(X, y):
    
    from sklearn.model_selection import StratifiedShuffleSplit
    from sklearn.linear_model import LogisticRegression, LinearRegression
    # create train/validation sets
    # using StratifiedShuffleSplit
    splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1234)
    
    for tr_index, ts_index in splitter.split(X, y):
        
        X_tr = X[tr_index]
        y_tr = y[tr_index]
        X_ts = X[ts_index]
        y_ts = y[ts_index]
    
    
    
    # train classifiers in ensemble using train set
    
    DT_model = DecisionTreeClassifier()
    RF_model = RandomForestClassifier()
    LSVM_model = LinearSVC()
    AB_model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2))
    XB_model = XB_model = GradientBoostingClassifier()
    
    classifiers_list=[DT_model, RF_model, LSVM_model, AB_model, XB_model]
    classifires_names=['DT_model', 'RF_model', 'LSVM_model', 'AB_model', 'XB_model']
    
    # Define Prediction Set
    predictions = []
    
    for i in range(len(classifiers_list)):
        
        classifiers_list[i].fit(X_tr,y_tr)        
        predictions.append(classifiers_list[i].predict(X_ts)) 
           
    
    # create new feature matrix for validation
    # set by getting predictions from the ensemble
    # classifiers
    
    
    # train logistic regression classifier on
    # new feature matrix
    LR_model = LogisticRegression()
    LR_model.fit(predictions, y_ts)
    
    
    EN_accuracy = cross_val_score(LR_model, X_ts, y_ts, scoring="accuracy", cv=10)
    EN_model_auc = cross_val_score(LR_model, X_ts, y_ts, scoring="roc_auc", cv=10)
    
   
    # return all trained classifiers 
    return classifiers_list, LR_model, X_tr, y_tr, X_ts, y_ts
    

Use 10-fold cross validation to measure performance of your stacked classifier. See Part II solution to see how to roll your own sklearn classifier along with http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator

In [32]:
#Running Stacking

trained_classifiers, LR_model,train_X, train_y, test_X, test_y  = build_stack_ensemble(train_X,train_y)

ValueError: Found input variables with inconsistent numbers of samples: [5, 18106]

## Final Result

Choose a single model based on all previous project steps. Train this model on the complete training dataset and measure it's performance on the held out test set.

Compare to the 10-fold CV estimate you got previously.

In [None]:
# final result goes here


#RBF SVM

RBSVM_model = SVC(kernel="rbf")
RBSVM_model.fit(train_X, train_y)

#Accuracy Eval
RBSVM_model_accuracy = cross_val_score(RBSVM_model, test_X, test_y, scoring="accuracy", cv=10)
RBSVM_model_auc = cross_val_score(RBSVM_model, test_X, test_y, scoring="roc_auc", cv=10)

print(RBSVM_model_accuracy)
print(RBSVM_model_auc)
