## Ensamble modeling 
- in [this notebook](hypermerameter_optimized_modeling.ipynb) we calculated bias and variance of the model (mean and standard deviation of AUC and F1 when we put it in cross-validation)
- Use **bagging ensamble** for models that had high variance: **Logistic regression, KNN**
- Use **boosting ensamble** for models that had high bias: **using XGBoost features**
- Use **voting ensamble** on our top 3 models to see if we can improve it: **XGBoost, SVC, Random Forest**

In [1]:
# Bagged Decision Trees for Classification - necessary dependencies
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.metrics import make_scorer, roc_auc_score

# boosting ensamble (there are others)
from sklearn.ensemble import GradientBoostingClassifier

# voting ensamble
from sklearn.ensemble import VotingClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

In [2]:
# read in data 
data = pd.read_csv('all_model_data.csv', index_col = 0)

## Bagging - Logistic Regression 

In [9]:
# Segregate the features from the labels
X = data[['ProdRelPageRatio_Scaled_Bin','totalFracAdmin_Scaled','Administrative_Duration_Scaled',
          'BounceRates_Norm_Scaled', 'ExitRates_Scaled','SpecialDay_1.0']]
Y = data.Revenue

In [10]:
# use 10-kfold to validate the results 
kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle=True)
# innitialize our model with our best params 
cart = LogisticRegression(solver='lbfgs', C=5, class_weight=dict,dual=False,random_state = 123,max_iter=90,
                            verbose=0, warm_start=True) 

# we will use 100 trees to increase learning 
num_trees = 100
# innitialize bagging ensable 
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=123)
# use cros-val to validate the results
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.8852392538523925


- 10-Kfold accuracy **improved** from 0.848 to 0.8852

## Bagging - KNN

In [11]:
# Segregate the features from the labels
X = data[['PageValues_Norm_Scaled','ExitRates_Scaled','totalFracProd_Scaled']]
Y = data.Revenue

In [12]:
# use 10-kfold to validate the results 
kfold = model_selection.KFold(n_splits=10, random_state=123, shuffle=True)
# innitialize our model with our best params 
cart = KNeighborsClassifier(n_jobs=-1, n_neighbors=3) 

# we will use 100 trees to increase learning 
num_trees = 100
# innitialize bagging ensable 
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=123)
# use cros-val to validate the results
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.8820762368207623


- 10-Kfold accuracy **improved** from 0.871 to 0.882

## Stochastic Gradient Boosting Classification

- as we have mixed features we cannot feed the model all our features 
- instead we will used features that our best model (XGBoost) uses 

In [7]:
# Segregate the features from the labels
X = data[['PageValues_Norm_Scaled', 'AdminBounceRatio_Norm_Scaled', 'ProdRelExitRatio_Norm_Scaled',
          'Month_bin_4', 'Month_bin_2', 'VisitorType_bin_2', 'Informational_Duration_Scaled', 'totalFracProd_Bin']]
Y = data.Revenue

In [13]:
# we will use AUC to check validity of hyperparameters 
#scorer = make_scorer(roc_auc_score)
num_trees = 100
seed=123

# use 10-kfold to validate the results 
kfold = model_selection.KFold(n_splits=10, random_state=seed, shuffle=True)

# innitialize model
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
# validate our results
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.894322789943228


- Good results but **our best 10-Kfold is still random forest: 0.917**

## Voting-based Ensemble (XGBoost, SVC, Random Forest)
- remember to use the best params for each model
- the code displayed below works for models that have the same features 
- if your models perform best with different features a more manual/weighted voting approach must be created
    - Here is a great article: https://sebastianraschka.com/Articles/2014_ensemble_classifier.html
- We did not have time so we used a more un-scientific approach of picking features that worked best for our top model (XGBoost)

In [5]:
# Segregate the features from the labels
X = data[['PageValues_Norm_Scaled', 'AdminBounceRatio_Norm_Scaled', 'ProdRelExitRatio_Norm_Scaled',
          'Month_bin_4', 'Month_bin_2', 'VisitorType_bin_2', 'Informational_Duration_Scaled', 'totalFracProd_Bin']]
Y = data.Revenue

In [6]:
# random_state
seed = 123
# use 10-kfold validation
kfold = model_selection.KFold(n_splits=10, random_state=seed, shuffle=True)

# create the sub models
estimators = [] # this will be used in the voting classifier 

# create our 3 models
model1 = XGBClassifier(random_state=seed,learning_rate=0.3,loss='deviance',max_depth=11,max_leaf_nodes=1,
                       n_estimators=110,subsample=1.0)
estimators.append(('xgboost', model1)) # append it to the estimators 

model2 = SVC(kernel='rbf', gamma=1.0672387970376063, class_weight='balanced', C=0.8914369396699439,
            probability=True, random_state = seed)
estimators.append(('svc', model2))

model3 = RandomForestClassifier(bootstrap=False,class_weight='balanced',criterion='entropy',max_depth=20,
                                 max_features=0.4,max_leaf_nodes=5,min_samples_leaf=20,min_samples_split=14,
                                 n_estimators=100,random_state = seed)
estimators.append(('fr', model3))

# create the voting ensemble model
ensemble = VotingClassifier(estimators)
# validate with cross-validation
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold, scoring = scorer)
print(results.mean())

0.8744525547445257


- Did not improve our results 