# Ensemble Learning
___

In this section we combine our top 3 models employing Ensemble Learning, namely:
* Logistic Regression
* Random Forest Classifier
* XGBoost Classifer

_Note:_ pipeline 2 dataset used with overall best results for average weighted f1-score and average auc score

In [2]:
#Load dataset
import pandas as pd
df_pipeline2 = pd.read_csv("pipeline_2.csv")

In [3]:
# Define Features and Target variables
X = df_pipeline2.iloc[:, :-1] # Features is all columns in the dataframe except the last column
Y = df_pipeline2.iloc[:, -1] # Target is the last column in the dataframe: 'Revenue'

### Voting Ensemble

In [4]:
# Voting Ensemble for Classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection
import warnings

# ignore warnings
warnings.filterwarnings("ignore")

kfold = model_selection.KFold(n_splits=10, random_state=2019)
# create the sub models
estimators = []

# note pipeline 2 hyper-parameters for LogisticRegression
model1 = LogisticRegression(C=4714.8663634573895, dual=False, fit_intercept=True, max_iter=40, multi_class='multinomial', penalty='l2', solver='sag')
estimators.append(('logisticregression', model1))

# note pipeline 2 hyper-parameters for Random Forest
model2 = RandomForestClassifier(bootstrap= False, criterion='entropy', max_depth=20, max_features='auto', min_samples_leaf=1, min_samples_split=2, n_estimators=210, random_state=2019)
estimators.append(('randomforestclassifier', model2))

# note pipeline 2 hyper-parameters for XGBoost
model3 = xgb.XGBClassifier(colsample_bytree=1.0, gamma=6, max_depth=4, min_child_weight=1, subsample=1.0)
estimators.append(('xgboostclassifier', model3))

# create the ensemble model
#ensemble = VotingClassifier(estimators)

# to calculate the roc_auc as a scoring metric we need to set voting equal to soft
# source: https://www.oreilly.com/library/view/machine-learning-for/9781783980284/47c32d8b-7b01-4696-8043-3f8472e3a447.xhtml
# In hard voting (also known as majority voting), every individual classifier votes for a class, and the majority wins. 
# In statistical terms, the predicted target label of the ensemble is the mode of the distribution of individually predicted labels.

# In soft voting, every individual classifier provides a probability value that a specific data point belongs to a particular target class. 
# The predictions are weighted by the classifier's importance and summed up. Then the target label with the greatest sum of weighted probabilities wins the vote.

# source: https://stackoverflow.com/questions/51465682/roc-auc-in-votingclassifier-randomforestclassifier-in-scikit-learn-sklearn
ensemble = VotingClassifier(estimators,voting='soft')

result_f1_weighted = model_selection.cross_val_score(ensemble, X, Y, cv=kfold,  scoring='f1_weighted')
result_auc_weighted = model_selection.cross_val_score(ensemble, X, Y, cv=kfold, scoring='roc_auc')

print('--------------------------------------------------------------------------')
print('All the weighted f1 results:')
print('--------------------------------------------------------------------------')
print(result_f1_weighted, '\n')

print('--------------------------------------------------------------------------')
print('Average of all the weighted f1 results:')
print('--------------------------------------------------------------------------')
print(result_f1_weighted.mean(), '\n')

print('--------------------------------------------------------------------------')
print('All the AUC scores:')
print('--------------------------------------------------------------------------')
print(result_auc_weighted, '\n')

print('--------------------------------------------------------------------------')
print('Average of all the AUC scores:')
print('--------------------------------------------------------------------------')
print(result_auc_weighted.mean(), '\n')

# reset the warnings for future code
warnings.resetwarnings()

--------------------------------------------------------------------------
All the weighted f1 results:
--------------------------------------------------------------------------
[0.95414527 0.92910042 0.93849473 0.94086302 0.89193247 0.86825904
 0.86778652 0.85381105 0.87176977 0.85355366] 

--------------------------------------------------------------------------
Average of all the weighted f1 results:
--------------------------------------------------------------------------
0.8969715957082037 

--------------------------------------------------------------------------
All the AUC scores:
--------------------------------------------------------------------------
[0.98717385 0.97446085 0.98449731 0.97407905 0.91533097 0.87751228
 0.88680978 0.87788358 0.8809565  0.86928115] 

--------------------------------------------------------------------------
Average of all the AUC scores:
--------------------------------------------------------------------------
0.922798531796707 



<b> Take-away: </b>
<br> The effect of the Voting Ensemble in this case did <b> NOT </b> result in a major improvement.
* The average weighted f1 score increased from our best model Random Forest with 0.8914 to 0.8970
* The average AUC score stayed approximately the same ~0.923