## DS 862 Machine Learning for Business Analysts Fall 2020

### Ensemble Methods

#### Submitted by:
* Di Wang

In the Python demonstration I have showed you how to conduct ensemble modeling with regression. For this assignment, you will be doing ensemble with classification instead. Almost all the functions that we have covered have a classification version that you can directly apply (with a few changes ocassionally, of course).

The dataset you will be using is a bank churn modeling found on [Kaggle](https://www.kaggle.com/shrutimechlearn/churn-modelling). The goal is to use the given information to predict whether a bank customer will churn or not.

In [1]:
import pandas as pd
import numpy as np

import warnings # Suppress warnings because they are annoying
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

data = pd.read_csv('bank_churn.csv')

In [2]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


First do some preprocessing. Remove the following columns: RowNumber, CustomerID, Surname. Convert categorical data into dummy variables (with dropping). Split data into 80%-20% train/test sets.

In [3]:
# Remove the following columns
data.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1, inplace = True)

In [4]:
# Convert categorical data into dummy variables 
data = pd.get_dummies(data, columns = ['Geography', 'Gender'], drop_first = True)

In [5]:
# Split data 
X = data.copy()
X = data.drop('Exited', axis = 1)
y = data['Exited']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

In [6]:
# Scale the data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

You will now practice each of the ensemble methods we discussed. Your job is to build an ensemble classifier using each method, and provide me your evaluation (accuracy) on the test set. Remember to conduct the appropriate preprocessing (if needed). You may tune your models if you want.

### Voting (Soft) Classifier
Use 5 classifiers, where two of them needs to be same but different hyperparameters. You can choose your own classifiers. Build a soft voting classifier.

In [7]:
from sklearn.linear_model import LogisticRegression #Logistic 
from sklearn.naive_bayes import GaussianNB #Naive Bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.ensemble import RandomForestClassifier #Random Forest

from sklearn.ensemble import VotingClassifier # This function is to perform voting classifier

# Define the individual models
LR = LogisticRegression()
GB = GaussianNB()
DT = DecisionTreeClassifier(random_state=123) 
RF1 = RandomForestClassifier(n_estimators=50, random_state=123)
RF2 = RandomForestClassifier(max_features=8, random_state=123)

# Fit the individual classifier
LR.fit(X_train_s, y_train)
GB.fit(X_train_s, y_train)
DT.fit(X_train_s, y_train)
RF1.fit(X_train_s, y_train)
RF2.fit(X_train_s, y_train)

# Fit the voting classifier
vr = VotingClassifier(estimators = [('LR', LR), ('GB', GB), ('DT', DT), 
                                    ('RF1', RF1), ('RF2', RF2)], n_jobs = 2, voting='soft')
vr.fit(X_train_s, y_train)

VotingClassifier(estimators=[('LR', LogisticRegression()), ('GB', GaussianNB()),
                             ('DT', DecisionTreeClassifier(random_state=123)),
                             ('RF1',
                              RandomForestClassifier(n_estimators=50,
                                                     random_state=123)),
                             ('RF2',
                              RandomForestClassifier(max_features=8,
                                                     random_state=123))],
                 n_jobs=2, voting='soft')

In [8]:
# Evaluation (accuracy) on the test set -- voting soft
from sklearn.metrics import accuracy_score

print('Voting soft:', accuracy_score(vr.predict(X_test_s), y_test))

Voting soft: 0.8575


### Voting (Hard) Classifier
Now do the same, but with a hard voting classifier. Compare the result with the soft classifier.

In [9]:
# Fit the voting classifier
vr2 = VotingClassifier(estimators = [('LR', LR), ('GB', GB), ('DT', DT), 
                                    ('RF1', RF1), ('RF2', RF2)], n_jobs = 2, voting='hard')
vr2.fit(X_train_s, y_train)

VotingClassifier(estimators=[('LR', LogisticRegression()), ('GB', GaussianNB()),
                             ('DT', DecisionTreeClassifier(random_state=123)),
                             ('RF1',
                              RandomForestClassifier(n_estimators=50,
                                                     random_state=123)),
                             ('RF2',
                              RandomForestClassifier(max_features=8,
                                                     random_state=123))],
                 n_jobs=2)

In [10]:
# Evaluation (accuracy) on the test set -- voting hard
print('Voting hard:', accuracy_score(vr2.predict(X_test_s), y_test))

Voting hard: 0.86


**Your Observation:** 
- For soft voting, every individual classifier provides a probability value, sum up the result with 20% for each model. 
- For hard voting, every individual classifier votes for a class. In this case, the final result will be the label predicted by 3 or more models.
- The accuracy for soft voting is 85.75% and hard voting is 86%, quite similar. 

As a comparison, let's also fit the individual models and see if there's really an improvement with the voting classifier.

In [11]:
# Evaluation (accuracy) on the test set -- individual models
print('Logistic Regression:', accuracy_score(LR.predict(X_test_s), y_test))
print('Gaussian Naive Bayes:', accuracy_score(GB.predict(X_test_s), y_test))
print('Decision Tree:', accuracy_score(DT.predict(X_test_s), y_test))
print('RandomForest 1:', accuracy_score(RF1.predict(X_test_s), y_test))
print('RandomForest 2:', accuracy_score(RF2.predict(X_test_s), y_test))

Logistic Regression: 0.813
Gaussian Naive Bayes: 0.8255
Decision Tree: 0.8005
RandomForest 1: 0.865
RandomForest 2: 0.863


**Your Observation:** 
- Looks like both voting models perform better than Logistic Regression, Gaussian Naive Bayes and decision tree. But two Random Forest models have a higher accuracy rate than the voting models. I think it make sense because the way RF doing is fitting a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model. 

### Bagged Logistic Regression
Now fit a bagged model, using logistic regression as base. You can choose what logistic regression to use.

In [12]:
from sklearn.ensemble import BaggingClassifier

# Instantiate the model
bag_lr = BaggingClassifier(LogisticRegression(), n_estimators = 100)

# Fit the model
bag_lr.fit(X_train_s, y_train)

# Prediction
print('Prediction:', bag_lr.predict(X_test_s)) #2000 observations
print('Accuracy:', bag_lr.score(X_test_s, y_test))

Prediction: [0 0 0 ... 0 0 0]
Accuracy: 0.8125


**Your Observation:** 
- The accuracy is 81.3%, which is same at the Logistic Regression Model we did in task 1. 

### XGBoost
Now do the same, but with XGBoost

In [13]:
from xgboost.sklearn import XGBClassifier

# Get a validation set, 20% from training, don't change the testing set
X1_train, X1_valid, y1_train, y1_valid = train_test_split(X_train, y_train, test_size = 0.2, 
                                                          random_state = 123)
# Scale the data
X1_train_s = scaler.fit_transform(X1_train)
X1_valid_s = scaler.transform(X1_valid)
X_test_s = scaler.transform(X_test)

# tune parameter
param = {'max_depth':range(3,8), 'min_child_weight':range(1,6),  
        'n_estimators':[30,50], 'learning_rate':[0.1,0.5,1]}

# Instantiate the model
xgb_cla = GridSearchCV(XGBClassifier(random_state = 123), param, n_jobs = -2)

# Fit the model
xgb_cla.fit(X1_train_s, y1_train, eval_set = [(X1_valid_s, y1_valid)], 
            early_stopping_rounds = 3) 
print('Best Parameters:', xgb_cla.best_params_)

# Prediction
print('Accuracy:', accuracy_score(xgb_cla.predict(X_test_s), y_test))

[0]	validation_0-error:0.15687
Will train until validation_0-error hasn't improved in 3 rounds.
[1]	validation_0-error:0.14625
[2]	validation_0-error:0.14938
[3]	validation_0-error:0.14750
[4]	validation_0-error:0.14375
[5]	validation_0-error:0.13937
[6]	validation_0-error:0.14000
[7]	validation_0-error:0.13937
[8]	validation_0-error:0.13625
[9]	validation_0-error:0.13625
[10]	validation_0-error:0.13562
[11]	validation_0-error:0.13562
[12]	validation_0-error:0.13500
[13]	validation_0-error:0.13562
[14]	validation_0-error:0.13437
[15]	validation_0-error:0.13312
[16]	validation_0-error:0.13312
[17]	validation_0-error:0.13250
[18]	validation_0-error:0.13375
[19]	validation_0-error:0.13750
[20]	validation_0-error:0.13312
Stopping. Best iteration:
[17]	validation_0-error:0.13250

Best Parameters: {'learning_rate': 0.5, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 30}
Accuracy: 0.861


**Your Observation:** 
- After tuning my model, looks like the best parameters are 0.5 learning rate, 3 max depth, 3 min child weight and 30 number of estimators. The accuracy on testing set is 86.1%

### Light GBM
Repeat with Light GBM

In [14]:
from lightgbm import LGBMClassifier

# tune parameter
param = {'max_depth':range(3,8), 'min_child_weight':range(1,6), 
        'n_estimators':[30,50], 'learning_rate':[0.1,0.5,1]}

# Instantiate the model
lgbm_cla = GridSearchCV(LGBMClassifier(random_state = 123), param, n_jobs = -2)

# Fit the model
lgbm_cla.fit(X1_train_s, y1_train, eval_set = [(X1_valid_s, y1_valid)], 
            early_stopping_rounds = 3) 
print('Best Parameters:', lgbm_cla.best_params_)

# Prediction
print('Accuracy:', accuracy_score(lgbm_cla.predict(X_test_s), y_test))


[1]	valid_0's binary_logloss: 0.471383
Training until validation scores don't improve for 3 rounds
[2]	valid_0's binary_logloss: 0.449895
[3]	valid_0's binary_logloss: 0.433233
[4]	valid_0's binary_logloss: 0.420436
[5]	valid_0's binary_logloss: 0.40954
[6]	valid_0's binary_logloss: 0.400593
[7]	valid_0's binary_logloss: 0.393311
[8]	valid_0's binary_logloss: 0.38663
[9]	valid_0's binary_logloss: 0.381325
[10]	valid_0's binary_logloss: 0.376507
[11]	valid_0's binary_logloss: 0.37183
[12]	valid_0's binary_logloss: 0.367364
[13]	valid_0's binary_logloss: 0.364379
[14]	valid_0's binary_logloss: 0.360544
[15]	valid_0's binary_logloss: 0.357906
[16]	valid_0's binary_logloss: 0.356517
[17]	valid_0's binary_logloss: 0.354766
[18]	valid_0's binary_logloss: 0.352231
[19]	valid_0's binary_logloss: 0.351109
[20]	valid_0's binary_logloss: 0.35023
[21]	valid_0's binary_logloss: 0.349013
[22]	valid_0's binary_logloss: 0.347853
[23]	valid_0's binary_logloss: 0.346957
[24]	valid_0's binary_logloss: 0.

**Your Observation:** 
- After tuning my model, looks like the best parameters are 0.1 learning rate, 6 max depth, 3 min child weight and 50 number of estimators. The accuracy on testing set is 86.35%.
- Looks like Light BGM performs a little better than XGBoost.

### Stacking
Lastly, do this with Stacking. You may use the same models you used from the voting classifiers. Choose your own blender function.

In [15]:
# Redefining the base learners -- same models from task 1 voting
LR = LogisticRegression()
GB = GaussianNB()
DT = DecisionTreeClassifier(random_state=123) 
RF1 = RandomForestClassifier(n_estimators=50, random_state=123)
RF2 = RandomForestClassifier(max_features=8, random_state=123)

In [16]:
# First we will define the base learners. We will put them in a dictionary
models = {'LR': LR, 'GB': GB, 'DT': DT, 'RF1': RF1, 'RF2': RF2}

# Also define the blender
from sklearn.svm import SVC 
blender = SVC()

In [17]:
# Split the training data into 2vparts, 1 to train the weak learners, another to train the blender
X_train_1, X_train_2, y_train1, y_train2 = train_test_split(X_train, y_train, 
                                                              test_size = 0.5, random_state = 123)

# Scale the data
X_train_s1 = scaler.fit_transform(X_train_1)
X_train_s2 = scaler.transform(X_train_2)
X_test_s1 = scaler.transform(X_test)

In [18]:
# Train the weak learners
for name, model in models.items():
    model.fit(X_train_s1, y_train1)

In [19]:
# Train the blender
# Get the prediction
predictions = pd.DataFrame() # Set up a dataframe to store the predictions
for name, model in models.items():
    predictions[name] = model.predict_proba(X_train_s2)[:,1]

# Get the blender
scaler_blend = StandardScaler() # Scale the predictions for SVR
predictions_scale = scaler_blend.fit_transform(predictions)
blender.fit(predictions_scale, y_train2)

SVC()

In [20]:
predictions

Unnamed: 0,LR,GB,DT,RF1,RF2
0,0.037972,0.021811,0.0,0.00,0.00
1,0.259700,0.234523,0.0,0.26,0.20
2,0.307226,0.326384,0.0,0.04,0.00
3,0.262999,0.253475,1.0,0.74,0.70
4,0.035705,0.020135,0.0,0.00,0.00
...,...,...,...,...,...
3995,0.544092,0.647458,1.0,0.46,0.43
3996,0.125885,0.275384,1.0,0.82,0.90
3997,0.087630,0.099164,0.0,0.06,0.10
3998,0.124042,0.142009,1.0,0.28,0.14


In [21]:
# Perform evaluation
# First send the data through the weak learners
predictions = pd.DataFrame() # Set up a dataframe to store the predictions
for name, model in models.items():
    predictions[name] = model.predict_proba(X_test_s1)[:,1]
    
# Prediction through the blender, and evaluate
predictions_scale = scaler_blend.transform(predictions)
accuracy_score(blender.predict(predictions_scale), y_test)

0.8575

**Your Observation:** 

| Dataset | Voting soft | Voting hard | Bagged Logistic Regression | XGBoost | Light GBM | Stacking 
| --- | --- | --- | --- | --- | --- | --- | 
| Churn Modelling | .8575 | .8600 | .8130 | .8610 | .8635 | .8575 | 

- Looks like the results of different Ensemble Methods are quite similar, the best one is 86.35% from Light GBM, and the worst one is 81.30% from Bagged Logistic Regression. 

### Thank you