<a href="https://colab.research.google.com/github/cherotich/ensemble_learning/blob/main/Copy_of_Data_Science_Ensemble_Learning_with_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="blue">To use this notebook on Google Colaboratory, you will need to make a copy of it. Go to **File** > **Save a Copy in Drive**. You can then use the new copy that will appear in the new tab.</font>

# AfterWork Data Science: Ensemble Learning with with Python

## Importing the Necessary Libraries

In [None]:
# We will start by importing the necessary libraries
# ---
# 
import pandas as pd                # Pandas for data manipulation
import numpy as np                 # Numpy for scientific computations

## Examples

### <font color="blue">Classification</font>

In [None]:
# Example 
# ---
# Question: Will John, 40 years old with a salary of 2500 will buy a car?
# ---
# Dataset url = http://bit.ly/SocialNetworkAdsDataset
# ---

##### Data Importation and Exploration

In [None]:
# Loading and previewing our dataset
# ---
# 
social_df = pd.read_csv('http://bit.ly/SocialNetworkAdsDataset')
social_df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [None]:
# Determining the size of our dataset
# (records, columns)
# ---
# 
social_df.shape

(400, 5)

##### Data Preparation

In [None]:
# Normally during this stage we would perform quite a number of 
# procedures, but because our focus is only onlearning about the 
# different modeling algorithms, we will only perform once 
# essential step in ot dataset. We will perform encoding,
# which will help us transform our categorical values in our 
# dataset into numerical values. 
# Lets see what happens when we encode the gender variable 
# to have only numerical values. 
# ---
#
social_df["Gender"] = np.where(social_df["Gender"].str.contains("Male", "Female"), 1, 0)
social_df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,1,19,19000,0
1,15810944,1,35,20000,0
2,15668575,0,26,43000,0
3,15603246,0,27,57000,0
4,15804002,1,19,76000,0


##### Data Modeling

In [None]:
# Preparing our dataset for training
# ---
# We first divide our data into attributes and labels:
# You can think of this as splitting our data set in dependent and independent variables 
# where Age and EstimatedSalary are the independent variables and Purchased are the dependent/label variable.
# ---
# 
X = social_df.iloc[:, [1, 2 ,3]].values  # Independent/predictor variables
y = social_df.iloc[:, 4].values          # Dependent/label variable

In [None]:
# Splitting the dataset into a training set and test set
# ---
# We will split our dataset into training data and test data. 
# Training data will be used to train our logistic model and test data will be used to validate our model
# Because we’ll use sklearn to split our data, we will import train_test_split from sklearn.model_selection
# ---
# 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
# Feature Scaling / Normalisation
# ---
# We then perform feature scaling / normalisation to scale our data between 0 and 1 so as to get better accuracy.
# Here, scaling is important because there is a huge difference between Age and EstimatedSalary.
# In addition, this would also reduce redundacy in our dataset. 
# ---
# 

# We import our scaler from sklearn
from sklearn.preprocessing import StandardScaler

# We make an instance sc_X of the object StandardScaler.
# You can think of making an instance as making a copy.
sc_X = StandardScaler()

# We then fit and transform X_train and X_test
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [None]:
# In this example, because we will be comparing how 
# the different classification models will perform
# ---
#
from sklearn.linear_model import LogisticRegression # Logistic Regression Classifier
from sklearn.tree import DecisionTreeClassifier     # Decision Tree Classifier
from sklearn.svm import SVC                         # SVM Classifier
from sklearn.naive_bayes import GaussianNB          # Naive Bayes Classifier
from sklearn.neighbors import KNeighborsClassifier  # KNN Classifier

# We will also use our ensemble classifiers
from sklearn.ensemble import BaggingClassifier           # Bagging Meta-Estimator Classifier
from sklearn.ensemble import RandomForestClassifier      # RandomForest Classifier 
from sklearn.ensemble import AdaBoostClassifier          # AdaBoost Classifier
from sklearn.ensemble import GradientBoostingClassifier  # AdaBoost GradientBoostingClassifier
import xgboost as xgb                                    # Importing the XGBoost librariy

# Below, we make an instance classifier of the object LogisticRegression, 
# DecisionTreeClassifier, SVC, GaussianNB, KNeighborsClassifier, GaussianNB.
# As we will get to see, each of the classifiers take different parameters.
# ---
# 
logistic_classifier = LogisticRegression(random_state = 0, solver='lbfgs')
decision_classifier = DecisionTreeClassifier()
svm_classifier = SVC()
knn_classifier = KNeighborsClassifier(n_neighbors=5)
naive_classifier = GaussianNB() 

# We start implementing ensemble methods by first using Bagging Classifiers
# ---
# Uncomment each classifier and run the respective code
# ---
bagging_meta_classifier = BaggingClassifier()
random_forest_classifier = RandomForestClassifier()

# Boosting Classifiers
# ---
ada_boost_classifier = AdaBoostClassifier()
gbm_classifier = GradientBoostingClassifier() 
xg_boost_classifier = xgb.XGBClassifier() 

# Now using these classifiers to fit our data, X_train and y_train.
# By fitting we mean we train our classifiers based on the train dataset.
# ---
# Upon running this cell, we should have classifiers that can predict 
# whether a person will buy a car or not.
# ---
# Don't worry about the output, we get GaussianNB because our Naive Bayes classifier
# is the last one to be built.
# ---
#
logistic_classifier.fit(X_train, y_train)
decision_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)
naive_classifier.fit(X_train, y_train)

# Bagging Classifiers
# ---
bagging_meta_classifier.fit(X_train, y_train)
random_forest_classifier.fit(X_train, y_train)

# Boosting Classifiers
# ---
ada_boost_classifier.fit(X_train, y_train)
gbm_classifier.fit(X_train, y_train)
xg_boost_classifier.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
# We now predict the test set results. 
# This will help us determine whether our classifiers made the correct predictions.
# ---
# No expected output here.
# ---
logistic_y_prediction = logistic_classifier.predict(X_test) 
decision_y_prediction = decision_classifier.predict(X_test) 
svm_y_prediction = svm_classifier.predict(X_test) 
knn_y_prediction = knn_classifier.predict(X_test) 
naive_y_prediction = naive_classifier.predict(X_test) 

# Bagging Classifiers
# ---
bagging_y_classifier = bagging_meta_classifier.predict(X_test) 
random_forest_y_classifier = random_forest_classifier.predict(X_test) 

# Boosting Classifiers
# ---
ada_boost_y_classifier = ada_boost_classifier.predict(X_test)
gbm_y_classifier = gbm_classifier.predict(X_test)
xg_boost_y_classifier = xg_boost_classifier.predict(X_test)

In [None]:
# We then import evaluation metrics to determine the accuracy of classifiers
# ---
# 
from sklearn.metrics import classification_report, accuracy_score 

# The accuracy score - is the simplest way to evaluate 
# However, we note not for a highly imbalance dataset. 
# By imbalanced we mean that our original dataset would
# need to have an equal no's of 1 and 0's
# ---
print("Logistic Regression Classifier", accuracy_score(logistic_y_prediction, y_test))
print("Decision Trees Classifier", accuracy_score(decision_y_prediction, y_test))
print("SVN Classifier", accuracy_score(svm_y_prediction, y_test))
print("KNN Classifier", accuracy_score(knn_y_prediction, y_test))
print("Naive Bayes Classifier", accuracy_score(naive_y_prediction, y_test))

# Bagging Classifiers
# ---
print("Bagging Classifier", accuracy_score(bagging_y_classifier, y_test))
print("Random Forest Classifier", accuracy_score(random_forest_y_classifier, y_test))

# Boosting Classifiers
# ---
print("Ada Boost Classifier", accuracy_score(ada_boost_y_classifier, y_test))
print("GBM Classifier", accuracy_score(gbm_y_classifier, y_test))
print("XGBoost Classifier", accuracy_score(xg_boost_y_classifier, y_test))

Logistic Regression Classifier 0.9
Decision Trees Classifier 0.91
SVN Classifier 0.93
KNN Classifier 0.93
Naive Bayes Classifier 0.91
Bagging Classifier 0.92
Random Forest Classifier 0.92
Ada Boost Classifier 0.92
GBM Classifier 0.91
XGBoost Classifier 0.93


In [None]:
# We now print the classification report, 
# which is more reliable for a highly imbalanced dataset. 
# We use the precision values which give us accuracy values.
# 
# ---
# The precision will be "how many are correctly classified among that class".
# The recall means "how many of this class you find over the whole number of element of this class".
# The f1-score is the harmonic mean between precision & recall.
# The support is the number of occurence of the given class in your dataset.
# ---
# 
print('Logistic classifier:')
print(classification_report(y_test, logistic_y_prediction))

print('Decision Tree classifier:')
print(classification_report(y_test, decision_y_prediction))

print('SVM Classifier:')
print(classification_report(y_test, svm_y_prediction))

print('KNN Classifier:')
print(classification_report(y_test, knn_y_prediction))

print('Naive Bayes Classifier:')
print(classification_report(y_test, naive_y_prediction)) 

# Bagging Classifiers
# ---
print('Bagging Meta Classifier:')
print(classification_report(y_test, bagging_y_classifier)) 

print('Random Forest Classifier:')
print(classification_report(y_test, random_forest_y_classifier)) 


# Boosting Classifiers
# ---
print('Ada Boost Classifier:')
print(classification_report(y_test, ada_boost_y_classifier)) 

print('GBM Classifier:')
print(classification_report(y_test, gbm_y_classifier)) 

print('XGBoost Classifier:')
print(classification_report(y_test, xg_boost_y_classifier)) 

# Remember, we can then further perform model opmization techiniques i.e. 
# data cleaning, feature engineering, checking for model assumptions, etc. 
# to further get the best classifier. 

Logistic classifier:
              precision    recall  f1-score   support

           0       0.90      0.96      0.93        68
           1       0.89      0.78      0.83        32

    accuracy                           0.90       100
   macro avg       0.90      0.87      0.88       100
weighted avg       0.90      0.90      0.90       100

Decision Tree classifier:
              precision    recall  f1-score   support

           0       0.94      0.93      0.93        68
           1       0.85      0.88      0.86        32

    accuracy                           0.91       100
   macro avg       0.89      0.90      0.90       100
weighted avg       0.91      0.91      0.91       100

SVM Classifier:
              precision    recall  f1-score   support

           0       0.96      0.94      0.95        68
           1       0.88      0.91      0.89        32

    accuracy                           0.93       100
   macro avg       0.92      0.92      0.92       100
weighted av

In [None]:
# Answering our question
# ---
# We then make a new prediction & compare results.
# Note that we would only use the best optimized classifier for this case.
# ---
# Predict whether John, 60 years old with a salary of 2500 will buy a car or not?
# ---
# Dataset limitation: This is not a practical dataset, thus dataset will lack essential features/variables.
# In a real case scenario, we would work with may kinds of datasets that require transformation
# i.e. data cleaning, feature engineering, etc.
# ---
#
new_case = [[1,	60, 1500]]

print("Logistic Regression Classifier", logistic_classifier.predict(new_case))
print("Decision Tree Classifier",decision_classifier.predict(new_case))
print("SVM Classifier", svm_classifier.predict(new_case))
print("KNN Classifier", knn_classifier.predict(new_case))
print("Naive Bayes Classifier", naive_classifier.predict(new_case))

# Bagging Classifiers
# ---
print("Bagging Meta Classifier", bagging_meta_classifier.predict(new_case))
print("Random Forest Classifier", random_forest_classifier.predict(new_case))

# Boosting Classifiers
# ---
print("Ada Boosting Classifier", ada_boost_classifier.predict(new_case))
print("GBM Classifier", gbm_classifier.predict(new_case))
print("XGBoost Classifier", xg_boost_classifier.predict(new_case)) 

Logistic Regression Classifier [1]
Decision Tree Classifier [1]
SVM Classifier [1]
KNN Classifier [1]
Naive Bayes Classifier [1]
Bagging Meta Classifier [1]
Random Forest Classifier [1]
Ada Boosting Classifier [1]
GBM Classifier [1]
XGBoost Classifier [1]


### <font color="blue">Regression</font>

##### <font color="blue">Example</font>

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Example
# --- 
# Questions: Create a decision tree regression model using the following dataset.
# ---
# Dataset url = http://bit.ly/FishDatasetClean
# NB: This dataset is clean version of the one 
# we used in the multiple regression example above.
# ---
# OUR CODE GOES BELOW
# 

##### Step 1. Loading our Data 

In [None]:
# Reading our data
# ---
# 
df = pd.read_csv('http://bit.ly/FishDatasetClean')
df.head()

Unnamed: 0,Weight,Length1,Length2,Length3,Height,Width
0,242.0,23.2,25.4,30.0,11.52,4.02
1,290.0,24.0,26.3,31.2,12.48,4.3056
2,340.0,23.9,26.5,31.1,12.3778,4.6961
3,363.0,26.3,29.0,33.5,12.73,4.4555
4,430.0,26.5,29.0,34.0,12.444,5.134


In [None]:
# Describing our dataset
# ---
# 
df.describe()

Unnamed: 0,Weight,Length1,Length2,Length3,Height,Width
count,159.0,159.0,159.0,159.0,159.0,159.0
mean,398.326415,26.24717,28.415723,31.227044,8.970994,4.417486
std,357.978317,9.996441,10.716328,11.610246,4.286208,1.685804
min,0.0,7.5,8.4,8.8,1.7284,1.0476
25%,120.0,19.05,21.0,23.15,5.9448,3.38565
50%,273.0,25.2,27.3,29.4,7.786,4.2485
75%,650.0,32.7,35.5,39.65,12.3659,5.5845
max,1650.0,59.0,63.4,68.0,18.957,8.142


##### Step 2, 3, 4: Checking, Cleaning, Exploratory Analysis and have already been performed on our dataset.

##### Step 5. Implementation and Evaluation

In [None]:
# Let's now split our dataset
# ---
# 
# Firstly, importing our train_test_split function
# ---
#
from sklearn.model_selection import train_test_split

X = df[['Length1', 'Length2', 'Length3', 'Height', 'Width']]
y = df['Weight']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

In [None]:
# Lets now train our regressors
# ---
#  

from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import BaggingRegressor 
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb


# Creating our regressors, We'll just use the decision tree regressor this time
# ---
# 
regressor = DecisionTreeRegressor()

# Then creating our ensemble regressors
# ---

# Bagging Regressors
# ---
bagging_est_regressor = BaggingRegressor()
random_forest_regressor = RandomForestRegressor()

# Boosting Regressors
# ---
ada_boost_regressor = AdaBoostRegressor()
gbm_regressor = GradientBoostingRegressor()
xgboost_regressor = xgb.XGBRegressor(objective ='reg:squarederror') # It requires us to specify the objective function

# Fitting our data to our regressors 
# ---
# Decision Tree Regressor
regressor.fit(X_train, y_train)

# Bagging Regressors
# ---
bagging_est_regressor.fit(X_train, y_train)
random_forest_regressor.fit(X_train, y_train)

# Boosting Regressors
# ---
ada_boost_regressor.fit(X_train, y_train)
gbm_regressor.fit(X_train, y_train)
xgboost_regressor.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:squarederror',
             random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
             seed=None, silent=None, subsample=1, verbosity=1)

In [None]:
# Making predictions using our models
# ---
#  
y_pred = regressor.predict(X_test)

# Bagging Regressors
# ---
bag_est_y_pred = bagging_est_regressor.predict(X_test)
random_forest_y_pred = random_forest_regressor.predict(X_test)

# Boosting Regressors
# ---
ada_boost_y_pred = ada_boost_regressor.predict(X_test)
gbm_y_pred = gbm_regressor.predict(X_test)
xgboost_y_pred = xgboost_regressor.predict(X_test)

In [None]:
# Next, we compare actual output values for X_test with the predicted values
# ---
#
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(5)

Unnamed: 0,Actual,Predicted
7,390.0,430.0
40,0.0,140.0
95,170.0,150.0
45,160.0,160.0
110,556.0,690.0


In [None]:
# We make predictings for the random forest regressor
# ---
# Next, we compare actual output values for X_test with the predicted values
# ---
random_forest_df = pd.DataFrame({'Actual': y_test, 'Predicted': random_forest_y_pred})
random_forest_df.head(5)

Unnamed: 0,Actual,Predicted
7,390.0,425.69
40,0.0,127.0
95,170.0,163.31
45,160.0,156.87
110,556.0,666.9


In [None]:
# We make predictions for the adaboost regressor
# ---
# Next, we compare actual output values for X_test with the predicted values
# ---
ada_boost_df = pd.DataFrame({'Actual': y_test, 'Predicted': ada_boost_y_pred})
ada_boost_df.head(5)

Unnamed: 0,Actual,Predicted
7,390.0,396.366667
40,0.0,142.230769
95,170.0,146.0
45,160.0,150.941176
110,556.0,640.0


In [None]:
# We make predictions for the gbm regressor
# ---
# Next, we compare actual output values for X_test with the predicted values
# ---
gbm_df = pd.DataFrame({'Actual': y_test, 'Predicted': gbm_y_pred})
gbm_df.head(5)

Unnamed: 0,Actual,Predicted
7,390.0,434.394032
40,0.0,130.844108
95,170.0,162.803669
45,160.0,152.770039
110,556.0,695.941056


In [None]:
# We also predict for the XGboost regressor
# ---
# Next, we compare actual output values for X_test with the predicted values
# ---
xgboost_df = pd.DataFrame({'Actual': y_test, 'Predicted': xgboost_y_pred})
xgboost_df.head(5)

Unnamed: 0,Actual,Predicted
7,390.0,444.598541
40,0.0,130.755463
95,170.0,162.332397
45,160.0,149.509171
110,556.0,701.653015


In [None]:
# Finally, we evaluate the models
# ---  
# NB: The closer the RMSE is to 0, the better the model.
#  
from sklearn.metrics import mean_squared_error

print('Decision Tree - Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))

# Bagging Regressors
# ---
print('Bagging Estimator - Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, bag_est_y_pred)))
print('Random Forest - Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, random_forest_y_pred)))

# Boosting Regressors
# ---
print('Ada Boost - Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, ada_boost_y_pred)))
print('GBM - Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, gbm_y_pred)))
print('XGBoost - Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, xgboost_y_pred)))

Decision Tree - Root Mean Squared Error: 187.11269609961442
Bagging Estimator - Root Mean Squared Error: 157.89943361994685
Random Forest - Root Mean Squared Error: 159.7836546309702
Ada Boost - Root Mean Squared Error: 177.6687256431936
GBM - Root Mean Squared Error: 149.67887602143222
XGBoost - Root Mean Squared Error: 144.584484457552


##<font color="green">Challenges</font>

###<font color="green">Challenge 1</font>

In [None]:
# Challenge 1
# ---
# Question: A cancer medical reasearch institution would like to make predictions on two different 
# cancer types benign and malignant. Build a model to predict the breast cancer type 
# (0 = benign or 1 = malignant) given the following dataset. In addition, make a prediction.
# NB: Remember to record your observations and also implement ensemble techniques with the goal of improving accuracy.
# Make a recommendation on the best model to use for this problem.
# ---
# Dataset url = http://bit.ly/BreastCancersDataset
# ---
# OUR CODE GOES BELOW
#

###<font color="green">Challenge 2</font>

In [None]:
# Challenge 
# ---
# Predict the price of cars comparing your models using the following dataset. 
# Make a recommendation on the best model to use for this problem.
# ---
# Dataset url = http://bit.ly/CarPriceDataset
# ---
# OUR CODE GOES BELOW
#