<a href="https://colab.research.google.com/github/besherh/Machine-Learning-Course/blob/master/EnsembleLearning/Ensemble_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#A case study
The dataset you are going to be using for this case study is popularly known as the Wisconsin Breast Cancer dataset. The task related to it is Classification.

The dataset contains a total number of 10 features labeled in either benign or malignant classes. The features have 699 instances out of which 16 feature values are missing. The dataset only contains numeric values.

The dataset can be downloaded from our Github page.
https://github.com/besherh/Machine-Learning-Course/blob/master/EnsembleLearning/datasets/breast-cancer.csv

You will implement the Ensembles using the mighty scikit-learn library.

Let's first import all the Python dependencies you will be needing for this case study.



In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler


Let's load the dataset in a DataFrame object.



In [3]:
data = pd.read_csv('breast-cancer.csv')
data.head()


Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


The column "Sample code number" is just an indicator and it's of no use in the modeling. So, let's drop it:



In [4]:
data.drop(['Sample code number'],axis = 1, inplace = True)
data.head()


Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,5,1,1,1,2,1,3,1,1,2
1,5,4,4,5,7,10,3,2,1,2
2,3,1,1,1,2,2,3,1,1,2
3,6,8,8,1,3,4,3,7,1,2
4,4,1,1,3,2,1,3,1,1,2


You can see that the column is dropped now. Let's get some statistics about the data using Panda's describe() and info() functions:



In [None]:
data.describe()


Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [None]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Clump Thickness              699 non-null    int64 
 1   Uniformity of Cell Size      699 non-null    int64 
 2   Uniformity of Cell Shape     699 non-null    int64 
 3   Marginal Adhesion            699 non-null    int64 
 4   Single Epithelial Cell Size  699 non-null    int64 
 5   Bare Nuclei                  699 non-null    object
 6   Bland Chromatin              699 non-null    int64 
 7   Normal Nucleoli              699 non-null    int64 
 8   Mitoses                      699 non-null    int64 
 9   Class                        699 non-null    int64 
dtypes: int64(9), object(1)
memory usage: 54.7+ KB


As mentioned earlier, the dataset contains missing values. The column named "Bare Nuclei" contains them. Let's verify.



In [None]:
data['Bare Nuclei'].to_numpy()



array(['1', '10', '2', '4', '1', '10', '10', '1', '1', '1', '1', '1', '3',
       '3', '9', '1', '1', '1', '10', '1', '10', '7', '1', '?', '1', '7',
       '1', '1', '1', '1', '1', '1', '5', '1', '1', '1', '1', '1', '10',
       '7', '?', '3', '10', '1', '1', '1', '9', '1', '1', '8', '3', '4',
       '5', '8', '8', '5', '6', '1', '10', '2', '3', '2', '8', '2', '1',
       '2', '1', '10', '9', '1', '1', '2', '1', '10', '4', '2', '1', '1',
       '3', '1', '1', '1', '1', '2', '9', '4', '8', '10', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '6', '10', '5', '5', '1', '3',
       '1', '3', '10', '10', '1', '9', '2', '9', '10', '8', '3', '5', '2',
       '10', '3', '2', '1', '2', '10', '10', '7', '1', '10', '1', '10',
       '1', '1', '1', '10', '1', '1', '2', '1', '1', '1', '?', '1', '1',
       '5', '5', '1', '?', '8', '2', '1', '10', '1', '10', '5', '3', '1',
       '10', '1', '1', '?', '10', '10', '1', '1', '3', '?', '2', '10',
       '1', '1', '1', '1', '1', '1', '10', '10'

You can spot some "?"s in it, right? Well, these are your missing values, and you will be imputing them with Mean Imputation. But first, you will replace those "?"s with 0's.



In [None]:
data.replace('?',0, inplace=True)


In [None]:
data['Bare Nuclei'].to_numpy()


array(['1', '10', '2', '4', '1', '10', '10', '1', '1', '1', '1', '1', '3',
       '3', '9', '1', '1', '1', '10', '1', '10', '7', '1', 0, '1', '7',
       '1', '1', '1', '1', '1', '1', '5', '1', '1', '1', '1', '1', '10',
       '7', 0, '3', '10', '1', '1', '1', '9', '1', '1', '8', '3', '4',
       '5', '8', '8', '5', '6', '1', '10', '2', '3', '2', '8', '2', '1',
       '2', '1', '10', '9', '1', '1', '2', '1', '10', '4', '2', '1', '1',
       '3', '1', '1', '1', '1', '2', '9', '4', '8', '10', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '6', '10', '5', '5', '1', '3',
       '1', '3', '10', '10', '1', '9', '2', '9', '10', '8', '3', '5', '2',
       '10', '3', '2', '1', '2', '10', '10', '7', '1', '10', '1', '10',
       '1', '1', '1', '10', '1', '1', '2', '1', '1', '1', 0, '1', '1',
       '5', '5', '1', 0, '8', '2', '1', '10', '1', '10', '5', '3', '1',
       '10', '1', '1', 0, '10', '10', '1', '1', '3', 0, '2', '10', '1',
       '1', '1', '1', '1', '1', '10', '10', '10', '1',

The "?"s are replaced with 0's now. Let's do the missing value treatment now.


For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known part of the data. 

In [None]:
from sklearn.impute import SimpleImputer

# Convert the DataFrame object into NumPy array otherwise you will not be able to impute
values = data.values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Now impute it
imputedData = imputer.fit_transform(values)


Now if you take a look at the dataset itself, you will see that all the ranges of the features of the dataset are not the same. This may cause a problem. A small change in a feature might not affect the other. To address this problem, you will normalize the ranges of the features to a uniform range, in this case, 0 - 1.



In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
normalizedData = scaler.fit_transform(imputedData)


You have performed all the preprocessing that was required in order to perform your Ensembling experiments.

You will start with Bagging based Ensembling. In this case, you will use a Bagged Decision Tree.



In [None]:
# Bagged Decision Trees for Classification - necessary dependencies

from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier


You have imported the dependencies for the Bagged Decision Trees.



In [None]:
# Segregate the features from the labels
X = normalizedData[:,0:9]
Y = normalizedData[:,9]


Remember, in bagging we need to divide the data set into diffrenet subsets. We are going apply this using a new technique called k-Fold Cross-Validation
#k-Fold Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

The general procedure is as follows:

Shuffle the dataset randomly.


Split the dataset into k groups.

For each unique group:

*   Take the group as a hold out or test data set
*   Take the remaining groups as a training data set
*   Fit a model on the training set and evaluate it on the test set
*   Retain the evaluation score and discard the model    
   
Summarize the skill of the model using the sample of model evaluation scores


To learn more about k-flod refer to this link :
https://machinelearningmastery.com/k-fold-cross-validation/ 
or this:

https://www.youtube.com/watch?v=CRqLeHpACVI


In [None]:
kfold = model_selection.KFold(n_splits=10, random_state=7)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=7)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())




0.9585714285714285


Let's see what you did in the above cell.

First, you initialized a 10-fold cross-validation fold. After that, you instantiated a Decision Tree Classifier with 100 trees and wrapped it in a Bagging-based Ensemble. Then you evaluated your model.

You model performed pretty well. It yielded an accuracy of 95.71%.

Brilliant! Let's implement the other ones.



In [None]:
# AdaBoost Classification

from sklearn.ensemble import AdaBoostClassifier
seed = 7
num_trees = 70
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())




0.9557142857142857


In this case, you did an AdaBoost classification (with 70 trees) which is based on Boosting type of Ensembling. The model gave you an accuracy of 95.57% for 10-fold cross-validation.

Finally, it's time for you to implement the Voting-based Ensemble technique.



In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

kfold = model_selection.KFold(n_splits=10, random_state=seed)
# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())

0.9571221532091098




Take it further:
Try other Boosting-based Ensemble techniques viz. Gradient Boosting, XGBoost, etc.
Play with the different parameter settings that scikit-learn offers in Ensembles and then try to find why a particular setting performed well. This will make your understanding even stronger. link
Try Ensemble learning on a variety of datasets to understand where you should and where you should not apply Ensemble learning. For finding datasets Kaggle, UCI Repository, etc. are good places to search.
