# DS4023 Machine Learning : Ensemble Learning Exercise

This exercise, you'll explore different ensemble methods and how does ensemble improves the performance of a machine learning model. There are three parts in this exercise:
1. Simple ensemble strategy: majority voting
2. Bagging Method
3. Boosting Method: Adaboost

The dataset we use for this exercise is a cancer dataset with 699 instances and a total number of 9 features labeled in either benign or malignant classes (0 for benign, 1 for malignant). The dataset only contains numeric values and has been normalized.

Many methods will use random generator, e.g., train-test split, decision tree model, bagging boostramp sample generation, therefore, we can set the seed to a fixed number in order to achieve same results.

## Load Dataset

In [1]:
import numpy as np
import pandas as pd
from sklearn import model_selection
data = pd.read_csv("cancer_normalized.csv")
data.describe()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,0.379749,0.237164,0.245271,0.200763,0.246225,0.346352,0.270863,0.207439,0.06549,0.344778
std,0.31286,0.339051,0.330213,0.317264,0.246033,0.364071,0.270929,0.339293,0.190564,0.475636
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.111111,0.0,0.0,0.0,0.111111,0.1,0.111111,0.0,0.0,0.0
50%,0.333333,0.0,0.0,0.0,0.111111,0.1,0.222222,0.0,0.0,0.0
75%,0.555556,0.444444,0.444444,0.333333,0.333333,0.5,0.444444,0.333333,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [2]:
# load data in to program.
x = data.iloc[:,:-1]
y = data.iloc[:,-1]

## 1.  Simple Ensemble Strategies

In this section, we will look at a simple ensemble technique for classification: majority voting. In this method, multiple models are used to make predictions for each data instance. The predictions by each model are considered as a **vote**. The prediction which we get from the majority of the models are used as the final prediction.

Scikit-Learn provides us with some handy functions that we can use to accomplish this.
- The ``VotingClassifier`` takes in a list of different estimators as arguments and a voting method. The ``hard`` voting method uses the predicted labels and a majority rules system, while the ``soft`` voting method predicts a label based on the sum of the predicted probabilities.

Here, we use three models, *Decision Tree*, *SVM* and *LogisticRegression*, for voting and adopt 10-fold cross validation. Report the mean accuracy of **individual classifiers and the ensemble by applying the majority voting strategy (**hard voting**).** Compare the performance. 
- Note: For DecisionTreeClassifier() implementation, the features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data.

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

seed = 7

# your implementation here
# your implementation here
# model for three
dt = DecisionTreeClassifier(random_state=seed)
lr = LogisticRegression(random_state=seed)
svm = SVC(random_state=seed)

# voting classifier
voting_clf = VotingClassifier(estimators=[
    ('dt', dt),
    ('lr', lr),
    ('svm', svm)
    ], voting = 'hard')

In [4]:
from sklearn.model_selection import cross_val_score

# mean accuarcy for ten fold(represented as cv)
dt_accuarcy = cross_val_score(dt,x,y,cv=10).mean()
lr_accuarcy = cross_val_score(lr,x,y,cv=10).mean()
svm_accuarcy = cross_val_score(svm,x,y,cv=10).mean()
vote_accuarcy = cross_val_score(voting_clf,x,y,cv=10).mean()
print(dt_accuarcy)
print(lr_accuarcy)
print(svm_accuarcy)
print(vote_accuarcy)

0.9485093167701864
0.9628571428571429
0.9685507246376812
0.9642650103519671


## Bagging Method

In this section, we will explore the bagging method by using decision tree as the base learning algorithm. Scikit-Learn provides us a module of ``BaggingClassifier``, we can provide the base learning model and the number of estimation models. Try to set the number of estimators to 100 and report the mean accuracy of the ensemble using 10-fold cross validation. Compare the performance with a single decision tree model.

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score

num_trees = 100

# your implemenation here
bag = BaggingClassifier(dt,
                        n_estimators = num_trees,
                        bootstrap = True,
                        n_jobs = -1,
                        oob_score = True,
                        random_state = seed)
bagging_accuarcy = cross_val_score(bag,x,y,cv=10)

In [6]:
print(bagging_accuarcy.mean())

0.9557142857142857


First, we initialized a 10-fold cross-validation fold. After that, we instantiated a Decision Tree Classifier with 100 trees and wrapped it in a Bagging-based Ensemble. The accuracy improved to 95.85%.

Sklearn also provides access to the ``RandomForestClassifier``, which is a modification of the decision tree classification. Use random forest model and report the mean accuracy by using 10-folds cross-validation. Number of trees set to 100.

**Compare the performance of RandomForestClassifier with bagged decision tree and give the analysis.**

In [7]:
from sklearn.ensemble import RandomForestClassifier
# your implementation here...
rf = RandomForestClassifier(random_state = seed)
rf_accuracy = cross_val_score(rf,x,y,cv=10)

In [8]:
print(rf_accuracy.mean())

0.9671428571428571


**Comparison and analysis**:

**Random Forest** performs better than **Decision Tree Ensembled using Bagging Method**. 

1. The randomness of the bagging method only comes from the sample perturbation. And due to attribute perturbation is introduced into the random forest model, so that the generalization performance of the final model can be further improved by increasing the degree of difference between individual learners.

2. Random forests are better trained than bagging because each tree in bagging examines all features, while random forests consider only a subset of features.

3. Compared with bagging, random forests tend to have poor starting performance at first, but as the number of individual learners increases, random forests converge to smaller errors.



## Adaboost Method

In this section, you use AdaBoost classification by boosting the ``decision stump``(**one-level decision tree**).Try to set the number of rounds to 100 and report the performance of the ensemble. Compare the performance with a single decision tree model.

In [9]:
from sklearn.ensemble import AdaBoostClassifier
# your implementation here...
adaboost = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=1, 
                                                                        random_state = seed),
                                       n_estimators = 100,
                                       random_state = seed)
adaboost_accuracy = cross_val_score(adaboost,x,y,cv=10)



In [10]:
print(adaboost_accuracy.mean())

0.9585507246376814
