# Ensemble Methods

This lab demonstrates two different types of ensemble methods, bagging and boosting. Recall that some algorithms (decision trees in this activity) can suffer from high variance. Their results can vary drastically depending on the training data provided. Therefore, the general concept is to build multiple models with multiple subsets of the data to yield a more stable and accurate solution. 

### Exercise 1: What is Bootstrap Sampling?

From a statistical point of view, bootstrap sampling is a method that involves sampling data repeatedly with replacement from a data source to estimate a population parameter. To break this down further: <br />
**Sampling** - the process of selecting a subset of items from a vast collection of items in order to estimate a certain characterstic <br />
**Sampling with replacement** - A data point that is drawn can reappear in future drawn samples. The data point it is not removed from the data source <br />
**Parameter Estimation** - A parameter is some measurable characteristic associated with a given population. This could be something like mean, standard deviation, or from a machine learning perspective, some sort of learned coefficient. <br />

In statistics this is done because sampling an entire population is infeasible. Instead, small random samples are taken and then averaged.

**Do This**: In the cell below, we compute the average of 100 random numbers. Instead of computing this average directly, use the `resample` function in `sklearn.utils` to sample 10% of the data 20, 100, and 1000 times. Each time you take a 10% sample, store the average of those numbers, then report the average of all the averages you stored for each experiment (repeating 20 times, 100 times, and 1000 times).

In [14]:
from random import seed
from random import random
import numpy as np
import pandas as pd
from sklearn.utils import resample
import matplotlib.pyplot as plt

size = 100
randNums = np.zeros(size)

# generate some random numbers
seed(1)
for i in range(0, size):
    randNums[i] = random()

actual_avg = np.average(randNums)
print('Actual average: {:.4f}'.format(actual_avg))

# Sample 10% of data (20 times) and take the mean
estimated_avg = 0
for i in range(20):
    sample = resample(randNums, n_samples = 10, replace = False)
    avg = np.average(sample)
    estimated_avg += avg
    
estimated_avg = estimated_avg/20


print('Estimated Average with 20 iterations: {:.4f}'.format(estimated_avg))

# Sample 10% of data (100 times) and take the mean
estimated_avg = 0
for i in range(100):
    sample = resample(randNums, n_samples = 10, replace = False)
    avg = np.average(sample)
    estimated_avg += avg
    
estimated_avg = estimated_avg/100



print('Estimated Average with 100 iterations: {:.4f}'.format(estimated_avg))

# Sample 10% of data (1000 times) and take the mean
estimated_avg = 0
for i in range(1000):
    sample = resample(randNums, n_samples = 10, replace = False)
    avg = np.average(sample)
    estimated_avg += avg
    
estimated_avg = estimated_avg/1000



print('Estimated Average with 1000 iterations: {:.4f}'.format(estimated_avg))

Actual average: 0.5124
Estimated Average with 20 iterations: 0.5281
Estimated Average with 100 iterations: 0.5124
Estimated Average with 1000 iterations: 0.5130


As the number of iterations increases, the estimated average will converge to the actual average. In other words, the variance of the estimated average decreases as the number of iterations increases. This is the fundamental idea behind bootstrapping and the driving principle behind the next few exercises below.

### Exercise 2: Bootstrap Aggregating (Bagging)

Bagging is an ensemble method that takes the idea of bootstrapping by combining predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model. The benefit of this is a reduction in variance and therefore an increase in prediction accuracy. <br /> 
For the next few examples we will use the Sonar dataset from the UCI Machine Learning repository. This is a dataset that describes sonar chirp returns bouncing off of different surfaces. The 60 input variables are the strength of the returns at different angles. It is a binary classification problem that requires a model to differentiate rocks from metal cylinders. There are 208 observations with "M" representing mine and "R" representing rocks.

In [2]:
from sklearn import preprocessing
def LoadData(sfile):
    # Load samples
    df = pd.read_csv(
        filepath_or_buffer=sfile, 
        header=None)
    return df

def ParseData(df):
    # Convert mine and rock label to integers
    le = preprocessing.LabelEncoder()
    le.fit(df[60])
    df[60] = le.transform(df[60])
    labels = df.iloc[:,60].values
    df = df.drop([60], axis=1)
    samples = df.iloc[:,:].values
    return samples, labels

df = LoadData('data/sonar.all-data.csv')
[samples, labels] = ParseData(df)
print('Sonar Data')
df

Sonar Data


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,1
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,1
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078,1
3,0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117,1
4,0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,0.0187,0.0346,0.0168,0.0177,0.0393,0.1630,0.2028,0.1694,0.2328,0.2684,...,0.0116,0.0098,0.0199,0.0033,0.0101,0.0065,0.0115,0.0193,0.0157,0
204,0.0323,0.0101,0.0298,0.0564,0.0760,0.0958,0.0990,0.1018,0.1030,0.2154,...,0.0061,0.0093,0.0135,0.0063,0.0063,0.0034,0.0032,0.0062,0.0067,0
205,0.0522,0.0437,0.0180,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.0160,0.0029,0.0051,0.0062,0.0089,0.0140,0.0138,0.0077,0.0031,0
206,0.0303,0.0353,0.0490,0.0608,0.0167,0.1354,0.1465,0.1123,0.1945,0.2354,...,0.0086,0.0046,0.0126,0.0036,0.0035,0.0034,0.0079,0.0036,0.0048,0


In [3]:
# Function that utilizes cross validation to test accuracy of model
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std
def evaluate_model(model):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, samples, labels, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    print('Accuracy: {:.4f}'.format(scores.mean()))
    print('Cross validation accuracies')
    print(scores)
    return scores

**Do This**: In the two empty cells below, compare the `DecisionTreeClassifier` and the `BaddingClassifier` models using the `evaluate_model` function above. Set `random_state` to 0 for both models and `n_estimators` to 100 for the `BaddingClassifier` model.

##### Classic Decision Tree - no ensemble method

In [16]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

def evaluate_model(model):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, samples, labels, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    print('Accuracy: {:.4f}'.format(scores.mean()))
    print('Cross validation accuracies')
    print(scores)
    return scores

from sklearn import preprocessing
def LoadData(sfile):
    # Load samples
    df = pd.read_csv(
        filepath_or_buffer=sfile, 
        header=None)
    return df

def ParseData(df):
    # Convert mine and rock label to integers
    le = preprocessing.LabelEncoder()
    le.fit(df[60])
    df[60] = le.transform(df[60])
    labels = df.iloc[:,60].values
    df = df.drop([60], axis=1)
    samples = df.iloc[:,:].values
    return samples, labels

df = LoadData('data/sonar.all-data')
[samples, labels] = ParseData(df)
clf = DecisionTreeClassifier(random_state=0)
score = evaluate_model(clf)

Accuracy: 0.7240
Cross validation accuracies
[0.80952381 0.66666667 0.76190476 0.61904762 0.76190476 0.76190476
 0.71428571 0.71428571 0.7        0.7        0.61904762 0.71428571
 0.76190476 0.66666667 0.71428571 0.85714286 0.76190476 0.71428571
 0.65       0.75       0.66666667 0.76190476 0.57142857 0.80952381
 0.80952381 0.80952381 0.76190476 0.76190476 0.75       0.6       ]


##### Bagged Decision Tree

In [17]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

def evaluate_model(model):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, samples, labels, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    print('Accuracy: {:.4f}'.format(scores.mean()))
    print('Cross validation accuracies')
    print(scores)
    return scores

from sklearn import preprocessing
def LoadData(sfile):
    # Load samples
    df = pd.read_csv(
        filepath_or_buffer=sfile, 
        header=None)
    return df

def ParseData(df):
    # Convert mine and rock label to integers
    le = preprocessing.LabelEncoder()
    le.fit(df[60])
    df[60] = le.transform(df[60])
    labels = df.iloc[:,60].values
    df = df.drop([60], axis=1)
    samples = df.iloc[:,:].values
    return samples, labels

df = LoadData('data/sonar.all-data')
[samples, labels] = ParseData(df)
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=0)
score = evaluate_model(clf)

Accuracy: 0.8045
Cross validation accuracies
[0.76190476 0.80952381 0.80952381 0.71428571 0.66666667 0.85714286
 0.9047619  0.85714286 0.85       0.85       0.85714286 0.80952381
 0.9047619  0.71428571 0.71428571 0.85714286 0.76190476 0.85714286
 0.8        0.85       0.71428571 0.85714286 0.66666667 0.85714286
 0.76190476 0.9047619  0.80952381 0.85714286 0.85       0.65      ]


**Note** <br/>
There are several hyperparameters for a decision tree that have not been optimized in the above example. These include things such as max depth, miniminum number of leaf samples, or class purity. Even with these optimized, a bagged decision tree will generally outperform a single tree. 

### Exercise 3: Random Forest

The issue with bagged decision trees is that each tree contains the same number of features. As a result, while many trees are produced, it turns out they are highly correlated with each other. This problem becomes exacerbated when there are dominant features that create the best class purity, and therefore are always chosen as the splitting point. To combat this, we shall pick a random subset of features, diversifying each tree and reducing structural similarity between them. 

#### Default Random Forest

The sklearn library has a random forest function which has defaults for a handful of hyperparameters. 

In [6]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=0)
scores = evaluate_model(rf)

Accuracy: 0.8265
Cross validation accuracies
[0.80952381 0.9047619  0.80952381 0.71428571 0.71428571 0.80952381
 0.95238095 0.95238095 0.9        0.7        0.9047619  0.9047619
 0.85714286 0.71428571 0.80952381 0.9047619  0.85714286 0.85714286
 0.75       0.75       0.80952381 0.76190476 0.76190476 0.80952381
 0.85714286 0.9047619  0.85714286 0.85714286 0.85       0.75      ]


#### Tuning Random Forest
There is some hyperparameter tuning that can be done here. Some of which include number of bootstrap samples, number of features, number of trees, or tree depth. Generally when using random forests the trees are grown to their entirety without pruning which, counterintuitively, does not cause overfitting. Therefore the tuning of these hyperparameters might only cause very marginal gains at best. Below we will experiment with the number of features.

**Do This**: Experiment with setting `max_features` to values in the range $[1,16]$ in a `RandomForestClassifier` model. For each model, set `random_state` to 0 and print the average accuracy of the model.

In [10]:
from sklearn.ensemble import RandomForestClassifier

def evaluate_model(model):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, samples, labels, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    print('Accuracy: {:.4f}'.format(scores.mean()))

for i in range(1, 17):
    print("max_features = ", i)
    rf = RandomForestClassifier(random_state=0, max_features = i)
    scores = evaluate_model(rf)

max_features =  1
Accuracy: 0.8379
max_features =  2
Accuracy: 0.8329
max_features =  3
Accuracy: 0.8318
max_features =  4
Accuracy: 0.8252
max_features =  5
Accuracy: 0.8410
max_features =  6
Accuracy: 0.8218
max_features =  7
Accuracy: 0.8265
max_features =  8
Accuracy: 0.8219
max_features =  9
Accuracy: 0.8250
max_features =  10
Accuracy: 0.8475
max_features =  11
Accuracy: 0.8332
max_features =  12
Accuracy: 0.8139
max_features =  13
Accuracy: 0.8103
max_features =  14
Accuracy: 0.8217
max_features =  15
Accuracy: 0.8074
max_features =  16
Accuracy: 0.8205


What this shows is that there might be a better selection of the number of features rather than the default  $\sqrt{m}$ where m is the number of features, in this dataset, 60. However, caution should be excercised here as the random state of the classifier is set to 0. Changing this will lead to different results so more experimentation would need to be performed to justify changing the default number of features. <br/>

**Note** <br/>
- As the Bagged decision trees are constructed, we can calculate how much the error function drops for a variable at each split point. <br/>
- In regression problems this may be the drop in sum squared error and in classification this might be the Gini score. <br/>
- These drops in error can be averaged across all decision trees and output to provide an estimate of the importance of each input variable. The greater the drop when the variable was chosen, the greater the importance. <br/>
- These outputs can help identify subsets of input variables that may be most or least relevant to the problem and suggest at possible feature selection experiments you could perform where some features are removed from the dataset.

### Exercise 4: Boosting

Boosting ensemble algorithms create a sequence of models that attempt to correct the mistakes of the models before them in the sequence. Once created, the models make predictions which may be weighted by their demonstrated accuracy and the results are combined to create a final output prediciton.  

A couple things to be aware of <br/>
**Outliers** - Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. These should be removed prior to training. <br/>
**Quality Labels** - Because the ensemble method attempts to correct misclassifications in the training data, you need to be careful that the data is labeled as accurately as possible. <br/>

We will explore a boosting method called **Adaptive Boosting (AdaBoost)** <br/>
AdaBoost was the first really successful boosting algorithm developed for binary classification. AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners. These are models that achieve accuracy just above random chance on a classification problem.The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps.

#### Default AdaBoost

Just like Random Forests, the sklearn library has an AdaBoost function which has defaults for a handful of hyperparameters. 

**Do This**: Use the `evaluate_model` function to evalutate the `AdaBoostClassifier` with `random_state` set to 0.

In [12]:
from sklearn.ensemble import AdaBoostClassifier

def evaluate_model(model):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, samples, labels, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    print('Accuracy: {:.4f}'.format(scores.mean()))
    print('Cross validation accuracies')
    print(scores)
    return scores

clf = AdaBoostClassifier(random_state=0)
score = evaluate_model(clf)

Accuracy: 0.8218
Cross validation accuracies
[0.85714286 0.95238095 0.85714286 0.76190476 0.80952381 0.95238095
 0.9047619  0.85714286 0.75       0.85       0.76190476 0.80952381
 0.9047619  0.66666667 0.76190476 0.76190476 0.85714286 0.80952381
 0.85       0.7        0.85714286 0.85714286 0.66666667 0.80952381
 0.80952381 0.95238095 0.85714286 0.80952381 0.9        0.7       ]


#### Tuning AdaBoost
There is some hyperparameter tuning that can be done here. Some of which include number of trees, depth of weak learner, learning rate (contribution of each model to ensemble prediction), and even different weak learners such as KNN. Below we will experiement with number of trees and depth of the weak learner

**Do This**: Tune the number of trees (`n_estimators`) parameter in `AdaBoostClassifier`, testing values in the set $\{10,50,100,500,1000,5000\}$. For each model print the average Accuracy for the test set.

In [14]:
from sklearn.ensemble import AdaBoostClassifier

def evaluate_model(model):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, samples, labels, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    print('Accuracy: {:.4f}'.format(scores.mean()))

arr = [10, 50, 100, 500, 1000, 5000]
for i in arr:
    print(i)
    clf = AdaBoostClassifier(random_state=0, n_estimators = i)
    score = evaluate_model(clf)

10
Accuracy: 0.7677
50
Accuracy: 0.8218
100
Accuracy: 0.8410
500
Accuracy: 0.8456
1000
Accuracy: 0.8540
5000
Accuracy: 0.8507


**Do This**: Tune the maximum depth (`max_depth`) parameter in a `DecisionTreeClassifier` which is used as the `base_estimator` for an `AdaBoostClassifier` model, testing values in the range $[1,7]$. For each model, print the average Accuracy for the test set.

In [15]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

for i in range(1, 8):
    print(i)
    x = DecisionTreeClassifier(max_depth = i)
    clf = AdaBoostClassifier(random_state=0, base_estimator = x)
    score = evaluate_model(clf)

1
Accuracy: 0.8218
2
Accuracy: 0.8130
3
Accuracy: 0.8013
4
Accuracy: 0.8414
5
Accuracy: 0.8350
6
Accuracy: 0.7486
7
Accuracy: 0.7311


This hyperparameter tuning shows some promising results. The accuracy can be imporoved on this dataset if the number of trees and depth is adjusted. Further progress might be possible if other parameters are tuned. As mentioned before, the seed is set to 0 so be careful to insure that these adjustments are true when the stochastic nature of the algorithm isn't fixed

### References

1.) https://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/ <br/>
2.) https://machinelearningmastery.com/random-forest-ensemble-in-python/ <br/>
3.) https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/ <br/>
4.) https://machinelearningmastery.com/implement-bagging-scratch-python/ <br/>
5.) https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/ <br/>
6.) https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/ <br/>
7.) https://www.analyticsvidhya.com/blog/2020/02/what-is-bootstrap-sampling-in-statistics-and-machine-learning/ <br/>
8.) https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/ <br/>
9.) https://machinelearningmastery.com/adaboost-ensemble-in-python/