# Ensemble Learning

## Import the data

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AML2021Bas/main/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

# BAGGING Algorithms

## Bagged Decision Trees

Bagging performs best with algorithms that have **high variance**.


In the example below is an example of using the `BaggingClassifier` with the Classification and Regression Trees algorithm (`DecisionTreeClassifier` - more info [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)). A total of 100 trees are created.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
#
from sklearn.ensemble import BaggingClassifier                     # <---
#
from sklearn.tree import DecisionTreeClassifier                    # <---

In [None]:
array = data.values
X = array[:,0:8]
Y = array[:,8]

seed = 42

Try a rough decision tree classifier.

In [None]:
kfold = KFold(n_splits=10, shuffle=True)
model = DecisionTreeClassifier(random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Then try to do better with Bagging.

In [None]:
# Bagged Decision Trees for Classification
kfold = KFold(n_splits=10, shuffle=True)
cart = DecisionTreeClassifier(random_state=seed)
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed, bootstrap = True)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

*(NOTE: running the cell above should take some more time than usual..)*

Now, try to change one parameter above, `bootstrap = True` and rerun and see what happens..




Running the example in the latter way, we get a more robust estimate of model accuracy.

## Random Forest

Random Forests is **an extension of bagged decision trees**.


You can construct a Random Forest model for classification using the RandomForestClassifier class, documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). The example below demonstrates using Random Forest for classification with 100 trees and split points chosen from a random selection of 3 features.

In [None]:
from sklearn.ensemble import RandomForestClassifier                    # <---

In [None]:
# Random Forest Classification
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, shuffle=True)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

*(NOTE: running the cell above should take some more time than usual..)*

Running the example provides a mean estimate of classification accuracy.

## Extra Trees

Extra Trees are **another modification of bagging** where random trees are constructed from samples of the training dataset.

You can construct an Extra Trees model for classification using the ExtraTreesClassifier class (documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)).

The example below provides a demonstration of extra trees with the number of trees set to 100 and splits chosen from 7 random features.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier                    # <---

In [None]:
# Extra Trees Classification
num_trees = 100
max_features = 7
kfold = KFold(n_splits=10, shuffle=True)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

*(NOTE: running the cell above should take some more time than usual..)*

Running the example provides a mean estimate of classification accuracy.

# BOOSTING Algorithms

## AdaBoost

You can construct an AdaBoost model for classification using the AdaBoostClassifier class (documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)).

The example below demonstrates the construction of 30 decision trees in sequence using the AdaBoost algorithm.

In [None]:
from sklearn.ensemble import AdaBoostClassifier                    # <---

In [None]:
# AdaBoost Classification
num_trees = 30
kfold = KFold(n_splits=10, shuffle=True)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Running the example provides a mean estimate of classification accuracy.

## Stochastic Gradient Boosting

You can construct a Gradient Boosting model for classification using the `GradientBoostingClassifier` class (documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)).

The example below demonstrates Stochastic Gradient Boosting for classification with 100 trees.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier                    # <---

In [None]:
# Stochastic Gradient Boosting Classification
num_trees = 100
kfold = KFold(n_splits=10, shuffle=True)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Running the example provides a mean estimate of classification accuracy.

# VOTING Algorithms

Voting remains **one of the simplest ways of combining the predictions from multiple ML algorithms**.


You can create a voting ensemble model for classification using the `VotingClassifier` class (documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)).

The code below provides an example of combining the predictions of logistic regression, classification and regression trees and support vector machines together for a classification problem.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier                    # <---

In [None]:
# Voting Ensemble for Classification

kfold = KFold(n_splits=10, shuffle=True)

# create the sub models
estimators = []
#model1 = LogisticRegression()
model1 = LogisticRegression(C=10, tol=0.01, solver='lbfgs', max_iter=10000)

estimators.append(( 'logistic' , model1))
model2 = DecisionTreeClassifier()
estimators.append(( 'cart' , model2))
model3 = SVC()
estimators.append(( 'svm' , model3))

# create the ensemble model
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())

Running the example provides a mean estimate of classification accuracy.

## Summary

What we did:

* we discovered ensemble ML algorithms for improving the performance of models on your problems. You learned about
Bagging Ensembles including Bagged Decision Trees, Random Forest and Extra Trees, Boosting Ensembles including AdaBoost and Stochastic Gradient Boosting, Voting Ensembles for averaging the predictions for any arbitrary models.

# Exercises

## <font color=red>Exercise 1</font>

1. Load the *MNIST* data and split it into a training set, a validation set, and a test set (e.g. use 50k instances for training, and 10k each for validation and testing).
2. Then train various classifiers, such as a **Random Forest** classifier, an **Extra-Trees** classifier, and an **SVM** classifier, for example. If you want, pick more.
3. Next, try to combine them into an **ensemble** that outperforms each individual classifier on the validation set, using **soft or hard voting**.
4. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

### <font color='green'>Solution</font>

In [None]:
# type your code below

_Credits: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Edition) by Aurélien Géron, O'Reilly Media Inc., 2019_