#  Bagging Ensembles

- bagged decision tree
- random forest
- exremely randomized tree

Bagging is a bootstrap ensemble method that creates individuals for its ensemble by training each classifier on a random redistribution of the training set. Each classifier's training set is generated by randomly drawing, with replacement, N examples - where N is the size of the original training set; many of the original examples may be repeated in the resulting training set while others may be left out. Each individual classifier in the ensemble is generated with a different random sampling of the training set.


## Decision Tree (high variance) and Bagged Decision Tree

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn import metrics

In [4]:
iris = load_iris()
X = pd.DataFrame(iris.data[:, :], columns = iris.feature_names[:])
y = pd.DataFrame(iris.target, columns =["Species"])

In [8]:
# Splitting Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 20, random_state = 100)

# Defining the stump
stump = DecisionTreeClassifier(max_depth = 1)

# Creating an ensemble 
ensemble = BaggingClassifier(base_estimator = stump, n_estimators = 1000,bootstrap = False)

In [9]:
# Training classifiers
stump.fit(X_train, np.ravel(y_train))
ensemble.fit(X_train, np.ravel(y_train))

# Making predictions
y_pred_stump = stump.predict(X_test)
y_pred_ensemble = ensemble.predict(X_test)

In [10]:
# Determine performance
stump_accuracy = metrics.accuracy_score(y_test, y_pred_stump)
ensemble_accuracy = metrics.accuracy_score(y_test, y_pred_ensemble)

# Print message to user
print(f"The accuracy of the stump is {stump_accuracy*100:.1f} %")
print(f"The accuracy of the ensemble is {ensemble_accuracy*100:.1f} %")


The accuracy of the stump is 55.0 %
The accuracy of the ensemble is 55.0 %


We created 1000 decision stumps that were exactly the same. It’s like we asked a single person what their favorite food was 1000 times and, not surprisingly, obtained the same answer 1000 times.

## Random Forest Classifier (medium variance)

Random forest models reduce the risk of overfitting by introducing randomness by:
- building multiple trees (n_estimators)
- drawing observations with replacement (i.e., a bootstrapped sample)
- splitting nodes on the best split among a random subset of the features selected at every node

In [22]:
# Defining the stump
stump = DecisionTreeClassifier(max_depth = 1, splitter = "best", max_features = "sqrt")

# Create Random Forest 
ensemble = BaggingClassifier(base_estimator=stump,n_estimators=1000,bootstrap=True)

In [23]:
# Training classifiers
stump.fit(X_train, np.ravel(y_train))
ensemble.fit(X_train, np.ravel(y_train))

# Making predictions
y_pred_tree = stump.predict(X_test)
y_pred_ensemble = ensemble.predict(X_test)

In [24]:
# Determine performance
stump_accuracy = metrics.accuracy_score(y_test, y_pred_stump)
ensemble_accuracy = metrics.accuracy_score(y_test, y_pred_ensemble)

# Print message to user
print(f"The accuracy of the stump is {stump_accuracy*100:.1f} %")
print(f"The accuracy of the Random Forest is {ensemble_accuracy*100:.1f} %")

The accuracy of the stump is 55.0 %
The accuracy of the Random Forest is 90.0 %


So by simply introducing variation, we were able to obtain an accuracy of 95 %. In other words, decision stumps with low accuracies were used to build a forest. Variation was introduced among the stumps by building them on bootstraps

## Extra Tree Classifier (low variance)

Extremely Randomized Trees Classifier(Extra Trees Classifier) is a type of bagged decision tree ensemble method. The random trees are constructed from the samples of the training dataset. 

Extra Trees is like RF, in that it builds multiple trees and splits nodes using random subsets of features, but with two key differences: it does not bootstrap observations (meaning it samples without replacement), and nodes are split on random splits, not best splits.

In [25]:
# Defining the stump
stump = DecisionTreeClassifier(max_depth = 1, splitter = "random", max_features = "sqrt")

# Create Extra Trees
ensemble = BaggingClassifier(base_estimator=stump,n_estimators=1000,bootstrap=False)

In [26]:
# Training classifiers
stump.fit(X_train, np.ravel(y_train))
ensemble.fit(X_train, np.ravel(y_train))

# Making predictions
y_pred_tree = stump.predict(X_test)
y_pred_ensemble = ensemble.predict(X_test)

In [27]:
# Determine performance
stump_accuracy = metrics.accuracy_score(y_test, y_pred_stump)
ensemble_accuracy = metrics.accuracy_score(y_test, y_pred_ensemble)

# Print message to user
print(f"Stump Accuracy {stump_accuracy*100:.1f} %")
print(f"ExtraTree Accuracy {ensemble_accuracy*100:.1f} %")

Stump Accuracy 55.0 %
ExtraTree Accuracy 95.0 %


Decision Trees show high variance, Random Forests show medium variance, and Extra Trees show low variance.

# Random Forest (RF optimizes splits on trees)

RF is a bagged Decision Tree (precisely Decision stumps) Ensemble method. The RF algorihm is essentially the combination of two independent ideas: bagging, and random selection of features.

DTs work well when they are of small depth. Higher depth DTs are more prone to overfitting and thus lead to higher variance in the model. This shortcoming of DT is explored by the Random Forest model. The Purpose of Random Forest is to improve Prediction Accuracy.


### Parameter tunings

As you increase max_depth you increase variance and decrease bias. On the other hand, as you increase min_samples_leaf you decrease variance and increase bias. So, these parameters will control the level of regularization when growing the trees. In summary, decreasing any of the max* parameters and increasing any of the min* parameters will increase regularization.

- a) max_depth: The maximum depth that you allow the tree to grow to. The deeper you allow, the more complex your model will become. Increasing max_depth, training error will always go down (or at least not go up).The deeper the tree, the more splits it has and it captures more information about the data.
- b) min_samples_split: min_sample_split is the minimum no. of samples or data points required for a split. For instance, if min_sample_split = 6 and there are 4 sample in the node, then the split will not happen (regardless of entropy).
- c) min_sample_leaf: the minimum number of data points required in each leaf to continue the splits.This parameter is similar to min_samples_splits, however, this describe the minimum number of samples of samples at the leafs, the base of the tree.
- d) max_leaf_nodes: the maximum number of (supposedly best) leaf nodes.

### OUT OF BAG Error

RF technique involves sampling of the input data with replacement (bootstrap sampling). In this sampling, about one thrird of the data is not used for training and can be used to testing.These are called the out of bag samples. Error estimated on these out of bag samples is the out of bag error.
The OOB error is often used for assessing the prediction performance of RF. An advantage of the OOB error is that the complete original sample is used both for constructing the RF classifier and for error estimation.

## What is the difference between Bagging and Random Forest?

bagging is not same as random forest.

Bagging creates randomized samples of the data set and grows trees on a different sample of the original data. The remaining 1/3 of the sample is used to estimate unbiased OOB error. It considers all the features at a node (for splitting). Once the trees are fully grown, it uses averaging or voting to combine the resultant predictions.

The need for random forest surfaced after discovering that the bagging algorithm results in correlated trees when faced with a dataset having strong predictors. Unfortunately, averaging several highly correlated trees doesn't lead to a large reduction in variance.

But how do correlated trees emerge? Good question! Let's say a data set has a very strong predictor, along with other moderately strong predictors. In bagging, a tree grown every time would consider the very strong predictor at its root node, thereby resulting in trees similar to each other.

The main difference between random forest and bagging is that random forest considers only a subset of predictors at a split. This results in trees with different predictors at top split, thereby resulting in decorrelated trees and more reliable average output. That's why we say random forest is robust to correlated predictors.

Decision stump , a decision tree classifier with depth set to one (two splits)

# DT vs RF

The first is the number of features to consider when looking for the best split at each tree node: while DT considers all the features, RF considers a random subset of them, of size equal to the parameter max_features.

The second is that, while DT considers the whole training set, a single RF tree considers only a bootstrapped sub-sample of it.

Averaging ensembles such as a RandomForestClassifier and ExtraTreesClassifier are meant to tackle the variance problems (lack of robustness with respect to small changes in the training set) of individual DecisionTreeClassifier instances.

#### to make single decisin tree using RF

```python
clf = RandomForestClassifier(n_estimators=1, max_features=None, bootstrap=False)

```

in which case neither bootstrap sampling nor random feature selection will take place, and the performance should be roughly equal to that of a single decision tree.

# RF vs ET

ERT do not resample observations when building a tree. (They do not perform bagging.)

ERT do not use the “best split.”
Like a RF, ERT select a random subset of predictors for each split. 
Instead of the “best split” for the predictors, ERT makes a small number of randomly chosen splits-points for each of the selected predictors. In the original method, this value was 1. (A tuning parameter: numRandomCuts)
ERT then selects the “best split” from this small number of choices.

Random forest that develop each decision tree from a bootstrap sample of the training dataset, the Extra Trees algorithm fits each decision tree on the whole training dataset.