# Bagging
The idea behind bagging is combining the results of multiple models (say, Decision Trees) to get a generalized result.

If you create a bunch of models on the same set of data and combine it, will it be useful? There's a high chance that these models will all give the same result since they are getting the same input, so how can we solve this problem?

#### Bootstrapping
Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, **with replacement**. The size of the subsets is the same as the size of the original dataset.

#### Bagging (Bootstrap Aggregating)
Is a technique that uses these subsets (bags) to get a fair idea of the distribution of the complete set. How it works:

1. Multiple subsets are created from the original dataset, selecting observations with replacement
![1](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/image20-768x289.png)

2. A base model (weak model) is created on each of these subsets

3. The models run in parallel and are independent of eachother

4. The final predictions are determined by combining the predictions from all the models
![4](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/Screenshot-from-2018-05-08-13-11-49-768x580.png)

Bagging performs best with algorithms that have high variance, like decision trees constructed without pruning.

# Example Implementations
* Bagged Decision Trees
* Random Forest
* Extra Trees

## Bagged Decision Trees

The example below uses scikit's `BaggingClassifier` with the Classification and Regression Trees algorithm `DecisionTreeClassifier` with 100 trees:

In [3]:
# Import packages
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [14]:
# Import data
URL = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

df = pd.read_csv(URL, names=features)

X = df.iloc[:, 0:8]
y = df.iloc[:, 8]

In [6]:
# Instantiate the models

kfold = KFold(n_splits=10, random_state=42)

cart = DecisionTreeClassifier()
num_trees = 100

In [7]:
# Bag it up
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=42)

In [9]:
# Check the results
results = cross_val_score(model, X, y, cv=kfold)

print(results)
print(results.mean())

[0.71428571 0.83116883 0.75324675 0.63636364 0.80519481 0.83116883
 0.80519481 0.85714286 0.71052632 0.77631579]
0.7720608339029391


Running this simple example gave us a bagged model accuracy of 77%.

## Random Forest
A random forest is an extension of bagged decision trees.

Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of the tree, only a random subset of features are considered for each split.

Below is an implementation of sklearn's `RandomForestClassifier` with 100 trees and split points chosen from a random selection of 3 features. Same dataset as above.

In [10]:
# Import packages
from sklearn.ensemble import RandomForestClassifier

In [11]:
# Instantiate and train the model
max_features = 3

rfc = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)

In [12]:
# Get accuracy
results = cross_val_score(rfc, X, y, cv=kfold)

print(results.mean())

0.7642857142857143


## Extra Trees
Extra Trees are another modification of bagging where random trees are constructed from samples of the training set. We'll construct one from sklearn's `ExtraTreesClassifier` with 100 trees and 7 random features again, using the same dataset.

In [13]:
from sklearn.ensemble import ExtraTreesClassifier

max_features = 7

etc = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)

results = cross_val_score(etc, X, y, cv=kfold)

print(results.mean())

0.7642173615857827
