# Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training algorithms. Another approach is to use the same training algorithm for every predictor, but to train on different random subsets of the training set. When sampling is performed _with_ replacement, this method is called __bagging__ (_boostrap aggregating_). When sampling is performed _without_ replacement, it is called __pasting__. 

In other words, both bagginf and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

```mermaid

flowchart TD
    A[Training set] -->|Random sampling| DATA_A
    A[Training set] -->|Random sampling| DATA_B
    A[Training set] -->|Random sampling| DATA_C
    A[Training set] -->|Random sampling| DATA_D
  
    DATA_A-->|Training| B[Predictors]
    DATA_B-->|Training| C[Predictors]
    DATA_C-->|Training| D[Predictors]
    DATA_D-->|Training| E[Predictors]
```

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the _statistical mode_ (i.e. the most frequent prediction, just like a hard voting classifier) for classification, or the average for regression. Each infividual predictor has a higher a bias than if it were trained on the original training set, but aggregation reduces both bias and variance. 

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [2]:
iris = datasets.load_iris()
X = iris["data"][:,(2,3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [3]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100, 
    bootstrap=True,
    n_jobs=-1
)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [5]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9736842105263158