<h1 align='center'> Bagging Classifier </h1>

- One way to get a diverse set of classifiers is to use very different training algorithms.
- Another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set. When sampling is performed __with replacement__, this method is called <code>bagging</code> (short for bootstrap aggregating).
- When sampling is performed __without replacement__, it is called <code>pasting</code>. 

- In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification


In [2]:
X,y = make_classification(n_samples=10000, n_features=10, n_informative= 3)

### Train Test Split

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

### Decision Tree 

In [5]:
from sklearn.tree import DecisionTreeClassifier

In [6]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)


#### Performance metrics

In [7]:
from sklearn.metrics import accuracy_score

In [8]:
print('Decision Tree Accuracy',accuracy_score(y_test,y_pred) )

Decision Tree Accuracy 0.9305


### <font color='Blue'>Bagging</font>

### Bagging Classifier using Decision Tree

In [9]:
from sklearn.ensemble import BaggingClassifier

In [10]:
bag = BaggingClassifier(
        base_estimator=DecisionTreeClassifier(),
        n_estimators=500,
        max_samples=0.5,
        bootstrap=True,
        random_state=42 )

In [11]:
bag.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.5,
                  n_estimators=500, random_state=42)

In [12]:
y_pred = bag.predict(X_test)
accuracy_score(y_test,y_pred)

0.961

In [13]:
bag.estimators_samples_[0].shape

(4000,)

In [14]:
bag.estimators_features_[0].shape

(10,)

### Bagging Classifier using SVM

In [15]:
from sklearn.svm import SVC

In [16]:
bag = BaggingClassifier(
    base_estimator=SVC(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    random_state=42
)

In [17]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Bagging using SVM",accuracy_score(y_test,y_pred))

Bagging using SVM 0.9455


### <font color ='orange'>Pasting </font>

- In Pasting we do __Row sampling without replacement
- For this we set parameter __bootstrap = False__

In [18]:
bag = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=False,
    random_state=42,
    verbose = 1,
    n_jobs=-1
)

In [19]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Pasting classifier",accuracy_score(y_test,y_pred))

[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done   2 out of  16 | elapsed:    8.1s remaining:   57.6s
[Parallel(n_jobs=16)]: Done  16 out of  16 | elapsed:    9.2s finished
[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done   2 out of  16 | elapsed:    0.0s remaining:    0.6s


Pasting classifier 0.96


[Parallel(n_jobs=16)]: Done  16 out of  16 | elapsed:    0.2s finished


In [20]:
bag.estimators_samples_[0].shape

(2000,)

### <font color = 'orange'>Random Subspace </font>

- In random subspace we do column sampling
- It is use when we have large no. of features in data

In [21]:
bag = BaggingClassifier(
                    base_estimator=DecisionTreeClassifier(),
n_estimators=500,
max_samples=1.0,
bootstrap=False,
max_features=0.5,
bootstrap_features=True,
random_state=42)

In [22]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print('Random subspace classifier',accuracy_score(y_test,y_pred))

Random subspace classifier 0.9495


In [23]:
bag.estimators_samples_[0].shape

(8000,)

In [24]:
bag.estimators_features_[0].shape

(5,)

### <font color='orange'>Random Patches</font>

In [25]:
bag = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    max_features=0.5,
    bootstrap_features=True,
    random_state=42
)

In [26]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Random Patches classifier",accuracy_score(y_test,y_pred))

Random Patches classifier 0.9465


### Out Of Bag Score

- When we do row sampling with replacement, there is possibility that there will be some rows, that are not given to any base model,any there will be some rows which repetedly given to multiple trees,
- It is statistically proven that around 63% rows which given to model and 37% never given to model.
- That means there will be 37% rows that are unseen by bagging classifier.
- that is why they called __out of bag__ samples

So the idea is since there are 37% rows that are not seen by classifer why not use them as a testing mechanisum

This is done by setting __oob_score=True__ , so after fit we can see oob_score_ attribute which can give us rought idea of accuracy of our model using oob samples as testing data

In [27]:
bag = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    oob_score=True,
    random_state=42
)

In [28]:
bag.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.25,
                  n_estimators=500, oob_score=True, random_state=42)

In [29]:
bag.oob_score_

0.95675

In [30]:
y_pred = bag.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))

Accuracy 0.958


### Bagging Tips

- __Bagging__ generally gives better results than __Pasting__
- Good results come around the __25%__ to __50%__ row sampling mark
- <code>Random patches</code> and <code>subspaces</code> should be used while dealing with __high dimensional__ data
- To find the correct hyperparameter values we can do __GridSearchCV/RandomSearchCV__

### GridSearchCV

In [33]:
from sklearn.model_selection import GridSearchCV

In [34]:
parameters = {
    'n_estimators': [50,100,500], 
    'max_samples': [0.1,0.4,0.7,1.0],
    'bootstrap' : [True,False],
    'max_features' : [0.1,0.4,0.7,1.0]
    }

In [35]:
search = GridSearchCV(BaggingClassifier(), parameters, cv=5)

In [36]:
search.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=BaggingClassifier(),
             param_grid={'bootstrap': [True, False],
                         'max_features': [0.1, 0.4, 0.7, 1.0],
                         'max_samples': [0.1, 0.4, 0.7, 1.0],
                         'n_estimators': [50, 100, 500]})

In [37]:
search.best_params_
search.best_score_

0.9568749999999999

In [38]:
search.best_params_

{'bootstrap': False,
 'max_features': 1.0,
 'max_samples': 0.4,
 'n_estimators': 500}