#  Bagging

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Bagging

A single decision tree does not perform well as it tends to overfit.  A possible solution is the construct multiple trees to reduce variances.  To make sure each tree is not exactly learning the same thing since it will then be all same trees, we need to inject some differences to these trees (i.e., make them as diverse as possible but at the same time they also see some overlappinp samples).  One simple idea is that each of the tree is trained on a subset of **bootstrapping sample** and then perform some sort of aggregation of the decision.

The process has the following steps:

1. Sample $m$ times **with replacement** from the original training data
2. Repeat $B$ times to generate $B$ "boostrapped" training datasets $D_1, D_2, \cdots, D_B$
3. Train $B$ trees using the training datasets $D_1, D_2, \cdots, D_B$ 

Boostrapping the data plus performing some sort of aggregation (averaging or majority votes) is called **boostrap aggregation** or **bagging**.

*Example*:

Assume that we have a training set where $m=4$, and $n=2$:

$$D = {(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4)}$$

We generate, say, $B = 3$ datasets by boostrapping:

$$D_1 = {(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_3, y_3)}$$
$$D_2 = {(x_1, y_1), (x_4, y_4), (x_4, y_4), (x_3, y_3)}$$
$$D_3 = {(x_1, y_1), (x_1, y_1), (x_2, y_2), (x_2, y_2)}$$

We can then train 3 trees.

Note: When sampling is performed **without** replacement, it is called **pasting**.  In other words, both bagging and pasting allow training instacnes to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

Let's try to code from scratch.  To make our life easier, we shall use DecisionTree from the sklearn library (since we already code it from scratch in the previous class)

### Scratch

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                test_size=0.3, shuffle=True, random_state=42)



In [3]:
from sklearn.tree import DecisionTreeClassifier
import random
from scipy import stats
from sklearn.metrics import classification_report,accuracy_score

class Bagging:
    def __init__(self,B,bootstrap_ratio,with_no_replacement = True):
        self.B = B
        self.bootstrap_ratio = bootstrap_ratio #100% replacement
        self.with_no_replacement = with_no_replacement
        self.tree_params = {'max_depth': 2, 'criterion':'gini', 'min_samples_split': 5}
        self.models = [DecisionTreeClassifier(**self.tree_params) for _ in range(B)] # 5 different decision trees
    
    def fit(self,X,y):
        m, n = X.shape
        # print(m,n)

        #sample size for each tree
        sample_size = int(self.bootstrap_ratio * len(X))
        print('sample_size',sample_size)

        xsamples = np.zeros((self.B, sample_size, n)) 
        ysamples = np.zeros((self.B, sample_size))
        '''
        xsamples = (#trees, sample_size, features)
        ysamples = (#trees, sample_size)
        '''
        
        xsamples_oob = [] # list because length is not known
        ysamples_oob = []
        # print(ysamples.shape)

#subsamples for each model
        for i in range(self.B):
            ##sampling with replacement; i.e., sample can occur more than once
            # for the same predictor
            oob_idx = []
            idxes = []
            for j in range(sample_size):
                idx = random.randrange(m)   #<----with replacement #change so no repetition
                if self.with_no_replacement:
                    while idx in idxes:
                        idx = random.randrange(m) 
                # print(idx)
                # print('X_train',X_train[idx])
                idxes.append(idx)
                oob_idx.append(idx)
                xsamples[i, j, :] = X[idx]
                # print('xsamples',xsamples)
                ysamples[i, j] = y[idx]
            mask = np.zeros((m),dtype = bool)
            mask[oob_idx] = True
            xsamples_oob.append(X[~mask])
            ysamples_oob.append(y[~mask])

                # print('xsamples',xsamples.shape)
                # print('ysamples',ysamples.shape)
                # print('xsamples',xsamples[2,:])

            #fitting each estimator
        oob_score = 0
        for i, model in enumerate(self.models):
#             print(i)
            _X = xsamples[i, :]
#             print('_X',_X.shape)
            _y = ysamples[i, :]
#             print('_y',_y.shape)
            model.fit(_X, _y)

            #calculating oob score
            _X_test = np.asarray(xsamples_oob[i])
            _y_test = np.asarray(ysamples_oob[i])
            yhat = model.predict(_X_test)
            oob_score += accuracy_score(_y_test,yhat) # acc = _y_test,yhat #oob_score: sum of acc of all models
        #     print('oob_score',oob_score)
            print(f"Tree {i}",accuracy_score(_y_test,yhat))
        self.avg_oob_score = oob_score/len(self.models)
        print("===== Average out of bag score =====")
        print(self.avg_oob_score)

#make prediction and return the probabilities
    def predict(self,X): #<== X_test
        predictions = np.zeros((self.B, X.shape[0])) #X_test.shape[0]
        for i, model in enumerate(self.models):
            yhat = model.predict(X)
        #     print(yhat)
            predictions[i, :] = yhat
#         print(predictions)
#         print(stats.mode(predictions)[0])
        return stats.mode(predictions)[0][0]
    
# print(stats.mode(prediction))
# print(xsamples.shape)
# print(xsamples_oob)
# print(ysamples_oob)
# print(mask)
# print(idxes)
# print(oob_idx)

model = Bagging(B=5, bootstrap_ratio =0.8,with_no_replacement = False )
model.fit(X_train,y_train)
yhat = model.predict(X_test)
# print(yhat)
print(classification_report(y_test, yhat))



sample_size 84
Tree 0 0.9777777777777777
Tree 1 0.9387755102040817
Tree 2 0.9166666666666666
Tree 3 0.8823529411764706
Tree 4 0.9423076923076923
===== Average out of bag score =====
0.9315761176265378
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



### Sklearn

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier()

'''
To perform in sklearn, we can use the BaggingClassifier API.  
Pasting can be done using BaggingClassifier< setting boostrap=False
'''

bag = BaggingClassifier(tree, n_estimators=5, max_samples=0.99)

bag.fit(X_train, y_train)
yhat = bag.predict(X_test)
print(classification_report(y_test, yhat))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



### ===Classwork===

#### Out of Bag Evaluation

Well, it seems like our bagging technique is quite good.  Anyhow, one interesting observation is that each tree only see a subset of the dataset. Any data that a particular tree did not see is called **out of bag** (oob).  Note that oob is not the same for all predictors.

One interesting thing is that since oob is something that each tree never see, thus oob is somewhat a validation set.  Thus what we can do is after we fit each tree. We can ask each tree to test their accuracy with their own oob, and then we can average the accuracy from all trees.  

<strong>Your work: Let's modify the above scratch code to</strong>
    <ol>
        <li>Calculate for oob evaluation for each bootstrapped dataset, and also the average score</li>
        <li>Change the code to "without replacement"
        <li>Put everything into a class <code>Bagging</code>.  It should have at least two methods, <code>fit(X_train, y_train)</code>, and <code>predict(X_test)</code></li>
    </ol>
No score, no pressure, only intrinic motivation