# Bagging Machine Learning Algorithm

### **B**ootstrap **Agg**regat**ing** or [Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating)
* [Scikit- Learn Reference](http://scikit-learn.org/stable/modules/ensemble.html#bagging)
* Bootstrap sampling: Sampling with replacement
* Combine by averaging the output (regression)
* Combine by voting (classification)
* Can be applied to many classifiers which includes ANN, CART, etc.

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

In [None]:
df = sns.load_dataset('titanic')

In [None]:
df.shape

In [None]:
df.head()

'survived' feature is our target variable. and the others (such as passenger class and ...) are our independent features

In [None]:
# dropning records with NaN values
df.dropna(inplace=True)

In [None]:
# which values does 'pclass' feature have?
df['pclass'].unique()

In [None]:
# how many values in each 'pclass' feature?
df['pclass'].value_counts()

In [None]:
df['sex'].unique()

In [None]:
df['sex'].value_counts()

In [None]:
# histogram has a semi-colon in it's end BTW ;)
df['age'].hist(bins=50, figsize= (50,10));

## Data Pre-processing

In [None]:
# taking threee features for our predictive model
X = df[['pclass', 'sex', 'age']]

In [None]:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()

In [None]:
# binerize the 'sex' feature to zero and one values and store it as 'X'
X['sex'] = lb.fit_transform(X['sex'])

In [None]:
# see what it look like again
X.head()

In [None]:
X.shape

In [None]:
# what is going on in sex?! 
X.describe()

In [None]:
X.info()

In [None]:
# specifing the target model and store it as 'y'
y = df['survived']

In [None]:
y.value_counts()

# Fit Model

In [None]:
# usnig desicion tree classifier 
from sklearn.tree import DecisionTreeClassifier
# taking advantage of ensemble methods
from sklearn.ensemble import BaggingClassifier

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# instead of hitting  'shift' and 'tab' keys several times, use a question mark at the end of a method to see the documentation 
train_test_split?

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
# importing cross validation
from sklearn.model_selection import cross_val_score, cross_val_predict
# report stuff we need
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# since we gonna train the data using multiple methods, we define a function to evaluate the score and accuracy 
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))    

# usable to any other model, just becuz it is cute and cool

## Decision Tree

In [None]:
# let's classify the sitation
clf = DecisionTreeClassifier(random_state=42)

In [None]:
# and fit with no extra hyper-parameters
clf.fit(X_train, y_train)

In [None]:
print_score(clf, X_train, y_train, X_test, y_test, train=True)

In [None]:
print_score(clf, X_train, y_train, X_test, y_test, train=False) # Test

***

## Bagging (oob_score=False)

**oob_score** stands for : **All of the Bag score**

In [None]:
# using the eldery decision tree classifier
# using 1000 agent
# using all of CPU cores
# using bootstrap and random state of forty two

bag_clf = BaggingClassifier(base_estimator=clf, n_estimators=1000,
                            bootstrap=True, n_jobs=-1,
                            random_state=42)

In [None]:
bag_clf.fit(X_train, y_train)

In [None]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)

In [None]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

***

## Bagging (oob_score=True)

Use out-of-bag samples to estimate the generalization accuracy

** Trust me, It's good! **

In [None]:
bag_clf = BaggingClassifier(base_estimator=clf, n_estimators=1000,
                            bootstrap=True, oob_score=True,
                            n_jobs=-1, random_state=42)

In [None]:
bag_clf.fit(X_train, y_train)

In [None]:
bag_clf.oob_score_

In [None]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)

In [None]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)

***