In [0]:
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier, BaggingClassifier, \
    AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# Introduction

Until now we have learned how to fit a predictive model to our data. For fitting a model we've used a sample of the data we called the **train** set and for assessing the model we've used the rest of the data - the **test** set. Using a single model based on a single sampling of the data may have some problems. We can improve a model, increase its validity and reduce the variance of its predictions by applying multiple models for the same problem and then consider all the outcomes for the final resolution. Such a collection of models is called **ensemble**, and it has two main flavors: averaging and boosting.

The **averaging** approach is very intuitive. You fit many models and take a decision based on all the results. One of the advantages of this approach is that it can be easily parallelized. In this context we will learn about **Voting** and **Bagging** (**B**ootstrap **agg**regation).

The **boosting** approach is more complicated. The boosting algorithm starts with a **weak** version of a given **base** model and iteratively improves its measures by applying a sequence of modifications to its hyperparameters. In this context we will learn about **AdaBoost** (**Ada**ptive **boost**ing) and **Gradient boosting**.

All the ensemble methods are gathered together under the module [_sklearn.ensemble_][ensemble], and they all share the usual API, which in this case is a wrapping of a much more complex actions.

[ensemble]: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble "sklearn.ensemble module"

# Avergaring methods

## Voting

Voting is the most intuitive ensemble method, as it considers results of different estimators in a straight-forward manner.

In [2]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import files
    uploaded = files.upload()

Saving spambase.csv to spambase.csv


In [0]:
spam = read_csv("spambase.csv", index_col=0)

X = spam[spam.columns[:-1]]
y = spam.spam

X_train, X_test, y_train, y_test = train_test_split(X, y)

Let's train 3 different classifiers.

In [0]:
clf1 = LogisticRegression()
clf2 = DecisionTreeClassifier(max_depth=5)
clf3 = SVC(C=0.1)

classifiers = [('LR', clf1), ('DT', clf2), ('SVM', clf3)]

In [5]:
results = y_train.to_frame()
for clf_name, clf in classifiers:
    clf.fit(X_train, y_train)
    results[clf_name] = clf.predict(X_train)
    print("{:3} classifier:\n \
        \ttrain accuracy: {:.2f}\n \
        \ttest accuracy: {:.2f}"\
        .format(clf_name, 
                clf.score(X_train, y_train), 
                clf.score(X_test, y_test)))



LR  classifier:
         	train accuracy: 0.93
         	test accuracy: 0.92
DT  classifier:
         	train accuracy: 0.92
         	test accuracy: 0.91




SVM classifier:
         	train accuracy: 0.86
         	test accuracy: 0.75


> **Note:** The method `to_frame()` converts a Series to a DataFrame. I used it for making the `results` DataFrame which will gather our various predictions.

In [6]:
results.head()

Unnamed: 0,spam,LR,DT,SVM
10,spam,spam,spam,non-spam
2299,non-spam,non-spam,non-spam,non-spam
2984,non-spam,non-spam,non-spam,non-spam
2431,non-spam,non-spam,non-spam,spam
4406,non-spam,non-spam,non-spam,non-spam


The ensemble classifier is implemented by the [_VotingClassifier_][1] class. The voting itself may be **hard**, which has the obvious meaning of voting or it could be **soft**, which then predicts the class label based on the argmax of the sums of the predicted probalities.

[1]: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html "VotingClassifier class"

In [0]:
classifiers = [('LR', clf1), ('DT', clf2), ('SVM', clf3)]

In [8]:
clf_voting = VotingClassifier(estimators=classifiers,
                              voting='hard')
clf_voting.fit(X_train, y_train)

VotingClassifier(estimators=[('LR', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)), ('DT', Decision...f', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)

In [9]:
print("{:3} classifier:\n \
    \ttrain accuracy: {:.2f}\n \
    \ttest accuracy: {:.2f}"\
    .format('Voting', 
            clf_voting.score(X_train, y_train), 
            clf_voting.score(X_test, y_test)))

Voting classifier:
     	train accuracy: 0.94
     	test accuracy: 0.92


In [10]:
results['Voting'] = clf_voting.predict(X_train)
results.head()

Unnamed: 0,spam,LR,DT,SVM,Voting
10,spam,spam,spam,non-spam,spam
2299,non-spam,non-spam,non-spam,non-spam,non-spam
2984,non-spam,non-spam,non-spam,non-spam,non-spam
2431,non-spam,non-spam,non-spam,spam,non-spam
4406,non-spam,non-spam,non-spam,non-spam,non-spam


## Bagging (Bootstrap Aggregation)

In English the term bootstrapping means "to get something out of a situation using existing resources", and in statistics it refers to the option of randomly resampling the data in order to create a collection of models. This means that bagging is exactly like voting, with the only detail that instead of different models you choose a specific type of model (called **base model**), and then fit subsamples of your data to it many times.

In Scikit-learn this meta-classifier is implemented by the [BaggingClassifier][1] class, and its main arguments are of course _base_\__estimator_ and _n_\__estimators_.

[1]: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html "BaggingClassifier class"

### Decision tree as a base model

In [0]:
clf_base = DecisionTreeClassifier(max_depth=5)

In [12]:
clf_bagging = BaggingClassifier(base_estimator=clf_base,
                                n_estimators=100, max_samples=0.1)
clf_bagging.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=0.1, n_estimators=100, n_jobs=None, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [14]:
print("{:3} classifier:\n \
    \ttrain accuracy: {:.2f}\n \
    \ttest accuracy: {:.2f}"\
    .format('DT bagging', 
            clf_bagging.score(X_train, y_train), 
            clf_bagging.score(X_test, y_test)))

DT bagging classifier:
     	train accuracy: 0.93
     	test accuracy: 0.92


In [15]:
results['Bagging DTs'] = clf_bagging.predict(X_train)
results.head()

Unnamed: 0,spam,LR,DT,SVM,Voting,Bagging DTs
10,spam,spam,spam,non-spam,spam,spam
2299,non-spam,non-spam,non-spam,non-spam,non-spam,non-spam
2984,non-spam,non-spam,non-spam,non-spam,non-spam,non-spam
2431,non-spam,non-spam,non-spam,spam,non-spam,non-spam
4406,non-spam,non-spam,non-spam,non-spam,non-spam,non-spam


> **Note:** The most common bagging classifier uses decision trees as base models, and then it is usually called **Random Forest**. This is so common that Scikit-learn supports a separate class for it, quite naturally called [RandomForestClassifier][1].

[1]: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html "RandomForestClassifier class"

### Logistic regression as a base model

In [0]:
clf_base = LogisticRegression()

In [17]:
clf_bagging = BaggingClassifier(base_estimator=clf_base,
                                n_estimators=100)
clf_bagging.fit(X_train, y_train)



BaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=None, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [18]:
print("{:3} classifier:\n \
    \ttrain accuracy: {:.2f}\n \
    \ttest accuracy: {:.2f}"\
    .format('LR bagging', 
            clf_bagging.score(X_train, y_train), 
            clf_bagging.score(X_test, y_test)))

LR bagging classifier:
     	train accuracy: 0.93
     	test accuracy: 0.92


In [19]:
results['Bagging LRs'] = clf_bagging.predict(X_train)
results.head()

Unnamed: 0,spam,LR,DT,SVM,Voting,Bagging DTs,Bagging LRs
10,spam,spam,spam,non-spam,spam,spam,spam
2299,non-spam,non-spam,non-spam,non-spam,non-spam,non-spam,non-spam
2984,non-spam,non-spam,non-spam,non-spam,non-spam,non-spam,non-spam
2431,non-spam,non-spam,non-spam,spam,non-spam,non-spam,non-spam
4406,non-spam,non-spam,non-spam,non-spam,non-spam,non-spam,non-spam


> **Your turn 1:**

> * Part I - Create a simple k-nearest neighbors classifier for the spam problem.
> * Part II - Add the model to the voting classifier we've made earlier.
> * Part III- Create a Bagging classifier using your kNN model as the base model.

## Boosting methods

### AdaBoost

The core principle of AdaBoost is to fit a sequence of _weak_ learners (i.e., base models that are only slightly better than random guessing) on repeatedly modified versions of the **data**. At each boosting iteration the samples that were misclassified in the previous iteration are given higher weights, while the correctly-classified samples are given lower weights.

In Scikit-learn AdaBoost is implemented by the [AdaBoost][adaboost] class, and its main arguments are the `base_estimator`, the maximum number of iterations - `n_estimators`, and the `learning_rate` - the allowed influence of former classifiers on the boosting process.

It worth noting that since AdaBoost inherently follows problematic samples, it is relatively sensitive to noisy data and outliers.

[adaboost]: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html "AdaBoost class"

In [0]:
clf_base = DecisionTreeClassifier(max_depth=3)

In [0]:
clf_adaboost = AdaBoostClassifier(base_estimator=clf_base,
                                  n_estimators=200,
                                  learning_rate=0.01)
clf_adaboost.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=0.01, n_estimators=200, random_state=None)

In [0]:
print("{:3} classifier:\n \
    \ttrain accuracy: {:.2f}\n \
    \ttest accuracy: {:.2f}"\
    .format('DT ADA boosting', 
            clf_adaboost.score(X_train, y_train), 
            clf_adaboost.score(X_test, y_test)))

DT ADA boosting classifier:
     	train accuracy: 0.94
     	test accuracy: 0.94


### Gradient boosting

The gradient boosting method builds the model by iteratively optimizing a function called **loss function**. In every iteration, the parameters of the model are adjusted in such a way to minimize the loss function. This concept is related to the more general optimization concept of [gradient descent optimization][gd].

In Scikit-learn the gradient boosting algorithm is implemented by the [GradientBoostingClassifier][gbc] class, and like the AdaBoost class it supports the arguments `n_estimators` and `learning_rate`.

[gd]: https://en.wikipedia.org/wiki/Gradient_descent "Gradient descent - Wikipedia"
[gbc]: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html "GradientBoostingClassifier class"

> **NOTE:** Although the concept of gradient boosting is general, the _GradientBoostingClassifier_ classifier works only with decision trees. As a result, there is no need for defining the base model, and the model hyperparameters are given directly to the _GradientBoostingClassifier_.

In [0]:
clf_GB = GradientBoostingClassifier(max_depth=3,
                                    n_estimators=200,
                                    learning_rate=0.01)
clf_GB.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.01, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=200,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [0]:
print("{:3} classifier:\n \
    \ttrain accuracy: {:.2f}\n \
    \ttest accuracy: {:.2f}"\
    .format('DT gradient boosting', 
            clf_GB.score(X_train, y_train), 
            clf_GB.score(X_test, y_test)))

DT gradient boosting classifier:
     	train accuracy: 0.93
     	test accuracy: 0.93


> **Note:** Gradient boosting is also implemented by the commonly used [**XGBoost**](http://xgboost.readthedocs.io/en/latest/) and [**LightGBM**](https://github.com/microsoft/LightGBM) packages.