<a href="https://colab.research.google.com/github/anyuanay/INFO213/blob/main/INFO213_Week8_ensemble_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 213: Data Science Programming 2
___

## Week 8: Combining Different Models for Ensemble Learning


**Overview:**
- [Learning with ensembles](#Learning-with-ensembles)
- [Combining classifiers via majority vote](#Combining-classifiers-via-majority-vote)
    - [Implementing a simple majority vote classifier](#Implementing-a-simple-majority-vote-classifier)
    - [Using the majority voting principle to make predictions](#Using-the-majority-voting-principle-to-make-predictions)
    - [Evaluating and tuning the ensemble classifier](#Evaluating-and-tuning-the-ensemble-classifier)
- [Bagging – building an ensemble of classifiers from bootstrap samples](#Bagging----Building-an-ensemble-of-classifiers-from-bootstrap-samples)
    - [Bagging in a nutshell](#Bagging-in-a-nutshell)
    - [Applying bagging to classify examples in the Wine dataset](#Applying-bagging-to-classify-examples-in-the-Wine-dataset)
- [Leveraging weak learners via adaptive boosting](#Leveraging-weak-learners-via-adaptive-boosting)
    - [How boosting works](#How-boosting-works)
    - [Applying AdaBoost using scikit-learn](#Applying-AdaBoost-using-scikit-learn)
- [Gradient boosting -- training an ensemble based on loss gradients](#Gradient-boosting----training-an-ensemble-based-on-loss-gradients)
  - [Comparing AdaBoost with gradient boosting](#Comparing-AdaBoost-with-gradient-boosting)
  - [GradientBoostingClassifier in Scikit Learn](#link)
  - [Using XGBoost](#Using-XGBoost)

# Motivation:
- We have focused on the best practices for tuning and evaluating different models.
- We will build upon those techniques and explore different methods for constructing a set of classifiers that can often have a better predictive performance than any of its
individual members.
- We will do the following:
    - Make predictions based on majority voting.
    - Use bagging to reduce overfitting by drawing random combinations of the training dataset with repetition
    - Apply boosting to build powerful models from weak learners that learn from their mistakes.

# Learning with ensembles
- Goal: To combine different classifiers into a meta-classifier that has better generalization performance than each individual classifier alone.
- Majority vs. plurality voting:

<img src="https://github.com/rasbt/machine-learning-book/blob/main/ch07/figures/07_01.png?raw=true" width="600px" />


## Majority Voting:

- Using the training dataset, we start by training m different classifiers (C1, ..., Cm).
- different classification algorithms, for example, decision
trees, support vector machines, logistic regression classifiers, and so on.
- or same base classification algorithm, fitting different subsets of the training dataset, for example, random forest. algorithm combining different decision tree classifiers.

<img src="https://github.com/rasbt/machine-learning-book/blob/main/ch07/figures/07_02.png?raw=true" width="600px" />

## Why ensemble with majority works?
- Assume, all n-base classifiers for a binary classification task have an equal error rate, $\epsilon$ and independent. - Simpley simply express the error probability of an ensemble of base classifiers as a probability mass function of a binomial distribution:

$$
P(y>k) = \sum_{k}^{n}\binom{n}{k}\epsilon^k(1-\epsilon)^{n-k}
$$


- Let us take a look at a more concrete example of 11 base
classifiers (n = 11), where each classifier has an error rate of 0.25 ($\epsilon = 0.25$).
- As we can see, the error rate of the ensemble (0.034) is much lower than the error rate of each individual
classifier (0.25) if all the assumptions are met.

```python
from scipy.special import comb
import math


def ensemble_error(n_classifier, error):
    k_start = int(math.ceil(n_classifier / 2.))
    probs = [comb(n_classifier, k) * error**k * (1-error)**(n_classifier - k)
             for k in range(k_start, n_classifier + 1)]
    return sum(probs)
```

```python
ensemble_error(n_classifier=11, error=0.25)
```

- Let us plot the ensemble errors vs. individual errors for n=11

```python
import numpy as np


error_range = np.arange(0.0, 1.01, 0.01)
ens_errors = [ensemble_error(n_classifier=11, error=error)
              for error in error_range]
```

```python
import matplotlib.pyplot as plt


plt.plot(error_range,
         ens_errors,
         label='Ensemble error',
         linewidth=2)

plt.plot(error_range,
         error_range,
         linestyle='--',
         label='Base error',
         linewidth=2)

plt.xlabel('Base error')
plt.ylabel('Base/Ensemble error')
plt.legend(loc='upper left')
plt.grid(alpha=0.5)
#plt.savefig('figures/07_03.png', dpi=300)
plt.show()
```

**Observation**: The error probability of an ensemble is always better than the error
of an individual base classifier, as long as the base classifiers perform better than random guessing
($\epsilon < 0.5$).

# Combining classifiers via majority vote

- We will implement an algorithm to combine different classification algorithms associated with individual weights for confidence.
- Our goal is to build a stronger meta-classifier that balances out the individual classifiers' weaknesses on a particular dataset.

Let $C_1, C_2, C_m$ be $m$ classifiers with weights $\mathbf{w}=\{w_1, w_2, ... w_m\}$ in an ensemble. Let $A$ be a set of class labels, for example, $[0, 1]$. Let $\hat{y}$ be the prediction of the ensemble. Let $\chi_A$ be an indicator function such as $\chi_A(x=0)=1$ if $x=0$, otherwise, $\chi_A(x=0)=0$.

The weighted majority vote is:
$$
\hat{y} = \arg \max_j \sum_i^m w_i \chi_A(C_i(\mathbf{x})=j)
$$

For all equal weights,
$$
\hat{y} = mode\{C_1(\mathbf{x}), C_2(\mathbf{x}),..., C_m(\mathbf{x})\}
$$

### A manual example:

Let $C_1(\mathbf{x})=0$, $C_2(\mathbf{x})=0$, and $C_3(\mathbf{x})=1$ be $3$ classifiers with weights $\mathbf{w}=\{0.2, 0.2, 0.6\}$ in an ensemble.

$$
\hat{y} = \arg \max_j \sum_i^m w_i \chi_A(C_i(\mathbf{x})=j) =
\arg \max_i [0.2\times 1 + 0.2\times 1, 0.6\times 1] = 1 (the\ index)
$$

## Implementing a simple majority vote
- We can use NumPy's convenient argmax and bincount functions, where bincount counts the number of occurrences of each
class label. The argmax function then returns the index position of the maximum.

```python
import numpy as np

np.argmax(np.bincount([0, 0, 1],
                      weights=[0.2, 0.2, 0.6]))
```

## Implement and Evaluating a Majority Vote Ensemble Algorithm in Python

1. Load the iris data from sklearn.datasets. Use only sepal width and petal length to make the classification task more challenging for illustration purposes. Only classify flower examples from the Iris-versicolor and Iris-virginica classes,
2. Split the data into training and test sets
3. Creating 3 classifiers: LogisticRegression, DecisionTree, and KNN
4. Making pipelines for classifiers that require transformations.
5. Fitting and evaluating the individual classifiers via 10-fold cross-validation.
6. Collecting AUC scores of the ensembled classifiers through 10-fold cross-validation
7. Evaluate the ensemble classifier by AUC

#### Loading and pre-processing the data

```python
from sklearn import datasets
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X, y = iris.data[50:, [1, 2]], iris.target[50:]
le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test =\
       train_test_split(X, y,
                        test_size=0.5,
                        random_state=1,
                        stratify=y)
```

```python
X_train.shape, X_test.shape, y_train.shape, y_test.shape
```

```python
y_train
```

#### Creating 3 classifiers: LogisticRegression, DecisionTree, and KNN

```python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier


clf1 = LogisticRegression(penalty='l2',
                          C=0.001,
                          solver='lbfgs',
                          random_state=1)

clf2 = DecisionTreeClassifier(max_depth=1,
                              criterion='entropy',
                              random_state=0)

clf3 = KNeighborsClassifier(n_neighbors=1,
                            p=2,
                            metric='minkowski')
```

#### Making 2 pipelines for LogisticRegression and KNN
- Both classifiers are sensitive to feature scales. We need to apply standardized scaler on the features. So, make pipelines.

```python
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe1 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf1]])
pipe3 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf3]])
```

#### Making a list of classifier labels

```python
clf_labels = ['Logistic regression', 'Decision tree', 'KNN']
```

#### Fitting and evaluating the classifiers via 10-fold cross-validation
- We will then evaluate the model performance of each classifier via 10-fold cross-validation on the training dataset before we combine them into an ensemble classifier.

```python
from sklearn.model_selection import cross_val_score

print('10-fold cross validation:\n')
for clf, label in zip([pipe1, clf2, pipe3], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='roc_auc')
    print(f'ROC AUC: {scores.mean():.2f} '
          f'(+/- {scores.std():.2f}) [{label}]')
```

#### Collecting AUC scores of the ensembled classifiers through 10-fold cross-validation

```python
from sklearn.model_selection import StratifiedKFold
```

```python
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
```

```python
from sklearn.metrics import roc_auc_score
```

```python
auc_scores = []

for train_idx, val_idx in kf.split(X_train, y_train):
    X_tr, X_val = X_train[train_idx], X_train[val_idx]
    y_tr, y_val = y_train[train_idx], y_train[val_idx]

    # Collect predicted probabilities from each classifier
    probas = []
    for clf in [pipe1, clf2, pipe3]:
        clf.fit(X_tr, y_tr)
        probas.append(clf.predict_proba(X_val)[:, 1])


    # Average probabilities across classifiers
    avg_proba = np.mean(probas, axis=0)

    # Compute AUC score for this fold
    auc = roc_auc_score(y_val, avg_proba)
    auc_scores.append(auc)
```

```python
auc_scores = np.array(auc_scores)
auc_scores
```

#### Evaluating the ensemble classifier by AUC

```python
print(f'ROC AUC: {auc_scores.mean():.2f} '
          f'(+/- {auc_scores.std():.2f}) ["Ensemble"]')
```

## Implement a MajorityVoteClassifier

```python
from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator


class MajorityVoteClassifier(ClassifierMixin, BaseEstimator):

    """ A majority vote ensemble classifier

    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers]
      Different classifiers for the ensemble

    vote : str, {'classlabel', 'probability'} (default='classlabel')
      If 'classlabel' the prediction is based on the argmax of
        class labels. Else if 'probability', the argmax of
        the sum of probabilities is used to predict the class label
        (recommended for calibrated classifiers).

    weights : array-like, shape = [n_classifiers], optional (default=None)
      If a list of `int` or `float` values are provided, the classifiers
      are weighted by importance; Uses uniform weights if `weights=None`.

    """
    def __init__(self, classifiers, vote='classlabel', weights=None):

        self.classifiers = classifiers
        self.named_classifiers = {key: value for key, value
                                  in _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights

    def fit(self, X, y):
        """ Fit classifiers.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_examples, n_features]
            Matrix of training examples.

        y : array-like, shape = [n_examples]
            Vector of target class labels.

        Returns
        -------
        self : object

        """
        if self.vote not in ('probability', 'classlabel'):
            raise ValueError(f"vote must be 'probability' or 'classlabel'"
                             f"; got (vote={self.vote})")

        if self.weights and len(self.weights) != len(self.classifiers):
            raise ValueError(f'Number of classifiers and weights must be equal'
                             f'; got {len(self.weights)} weights,'
                             f' {len(self.classifiers)} classifiers')

        # Use LabelEncoder to ensure class labels start with 0, which
        # is important for np.argmax call in self.predict
        self.lablenc_ = LabelEncoder()
        self.lablenc_.fit(y)
        self.classes_ = self.lablenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self

    def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_examples, n_features]
            Matrix of training examples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_examples]
            Predicted class labels.

        """
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X), axis=1)
        else:  # 'classlabel' vote

            #  Collect results from clf.predict calls
            predictions = np.asarray([clf.predict(X)
                                      for clf in self.classifiers_]).T

            maj_vote = np.apply_along_axis(
                                      lambda x:
                                      np.argmax(np.bincount(x,
                                                weights=self.weights)),
                                      axis=1,
                                      arr=predictions)
        maj_vote = self.lablenc_.inverse_transform(maj_vote)
        return maj_vote

    def predict_proba(self, X):
        """ Predict class probabilities for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_examples, n_features]
            Training vectors, where n_examples is the number of examples and
            n_features is the number of features.

        Returns
        ----------
        avg_proba : array-like, shape = [n_examples, n_classes]
            Weighted average probability for each class per example.

        """
        probas = np.asarray([clf.predict_proba(X)
                             for clf in self.classifiers_])
        avg_proba = np.average(probas, axis=0, weights=self.weights)
        return avg_proba

    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        if not deep:
            return super().get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in self.named_classifiers.items():
                for key, value in step.get_params(deep=True).items():
                    out[f'{name}__{key}'] = value
            return out
```

## Using the majority voting principle to make predictions

```python
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


iris = datasets.load_iris()
X, y = iris.data[50:, [1, 2]], iris.target[50:]
le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test =\
       train_test_split(X, y,
                        test_size=0.5,
                        random_state=1,
                        stratify=y)
```

```python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score


clf1 = LogisticRegression(penalty='l2',
                          C=0.001,
                          solver='lbfgs',
                          random_state=1)

clf2 = DecisionTreeClassifier(max_depth=1,
                              criterion='entropy',
                              random_state=0)

clf3 = KNeighborsClassifier(n_neighbors=1,
                            p=2,
                            metric='minkowski')

pipe1 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf1]])
pipe3 = Pipeline([['sc', StandardScaler()],
                  ['clf', clf3]])

clf_labels = ['Logistic regression', 'Decision tree', 'KNN']

print('10-fold cross validation:\n')
for clf, label in zip([pipe1, clf2, pipe3], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='roc_auc')
    print(f'ROC AUC: {scores.mean():.2f} '
          f'(+/- {scores.std():.2f}) [{label}]')
```

```python
# Majority Rule (hard) Voting

mv_clf = MajorityVoteClassifier(classifiers=[pipe1, clf2, pipe3])

clf_labels += ['Majority voting']
all_clf = [pipe1, clf2, pipe3, mv_clf]

for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='roc_auc')
    print(f'ROC AUC: {scores.mean():.2f} '
          f'(+/- {scores.std():.2f}) [{label}]')
```

<br>
<br>

# Evaluating and tuning the ensemble classifier
- We compute the ROC curves from the test dataset to check that
MajorityVoteClassifier generalizes well with unseen data.

```python
all_clf
```

```python
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

import matplotlib.pyplot as plt

colors = ['black', 'orange', 'blue', 'green']
linestyles = [':', '--', '-.', '-']
for clf, label, clr, ls \
        in zip(all_clf,
               clf_labels, colors, linestyles):

    # assuming the label of the positive class is 1
    y_pred = clf.fit(X_train,
                     y_train).predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_true=y_test,
                                     y_score=y_pred)
    roc_auc = auc(x=fpr, y=tpr)
    plt.plot(fpr, tpr,
             color=clr,
             linestyle=ls,
             label=f'{label} (auc = {roc_auc:.2f})')

plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1],
         linestyle='--',
         color='gray',
         linewidth=2)

plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.grid(alpha=0.5)
plt.xlabel('False positive rate (FPR)')
plt.ylabel('True positive rate (TPR)')


#plt.savefig('figures/07_04', dpi=300)
plt.show()
```

## Visualize the decision boundaries of individual and ensemble classifiers

```python
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
```

```python
from itertools import product


all_clf = [pipe1, clf2, pipe3, mv_clf]

x_min = X_train_std[:, 0].min() - 1
x_max = X_train_std[:, 0].max() + 1
y_min = X_train_std[:, 1].min() - 1
y_max = X_train_std[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(nrows=2, ncols=2,
                        sharex='col',
                        sharey='row',
                        figsize=(7, 5))

for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        all_clf, clf_labels):
    clf.fit(X_train_std, y_train)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.3)

    axarr[idx[0], idx[1]].scatter(X_train_std[y_train==0, 0],
                                  X_train_std[y_train==0, 1],
                                  c='blue',
                                  marker='^',
                                  s=50)

    axarr[idx[0], idx[1]].scatter(X_train_std[y_train==1, 0],
                                  X_train_std[y_train==1, 1],
                                  c='green',
                                  marker='o',
                                  s=50)

    axarr[idx[0], idx[1]].set_title(tt)

plt.text(-3.5, -5.,
         s='Sepal width [standardized]',
         ha='center', va='center', fontsize=12)
plt.text(-12.5, 4.5,
         s='Petal length [standardized]',
         ha='center', va='center',
         fontsize=12, rotation=90)

#plt.savefig('figures/07_05', dpi=300)
plt.show()
```

## Access the individual parameters for GridSearch

```python
mv_clf.get_params()
```

## Hyperparameter tuning for MajorityVoteClassifier
- Based on the values returned by the get_params method, we now know how to access the individual classifier’s attributes.
- Let’s now tune the inverse regularization parameter, C, of the logistic regression classifier and the decision tree depth via a grid search for demonstration purposes:

```python
from sklearn.model_selection import GridSearchCV


params = {'decisiontreeclassifier__max_depth': [1, 2],
          'pipeline-1__clf__C': [0.001, 0.1, 100.0]}

grid = GridSearchCV(estimator=mv_clf,
                    param_grid=params,
                    cv=10,
                    scoring='roc_auc')
grid.fit(X_train, y_train)

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    mean_score = grid.cv_results_['mean_test_score'][r]
    std_dev = grid.cv_results_['std_test_score'][r]
    params = grid.cv_results_['params'][r]
    print(f'{mean_score:.3f} +/- {std_dev:.2f} {params}')
```

```python
print(f'Best parameters: {grid.best_params_}')
print(f'ROC AUC: {grid.best_score_:.2f}')
```

**Note**  
By default, the default setting for `refit` in `GridSearchCV` is `True` (i.e., `GridSeachCV(..., refit=True)`), which means that we can use the fitted `GridSearchCV` estimator to make predictions via the `predict` method, for example:

    grid = GridSearchCV(estimator=mv_clf,
                        param_grid=params,
                        cv=10,
                        scoring='roc_auc')
    grid.fit(X_train, y_train)
    y_pred = grid.predict(X_test)

In addition, the "best" estimator can directly be accessed via the `best_estimator_` attribute.

```python
grid.best_estimator_.classifiers
```

```python
mv_clf = grid.best_estimator_
```

```python
mv_clf.set_params(**grid.best_estimator_.get_params())
```

```python
mv_clf
```

<br>
<br>

# Bagging -- Building an ensemble of classifiers from bootstrap samples

- Bagging is an ensemble learning technique that is closely related to the MajorityVoteClassifier that
we implemented in the previous section.
- However, instead of using the same training dataset to fit the
individual classifiers in the ensemble, we draw bootstrap samples (random samples with replacement)
from the initial training dataset, which is why bagging is also known as bootstrap aggregating.

The concept of bagging is summarized below:

<img src = "https://github.com/rasbt/machine-learning-book/blob/main/ch07/figures/07_06.png?raw=true" width=600 />

## Bagging in a nutshell

- Each classifier receives a random subset of examples from the training dataset. We denote these random samples obtained via bagging as Bagging round 1, Bagging round 2,
and so on.
- Each subset contains a certain portion of duplicates and some of the original examples don’t
appear in a resampled dataset at all due to sampling with replacement.
- Once the individual classifiers
are fit to the bootstrap samples, the predictions are combined using majority voting.

<img src="https://github.com/rasbt/machine-learning-book/blob/main/ch07/figures/07_07.png?raw=true" width=800 />

## Applying bagging to classify examples in the Wine dataset
- To see bagging in action, let’s create a more complex classification problem using the Wine dataset.

```python
import pandas as pd
from sklearn.datasets import load_wine

wine = load_wine(as_frame=True)

df_wine = wine.data
df_wine
```

```python
y = wine.target.values
y
```

```python
df_wine.columns
```

```python
X = df_wine[['alcohol', 'od280/od315_of_diluted_wines']].values
X
```

```python
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test =\
            train_test_split(X, y,
                             test_size=0.2,
                             random_state=1,
                             stratify=y)
```

```python
X_train.shape, X_test.shape, y_train.shape, y_test.shape
```

## Using Scikit-Learn BaggingClassifier
- A BaggingClassifier algorithm is already implemented in scikit-learn, which we can import from the
ensemble submodule.
- Here, we will use an unpruned decision tree as the base classifier and create an ensemble of 500 decision trees fit on different bootstrap samples of the training dataset.

```python
# A base classifier
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='entropy',
                              max_depth=1,
                              random_state=1)
```

```python
# create a bagging
from sklearn.ensemble import BaggingClassifier
bag = BaggingClassifier(estimator=tree,
                        n_estimators=500,
                        max_samples=1.0,
                        max_features=1.0,
                        bootstrap=True,
                        bootstrap_features=False,
                        n_jobs=1,
                        random_state=1)
```

```python
from sklearn.metrics import accuracy_score


tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print(f'Decision tree train/test accuracies '
      f'{tree_train:.3f}/{tree_test:.3f}')
```

```python
bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred = bag.predict(X_test)

bag_train = accuracy_score(y_train, y_train_pred)
bag_test = accuracy_score(y_test, y_test_pred)
print(f'Bagging train/test accuracies '
      f'{bag_train:.3f}/{bag_test:.3f}')
```

```python
import numpy as np
import matplotlib.pyplot as plt


x_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(nrows=1, ncols=2,
                        sharex='col',
                        sharey='row',
                        figsize=(8, 3))


for idx, clf, tt in zip([0, 1],
                        [tree, bag],
                        ['Decision tree', 'Bagging']):
    clf.fit(X_train, y_train)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx].contourf(xx, yy, Z, alpha=0.3)
    axarr[idx].scatter(X_train[y_train == 0, 0],
                       X_train[y_train == 0, 1],
                       c='blue', marker='^')

    axarr[idx].scatter(X_train[y_train == 1, 0],
                       X_train[y_train == 1, 1],
                       c='green', marker='o')

    axarr[idx].set_title(tt)

axarr[0].set_ylabel('OD280/OD315 of diluted wines', fontsize=12)

plt.tight_layout()
plt.text(0, -0.2,
         s='Alcohol',
         ha='center',
         va='center',
         fontsize=12,
         transform=axarr[1].transAxes)

#plt.savefig('figures/07_08.png', dpi=300, bbox_inches='tight')
plt.show()
```

<br>
<br>

# Leveraging weak learners via adaptive boosting

- In boosting, the ensemble consists of very simple base classifiers, also often referred to as weak
learners, which often only have a slight performance advantage over random guessing - a typical example of a weak learner is a decision tree stump.
- The key concept behind boosting is to focus on
training examples that are hard to classify, that is, to let the weak learners subsequently learn from
misclassified training examples to improve the performance of the ensemble.

## How boosting works

In contrast to bagging, the initial formulation of the boosting algorithm uses random subsets of training examples drawn from the training dataset without replacement; the original boosting procedure can be summarized in the following four key steps:
1. Draw a random subset (sample) of training examples, d1, without replacement from the training dataset, D, to train a weak learner, C1.
2. Draw a second random training subset, d2, without replacement from the training dataset and add 50 percent of the examples that were previously misclassified to train a weak learner, C2.
3. Find the training examples, d3, in the training dataset, D, which C1 and C2 disagree upon, to train a third weak learner, C3.
4. Combine the weak learners C1, C2, and C3 via majority voting.

## Concepts behind Adaboost
- Subfigure 1 represents a training dataset for binary classification where all training examples are assigned equal weights.
- In subfigure 2, we assign a larger weight to the two previously misclassified examples.
- Similarly, in subfigure 3, previously misclassified examples get larger weights.
- In subfigure 4, we combine
the three weak learners trained on different re-weighted training subsets by a weighted majority vote.

<img src="https://github.com/rasbt/machine-learning-book/blob/main/ch07/figures/07_09.png?raw=true" width=600 />

## Applying AdaBoost using scikit-learn

Create a weaker learner which is a decision tree stump:

```python
tree = DecisionTreeClassifier(criterion='entropy',
                              max_depth=1,
                              random_state=1)
```

create an AdaBoostClassifier using sklearn:

```python
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(estimator=tree,
                         n_estimators=500,
                         learning_rate=0.1,
                         random_state=1)
```

Evaluate the weaker learner:

```python
tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print(f'Decision tree train/test accuracies '
      f'{tree_train:.3f}/{tree_test:.3f}')
```

Fit and evaluate the AdaBoostClassifier:

```python
ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)

ada_train = accuracy_score(y_train, y_train_pred)
ada_test = accuracy_score(y_test, y_test_pred)
print(f'AdaBoost train/test accuracies '
      f'{ada_train:.3f}/{ada_test:.3f}')
```

```python
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(8, 3))


for idx, clf, tt in zip([0, 1],
                        [tree, ada],
                        ['Decision tree', 'AdaBoost']):
    clf.fit(X_train, y_train)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx].contourf(xx, yy, Z, alpha=0.3)
    axarr[idx].scatter(X_train[y_train == 0, 0],
                       X_train[y_train == 0, 1],
                       c='blue', marker='^')
    axarr[idx].scatter(X_train[y_train == 1, 0],
                       X_train[y_train == 1, 1],
                       c='green', marker='o')
    axarr[idx].set_title(tt)

axarr[0].set_ylabel('OD280/OD315 of diluted wines', fontsize=12)

plt.tight_layout()
plt.text(0, -0.2,
         s='Alcohol',
         ha='center',
         va='center',
         fontsize=12,
         transform=axarr[1].transAxes)

# plt.savefig('figures/07_11.png', dpi=300, bbox_inches='tight')
plt.show()
```

# Gradient boosting -- training an ensemble based on loss gradients

- Gradient boosting is another variant of the boosting concept introduced in the previous section, that is, successively training weak learners to create a strong ensemble.
- Gradient boosting is an extremely important topic because it forms the basis of popular machine learning algorithms such as XGBoost, which is well-known for winning Kaggle competitions.

## Comparing AdaBoost with gradient boosting

- AdaBoost trains decision tree stumps based on errors of the previous decision tree stump.
- Gradient boosting fits decision trees in an iterative fashion using prediction errors. However, gradient boosting trees are usually deeper than decision tree stumps.
- Also, in contrast to AdaBoost, gradient boosting
does not use the prediction errors for assigning sample weights; they are used directly to form the
target variable for fitting the next tree.
- Moreover, instead of having an individual weighting term for
each tree, like in AdaBoost, gradient boosting uses a global learning rate that is the same for each tree.

## GradientBoost in Scikit Learn

- In scikit-learn, gradient boosting is implemented as sklearn.ensemble.GradientBoostingClassifier.
- It is important to note that gradient boosting
is a sequential process that can be slow to train.
- However, in recent years a more popular implementation
of gradient boosting has emerged, namely, XGBoost.

```python
from sklearn.ensemble import GradientBoostingClassifier
```

```python
clf = GradientBoostingClassifier(n_estimators=1000, learning_rate=0.01,
    max_depth=4, random_state=1).fit(X_train, y_train)
```

```python
clf.score(X_test, y_test)
```

## Use XGBoost

```python
import xgboost as xgb
```

```python
xgb.__version__
```

```python
model = xgb.XGBClassifier(n_estimators=1000, learning_rate=0.01, max_depth=4, random_state=1, use_label_encoder=False)


gbm = model.fit(X_train, y_train)
```

```python
gbm.score(X_test, y_test)
```

<br>
<br>