# Bagging in Scikit-Learn

## Table of Content 

- [Imports](#imports)
- [Classification](#classification)
  - [Data](#data)
  - [Convert Text To Vectors](#convert-text-to-vectors)
  - [Bagging](#bagging)
  - [Accuracy Compared to A Single Decision Tree](#accuracy-compared-to-a-single-decision-tree)
- [Regression](#regression)
  - [Data](#data)
  - [Pipeline](#pipeline)
  - [Bagging](#bagging)
  - [Accuracy Compared to A Single Decision Tree](#accuracy-compared-to-a-single-decision-tree)
  - [Bagged Model Performance On Training Data](#bagged-model-performance-on-training-data)
  - [Bagged Model Performance On Test Data](#bagged-model-performance-on-test-data)
  - [Single Decision Tree Performance On Training Data](#single-decision-tree-performance-on-training-data)
  - [Single Decision Tree Performance On Test Data](#single-decision-tree-performance-on-test-data)
  - [Performance Table](#performance-table)

## Imports

In [11]:
# Interactive shell
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import Markdown, display

InteractiveShell.ast_node_interactivity = "all"

# Data wrangling and standard library
import os
import pandas as pd
import numpy as np


# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error
from sklearn.datasets import fetch_20newsgroups, fetch_20newsgroups_vectorized
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin


# Utilities
import joblib

## Classification

### Data

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

In [2]:
# Returns a bunch object
newsgroups = fetch_20newsgroups(subset="train", random_state=1227)

In [61]:
# Attributes of bunch
newsgroups.__dir__()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [62]:
# Unique target values
np.unique(newsgroups["target"])
np.unique(newsgroups["target_names"])

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
       'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
       'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
       'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
       'sci.electronics', 'sci.med', 'sci.space',
       'soc.religion.christian', 'talk.politics.guns',
       'talk.politics.mideast', 'talk.politics.misc',
       'talk.religion.misc'], dtype='<U24')

In [69]:
# Description of the data
display(Markdown(newsgroups["DESCR"][25:1086]))



The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    =================   ==========
    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features                  text
    =================   ==========


### Convert Text To Vectors

In [44]:
# Convert a collection of raw documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer()
X_train, y_train = vectorizer.fit_transform(newsgroups.data), newsgroups.target

In [45]:
X_train.shape
y_train.shape

(11314, 130107)

(11314,)

### Bagging

When random subsets of the dataset are drawn as random subsets of the samples (*without replacement*), then this algorithm is known as **Pasting**. If samples are drawn *with replacement*, then the method is known as **Bagging**. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as **Random Subspaces**. Finally, when base estimators are built on subsets of both samples and features, then the method is known as **Random Patches**.

In [7]:
bag_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    # Number of base estimators in the ensemble
    n_estimators=300,
    # Bootstrap samples with replacement
    bootstrap=True,
    # Use out-of-bag samples to estimate the generalization error
    oob_score=True,
    random_state=1227,
    n_jobs=-1,
    verbose=1,
)

For the `BaggingClassifier` meta-estimator, parameters `max_samples` and `max_features` control the size of the subsets (in terms of samples and features), while `bootstrap` and `bootstrap_features` control whether samples and features are drawn with or without replacement. 

In [8]:
bag_clf.fit(X_train, y_train)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed: 25.2min remaining: 75.5min
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed: 25.4min finished


BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=300,
                  n_jobs=-1, oob_score=True, random_state=1227, verbose=1)

In [9]:
joblib.dump(bag_clf, "../models/bagging/bagging_classifier_newsgroup.pkl")

['../models/bagging/bagging_classifier_newsgroup.pkl']

### Out-Of-Bag

The out-of-bag score for the bagged model is as follow:

In [10]:
bag_clf.oob_score_

0.7534028637086795

The decision function computed with out-of-bag estimate on the training set is stored in `oob_decision_function_`. If `n_estimators` is small, it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. 

In [11]:
# Check for data points that were never left out
np.isnan(bag_clf.oob_decision_function_).sum()

0

In [12]:
bag_clf.oob_decision_function_.shape

(11314, 20)

Since the base estimator `DecisionTreeClassifier` has the `predict_proba` method, which predicts class probabilities of the input samples `X`. The decision function also returns the class probabilities for each training instance. For the first training sample, the bagged classifier predicts with probability 0.205607476635514:

In [13]:
bag_clf.oob_decision_function_[:1, :]
print("The max is", bag_clf.oob_decision_function_[:1, :].max())

array([[0.02803738, 0.04672897, 0.09345794, 0.02803738, 0.04672897,
        0.        , 0.00934579, 0.20560748, 0.10280374, 0.01869159,
        0.04672897, 0.04672897, 0.12149533, 0.05607477, 0.04672897,
        0.00934579, 0.02803738, 0.00934579, 0.02803738, 0.02803738]])

The max is 0.205607476635514


For the second training sample, the bagged model predicts with probability 0.388889:

In [118]:
bag_clf.oob_decision_function_[1:2, :]
print("The max is", bag_clf.oob_decision_function_[1:2, :].max())

array([[0.        , 0.38888889, 0.11111111, 0.05555556, 0.02777778,
        0.        , 0.02777778, 0.02777778, 0.        , 0.        ,
        0.        , 0.02777778, 0.16666667, 0.        , 0.05555556,
        0.02777778, 0.        , 0.        , 0.        , 0.08333333]])

The max is 0.3888888888888889


And so on....

### Accuracy Compared to A Single Decision Tree

In [39]:
# Load test data
X_test, y_test = fetch_20newsgroups_vectorized(subset="test", return_X_y=True)

In [40]:
X_test.shape
y_test.shape

(7532, 130107)

(7532,)

The bagged model achieved an accuracy score of:

In [41]:
y_pred_bagged = bag_clf.predict(X_test)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    4.9s remaining:   14.7s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    6.2s finished


In [43]:
print(accuracy_score(y_test, y_pred_bagged))

0.66064790228359


Although this is still subpar. This is in contrast to a single decision tree, which has an accuracy of:

In [46]:
# Single tree
tree_clf = DecisionTreeClassifier(random_state=1227)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree))

DecisionTreeClassifier(random_state=1227)

0.516064790228359


This means that the bagged model has improved the accuracy by $(\frac{0.66064790228359}{0.516064790228359} - 1) \times 100 \approx 28.0165 \%$.

## Regression

### Data

In [4]:
# Training data
X_train, y_train = (
    joblib.load("../../../california_housing_price_project/dataset/training_X.pkl"),
    joblib.load("../../../california_housing_price_project/dataset/training_y.pkl"),
)

In [5]:
X_train
y_train

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
8088,-118.21,33.82,34.0,1719.0,398.0,1444.0,372.0,2.8438,NEAR OCEAN
15259,-117.27,33.03,25.0,1787.0,311.0,1108.0,311.0,3.9826,NEAR OCEAN
710,-122.08,37.68,26.0,1167.0,370.0,253.0,137.0,2.4196,NEAR BAY
12828,-121.45,38.70,24.0,2159.0,369.0,1141.0,355.0,3.9853,INLAND
18294,-122.10,37.39,35.0,2471.0,349.0,881.0,342.0,7.6229,NEAR BAY
...,...,...,...,...,...,...,...,...,...
8285,-118.14,33.77,51.0,2812.0,621.0,1171.0,566.0,3.8750,NEAR OCEAN
20569,-121.76,38.66,17.0,5320.0,984.0,2866.0,928.0,4.1997,INLAND
12632,-121.48,38.49,26.0,3165.0,806.0,2447.0,752.0,1.5908,INLAND
5135,-118.26,33.97,46.0,1521.0,352.0,1100.0,334.0,1.5500,<1H OCEAN


8088     139300.0
15259    215800.0
710      275000.0
12828     90400.0
18294    500001.0
           ...   
8285     342900.0
20569    133400.0
12632     78600.0
5135     100600.0
2786     118800.0
Name: median_house_value, Length: 16511, dtype: float64

### Pipeline

In [6]:
# Column names
col_names = ["total_rooms", "total_bedrooms", "population", "households"]
# List comprehension and unpack to get column indexes, which are essentially the integer locations of these columns
rooms_id, bedrooms_id, population_id, households_id = [
    X_train.columns.get_loc(col) for col in col_names
]


# Custom transformer
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    # Constructor
    def __init__(self, add_bedrooms_per_room=True):
        # Hyperparameter for controlling feature engineering
        # If true, then add the new attribute 'bedrooms_per_room'
        self.add_bedrooms_per_room = add_bedrooms_per_room

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Create new attributes
        rooms_per_household = X[:, rooms_id] / X[:, households_id]
        population_per_household = X[:, population_id] / X[:, households_id]
        # If the hyperparameter self.add_bedrooms_per_room = True
        if self.add_bedrooms_per_room:
            # Create this attribute
            bedrooms_per_room = X[:, bedrooms_id] / X[:, rooms_id]
            # Include these additional attributes in the transformed feature matrix
            return np.c_[
                X, rooms_per_household, population_per_household, bedrooms_per_room
            ]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

In [7]:
# Pipeline for preprocessing the numerical attributes
numerical_pipeline = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="median")),
        ("attr_adder", CombinedAttributesAdder(add_bedrooms_per_room=True)),
    ]
)

In [8]:
# Wrapping a data frame object in a list() function returns a list of column names
numerical_attributes = list(X_train.select_dtypes(include=np.number))
categorical_attributes = list(X_train.select_dtypes(include=object))
numerical_attributes, categorical_attributes

(['longitude',
  'latitude',
  'housing_median_age',
  'total_rooms',
  'total_bedrooms',
  'population',
  'households',
  'median_income'],
 ['ocean_proximity'])

In [9]:
# Pipeline
pipeline = ColumnTransformer(
    transformers=[
        ("numerical", numerical_pipeline, numerical_attributes),
        (
            "categorical",
            OneHotEncoder(categories="auto", handle_unknown="ignore"),
            categorical_attributes,
        ),
    ],
    remainder="drop",
)

In [10]:
X_train = pipeline.fit_transform(X_train)
X_train
X_train.shape

array([[-118.21,   33.82,   34.  , ...,    0.  ,    0.  ,    1.  ],
       [-117.27,   33.03,   25.  , ...,    0.  ,    0.  ,    1.  ],
       [-122.08,   37.68,   26.  , ...,    0.  ,    1.  ,    0.  ],
       ...,
       [-121.48,   38.49,   26.  , ...,    0.  ,    0.  ,    0.  ],
       [-118.26,   33.97,   46.  , ...,    0.  ,    0.  ,    0.  ],
       [-118.45,   37.25,   20.  , ...,    0.  ,    0.  ,    0.  ]])

(16511, 16)

### Bagging

In [318]:
bag_reg = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    # Number of base estimators in the ensemble
    n_estimators=500,
    # Bootstrap samples with replacement
    bootstrap=True,
    # Use out-of-bag samples to estimate the generalization error
    oob_score=True,
    random_state=1227,
    n_jobs=-1,
    verbose=0,
)

In [319]:
bag_reg.fit(X_train, y_train)



BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=500,
                 n_jobs=-1, oob_score=True, random_state=1227)

In [320]:
joblib.dump(bag_reg, "../models/bagging/bagging_regressor_ca_housing.pkl")

['../models/bagging/bagging_regressor_ca_housing.pkl']

### Out-Of-Bag

In [13]:
bag_reg.oob_score_

0.8146932417823896

In [14]:
np.isnan(bag_reg.oob_prediction_).sum()

0

In [15]:
bag_reg.oob_prediction_
bag_reg.oob_prediction_.shape

array([150874.34554974, 166508.99470899, 207767.69154229, ...,
        73166.14583333,  95226.08695652, 113365.57377049])

(16511,)

### Accuracy Compared to A Single Decision Tree

In [16]:
X_test, y_test = (
    joblib.load("../../../california_housing_price_project/dataset/test_X.pkl"),
    joblib.load("../../../california_housing_price_project/dataset/test_y.pkl"),
)

### Bagged Model Performance On Training Data

The bagged model has the following values for $\hat{\sigma}^{2}$ (mean squared error) and root mean squared error:

In [17]:
# Predictions
y_pred_bagged = bag_reg.predict(X_train)
# Obtain MSE and RMSE
MSE, RMSE = (
    mean_squared_error(y_true=y_train, y_pred=y_pred_bagged, squared=True),
    mean_squared_error(y_true=y_train, y_pred=y_pred_bagged, squared=False),
)

In [18]:
f"MSE={MSE} and RMSE={RMSE}"

'MSE=336456540.8859869 and RMSE=18342.751726117513'

### Bagged Model Performance On Test Data

One the test set, the bagged model has the following performance:

In [19]:
# Apply pipeline to test data
X_test = pipeline.fit_transform(X_test)

In [20]:
# Predictions
y_pred_bagged_test = bag_reg.predict(X_test)
# Obtain MSE and RMSE
MSE_test, RMSE_test = (
    mean_squared_error(y_true=y_test, y_pred=y_pred_bagged_test, squared=True),
    mean_squared_error(y_true=y_test, y_pred=y_pred_bagged_test, squared=False),
)

In [21]:
f"MSE={MSE_test} and RMSE={RMSE_test}"

'MSE=2336630343.16881 and RMSE=48338.70440101607'

### Single Decision Tree Performance On Training Data

On the other hand, the single decision tree grossly overfits. On the training data, the mean square error is zero.

In [22]:
# Single tree
tree_reg = DecisionTreeRegressor(random_state=1227)
tree_reg.fit(X_train, y_train)
y_pred_tree = tree_reg.predict(X_train)
# Obtain MSE and RMSE
MSE_tree, RMSE_tree = (
    mean_squared_error(y_true=y_train, y_pred=y_pred_tree, squared=True),
    mean_squared_error(y_true=y_train, y_pred=y_pred_tree, squared=False),
)

DecisionTreeRegressor(random_state=1227)

In [23]:
f"MSE={MSE_tree} and RMSE={RMSE_tree}"

'MSE=0.0 and RMSE=0.0'

### Single Decision Tree Performance On Test Data

On the test set, the single decision tree performs worse than the bagged model:

In [24]:
y_pred_tree_test = tree_reg.predict(X_test)
# Obtain MSE and RMSE
MSE_tree_test, RMSE_tree_test = (
    mean_squared_error(y_true=y_test, y_pred=y_pred_tree_test, squared=True),
    mean_squared_error(y_true=y_test, y_pred=y_pred_tree_test, squared=False),
)

In [25]:
f"MSE={MSE_tree_test} and RMSE={RMSE_tree_test}"

'MSE=4770561447.170502 and RMSE=69069.25109750722'

### Performance Table

In [26]:
perf_table = pd.DataFrame(
    {
        "Performance": ["RMSE (Training Data)", "RMSE (Test Data)"],
        "BaggingRegressor": [RMSE, RMSE_test],
        "DecisionTreeRegressor": [RMSE_tree, RMSE_tree_test],
    }
)
perf_table

Unnamed: 0,Performance,BaggingRegressor,DecisionTreeRegressor
0,RMSE (Training Data),18342.751726,0.0
1,RMSE (Test Data),48338.704401,69069.251098
