# Bagged Decision Trees

In [51]:
import pandas as pd
import numpy as np

from numpy.random import choice, normal

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import roc_auc_score

from dataloader import DataLoader

In [46]:
dataloader = DataLoader('data/loan_data_clean.csv',
                           feature_names=['person_income', 'loan_int_rate','loan_percent_income', 'loan_intent'],
                           target_names=['loan_status'])
dataloader.train_test_split()

To apply bagging to decision trees, we create bootstrap samples from our training data by repeatedly sampling with replacement, then train one decision tree on each of these samples, and create an ensemble prediction by averaging over the predictions of the different trees.

Bagged decision trees are usually grown large, that is, have many levels and leaf nodes and are not pruned so that each tree has low bias but high variance. The effect of averaging their predictions then aims to reduce their variance. Bagging has been shown to substantially improve predictive performance by constructing ensembles that combine hundreds or even thousands of trees trained on bootstrap samples.

To illustrate the effect of bagging on the variance of a regression tree, we can use the `BaggingRegressor` meta-estimator provided by `sklearn`. It trains a user-defined base estimator based on parameters that specify the sampling strategy:

- `max_samples` and `max_features` control the size of the subsets drawn from the rows and the columns, respectively
- `bootstrap` and `bootstrap_features` determine whether each of these samples is drawn with or without replacement

In [61]:
train_size = 250
test_size = 500
reps = 100
max_depth = 10
n_estimators = 10

# Define the decision tree regressor and bagging regressor
tree = DecisionTreeClassifier(max_depth=max_depth)
bagged_tree = BaggingClassifier(estimator=tree, n_estimators=n_estimators)

# Train the bagged decision trees
bagged_tree.fit(dataloader.features, np.ravel(dataloader.loan_status))

# Predict probabilitites
default_proba = bagged_tree.predict_proba(dataloader.features_test)[:,1]

# Measure performance of the model
roc_auc_score(dataloader.loan_status_test, default_proba)

0.9286083507814444