<h1>Boosted Decision Trees</h1>

The premise behind boosting is similar to bagging -- we will use a set of decision trees that might not be great predictors and combine them to give a more reliable predictor. The main difference here is that while boosting treated all of the models at the same time (and averaged them) boosting treats the models sequentially. Instead of each tree being trained on a different subset of the data, each tree is trained on the same data but it is slightly modified.

The goal of a boosted tree is to learn slowly. We train a single tree and produce residuals -- the differences between the predicted values and the true values. We use these residuals to train the next tree in the sequence. In effect, we are using the residuals instead of the observations to train the next tree. We add this new tree into the fitted function (stack the trees on top of each other) and produce new residuals. And repeat.

The algorithm is laid out in steps below:
- fit a tree to the training data
- compute the residuals
- add the new tree to the previous tree
- repeat

The final model is the sum of all of the trees we have stacked on top of each other.

The idea is that we make our subsequent trees focus on the parts of the data that previous trees got wrong. We will weight each subsequent tree by a parameter $\lambda$ which controls how much we should learn from the residuals. The other parameters involved are the number of trees and the depth of each tree.

Lets see how this works:

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

In [9]:
df = pd.read_csv("heart.csv")
print(df.shape)
df.head()

(303, 14)


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [10]:
y = df['output']
X = df.drop('output', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [11]:
clf = GradientBoostingClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.7802197802197802


What hyperparameters should we try tuning?

In [12]:
clf = GradientBoostingClassifier(learning_rate=0.1,
                                 n_estimators=2000,
                                 subsample=1.0,
                                 min_samples_split=2,
                                 min_samples_leaf=1,
                                 max_depth=6,
                                 random_state=42,
                                 ccp_alpha=0.0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.7472527472527473


In [13]:
clf = GradientBoostingClassifier(learning_rate=0.1,
                                 n_estimators=2000,
                                 subsample=1.0,
                                 min_samples_split=2,
                                 min_samples_leaf=10,
                                 max_depth=6,
                                 random_state=42,
                                 ccp_alpha=0.0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.8021978021978022


In [14]:
clf = GradientBoostingClassifier(learning_rate=0.1,
                                 n_estimators=2000,
                                 subsample=1.0,
                                 min_samples_split=2,
                                 min_samples_leaf=1,
                                 max_depth=3,
                                 random_state=42,
                                 ccp_alpha=0.0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.7582417582417582
