# Preface

In this notebook, we demonstrate classification using decision trees. We will also demonstrate model ensembling with decision trees.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.set_context('notebook', font_scale=1.25, rc={"lines.linewidth": 2.5})
sns.set_style("darkgrid")
np.random.seed(123)  # For reproducibility

# Diabetes Dataset

This dataset is originally from the *National Institute of Diabetes and Digestive and Kidney Diseases*. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, `Outcome` (1 being diabetic and 0 if not). Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

In [None]:
dataset = pd.read_csv('./data/diabetes.csv')

In [None]:
dataset.head()

## Train Test Split

We are going to split the dataset as usual. This time, we are going to apply cross validation to the training set to evaluate our model for the purpose of model selection and only use the test data for final model evaluation. This is to prevent *overfitting the test set*.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x, y = dataset[dataset.columns[:-1]], dataset[dataset.columns[-1]]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)

# Decision Tree Classification

We first fit a decision tree using `DecisionTreeClassifier` from `sklearn`.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(x_train, y_train)

## Cross Validation Scoring

We now use cross-validation scoring on the training set. This gives a better measure of the actual performance of our trained model.

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score(clf, x_train, y_train, cv=10)
print(f'Mean accuracy: {np.mean(scores)}')
print(f'Std accuracy: {np.std(scores)}')
sns.distplot(scores)

Observe that cross validated accuracies better reflect testing performance!

In [None]:
print(f'Train accuracy: {clf.score(x_train, y_train)}')
print(f'Mean CV accuracy: {np.mean(scores)}')
print(f'Test accuracy: {clf.score(x_test, y_test)}')

## Visualizing the Decision Tree

As mentioned in the lecture, one advantage is that the decision tree can be visualized to see how it arrives at the decision. Let us see how our tree model arrives at a diagnosis of diabetes.

We will use the `plot_tree` from `sklearn.tree` to achieve this. Alternatively, you can also use the [`graphviz` package](https://www.graphviz.org).

In [None]:
from sklearn.tree import plot_tree

In [None]:
plt.figure(figsize=(30, 15))
plot_tree(
    clf,
    feature_names=dataset.columns[:-1],
    class_names=['Negative', 'Positive'],
    filled=True,
    fontsize=25
);

## Overfitting

Let us now fit a sequence of decision trees with increasing depth. We observe from the results below two things:
  1. As depth increases, we overfit: training accuracy increases but not test
  2. Generally, CV accuracy gives a better prediction of test error

In [None]:
results = []
for depth in range(2, 8):
    clf = DecisionTreeClassifier(max_depth=depth)
    clf.fit(x_train, y_train)
    train_acc = clf.score(x_train, y_train)
    test_acc = clf.score(x_test, y_test)
    cv_acc = np.mean(cross_val_score(clf, x_train, y_train, cv=10))
    results.append([depth, train_acc, test_acc, cv_acc])
results = pd.DataFrame(
    data=results,
    columns=['depth', 'train accuracy', 'test accuracy', 'cv accuracy'],
)
results = pd.melt(
    results,
    id_vars=['depth'],
    var_name='type',
    value_name='accuracy'
)  # Melt dataframe for easier plotting

In [None]:
sns.lineplot(
    x='depth',
    y='accuracy',
    hue='type',
    data=results,
)

# Random Forest (Bagging)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf = RandomForestClassifier(n_estimators=100, max_depth=3)
clf.fit(x_train, y_train)

In [None]:
print(f'Train accuracy: {clf.score(x_train, y_train)}')
scores = cross_val_score(clf, x_train, y_train, cv=10)
print(f'Mean CV accuracy: {np.mean(scores)}')
print(f'Test accuracy: {clf.score(x_test, y_test)}')

In [None]:
results_rf = []
for depth in range(2, 8):
    clf = RandomForestClassifier(n_estimators=100, max_depth=depth)
    clf.fit(x_train, y_train)
    train_acc = clf.score(x_train, y_train)
    test_acc = clf.score(x_test, y_test)
    cv_acc = np.mean(cross_val_score(clf, x_train, y_train, cv=10))
    results_rf.append([depth, train_acc, test_acc, cv_acc])
results_rf = pd.DataFrame(
    data=results_rf,
    columns=['depth', 'train accuracy', 'test accuracy', 'cv accuracy'],
)
results_rf = pd.melt(
    results_rf,
    id_vars=['depth'],
    var_name='type',
    value_name='accuracy'
)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 4), sharey=True)

sns.lineplot(
    x='depth',
    y='accuracy',
    hue='type',
    data=results,
    ax=ax[0]
)
ax[0].set_title('Decision Tree')

sns.lineplot(
    x='depth',
    y='accuracy',
    hue='type',
    data=results_rf,
    ax=ax[1]
)
ax[1].set_title('Random Forest')

# AdaBoost (Boosting)

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=5,
)

In [None]:
clf.fit(x_train, y_train)

In [None]:
print(f'Train accuracy: {clf.score(x_train, y_train)}')
scores = cross_val_score(clf, x_train, y_train, cv=10)
print(f'Mean CV accuracy: {np.mean(scores)}')
print(f'Test accuracy: {clf.score(x_test, y_test)}')

## Hyper-parameter Tuning

Observe that there are many choices in the `AdaBoost` classifier. We have mostly left everything to their default values.

In [None]:
clf

In practice, however, to obtain good performance we should perform *hyper-parameter tuning*. This means that we should pick the parameters (e.g. `max_depth`, `criterion`, `learning_rate`, `n_estimators` etc) to maximize performance. 

How do we judge performance? We use cross-validation on the training set!

In [None]:
from sklearn.model_selection import GridSearchCV

First, we check what parameters are adjustable using `clf.get_params()`. 

**Note: make sure you understand what these parameters mean!**

In [None]:
clf.get_params()

Next we set up parameter grids and apply `GridSearchCV`. This will take some time...

In [None]:
param_grid = {
    'base_estimator__max_depth': [1, 2, 3],
    'base_estimator__criterion': ['gini', 'entropy'],
    'n_estimators': [5, 25, 50],
    'learning_rate': [0.01, 0.1, 1.0],
}

In [None]:
clf_grid = GridSearchCV(estimator=clf, param_grid=param_grid, cv=4)

Here the ``cv`` argument is the number of folds of cross validation.

In [None]:
clf_grid.fit(x_train, y_train)

In [None]:
clf_grid.best_params_

In [None]:
print(f'Train accuracy: {clf_grid.score(x_train, y_train)}')
print(f'Test accuracy: {clf_grid.score(x_test, y_test)}')

# Final Remarks

So we have improved our results by cross-validation grid search over some hyper-parameters. What if the parameter space is very large so that grid search is impossible? You may check `sklearn.model_selection.RandomizedSearchCV`. 

Another boosting algorithm, namely *gradient boosting*, typically gives state of the art results on a variety of tasks. For details, check the following resources:
   * [Gradient boosting on sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
   * [xgboost](https://xgboost.readthedocs.io/en/latest/python/python_intro.html): another implementation, generally faster/better than sklearn's
   
Remember to perform hyperparameter tuning! The defaults can be very bad for some applications.