<a href="https://colab.research.google.com/github/damianodea/hpc_2025/blob/main/%5BPUBLIC%5D_HPQC_HandsOn_DT_Ensemble.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Decision Trees** (DT)

Import common modules. Make sure matplotlib plots figures inline. Check Python 3 or later is installed (Python 2.x may work, but it is deprecated in colab, so better to move to v3). Check sklearn ≥0.20 is installed.

In [None]:
import sys


# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make the notebook's output stable across subsequent runs
seed = 42
np.random.seed(seed)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)  #We instantiate a DT classifier
tree_clf.fit(X, y) # Training

In [None]:
from graphviz import Source
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file=os.path.join("./iris_tree.dot"),
        feature_names=iris.feature_names[2:],
        class_names=iris.target_names,
        rounded=True,
        filled=True
    )
print(iris.target_names)

Source.from_file(os.path.join("./iris_tree.dot"))

In [None]:
tree_clf.predict_proba([[5, 1.5]])

In [None]:
tree_clf.predict([[5, 1.5]]) # Versicolor

## <font color=red>Exercise</font>

Train and fine-tune a Decision Tree for the moons dataset by following these steps:


1.   Use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) (from `sklearn.model_selection`) to split the dataset into a training set and a test set (20\% test)
2.   Use grid search with cross-validation (with the help of the [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class) to find good hyperparameter values for a DecisionTreeClassifier (hint: try various values for max_leaf_nodes
3.   Train it on the full training set using these hyperparameters, and measure your model’s performance on the test set.

You should get roughly 97\% accuracy.

In [None]:
# type your code below

### <font color='green'>Solution</font>

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

In [None]:
from sklearn.model_selection import GridSearchCV

params = {
    'max_leaf_nodes': list(range(2, 50)),
    'max_depth': list(range(1, 7)),
    'min_samples_split': [2, 3, 4]
}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42),
                              params,
                              cv=3)

grid_search_cv.fit(X_train, y_train)

print("Best estimator is:")
grid_search_cv.best_estimator_

In [None]:
from sklearn.metrics import accuracy_score

y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)

# Ensemble models

## Bagging algorithms

### Bagged Decision Trees

In the example below is an example of using the [`BaggingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) with the Classification and Regression Trees algorithm. A total of 100 trees are created.


In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
#
from sklearn.ensemble import BaggingClassifier                     # <---

Try a simple decision tree classifier.

In [None]:
kfold = KFold(n_splits=10, shuffle=True)
model = DecisionTreeClassifier(random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

Then try to do better with Bagging.

In [None]:
# Bagged Decision Trees for Classification
kfold = KFold(n_splits=10, shuffle=True)
cart = DecisionTreeClassifier(random_state=seed)
num_trees = 100
model = BaggingClassifier(estimator=cart, n_estimators=num_trees, random_state=seed, bootstrap = False)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

*(NOTE: running the cell above should take some more time than usual..)*

Now, try to change one parameter above, `bootstrap = True` and rerun and see what happens...

Running the example in the latter way, we get a more robust estimate of model accuracy.

### Random Forest

Random Forests is **an extension of bagged decision trees**.

You can construct a Random Forest model for classification using the RandomForestClassifier class, documented here. The example below demonstrates using Random Forest for classification with 100 trees and split points chosen from a random selection of 3 features.

In [None]:
from sklearn.ensemble import RandomForestClassifier                    # <---

# Random Forest Classification
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, shuffle=True)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

(NOTE: running the cell above should take some more time than usual..)

Running the example provides a mean estimate of classification accuracy.

## Boosting Algorithms

### AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier                    # <---

# AdaBoost Classification
num_trees = 30
kfold = KFold(n_splits=10, shuffle=True)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

Running the example provides a mean estimate of classification accuracy.

### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier                    # <---

# Stochastic Gradient Boosting Classification
num_trees = 100
kfold = KFold(n_splits=10, shuffle=True)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

### Extreme gradient boosting (XGBoost)

In [None]:
import xgboost as xgb     # <--- Notice the different library

num_trees = 100
kfold = KFold(n_splits=10, shuffle=True)
model = xgb.XGBClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

## Credits

_Credits: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Edition) by Aurélien Géron, O'Reilly Media Inc., 2019_