# Leveraging on tree-based methods

This lecture will focus on techniques to create **ensembles** of different tree models in order to enhance performance. Remember that we defined regression/classification trees as weak learners that are just slightly better than a random guess in predicting/classifying data.

**REMARK** Note that when we refer to an ensemble, i.e., a blend of different models to produce a prediction/classification, that notion is very general and doesn't encompass just a specific model. We present ensemble under the umbrella of the tree-based method since they are known to perform well on several tasks, although the notion of ensemble is very general.

**TAKEAWAY** You could construct ensembles by mixing different types of models (a regression tree and a neural network), but we try to keep it simple here.

[Scikit-learn doc](https://scikit-learn.org/stable/modules/ensemble.html) has a good tutorial on ensemble methods that you can use as a reference for this lecture.

We first start with some theory, then we will move towards a dataset including credit card data, and we will apply both techniques to this same problem.

## Bagging 

**Bagging**, also called bootstrap aggregation, is a technique for reducing the variance of the output of a model. *Bagging works especially well for high-variance, low-bias procedures, such as trees*. **For regression**, we simply fit the same regression tree many times to bootstrap sampled (kind of random sampling with replacement) versions of the training data and average the result. **For classification**, a committee of trees each cast a vote for the predicted class.

**Boosting**, which we are going to explore later, was initially proposed as a committee method as well, although, unlike bagging, the committee of weak learners evolves over time, and the members cast a weighted vote. Boosting appears to dominate bagging on most problems and has become the preferred choice.

**Random forests** is a modification of bagging that builds a large collection of **de-correlated trees** and then **averages** them. On many problems, the performance of random forests is very similar to boosting, and they are simpler to train and tune. As a consequence, random forests are popular and are implemented in a variety of packages.


### Random Forests
The idea is to train many approximately unbiased models and, hence, reduce the variance by averaging their noise. Trees are ideal candidates for bagging since they can capture complex interaction structures in the data and, if grown sufficiently deep, have relatively low bias. 

However, being noisy, they greatly benefit from averaging. Moreover, since each tree generated in bagging is identically distributed (i.d.), the expectation of an average of $B$ such trees is the same as the expectation of any one of them. This means the bias of bagged trees is the same as that of the individual (bootstrap) trees, and the only hope of improvement is through variance reduction. *This is in contrast to boosting, where the trees are grown in an adaptive way to remove bias and hence are not i.d.*

An average of $B$ i.i.d. random variables, each with variance $\sigma^{2}$, has variance $\frac{1}{B} \sigma^{2}$. If the variables are simply i.d. (identically distributed, but not necessarily independent) with positive pairwise correlation $\rho$, the variance of the average is 
$$
\rho \sigma^{2}+\frac{1-\rho}{B} \sigma^{2}
$$
As $B$ increases, the second term disappears, but the first remains, and hence, the size of the correlation of pairs of bagged trees limits the benefits of averaging. 

The idea in random forests is to improve the variance reduction of bagging by reducing the correlation between the trees, without increasing the variance too much. This is achieved in the tree-growing process through random selection of the input variables.

<img src="images/rf_algo.png" width="600">


When growing a tree on a bootstrapped dataset:

Before each split, select $m \leq p$ of the input variables at random as candidates for splitting. Typically, values for $m$ are $\sqrt{p}$ or even as low as 1.


After $B$ such trees $\left\{T\left(x; \Theta_{b}\right)\right\}_{1}^{B}$ are grown, the random forest (regression) predictor is
$$
\hat{f}_{\mathrm{rf}}^{B}(x)=\frac{1}{B} \sum_{b=1}^{B} T\left(x ; \Theta_{b}\right) .
$$
We observe that $\Theta_{b}$ characterizes the $b$-th random forest tree in terms of split variables, cutpoints at each node, and terminal-node values. Intuitively, reducing $m$ will reduce the correlation between any pair of trees in the ensemble.

**Not all estimators can be improved by shaking up the data like this. It seems that highly nonlinear estimators, such as trees, benefit the most.** 
*For bootstrapped trees, $\rho$ is typically small $(0.05$ or lower is typical), while $\sigma^{2}$ is not much larger than the variance for the original tree. On the other hand, bagging does not change linear estimates, such as the sample mean (hence its variance either); the pairwise correlation between bootstrapped means is about $50 \%$.*



**Random forests do remarkably well, with very little tuning required.**



When used for classification, a random forest obtains a class vote from each tree and then classifies using a majority vote (as a committee). When used for regression, the predictions from each tree at a target point $x$ are simply averaged. 

In addition, consider that:
- For classification, the default value for $m$ is $\lfloor\sqrt{p}\rfloor$ and the minimum node size is one.
- For regression, the default value for $m$ is $\lfloor p / 3\rfloor$ and the minimum node size is five.

**In practice, the best values for these parameters will depend on the problem**, and they should be treated as tuning parameters. 


When the number of variables is large, but the fraction of relevant variables is small, random forests are likely to perform poorly with small $m$. At each split, the chance can be small that the relevant variables will be selected. When the number of relevant variables increases, the performance of random forests is surprisingly robust to an increase in the number of noise variables (look at the example provided in the book). 


There is the claim that random forests "cannot overfit" the data. It is certainly true that increasing $B$ does not cause the random forest sequence to overfit; like bagging, the random forest estimate approximates the expectation.
$$
\hat{f}_{\mathrm{rf}}(x)=\mathrm{E}_{\Theta} T(x ; \Theta)=\lim _{B \rightarrow \infty} \hat{f}(x)_{\mathrm{rf}}^{B}
$$
with an average over $B$ realizations of $\Theta$. The distribution of $\Theta$ here is conditional on the training data. However, this limit can overfit the data; the average of fully grown trees can result in too rich a model and incur unnecessary variance. 


## Boosting

Boosting was originally designed for classification problems, but as will be seen in this chapter, it can profitably be extended to regression as well. 

The most popular boosting algorithm is **AdaBoost**. Consider a two-class problem, with the output variable coded as $Y \in\{-1,1\}$. Given a vector of predictor variables $X$, a classifier $G(X)$ produces a prediction taking one of the two values $\{-1,1\}$. The error rate on the training sample is
$$
\overline{\mathrm{err}}=\frac{1}{N} \sum_{i=1}^{N} I\left(y_{i} \neq G\left(x_{i}\right)\right),
$$
and the expected error rate on future predictions is $\mathrm{E}_{X Y} I(Y \neq G(X))$.
A weak classifier is one whose error rate is only slightly better than random guessing. The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers $G_{m}(x), m=1,2, \ldots, M$.

<img src="images/ab_algo.png" width="600">

The predictions from all of them are then combined through a weighted majority vote to produce the final prediction:
$$
G(x)=\operatorname{sign}\left(\sum_{m=1}^{M} \alpha_{m} G_{m}(x)\right)
$$
Here $\alpha_{1}, \alpha_{2}, \ldots, \alpha_{M}$ are computed by the boosting algorithm and weight the contribution of each respective $G_{m}(x)$. Their effect is to give a higher influence to the more accurate classifiers in the sequence.

The data modifications at each boosting step consist of applying weights $w_{1}, w_{2}, \ldots, w_{N}$ to each of the training observations $\left(x_{i}, y_{i}\right), i=1,2, \ldots, N$. Initially, all of the weights are set to $w_{i}=1 / N$ so that the first step simply trains the classifier on the data in the usual manner. For each successive iteration $m=2,3, \ldots, M$, the observation weights are individually modified, and the classification algorithm is reapplied to the weighted observations. At step $m$, those observations that were misclassified by the classifier $G_{m-1}(x)$ induced at the previous step have their weights increased, whereas the weights are decreased for those that were classified correctly. Thus, as iterations proceed, observations that are difficult to classify correctly receive ever-increasing influence. Each successive classifier is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence.

<img src="images/ab_algo_code.png" width="600">

The details of the AdaBoost algorithm are shown in the algorithm above. The current classifier $G_{m}(x)$ is induced on the weighted observations at line $2 \mathrm{a}$. The resulting weighted error rate is computed at line $2 b$. Line $2 c$ calculates the weight $\alpha_{m}$ given to $G_{m}(x)$ in producing the final classifier $G(x)$ (line $3)$. The individual weights of each of the observations are updated for the next iteration at line $2 \mathrm{~d}$. Observations misclassified by $G_{m}(x)$ have their weights scaled by a factor $\exp \left(\alpha_{m}\right)$, increasing their relative influence for inducing the next classifier $G_{m+1}(x)$ in the sequence.

*The presented algorithm is known as "Discrete AdaBoost" because the base classifier $G_{m}(x)$ returns a discrete class label.* If the base classifier instead returns a real-valued prediction (e.g., a probability mapped to the interval $[-1,1])$, AdaBoost can be modified appropriately and called "Real AdaBoost."

You can have a look at the toy example in the book to have an idea about the power of boosting techniques in improving the performances of weak learners.

**Why is boosting  so powerful?**

It is a way of fitting an additive expansion in a set of elementary "basis" functions. Here the basis functions are the individual classifiers $G_{m}(x) \in\{-1,1\}$. More generally, basis function expansions take the form
$$
f(x)=\sum_{m=1}^{M} \beta_{m} b\left(x ; \gamma_{m}\right),
$$
where $\beta_{m}, m=1,2, \ldots, M$ are the expansion coefficients, and $b(x ; \gamma) \in \mathbb{R}$. are usually simple functions of the multivariate argument $x$, characterized by a set of parameters $\gamma$. We discuss basis expansions in some detail in Chapter $5 .$

Additive expansions like this are at the heart of many of the learning techniques covered in this book:
- In single-hidden-layer neural networks, $b(x ; \gamma)=\sigma\left(\gamma_{0}+\right.$ $\left.\gamma_{1}^{T} x\right)$, where $\sigma(t)=1 /\left(1+e^{-t}\right)$ is the sigmoid function, and $\gamma$ parameterizes a linear combination of the input variables.
- In signal processing, wavelets are a popular choice with $\gamma$ parameterizing the location and scale shifts of a "mother" wavelet.
- Multivariate adaptive regression splines use truncated power spline basis functions where $\gamma$ parameterizes the variables and values for the knots.

- For trees, $\gamma$ parameterizes the split variables and split points at the internal nodes and the predictions at the terminal nodes.

Typically, these models are fit by minimizing a loss function averaged over the training data, such as the squared-error or a likelihood-based loss function,
$$
\min _{\left\{\beta_{m}, \gamma_{m}\right\}_{1}^{M}} \sum_{i=1}^{N} L\left(y_{i}, \sum_{m=1}^{M} \beta_{m} b\left(x_{i} ; \gamma_{m}\right)\right) .
$$
For many loss functions $L(y, f(x))$ and/or basis functions $b(x; \gamma)$, this requires computationally intensive numerical optimization techniques. However, a simple alternative often can be found when it is feasible to rapidly solve the subproblem of fitting just a single basis function,
$$
\min _{\beta, \gamma} \sum_{i=1}^{N} L\left(y_{i}, \beta b\left(x_{i} ; \gamma\right)\right) .
$$

There are other variants of Boosting algorithms for trees that you may find implemented in `scikit-learn.` There are obviously variations of the same approach which are more suitable for certain problems. The key concept that you should be familiar with at this point is the difference between **bagging** and **boosting**, knowing that not necessarily one outperforms the other. This is even more true in the world of finance.

### AdaBoost and Gradient Boosting: A Comparison

Both AdaBoost and Gradient Boosting are ensemble methods that aim to improve the predictive performance of weak learners by combining them into a single strong learner. However, there are significant differences in their approaches and underlying principles. Below, I outline their similarities and differences, specifically in the context of how they are implemented in Scikit-learn and XGBoost.

**AdaBoost**


*AdaBoost (Adaptive Boosting)* starts by fitting a base learner—often a decision tree with a single split, also known as a "stump"—to the original dataset. In each subsequent iteration, it adjusts the weights of the training instances according to the errors made in the previous iteration. The base learner is then refitted to this reweighted data.

*Weighting Errors*
In classification, misclassified samples gain higher weights to gain more attention from future base learners. In regression, instances with larger errors have their weights increased.

*Combination*
The final prediction is derived from a weighted majority vote in classification tasks, and from a weighted sum in regression tasks.

*Model Complexity*
Typically employs simple base learners, like decision stumps.

*Objective Function*
Aims to minimize the weighted error rate, which could pertain to classification error or some continuous error in regression.

*Scikit-learn Implementation*
Available as `AdaBoostClassifier` for classification and `AdaBoostRegressor` for regression.



**Gradient Boosting**

*Gradient Boosting* also initiates by fitting a base learner to the original dataset. It then iteratively adds new models that focus on correcting the residuals (in regression) or the gradients of the loss function (in classification) for the combined ensemble of existing models.

*Gradient Descent*
Rather than altering instance weights as in AdaBoost, Gradient Boosting fits the new model to the residuals or gradients, performing a form of gradient descent in the function space.

*Combination*
The final prediction is a weighted sum of the base learners' predictions, whether for classification probabilities or regression outputs.

*Model Complexity*
Allows for more complex base learners, such as larger decision trees.

*Objective Function*
Typically optimized for a differentiable loss function, offering flexibility for different types of problems, be it classification or regression.

*Scikit-learn Implementation*
Implemented as `GradientBoostingClassifier` for classification and `GradientBoostingRegressor` for regression.

*XGBoost* is an optimized distributed gradient boosting library that is designed for high efficiency, flexibility, and portability. It often outperforms scikit-learn's Gradient Boosting in both speed and adaptability to various problem types.


**GradientBoosting vs. XGBoost**

There are some key differences between those two:
- **Algorithmic Enhancements**: XGBoost incorporates several algorithmic enhancements for tree pruning, regularization, and handling of missing data, among other things.
- **Optimization**: XGBoost is designed for speed and performance. It is engineered to be distributed and can be parallelized across clusters, which scikit-learn's Gradient Boosting is not natively designed for.
- **Flexibility**: XGBoost is generally more flexible, allowing for custom objective functions and evaluation criteria, among other features.
- **Regularization**: XGBoost has an additional regularization term in the objective function, which helps to reduce overfitting. Scikit-learn's implementation does not have this feature by default.
- **Handling Missing Data**: XGBoost has a built-in routine to handle missing data, while in scikit-learn, you would typically need to handle missing data during the preprocessing stage.
- **Early Stopping**: XGBoost allows for early stopping during the training process, which is not available by default in scikit-learn's GradientBoosting implementation.
- **Support for Various Types of Problems**: XGBoost can be used for regression, classification, ranking, and user-defined prediction problems. Scikit-learn's Gradient Boosting is not as flexible for different kinds of specialized problems.



#### Summary

- **Commonality**: Both are boosting algorithms that combine multiple weak learners to create a strong learner.

- **Key Difference**: AdaBoost focuses on training instances that are hard to predict, whereas Gradient Boosting focuses on correcting the errors of the combined ensemble.

- **Flexibility**: Gradient Boosting is generally more flexible and can be optimized for a variety of loss functions. This makes it applicable to a wider array of problems compared to AdaBoost.

By understanding these differences and similarities, you can make a more informed choice between AdaBoost, Gradient Boosting in scikit-learn, and XGBoost based on your specific requirements.


# Detecting Credit Card Default with Tree Ensembles 

In [None]:
# Standard Libraries
import warnings
import random
from io import StringIO

# Data Manipulation and Analysis
import numpy as np
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import missingno

# Preprocessing and Feature Engineering
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import category_encoders as ce

# Modeling and Evaluation
from sklearn.tree import DecisionTreeClassifier, export_graphviz, plot_tree
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier
import sklearn.metrics as metrics
from sklearn.model_selection import (
    GridSearchCV, 
    RandomizedSearchCV, 
    cross_val_score, 
    cross_validate, 
    StratifiedKFold
)
from xgboost import XGBClassifier

# Graphing and Visualization Tools
import pydotplus

# Global Settings
warnings.simplefilter(action="ignore", category=FutureWarning)


## Exploratory Data Analysis

[UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)

In [None]:
df = pd.read_excel("data/credit_card_default.xls", skiprows=1, index_col=0)

In [None]:
df

Get summary statistics for numeric variables:

In [None]:
df.describe().transpose().round(2)

Plot the distribution of age and split it by gender:

In [None]:
# Create a Violin Plot using Seaborn
sns.violinplot(x="SEX", y="AGE", data=df, inner="quartile")
plt.title("Distribution of Age by Gender")
plt.xlabel("Gender")
plt.ylabel("Age")

In [None]:
df['SEX'].value_counts()

In [None]:
df['AGE'].unique()

In [None]:
# Create a histogram using Seaborn's displot
ax = sns.displot(
    data=df,
    # bins=20,
    x='AGE',
    kind='hist',
    hue='SEX',
    palette={1: "blue", 2: "red"} 
)
ax.set(title='Distribution of Age by Gender', xlabel='Age', ylabel='Frequency')

We notice some spikes appearing every ~10 years and the reason for this is the binning. Below, we create the same histogram using `sns.countplot`. By doing so, each value of age has a separate bin and we can inspect the plot in detail. There are no such spikes in the following plot:

In [None]:
plot_ = sns.countplot(x=df['AGE'], color="blue")

for ind, label in enumerate(plot_.get_xticklabels()):
    if int(float(label.get_text())) % 10 == 0:
        label.set_visible(True)
    else:
        label.set_visible(False)

Plot a `pairplot` of selected variables:

In [None]:
df.columns

In [None]:
pair_plot = sns.pairplot(df[["AGE", "LIMIT_BAL", "PAY_2"]])
pair_plot.fig.suptitle("Pairplot of selected variables", y=1.05)

Additionally, we can separate the genders by specifying the `hue` argument:

In [None]:
pair_plot = sns.pairplot(
    data=df, 
    x_vars = ["AGE", "LIMIT_BAL", "PAY_2"],
    y_vars = ["AGE", "LIMIT_BAL", "PAY_2"],
    hue="SEX",
    palette={1: "blue", 2: "red"} 
)

Plot the correlation heatmap:

In [None]:
def plot_correlation_matrix(corr_mat):
    """
    Function for plotting the correlation heatmap. It masks the irrelevant fields.

    Parameters
    ----------
    corr_mat : pd.DataFrame
        Correlation matrix of the features.
    """

    # temporarily change style
    sns.set(style="white")
    # mask the upper triangle
    mask = np.zeros_like(corr_mat, dtype=bool)
    mask[np.triu_indices_from(mask)] = True
    # set up the matplotlib figure
    fig, ax = plt.subplots(figsize=(10, 8))
    # set up custom diverging colormap
    cmap = sns.diverging_palette(240, 10, n=9, as_cmap=True)
    # plot the heatmap
    sns.heatmap(
        corr_mat,
        mask=mask,
        cmap=cmap,
        vmax=0.3,
        center=0,
        square=True,
        linewidths=0.5,
        cbar_kws={"shrink": 0.5},
        ax=ax,
    )
    ax.set_title("Correlation Matrix", fontsize=16)
    # change back to darkgrid style
    sns.set(style="darkgrid")

In [None]:
corr_mat = df.select_dtypes(include="number").corr()
plot_correlation_matrix(corr_mat)

We can also directly inspect the correlation between the features (numerical) and the target:

In [None]:
df.select_dtypes(include="number").corr()[['default payment next month']]

Plot the distribution of limit balance for each gender and education level:

In [None]:
ax = sns.violinplot(x="EDUCATION", y="LIMIT_BAL", hue="SEX", split=True, data=df)
ax.set_title("Distribution of limit balance per education level", fontsize=16)

The following code plots the same information, without splitting the violin plots.

In [None]:
ax = sns.violinplot(x='EDUCATION', y='LIMIT_BAL',
                    hue='SEX', data=df)
ax.set_title('Distribution of limit balance per education level',
             fontsize=16);

Investigate the distribution of the target variable per gender and education level:

In [None]:
ax = sns.countplot(x="default payment next month", hue="SEX", data=df, orient="h")
ax.set_title("Distribution of the target variable", fontsize=16)


In [None]:
df['EDUCATION'].value_counts()

In [None]:
ax = sns.countplot(x="default payment next month", hue="EDUCATION", data=df, orient="h")
ax.set_title("Distribution of the target variable", fontsize=16)


## Splitting the data into training and test sets

Separe the features from the target

In [None]:
X = df.copy()
y = X.pop("default payment next month")

In [None]:
y

In [None]:
y.value_counts()/y.shape[0]

Split the data into training and test sets:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=5475, stratify=None
)

In [None]:
X.head()

In [None]:
X_train.head()

Is it ok to split this way?

In [None]:
y_train.value_counts()/y_train.shape[0]

In [None]:
X.sort_values("SEX")

Split without shuffling:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

In [None]:
X.head()

In [None]:
X_train.head()

Verify that the ratio of the target is preserved:

In [None]:
y_train.value_counts(normalize=True)

In [None]:
y_test.value_counts(normalize=True)

### If we want a validation set

In [None]:
# # define the size of the validation and test sets
# VALID_SIZE = 0.1
# TEST_SIZE = 0.2

# # create the initial split - training and temp
# X_train, X_temp, y_train, y_temp = train_test_split(X, y,
#                                                     test_size=(VALID_SIZE + TEST_SIZE),
#                                                     random_state=42)

# # calculate the new test size
# NEW_TEST_SIZE = np.around(TEST_SIZE / (VALID_SIZE + TEST_SIZE), 2)

# # create the valid and test sets
# X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp,
#                                                     test_size=NEW_TEST_SIZE,
#                                                     random_state=42)

## Fitting a single decision tree classifier

In [None]:
def performance_evaluation_report(
    model, X_test, y_test, show_plot=False, labels=None, show_pr_curve=False
):
    """
    Function for creating a performance report of a classification model.

    Parameters
    ----------
    model : scikit-learn estimator
        A fitted estimator for classification problems.
    X_test : pd.DataFrame
        DataFrame with features matching y_test
    y_test : array/pd.Series
        Target of a classification problem.
    show_plot : bool
        Flag whether to show the plot
    labels : list
        List with the class names.
    show_pr_curve : bool
        Flag whether to also show the PR-curve. For this to take effect,
        show_plot must be True.

    Return
    ------
    stats : pd.Series
        A series with the most important evaluation metrics
    """

    y_pred = model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)[:, 1]

    cm = metrics.confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()

    fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred_prob)
    roc_auc = metrics.auc(fpr, tpr)

    precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred_prob)
    pr_auc = metrics.auc(recall, precision)

    if show_plot:
        if labels is None:
            labels = ["Negative", "Positive"]

        N_SUBPLOTS = 3 if show_pr_curve else 2
        PLOT_WIDTH = 15 if show_pr_curve else 12
        PLOT_HEIGHT = 5 if show_pr_curve else 6

        fig, ax = plt.subplots(1, N_SUBPLOTS, figsize=(PLOT_WIDTH, PLOT_HEIGHT))
        fig.suptitle("Performance Evaluation", fontsize=16)

        sns.heatmap(
            cm,
            annot=True,
            fmt="d",
            linewidths=0.5,
            cmap="BuGn_r",
            square=True,
            cbar=False,
            ax=ax[0],
            annot_kws={"ha": "center", "va": "center"},
        )
        ax[0].set(
            xlabel="Predicted label", ylabel="Actual label", title="Confusion Matrix"
        )
        ax[0].xaxis.set_ticklabels(labels)
        ax[0].yaxis.set_ticklabels(labels)

        ax[1].plot(fpr, tpr, "b-", label=f"ROC-AUC = {roc_auc:.2f}")
        ax[1].set(
            xlabel="False Positive Rate", ylabel="True Positive Rate", title="ROC Curve"
        )
        ax[1].plot(
            fp / (fp + tn), tp / (tp + fn), "ro", markersize=8, label="Decision Point"
        )
        ax[1].plot([0, 1], [0, 1], "r--")
        ax[1].legend(loc="lower right")

        if show_pr_curve:
            ax[2].plot(recall, precision, label=f"PR-AUC = {pr_auc:.2f}")
            ax[2].set(
                xlabel="Recall", ylabel="Precision", title="Precision-Recall Curve"
            )
            ax[2].legend()

    stats = {
        "accuracy": metrics.accuracy_score(y_test, y_pred),
        "precision": metrics.precision_score(y_test, y_pred),
        "recall": metrics.recall_score(y_test, y_pred),
        "specificity": (tn / (tn + fp)),
        "f1_score": metrics.f1_score(y_test, y_pred),
        "cohens_kappa": metrics.cohen_kappa_score(y_test, y_pred),
        "roc_auc": roc_auc,
        "pr_auc": pr_auc,
    }

    return stats

Create the instance of the model, fit it to the training data and create prediction:

In [None]:
tree_classifier = DecisionTreeClassifier(random_state=42)
tree_classifier.fit(X_train, y_train)
y_pred = tree_classifier.predict(X_test)

In [None]:
y_pred

Evaluate the results:

In [None]:
LABELS = ["No Default", "Default"]
tree_perf = performance_evaluation_report(
    tree_classifier, X_test, y_test, labels=LABELS, show_plot=True
)

In [None]:
tree_perf

Plot the Decision Tree:

In [None]:
small_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
small_tree.fit(X_train, y_train)


plt.figure(figsize=(12, 8))
plot_tree(small_tree, feature_names=list(X_train.columns), class_names=LABELS, rounded=True, proportion=False, precision=2, filled=True)

In [None]:
y_pred_prob = tree_classifier.predict_proba(X_test)[:,1]

In [None]:
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred_prob)

In [None]:
ax = plt.subplot()
ax.plot(recall, precision, label=f"PR-AUC = {metrics.auc(recall, precision):.2f}")
ax.set(title="Precision-Recall Curve", xlabel="Recall", ylabel="Precision")
ax.legend()

## Fitting a Random Forest

Create the instance of the model, fit it to the training data and create prediction:

In [None]:
X_train.shape

In [None]:
rf_classifier = RandomForestClassifier(
    n_estimators=100, max_features=10, n_jobs=-1, random_state=42
)
%time rf_classifier.fit(X_train, y_train)
y_pred_rf = rf_classifier.predict(X_test)

Evaluate the results:

In [None]:
rf_perf = performance_evaluation_report(
    rf_classifier, X_test, y_test, labels=LABELS, show_plot=True
)

plt.tight_layout()

In [None]:
rf_perf

## Fitting Adaboost

In [None]:
adaboost_classifier = AdaBoostClassifier(
    n_estimators=50, learning_rate=0.1, random_state=0
)
%time adaboost_classifier.fit(X_train, y_train)
y_pred_adaboost = adaboost_classifier.predict(X_test)

In [None]:
adaboost_perf = performance_evaluation_report(
    adaboost_classifier, X_test, y_test, labels=LABELS, show_plot=True
)

In [None]:
adaboost_perf

## Fitting a Gradient Boosting algorithm

In [None]:
boost_classifier = GradientBoostingClassifier(
    learning_rate=0.1, max_depth=3, n_estimators=50, random_state=0
)
%time boost_classifier.fit(X_train, y_train)
y_pred_boost = boost_classifier.predict(X_test)

In [None]:
boost_perf = performance_evaluation_report(
    boost_classifier, X_test, y_test, labels=LABELS, show_plot=True
)

In [None]:
boost_perf

In [None]:
xgb_classifier = XGBClassifier(
    learning_rate=0.1, max_depth=3, n_estimators=50, random_state=0
)
%time xgb_classifier.fit(X_train, y_train)
y_pred_xgb = xgb_classifier.predict(X_test)

In [None]:
xgboost_perf = performance_evaluation_report(
    xgb_classifier, X_test, y_test, labels=LABELS, show_plot=True
)

In [None]:
xgboost_perf

## Tuning hyperparameters 

### Cross-validation

The code below is designed to visualize how various cross-validation (CV) strategies from `scikit-learn` work. In this specific example, the same X, y, and groups are used to demonstrate the different behaviors of CV strategies. The idea is to show how each CV strategy divides the same dataset into training and test sets, given the same target labels (y) and group labels (groups).

The target labels (y) and group labels (groups) are visualized at the bottom of each plot to show what they look like relative to the CV splits. This makes it easier to understand how each CV strategy works in the context of the class labels and groupings. It also helps to show how strategies like StratifiedKFold or GroupKFold use this additional information (y and groups respectively) to create splits.

For example:

`StratifiedKFold` ensures that the proportion of each class in the target variable is the same in both the training and test sets.

`GroupKFold` ensures that the same group is not present in both training and test sets.

By using the same y and groups for each CV strategy, it becomes easier to compare and contrast how they each work and how they handle the classes and groups. In real-world applications, you would choose the most appropriate CV strategy based on the specific characteristics of your data (e.g., if you have imbalanced classes, grouped data, time series data, etc.).

**What are those groups?**


In some machine learning applications, the data might have a group structure. For example, if you have medical data, multiple samples might come from the same patient. Or in finance, multiple data points might come from the same company or the same time period. This introduces correlation among the samples.

In such scenarios, it's important to ensure that all samples corresponding to a single group are either in the training set or in the test set but not both, in order to get an unbiased estimate of the generalization performance. This is known as group-wise or group-based cross-validation.

GroupKFold, GroupShuffleSplit, and other such cross-validation techniques from scikit-learn expect an additional groups array that specifies the group labels for each sample. The groups parameter ensures that the same group is not represented in both the training and test sets. This helps to prevent data leakage and results in a more robust model evaluation.

For instance, if you have data from 10 patients and each patient has 10 samples, the groups array would look something like [1, 1, 1, ..., 2, 2, ..., 10, 10], where the number indicates the patient ID for each sample.

To summarize, the groups parameter is used to specify which samples belong to the same "group," so that during the splitting process, samples from the same group are kept together, either entirely in the training set or in the test/validation set.

**What is the stratification?**

Certainly. In machine learning, it's often important to ensure that the training and test sets have similar properties. One such property might be the distribution of the target labels (y).

In a stratified sampling approach, the training and test sets are constructed so that they have approximately the same distribution of target labels as the complete dataset. This is particularly useful when the target labels are imbalanced; for example, in binary classification problems where one class is much less frequent than the other.

In StratifiedKFold, each fold is made by preserving the percentage of samples for each class. The algorithm ensures that each fold has the same distribution of the target labels as the entire dataset. To accomplish this, it needs access to the target labels (y) to know how to perform the stratified sampling.

So, when using StratifiedKFold or any stratified sampling technique, you must provide y (the target labels) so that the cross-validator can arrange the folds to have a similar distribution of classes.

In [None]:
from sklearn.model_selection import (TimeSeriesSplit, RepeatedKFold, KFold, ShuffleSplit,
                                     StratifiedKFold, GroupShuffleSplit,
                                     GroupKFold, StratifiedShuffleSplit)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
np.random.seed(3)
cmap_data = plt.cm.Paired
cmap_cv = plt.cm.coolwarm
n_splits = 5

# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)

percentiles_classes = [.1, .3, .6]
y = np.hstack([[ii] * int(100 * perc)
               for ii, perc in enumerate(percentiles_classes)])

# Evenly spaced groups repeated once
groups = np.hstack([[ii] * 10 for ii in range(10)])

def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
    """Create a sample plot for indices of a cross-validation object."""

    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        # Visualize the results
        ax.scatter(range(len(indices)), [ii + .5] * len(indices),
                   c=indices, marker='_', lw=lw, cmap=cmap_cv,
                   vmin=-.2, vmax=1.2)
        
    yticklabels = list(range(n_splits))
    
    add_yticks = 0
    if y is not None:
        # Plot the data classes and groups at the end
        ax.scatter(range(len(X)), [ii + 1.5] * len(X),
                   c=y, marker='_', lw=lw, cmap=cmap_data)
        yticklabels = yticklabels + ['class']
        add_yticks = add_yticks + 1
    
    if group is not None:
        ax.scatter(range(len(X)), [ii + 2.5] * len(X),
                   c=group, marker='_', lw=lw, cmap=cmap_data)
        yticklabels = yticklabels + ['group']
        add_yticks = add_yticks + 1

    # Formatting

    ax.set(yticks=np.arange(n_splits+add_yticks) + .5, yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[n_splits+add_yticks+0.2, -.2], xlim=[0, 100])
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
    ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.02))],
              ['Testing set', 'Training set'], loc=(1.02, .8))
    return ax

cvs = [KFold, GroupKFold, ShuffleSplit, StratifiedKFold,
       GroupShuffleSplit, StratifiedShuffleSplit, TimeSeriesSplit]


for cv in cvs:
    this_cv = cv(n_splits=n_splits)
    fig, ax = plt.subplots(figsize=(18, 6))
    plot_cv_indices(this_cv, X, y, groups, ax, n_splits)

    ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.02))],
              ['Testing set', 'Training set'], loc=(1.02, .8))
    # Make the legend fit
    plt.tight_layout()
    fig.subplots_adjust(right=.7)
plt.show()

In [None]:
k_fold = KFold(5, shuffle=True, random_state=42)

In [None]:
k_fold

In [None]:
rf_classifier = RandomForestClassifier(
    n_estimators=100, max_features=10, n_jobs=-1, random_state=42
)

In [None]:
rf_classifier

Evaluate the random forest regressor using cross-validation (but you can do the same for whatever classifier we have used so far)

In [None]:
cross_val_score(rf_classifier, X_train, y_train, cv=k_fold, scoring="recall")

Add extra metrics to cross-validation:

In [None]:
cross_validate(
    rf_classifier,
    X_train,
    y_train,
    cv=k_fold,
    scoring=["accuracy", "precision", "recall", "roc_auc"],
)

### Grid search

Define the parameter grid:

In [None]:
rf_classifier = RandomForestClassifier(
     max_features=10, n_jobs=-1, random_state=42
)

In [None]:
param_grid = {
    "criterion": ["entropy", "gini"],
    "max_depth": range(3, 6),
    # 'min_samples_leaf': range(2, 6),
    'n_estimators' : [10,20],
}

In [None]:
param_grid

Run Grid Search:

In [None]:
classifier_gs = GridSearchCV(
    rf_classifier, param_grid, scoring="recall", cv=k_fold, n_jobs=-1, verbose=1
)

classifier_gs.fit(X_train, y_train)

In [None]:
print(f"Best parameters: {classifier_gs.best_params_}")
print(f"Recall (Training set): {classifier_gs.best_score_:.4f}")
print(
    f"Recall (Test set): {metrics.recall_score(y_test, classifier_gs.predict(X_test)):.4f}"
)

Evaluate the performance of the Grid Search:

In [None]:
LABELS = ["No Default", "Default"]
tree_gs_perf = performance_evaluation_report(
    classifier_gs, X_test, y_test, labels=LABELS, show_plot=True
)

plt.tight_layout()
plt.savefig("images/ch8_im20.png")
plt.show()

In [None]:
tree_gs_perf

Run Randomized Grid Search:

In [None]:
classifier_rs = RandomizedSearchCV(rf_classifier, param_grid, scoring='recall',
                                   cv=k_fold, n_jobs=-1, verbose=1,
                                   n_iter=100, random_state=42)
classifier_rs.fit(X_train, y_train)

In [None]:
print(f'Best parameters: {classifier_rs.best_params_}')
print(f'Recall (Training set): {classifier_rs.best_score_:.4f}')
print(f'Recall (Test set): {metrics.recall_score(y_test, classifier_rs.predict(X_test)):.4f}')

In [None]:
tree_rs_perf = performance_evaluation_report(classifier_rs, X_test,
                                             y_test, labels=LABELS,
                                             show_plot=True)

In [None]:
tree_rs_perf