[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/francisco-ortin/data-science-course/blob/main/classification/ensemble.ipynb)
[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# Ensemble methods

This notebook tackles two important things: ensemble methods and how to compare the performance of different models. Ensemble methods are a type of machine learning technique that combines several models to improve the performance of the model. To compare the performance of different models, we have to used statistical tests, since the models are trained with stochastic procedures.

We use the [Titanic Disaster dataset](https://www.kaggle.com/c/titanic/data?select=test.csv) stored in `data/titanic.csv`. The dataset has de following features:
- PassengerId: unique identifier for each passenger.
- Survived: target variable (0 = No, 1 = Yes).
- Pclass: ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
- Name: name of the passenger.
- Sex: "male" or "female".
- Age: age in years.
- SibSp: number of siblings/spouses aboard.
- Parch: number of parents/children aboard.
- Ticket: ticket number.
- Fare: passenger fare.
- Cabin: cabin number.
- Embarked: port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

In [14]:
# make sure the required packages are installed
%pip install pandas seaborn matplotlib scikit-learn xgboost --quiet
# if running in colab, install the required packages and copy the necessary files
directory='data-science-course/classification'
if get_ipython().__class__.__module__.startswith('google.colab'):
    !git clone https://github.com/francisco-ortin/data-science-course.git  2>/dev/null
    !cp --update {directory}/*.py .
    !mkdir -p img data
    !cp {directory}/data/* data/.

# import the required modules
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.utils import resample
from time import time

import utils

random_state = 42

Note: you may need to restart the kernel to use updated packages.


## Dataset

Let's load and clean the dataset.

In [15]:
dataset_file_name = 'data/titanic.csv'
independent_vars = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked']
dependent_var = 'Survived'
class_names = ['Not Survived', 'Survived']
# load anc clean the dataset
dataset, independent_vars = utils.load_clean_titanic_dataset(dataset_file_name, independent_vars, dependent_var)
# Split the dataset into training and testing sets
(X_train, y_train), (X_test, y_test) = utils.split_dataset(dataset, independent_vars, dependent_var,
                            0.6, random_state)  # we choose a train size = 0.4 test size of 0.6 on purpose to have higher variance

## Different binary classification models

We create different binary classification models with the default hyperparameters. Please, note that we are not tuning the hyperparameters of the models. This is important, because the performance of the models could be improved by tuning the hyperparameters. Particularly, the Random Forest and XGBoost models have many hyperparameters that could be tuned and improved.

In [16]:
print(f"{'-'*5} Single-value evaluation method {'-'*5}")
time_before = time()
# Let's different DT-based models
dt_model = DecisionTreeClassifier(random_state=random_state)
dt_model.fit(X_train, y_train)

rf_model = RandomForestClassifier(random_state=random_state)
rf_model.fit(X_train, y_train)

xgb_model = xgb.XGBClassifier(random_state=random_state)
xgb_model.fit(X_train, y_train)

# Evaluate the models
utils.evaluate_models([dt_model, rf_model, xgb_model], X_test, y_test)
print(f"Time elapsed: {time() - time_before:.2f} seconds.")

----- Single-value evaluation method -----
Model: DecisionTreeClassifier.
	Accuracy: 0.7495.
	F1 Score: 0.6854.
Model: RandomForestClassifier.
	Accuracy: 0.7664.
	F1 Score: 0.7126.
Model: XGBClassifier.
	Accuracy: 0.7757.
	F1 Score: 0.7260.
Time elapsed: 0.29 seconds.


## ✨ Questions ✨

1. Can we state that XGBoost is the best model?
2. Set the random_state to None and run the code many times. Is XGBoost always the best model? What is happening?
3. What could we do to improve the comparison?

### Answers

*Write your answers here.*



## Model comparison

### Method 1: Confidence intervals

The first method we are using to compare models train and evaluate the models N times and compute the 95% confidence intervals. 

In [17]:
# Evaluate the performance of the models n_times and store the evaluation results in accuracies and f1_scores
print(f"\n{'-'*5} Re-train and re-evaluate method {'-'*5}")
time_before = time()
n_times = 30
accuracies = dict()
f1_scores = dict()
for _ in range(n_times):
    (temp_X_train, temp_y_train), (temp_X_test, temp_y_test) = utils.split_dataset(dataset, independent_vars, dependent_var,
                                                               0.6, random_state=None)
    dt_model = DecisionTreeClassifier(random_state=None)
    dt_model.fit(temp_X_train, temp_y_train)
    rf_model = RandomForestClassifier(random_state=None)
    rf_model.fit(temp_X_train, temp_y_train)
    xgb_model = xgb.XGBClassifier(random_state=None)
    xgb_model.fit(temp_X_train, temp_y_train)
    models = [dt_model, rf_model, xgb_model]
    metrics = utils.evaluate_models(models, temp_X_test, temp_y_test, verbose=False)
    for model_name, (accuracy, f1_score_value) in zip(models, metrics):
        accuracies.setdefault(model_name.__class__.__name__, []).append(accuracy)
        f1_scores.setdefault(model_name.__class__.__name__, []).append(f1_score_value)


----- Re-train and re-evaluate method -----


We now compute and visualize the confidence intervals of each model and metric.

In [18]:
for model_name in accuracies:
    accuracy_mean, accuracy_confidence_interval = utils.confidence_interval(accuracies[model_name], 0.95)
    f1_score_mean, f1_score_confidence_interval = utils.confidence_interval(f1_scores[model_name], 0.95)
    print(f"Model: {model_name}.")
    print(f"\tAccuracy mean: {accuracy_mean:.4f}. CI: {accuracy_confidence_interval}.")
    print(f"\tF1 Score: {f1_score_mean:.4f}. CI: {f1_score_confidence_interval}.")

print(f"Time elapsed: {time() - time_before:.2f} seconds.")

Model: DecisionTreeClassifier.
	Accuracy mean: 0.7680. CI: (0.7591361987943405, 0.776938567560799).
	F1 Score: 0.6821. CI: (0.6683912694960683, 0.6957462759811113).
Model: RandomForestClassifier.
	Accuracy mean: 0.7720. CI: (0.7642338617189826, 0.7796913719258778).
	F1 Score: 0.6929. CI: (0.6825579938557793, 0.7031504764987067).
Model: XGBClassifier.
	Accuracy mean: 0.7829. CI: (0.7763415583106815, 0.7893905289167328).
	F1 Score: 0.7062. CI: (0.6975814564985595, 0.714759473542269).
Time elapsed: 7.43 seconds.


## ✨ Questions ✨

4. Are there differences with the previous values?
5. Why?
6. What method do you think is better?
7. Is there any modification of the best, average, and worst models?
8. Are there significant differences between the models?

### Answers

*Write your answers here.*



### Method 2: Bootstrap

We do not retrain the model. Instead, we use bootstrap on the test set to estimate the confidence intervals.

*Notice*. This method is used when retraining the models is so expensive that we cannot afford to retrain them N times. A common example is a deep learning model that takes hours to train. In this case, we use the test set to compute the confidence intervals of the metrics using a method called bootstrapping. 

In [19]:
print(f"\n{'-'*5} Bootstrapping method {'-'*5}")
time_before = time()

dt_model = DecisionTreeClassifier(random_state=random_state)
dt_model.fit(X_train, y_train)
rf_model = RandomForestClassifier(random_state=random_state)
rf_model.fit(X_train, y_train)
xgb_model = xgb.XGBClassifier(random_state=random_state)
xgb_model.fit(X_train, y_train)
models = [dt_model, rf_model, xgb_model]

# Perform bootstrapping
n_bootstrap_samples = 1000
bootstrap_accuracies = dict()
bootstrap_f1_scores = dict()
for _ in range(n_bootstrap_samples):
    # Resample with replacement, both X_test and the corresponding y_test at the same time (important)
    resampled_X_test, resampled_y_test = resample(X_test, y_test, replace=True)
    # Calculate the metrics for each model and store it in the two dictionaries
    for model in models:
        resampled_y_pred = model.predict(resampled_X_test)
        accuracy = accuracy_score(resampled_y_test, resampled_y_pred)
        f1_score_value = f1_score(resampled_y_test, resampled_y_pred)
        bootstrap_accuracies.setdefault(model.__class__.__name__, []).append(accuracy)
        bootstrap_f1_scores.setdefault(model.__class__.__name__, []).append(f1_score_value)

# Compute and visualize the confidence intervals of each model and metric
for model_name in bootstrap_accuracies:
    accuracy_mean, accuracy_confidence_interval = utils.confidence_interval(bootstrap_accuracies[model_name], 0.95)
    f1_score_mean, f1_score_confidence_interval = utils.confidence_interval(bootstrap_f1_scores[model_name], 0.95)
    print(f"Model: {model_name}.")
    print(f"\tAccuracy mean: {accuracy_mean:.4f}. CI: {accuracy_confidence_interval}.")
    print(f"\tF1 Score: {f1_score_mean:.4f}. CI: {f1_score_confidence_interval}.")

print(f"Time elapsed: {time() - time_before:.2f} seconds.")


----- Bootstrapping method -----
Model: DecisionTreeClassifier.
	Accuracy mean: 0.7495. CI: (0.7483864180492062, 0.7506528342872422).
	F1 Score: 0.6853. CI: (0.6838199170049631, 0.6868258975189289).
Model: RandomForestClassifier.
	Accuracy mean: 0.7655. CI: (0.764376426259316, 0.7665880597219923).
	F1 Score: 0.7114. CI: (0.7099940157216071, 0.7129007509626097).
Model: XGBClassifier.
	Accuracy mean: 0.7753. CI: (0.7742390269187925, 0.7764376085952261).
	F1 Score: 0.7255. CI: (0.7240700411891269, 0.7269247001146079).
Time elapsed: 25.79 seconds.


## ✨ Questions ✨

9. Are there differences with the previous values? Why?
10. What method takes longer? Why?
11. In deep learning scenarios, do you think there will be similar execution-time differences?

### Answers

*Write your answers here.*

