# Machine Learning 2023-2024 - UMONS

# Tree-based methods 

In this lab, our objective is to predict whether female patients suffer from diabetes based on a series of medical attributes. To do so, we will apply several tree-based methods to the Pima Indians dataset.

Here is a description of the dataset's attributes:
  - **Num_pregnant** : The number of pregnancies the patient had. 
  - **glucose_con** : Patient's plasma glucose concentration.
  - **blood_pressure** : Patient's dialostic blood pressure (mmHg).
  - **triceps_thickness** : Patient's triceps skin-fold thickness (mm).
  - **insulin** : Patient's 2-h serum insulin (mu U/mL).
  - **bmi** : Patient's body mass index (kg/m^2).
  - **dpf** : Patient's diabetes pedigree function.
  - **age** : Patient's age. 
  - **diabetes** : Whether the patient has diabetes (1) or not (0).

**Load the necessary libraries** 

In [1]:
import numpy as np 
import pandas as pd 
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve, roc_auc_score, mean_squared_error
from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import math
from sklearn.pipeline import Pipeline 
from sklearn.tree import DecisionTreeClassifier, plot_tree, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor, BaggingClassifier, BaggingRegressor
from sklearn.utils import resample

**1) Load the dataset, get its general information, and check for missing values.** 

# Decision Trees

**2) Select 'diabetes' as the target variable, and all the remaining columns as predictors.**

**Create a pipeline containing the preprocessing steps (missing values imputer, scaler, ...) and a `DecisionTreeClassifier` with a maximum depth set to 3 (through the `max_depth` argument). Use the entropy as split criterion.** 
- **Do you think scaling the variables is necessary?**


**Fit this pipeline to the data (do not split the dataset for the time being), and plot the decision tree. How do you interpret it?** 

**You'll need the `plot_tree` method from the sklearn library. You can access the pipeline's classifier using the `named_steps['classifier']` attributes. You'll also need to pass the features' (predictors) names to the function using the `features_names` argument.**

**3) Let's see how the model's performance evolve as a function of the tree's maximum depth.**

**To this end, apply the following steps:**
- **Split the dataset into a training and a test set following a 0.8/0.2 partition**
- **For maximum depths varying from 1 to 20, fit a `DecisionTreeClassifier` to the *training* data using a 10 folds cross-validation with the AUROC as metric.**
- **Plot the means of the training and validation AUROCS across each folds as a function of the maximum depth.**
- **Compute the one standard error of the means at each depth, and add it to the plot as a shaded grey area around the means.**
    - **What can you conclude regarding the model's performance, as well as the uncertainty for the in-sample and out-of-sample AUROC estimates ?**
- **Identify which depth would lead a priori to the best model's out-of-sample performance. Using this depth, fit a decision tree to the training split and report the training AUROC and the test AUROC.** 

# Bagging 

**4) Implement your own bagging algorithm by fitting a decision tree to each bootstrap sample. To this end, perform the following steps:**
- **Draw 30 bootstrap samples from the training set with replacement. Each bootstrap sample should contain the same number of observations as in the training set. Use the `resample()` method of scikit-learn.** 
- **For each bootstrap sample, do:**
  - **Fit a `DecisionTreeClassifier` to the bootstrap sample. Reuse the pipeline defined previously. The maximum depth of each tree should be fixed to 5.**
  - **Using the fitted decision tree, predict the class and the probabilities on the test set, and save them in a list.**
- **You will now use two different aggregation strategies to get a single prediction from the ensemble:** 
  - **Majority vote strategy: predict the class that was predicted the most by each tree separately.**
  - **Average probability strategy: predict the class whose average probability across each tree is the highest.** 
- **Plot the confusion matrix of the predictions for both aggregation strategies separately.**

# Random Forest

**5) Perform a `RandomizedSearchCV` on a specified grid of hyper-parameters to find the best configuration for a `RandomForestClassifier`. Set the scoring function as the AUROC and limit the number of combination to try to 10.**

**Fit the best model found in the previous procedure to the training data, and predict on the test set. Report the test AUROC and display the ROC curve.**

# Boosting

**7) Fit a `GradientBoostingClassifier` to the training data, and report the training and test AUROCs.**

**8) For a `DecisionTreeClassifier`, a `BaggingClassifier`, a `RandomForestClassifier` and a `GradientBoostingClassifier`, perform a `RandomizedSearchCV` on a predefined grid of hyper-parameters. Amongst all models and hyper-parameters combinations, select the best model and report the best *validation* AUROC. The random search should be performed on the *training* data, and you can set the number of combinations to try per model to 5.**

**For the best model found, report the training and test AUROC's, and display the training and test ROC curves**

# Regression 

**9) Select *bmi* as the target variable and all the remaining columns in the dataframe as the predictors, at the exception to *diabetes*. Split your dataset into a training and test set, fit a `DecisionTreeRegressor` to the training data, and report the MSE on the training and test sets. What do you observe ?**

**10) For a `DecisionTreeRegressor`, a `BaggingRegressor`, a `RandomForestRegressor` and a `GradientBoostingRegressor`, perform a `RandomizedSearchCV` on a predefined grid of hyper-parameters. Amongst all models and hyper-parameters combinations, select the best model and report the best *validation* MSE. The random search should be performed on the *training* data, and you can set the number of combinations to try per model to 5.**

**For the best model found, report the training and test MSE.**