# Machine learning 101

This workshop aims to introduce the basics of supervised machine learning algorithms. Using a breast cancer dataset, we will employ classification algorithms to predict whether tumors are malignant or benign.

Applied to our dataset and in a simplified manner, the goal of supervised machine learning is to provide features (input data) to a model so that it predicts an output (nature of the tumor: malignant or benign).

# Jupyter Notebook


**Skip this section if you know how to use Jupyter Notebooks**


Jupyter Notebooks are files in the `.ipynb` format. They provide an interface that combines code and note-taking (similar to RMarkdown with R), which is very convenient for data analysis or machine learning because you can easily go back to modify the code without having to rerun it.

## Environments

`.ipynb` files require a special environment to be read, edited, and executed. Here are the 4 main ones:

* Jupyter Notebook (local web platform)
* JupyterLab
* Visual Studio Code (Codium on the campus' computers)
* Google Colab

We will work with Jupyter Notebook in a web instance. There are two ways to launch a Jupyter Notebook:

Via *graphical interface*: double-click on the `.ipynb` file in the file explorer.

Via *command lines*: open a terminal, navigate to the folder containing the notebook with `cd`. Then type `jupyter notebook`, which opens a local web interface, and select the notebook with the mouse.

## Cells

Cells are blocks that can be executed. There are two types: Python code cells and text cells.

You have various buttons to modify the cells: move a cell up/down, cut, delete, etc.

**Cells can be executed by first selecting the cell of interest with the mouse, then either clicking the button with an arrow or using the shortcut CTRL + Enter.**

### Code cell

Python code cells allow you to execute Python code directly.

You can specify that a command should be executed in the shell (bash on Unix systems or PowerShell on Windows) by using `!` e.g., `!echo "hello world"`.

### Text cell

Text cells allow you to insert text in Markdown format (see Markdown syntax) to provide information.

## Saving changes

Don't forget to save changes regularly with the shortcut CTRL + S or in File > Save in case of a crash."


# Installation des bibliothèques

We install Python libraries using the official Python library manager, pip.

* Scikit-Learn for machine learning models
* Matplotlib for creating graphs
* Seaborn for simpler graphing
* Pandas for CSV file manipulation (dataframes)

In [None]:
!pip install scikit-learn
!pip install matplotlib
!pip install seaborn
!pip install pandas

***Restart the kernel after installing the libraries:***

(top toolbar) Kernel > Restart Kernel > Restart

If you encounter any issues, save the notebook, close the page, and reopen the notebook.


# Importing libraries

Libraries and the dataset are imported



In [None]:
# We'll work with breast cancer dataset
from sklearn.datasets import load_breast_cancer

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

***If it does not work, you must restart the kernel:***

(top toolbar) Kernel > Restart Kernel > Restart

# Visualisation of the dataset's content

## Loading the dataset and first handling

Before preprocessing the dataset, it is recommended to perform a visualisation step. This helps in better understanding the dataset and knowing the types of variables for the features (quantitative, qualitative, etc.).

We have two classes: B and M. These classes are contained in the **target** column (initially binary). Each class represents a cancer classification, and the codes are as follows:
* 1 = "benign"
* 0 = "malignant"

We extract:
* the features that will allow us to classify the instances
* the labels containing the index and class columns


In [None]:
# Load data as tuple
features, labels = load_breast_cancer(return_X_y=True, as_frame=True)

# Concatenate features and labels together
dataset = pd.concat([features, labels], axis=1)

# Display the dataset's first lines
display(dataset.head())

The `target` column (0, 1) is changed into `target_names` containing strings ("malign", "benign")

In [None]:
# Loading the dataset from Sklearn library
data = load_breast_cancer()

# Creating a Pandas dataframe with column names corresponding to features
df = pd.DataFrame(data.data, columns=data.feature_names)

# Add a column 'target_names' and replace binary data
df['target_names'] = data.target_names[data.target]

# Display the dataset's first lines
display(df.head())

All features are listed

In [None]:
# Display features' names of the dataset
display(dataset.columns)

# Create a list of columns containing the word "mean"
mean_columns = [col for col in dataset.columns if 'mean' in col]

# Create a list of columns containing the word "error"
error_columns = [col for col in dataset.columns if 'error' in col]

# Create a list of columns containing the word "worst"
worst_columns = [col for col in dataset.columns if 'worst' in col]

# Display the columns containing the word "mean"
display(mean_columns)

# Display the columns containing the word "error"
display(error_columns)

# Display the columns containing the word "worst"
display(worst_columns)

We have 10 features: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension

Each feature has:
* an average (mean)
* a standard deviation (error)
* the mean of the 3 highest values (worst)

In [None]:
def data_info(dataset: pd.DataFrame) -> None:
    """
    Sums up the content of the dataset
    """
    # Info about the dataset's dimensions
    print("DIMENSIONS n")
    dims = dataset.shape
    print("\tThe dataset contains {} instances (observations), and {} features.\n".format(dims[0], dims[1]))
    print("\tThe features are: {}.\n".format(dataset.columns.tolist()))

    # Info about classes
    print("CLASSES:\n")
    classes = df["target_names"].value_counts().index.tolist()
    print("\tThere are {} classes in the dataset.\n".format(len(classes)))
    print("\tClasse are: {}.\n".format(classes))

    # Info about duplicates
    print("DUPLICATE:\n")
    subset_columns = list(dataset.columns)
    # Checking duplicates in the columns of the subset
    duplicated_rows = dataset.duplicated(subset=subset_columns, keep=False)
    # Couting duplicates
    duplicated_count = duplicated_rows.sum()
    print("\tThere are {} duplicate(s) in the dataset.\n".format(duplicated_count))
    # Display the number of duplicates
    if duplicated_count > 0:
      rows_duplicated = dataset[duplicated_rows]
      print("\tDuplicates are:\n")
      print(rows_duplicated)
    else:
      print("\tNo duplicate in the dataset.\n")

data_info(dataset)

### Global statistics of the features

Let's look at stats per class.


In [None]:
def stats_data(dataset: pd.DataFrame) -> None :

    # Global stats
    print("GLOBAL STATS: ")
    display(dataset.iloc[:, 0:30].describe())

    # Stats per class
    print("\nSTATS PER CLASS")
    for i in dataset["target"].value_counts().index.tolist() :

        print("\n\tClass: {}".format(i))
        display(dataset[dataset["target"] == i].iloc[:, 0:30].describe())


stats_data(dataset)

We observe that the features of the benign class (1) tend to be lower than those of the malignant class (0).

For example:
* The mean of `mean radius` is 12.146524 for the benign class (0) but 17.462830 for the malignant class (1).
* The median of `mean texture` is 17.390000 for the benign class (0) and 21.460000 for the malignant class (1)."

Since the table is partially truncated, we are going to show features belonging to the `error` group, which corresponds to the standard deviation.

In [None]:
def stats_data(dataset: pd.DataFrame) -> None:
    """
    Display global stats for columns of the error group
    """
    # Global stats
    print("GLOBAL STATS: ")
    display(dataset[error_columns].describe())

    # Stats per class
    print("\nSTATS PER CLASS")
    for i in dataset["target"].value_counts().index.tolist():
        print("\n\tClass: {}".format(i))
        display(dataset[dataset["target"] == i][error_columns].describe())


stats_data(dataset)

### Exercise 1

Select the most relevant group of features among `error`, `mean`, `worst`. Then display the global stats and stats per class of the chosen group.

Use the variables `mean_columns`, `error_columns` or `worst_columns`.

## Box plots of the features

A more visual way to show the distribution of the dataset is to use box plots, which allow us to observe trends between classes.

Below, we are only looking at features belonging to the `error` group.

In [None]:
def boxplot_data(dataset: pd.DataFrame) -> None :

    # Retrieve classes
    classes = dataset["target_names"].value_counts().index.tolist()

    # Retrieve variable names
    features_names = [col_name for col_name in dataset.columns.tolist() if "error" in col_name and col_name != "target"]

    # Define the subplot counter
    cpt = 1

    # Define a plot
    plt.figure(figsize=(20, 40))

    # For each variable
    for col_name in features_names :
        # On a subplot
        plt.subplot(len(features_names), len(classes), cpt)

        # Display the box plot of the values for each class
        sns.boxplot(data=dataset, x="target_names", y=col_name, hue="target_names")

        # Next subplot
        cpt+=1

    # Adjust layout
    plt.tight_layout(pad=2)

    # Show plot
    plt.show()

boxplot_data(df)

### Exercise 2

Do the same as above but select the feature group corresponding to `mean` and/or `worst` to visualise its box plots.

These box plots confirm our previous observations: instances of the malignant class tend to have lower feature values than instances of the benign class.

## Scatter plots of the features

These scatterplots will allow us to show pairwise relationships between data variables. Each subplot in the grid represents the relationship between two different variables.

In [None]:
def features_distributions(dataset: pd.DataFrame) -> None :

    # Retrive variable names
    features_names = ["mean perimeter", "mean texture", "mean area", "mean radius"]

    # Retrieve variable count
    n_features = len(features_names)

    # Define counter for the subplots
    cpt = 1

    # Define a plot
    plt.figure(figsize=(10, 10))

    # Display each variable depending on the other
    for i in features_names :

        for j in features_names :

            # On a subplot
            plt.subplot(n_features, n_features, cpt)

            # Display distribution of classes in the plan
            plt.scatter(dataset.loc[:, i], dataset.loc[:, j], c=dataset["target"])

            # Name axes
            plt.ylabel(i)
            plt.xlabel(j)

            # Next subplot
            cpt+=1

    # Adjust layout
    plt.tight_layout(pad=2)

    # Display plot
    plt.show()

features_distributions(dataset)

From these scatter plots, we can observe that instances of the same class (= same color) tend to cluster together. This is a positive sign for a classification approach.

# Preprocessing

The data preprocessing step is crucial to prepare the dataset for training as models are sensitive to certain characteristics such as missing or duplicate data, outliers, etc. This step aims to homogenise the dataset.

This preprocessing step includes several sub-steps:
* **Cleaning**: removing duplicate/missing data
* **Normalisation**: standardizing the data

> Other sub-steps such as **feature extraction** (not presented here) involve extracting meaningful data. For example, in analyzing text, one might extract semantically rich words (tokens) and transform them into quantitative data. This way, textual information is converted into numbers.
> Another sub-step is class **balancing**, which is not performed here. Some algorithms are less sensitive to imbalanced classes.
> In connection with class balancing, a **resampling** sub-step may be necessary, involving reshaping the dataset to make it more balanced. Resampling includes downsampling, upsampling, and interpolation.

Finally, the last step is **splitting** the dataset: creating separate training and validation datasets."

In [None]:
# This function normalises data
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Remove duplicates
dataset = dataset[dataset.duplicated() == False].reset_index(drop=True)

# Display info
display(dataset.head())
print("dimensions : {}".format(dataset.shape))

In [None]:
# Split the dataset
features, labels, target_names = dataset.iloc[:, 0:30], dataset.iloc[:, 30], df["target_names"]

# Create a MinMaxScaler object
normalizer = MinMaxScaler()

# Normalisation
features = normalizer.fit_transform(features)

# Transform into a dataframe
features = pd.DataFrame(data=features, columns=normalizer.get_feature_names_out())

# Display the results of normalisation
display(features.head())

Since the dataset is relatively clean, there is no need for extensive preprocessing. We can proceed to the training phase.

# Training and evaluation

As the name suggests, the training step involves training a model, meaning the model will adjust its parameters/weights to best predict the output: the class (malignant or benign)

We're going to use classifiers
* Naive Bayes
* Logistic regression
* Support-vector machine
* Random forest
* XGboost
* Multilayer perceptron

In [None]:
# Allow to split a dataset into a training and test dataset
from sklearn.model_selection import train_test_split

# All the models we're going to use
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier


# Allow to evaluate models
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, accuracy_score

# Find best hyperparameters
from sklearn.model_selection import GridSearchCV, LeaveOneOut

In [None]:
# Split data into 4 parts
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, stratify=labels, random_state=42)

# Display the sets
## Input
display(x_train.head())
display(x_test.head())
## Output: class to predict
display(y_train.head())
display(y_test.head())

The training dataset corresponds to 80% of the original dataset size, and the test dataset corresponds to the remaining 20%.

The splitting is done randomly, but we can 'fix the level of randomness' for reproducibility using the `random_state` parameter.

We will evaluate the model's performance using a confusion matrix. A confusion matrix compares the model's predicted classes to the actual classes (true positive, false negative, etc.). From this matrix, performance metrics can be calculated:
* Precision: The higher it is, the more true positives are maximised.
* Recall: The higher it is, the more true (true positive and true negative) instances are maximised.
* Accuracy: The higher it is, the more overall correct predictions (false negative and true positive) are maximised.
* F1-score: It combines precision and recall (harmonic mean).

All these metrics are important, but it is especially **precision** and **recall** (= sensitivity) that are of interest since favoring one often comes at the cost of the other (trade-off). Depending on the dataset and the classification objective, one might prefer to emphasise either precision or recall.

For breast cancer prediction, the preference is often towards detecting more true positives: if a malignant tumor is detected, it is likely to be malignant. Therefore, there is a tendency to favor precision, given the severity of treatments. However, in the end, it remains a choice."

In [None]:
from sklearn.metrics import balanced_accuracy_score, precision_score, recall_score

# Define a list of models
learning_algo = [MultinomialNB(),
                 LogisticRegression(random_state=42),
                 SVC(random_state=42),
                 DecisionTreeClassifier(random_state=42),
                 GradientBoostingClassifier(random_state=42),
                 MLPClassifier(random_state=42, max_iter=1000)
                ]

# Initiliase variables to follow the best models and their metrics
max_accuracy = 0
max_precision = 0
max_recall = 0
best_model_accuracy = None
best_model_precision = None
best_model_recall = None

# Loop through each algorithm
for algo in learning_algo:
    tmp = algo  # Assign the current algorithm to the tmp variable
    print(type(tmp).__name__ + ":")  # Display the name of the algorithm

    # Training the model
    tmp.fit(x_train, y_train)

    # Create a plot to display the confusion matrix
    plt.figure(figsize=(12, 12))

    # Display the confusion matrix for the training set
    ConfusionMatrixDisplay.from_estimator(tmp, x_train, y_train, cmap=plt.cm.Blues, ax=plt.subplot(2, 2, 1))
    plt.title("Confusion matrix - Training set")

    # Display the confusion matrix for the test set
    ConfusionMatrixDisplay.from_estimator(tmp, x_test, y_test, cmap=plt.cm.Blues, ax=plt.subplot(2, 2, 2))
    plt.title("Confusion matrix - Test set")

    plt.show()  # Display the confusion matrix

    # Predict the training and test sets
    y_pred_train = tmp.predict(x_train)
    y_pred_test = tmp.predict(x_test)

    # Compute the performance matrix for the training set
    accuracy_train = balanced_accuracy_score(y_train, y_pred_train)
    precision_train = precision_score(y_train, y_pred_train, average='weighted')
    recall_train = recall_score(y_train, y_pred_train, average='weighted')

    # Compute the performance matrix for the test set
    accuracy_test = balanced_accuracy_score(y_test, y_pred_test)
    precision_test = precision_score(y_test, y_pred_test, average='weighted')
    recall_test = recall_score(y_test, y_pred_test, average='weighted')

    # Display the classification report for the training set
    print("\nTRAINING SET :\n")
    print(classification_report(y_pred_train, y_train))

    # Display the performance metrics for the training set
    print("\nAccuracy of the training set:", accuracy_train)
    print("Precision of the training set:", precision_train)
    print("Recall of the training set:", recall_train)

    # Display the classifion repport for the test set
    print("\nTEST SET:\n")
    print(classification_report(y_pred_test, y_test))

    # Display the performance metrics for the test set
    print("Accuracy of the test set", accuracy_test)
    print("Precision of the test set:", precision_test)
    print("Recall of the test set:", recall_test)

    # Update the max values and best model if the current model is the best one
    if accuracy_test > max_accuracy:
        max_accuracy = accuracy_test
        best_model_accuracy = type(tmp).__name__

    if precision_test > max_precision:
        max_precision = precision_test
        best_model_precision = type(tmp).__name__

    if recall_test > max_recall:
        max_recall = recall_test
        best_model_recall = type(tmp).__name__

    print('\n')

# Display the best models and their performance metrics
print(f"Best model (accuracy): {best_model_accuracy} - accuracy: {max_accuracy:.4f}")
print(f"Best model (precision): {best_model_precision} - Precision: {max_precision:.4f}")
print(f"Best model (recall): {best_model_recall} - recall: {max_recall:.4f}")


Subsequently, we will search for the best hyperparameters, meaning those that will yield the best predictions. Hyperparameters are specific parameters of algorithms that influence their behavior.

Here, we are focused on optimising hyperparameters for all models:

In [None]:
# Instantiate a SVM
model = SVC()

# Define the parameters to test
params = {"C" : [0.001, 0.01, 0.1, 1, 10, 100, 1000],
          "random_state" : [i for i in range(0, 100, 1)]}

# Instantiate the iterator for the creation of the validation set during the research of the best hyperparameters
kfold = 10

# Create a reseach grid
grid_search = GridSearchCV(estimator=model, param_grid=params, scoring="accuracy", cv=kfold, verbose=3)

# Research + display the result
grid_search.fit(x_train, y_train)
print("\nThe best parameters for the models are: {}".format(grid_search.best_params_))

#The best parameters for the models are : {'C': 10, 'random_state': 0} with kfold= LeaveOneOut() et kfold= 10, to save time we use kfold= 10

In [None]:
# Retrieve the best model
model = grid_search.best_estimator_
print(model)
# Create a plot
plt.figure(figsize=(12, 12))

# Display the confusion matrix for the training set
ConfusionMatrixDisplay.from_estimator(model, x_train, y_train, cmap=plt.cm.Blues, ax=plt.subplot(2, 2, 1))
plt.title("Confusion matrix - Training Set")

# Display the confusion matrix for the test set
ConfusionMatrixDisplay.from_estimator(model, x_test, y_test, cmap=plt.cm.Blues, ax=plt.subplot(2, 2, 2))
plt.title("Confusion matrix - Test Set")

# Show plot
plt.show()

# Display the classification report of the training set
print("\nTRAINING SET:\n")
print(classification_report(model.predict(x_train), y_train))

# Display the classification report of the test set
print("\nTEST SET :\n")
print(classification_report(model.predict(x_test), y_test))
print('\n')

# Saving the model

The model is saved to apply it on other datasets.

The model is saved using the binary `.sav` format.

In [None]:
# Allow to save Python objects
import pickle as pk

In [None]:
# Instantiate the model with the best hyperparameters
model = SVC(C=10, random_state=0)

# Train the model on the whole data
model.fit(features, labels)

# Display the confusion matrix
plt.figure(figsize=(6, 6))
plt.title("Final Confusion Matrix")
ConfusionMatrixDisplay.from_estimator(model, features, labels, cmap=plt.cm.Blues, ax=plt.subplot(1, 1, 1))
plt.show()

# Display the confusion matrix
print(classification_report(model.predict(features), labels))

In [None]:
# Save the preprocessor, it'll come in handy for future data
pk.dump(normalizer, open("preprocesseur.sav", 'wb'))

# Save the model
pk.dump(model, open("model.sav", 'wb'))

# Synthetic exercise

Based on what you've seen, train a model on a mini-dataset (see the cell below). Identify the important functions and train one or more of the six models mentioned earlier.

In this mini-dataset, you need to predict the `y` class (dog or cat) from `x` (containing 2 features: height in cm and weight in kg). There are 6 instances.

In [None]:
x = [[45, 11.34], [56, 13.61], [51, 12.7], [30, 4.5], [25, 3.8], [35, 5.2]] # features
y = ["dog", "dog", "dog", "cat", "cat", "cat"] # class

# Corrections

## Exercise 1

In [None]:
# For the group "worst"

def stats_data(dataset: pd.DataFrame) -> None:
    # Statistiques globales pour les colonnes contenant "worst"
    print("STATISTIQUES GLOBALES : ")
    display(dataset[worst_columns].describe())

    # Statistiques par classe pour les colonnes contenant "worst"
    print("\nSTATISTIQUES PAR CLASSE")
    for i in dataset["target"].value_counts().index.tolist():
        print("\n\tClasse : {}".format(i))
        display(dataset[dataset["target"] == i][worst_columns].describe())


stats_data(dataset)

In [None]:
# For the group "error"

def stats_data(dataset: pd.DataFrame) -> None:
    # Statistiques globales pour les colonnes contenant "error"
    print("STATISTIQUES GLOBALES : ")
    display(dataset[error_columns].describe())

    # Statistiques par classe pour les colonnes contenant "error"
    print("\nSTATISTIQUES PAR CLASSE")
    for i in dataset["target"].value_counts().index.tolist():
        print("\n\tClasse : {}".format(i))
        display(dataset[dataset["target"] == i][error_columns].describe())


stats_data(dataset)

In [None]:
# For the group "mean"

def stats_data(dataset: pd.DataFrame) -> None:
    # Statistiques globales pour les colonnes contenant "mean"
    print("STATISTIQUES GLOBALES : ")
    display(dataset[mean_columns].describe())

    # Statistiques par classe pour les colonnes contenant "mean"
    print("\nSTATISTIQUES PAR CLASSE")
    for i in dataset["target"].value_counts().index.tolist():
        print("\n\tClasse : {}".format(i))
        display(dataset[dataset["target"] == i][mean_columns].describe())


stats_data(dataset)

## Exercise 2

In [None]:
# For the group "worst"

def boxplot_data(dataset: pd.DataFrame) -> None :

    # Récupèrer les classes
    classes = dataset["target_names"].value_counts().index.tolist()

    # Récupèrer le nom des variables
    features_names = [col_name for col_name in dataset.columns.tolist() if "worst" in col_name and col_name != "target"] # Solution: changer le contenu de "" après le if

    # Définir le compteur des subplots
    cpt = 1

    # Définir une figure
    plt.figure(figsize=(20, 40))

    # Pour chaque variables
    for col_name in features_names :
        # Sur un suplot
        plt.subplot(len(features_names), len(classes), cpt)

        # Afficher le boxplot de la distribution des valeurs de la variable, en fonction des espèces
        sns.boxplot(data=dataset, x="target_names", y=col_name, hue="target_names")

        # Passer au subplot suivant
        cpt+=1

    # Adjust layout
    plt.tight_layout(pad=2)

    # Afficher la figure
    plt.show()

boxplot_data(df)

In [None]:
# For the group "mean"

def boxplot_data(dataset: pd.DataFrame) -> None :

    # Récupèrer les classes
    classes = dataset["target_names"].value_counts().index.tolist()

    # Récupèrer le nom des variables
    features_names = [col_name for col_name in dataset.columns.tolist() if "mean" in col_name and col_name != "target"] # Solution: changer le contenu de "" après le if

    # Définir le compteur des subplots
    cpt = 1

    # Définir une figure
    plt.figure(figsize=(20, 40))

    # Pour chaque variables
    for col_name in features_names :
        # Sur un suplot
        plt.subplot(len(features_names), len(classes), cpt)

        # Afficher le boxplot de la distribution des valeurs de la variable, en fonction des espèces
        sns.boxplot(data=dataset, x="target_names", y=col_name, hue="target_names")

        # Passer au subplot suivant
        cpt+=1

    # Adjust layout
    plt.tight_layout(pad=2)

    # Afficher la figure
    plt.show()

boxplot_data(df)

## Synthetic exercise

In [None]:
x = [[45, 11.34], [56, 13.61], [51, 12.7], [30, 4.5], [25, 3.8], [35, 5.2]] # features
y = ["dog", "dog", "dog", "cat", "cat", "cat"] # class

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, stratify=y, random_state=42)

from sklearn.metrics import balanced_accuracy_score, precision_score, recall_score

# Define a list of models
learning_algo = [MultinomialNB(),
                 LogisticRegression(random_state=42),
                 SVC(random_state=42),
                 DecisionTreeClassifier(random_state=42),
                 GradientBoostingClassifier(random_state=42),
                 MLPClassifier(random_state=42, max_iter=1000)
                ]

# Initiliase variables to follow the best models and their metrics
max_accuracy = 0
max_precision = 0
max_recall = 0
best_model_accuracy = None
best_model_precision = None
best_model_recall = None

# Loop through each algorithm
for algo in learning_algo:
    tmp = algo  # Assign the current algorithm to the tmp variable
    print(type(tmp).__name__ + ":")  # Display the name of the algorithm

    # Training the model
    tmp.fit(x_train, y_train)

    # Create a plot to display the confusion matrix
    plt.figure(figsize=(12, 12))

    # Display the confusion matrix for the training set
    ConfusionMatrixDisplay.from_estimator(tmp, x_train, y_train, cmap=plt.cm.Blues, ax=plt.subplot(2, 2, 1))
    plt.title("Confusion matrix - Training set")

    # Display the confusion matrix for the test set
    ConfusionMatrixDisplay.from_estimator(tmp, x_test, y_test, cmap=plt.cm.Blues, ax=plt.subplot(2, 2, 2))
    plt.title("Confusion matrix - Test set")

    plt.show()  # Display the confusion matrix

    # Predict the training and test sets
    y_pred_train = tmp.predict(x_train)
    y_pred_test = tmp.predict(x_test)

    # Compute the performance matrix for the training set
    accuracy_train = balanced_accuracy_score(y_train, y_pred_train)
    precision_train = precision_score(y_train, y_pred_train, average='weighted')
    recall_train = recall_score(y_train, y_pred_train, average='weighted')

    # Compute the performance matrix for the test set
    accuracy_test = balanced_accuracy_score(y_test, y_pred_test)
    precision_test = precision_score(y_test, y_pred_test, average='weighted')
    recall_test = recall_score(y_test, y_pred_test, average='weighted')

    # Display the classification report for the training set
    print("\nTRAINING SET :\n")
    print(classification_report(y_pred_train, y_train))

    # Display the performance metrics for the training set
    print("\nAccuracy of the training set:", accuracy_train)
    print("Precision of the training set:", precision_train)
    print("Recall of the training set:", recall_train)

    # Display the classifion repport for the test set
    print("\nTEST SET:\n")
    print(classification_report(y_pred_test, y_test))

    # Display the performance metrics for the test set
    print("Accuracy of the test set", accuracy_test)
    print("Precision of the test set:", precision_test)
    print("Recall of the test set:", recall_test)

    # Update the max values and best model if the current model is the best one
    if accuracy_test > max_accuracy:
        max_accuracy = accuracy_test
        best_model_accuracy = type(tmp).__name__

    if precision_test > max_precision:
        max_precision = precision_test
        best_model_precision = type(tmp).__name__

    if recall_test > max_recall:
        max_recall = recall_test
        best_model_recall = type(tmp).__name__

    print('\n')

# Display the best models and their performance metrics
print(f"Best model (accuracy): {best_model_accuracy} - accuracy: {max_accuracy:.4f}")
print(f"Best model (precision): {best_model_precision} - Precision: {max_precision:.4f}")
print(f"Best model (recall): {best_model_recall} - recall: {max_recall:.4f}")