# Machine Learning 2023-2024 - UMONS

# Classification

In this lab we'll experiment with multiclass classification. We'll consider several models, some of which will be covered later in the course.
We'll be using the [Wine Quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) dataset, which contains several attributes of white wines.
Each observation is associated to a rating between 0 and 10 that will be the label of our classification task.

The columns of the dataframe contain the following information :
* fixed_acidity: amount of tartaric acid in g/dm^3
* volatile_acidity: amount of acetic acid in g/dm^3 
* citric_acid: amount of citric acid in g/dm^3
* residual_sugar: amount of remaining sugar after fermentation stops in g/l
* chlorides: amount of salt in wine 
* free_sulfur_dioxide: amount of free SO2
* total_sulfur_dioxide: amount of free and bound forms of SO2
* density: density of the wine
* pH: PH level of the wine on a scale from 0 to 14
* sulphates: amount of sulphates 
* alcohol: the percent of alcohol content
* quality: quality of the wine (score between 0 and 10)

**Import the necessary libraries**

In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (ConfusionMatrixDisplay, accuracy_score,
                             confusion_matrix, log_loss, classification_report)
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, LabelBinarizer
import warnings

np.random.seed(0)

**We load the dataset 'wine.csv'.**

In [2]:
df = pd.read_csv('data/wine.csv', sep=';')
df.head()

**1) Check the properties of this dataset (length, types, missing values).** 

## Data splitting

**We predict the target 'quality' from all other features. We split the dataset into a training and test set following a 80/20 partition.**

In [4]:
ylabel = 'quality'
X = df.drop(ylabel, axis=1)
y = df[ylabel]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, test_size=0.2, shuffle=True, random_state=0
)

## Data exploration

**2) Look at the distribution of the variable 'quality' in the training set using `sns.countplot`.**

**3) For each continuous feature, plot a boxplot of this feature grouped by label values. Use the `sns.boxenplot` function of the seaborn library. Which features seem to be the most useful to predict the label 'quality'?**

**4) Plot the pairwise relationship of the most useful features using the function `sns.pairplot`. Plot a different color according to the value of the variable 'quality' using the `hue` parameter. What do you observe?**

## Classification metrics

**5) Define a simple pipeline where you first scale the data with `StandardScaler` to have zero mean and unit variance followed by a (linear) logistic regression. Then fit the model.**

**6) One of the most useful tool to diagnose a classification model is the confusion matrix. Print it using `confusion_matrix` and `ConfusionMatrixDisplay`.**

The size of the matrix is $n \times n$, where $n$ is the number of classes. Each row represents the instances in an actual class, while each column represents the instances in a predicted class. A cell $i, j$ represents the number of instances of class $i$ that were predicted as class $j$.

**7) From the confusion matrix, several performance metrics can be calculated for each class, as well as overall metrics. Using the function `classification_report`, generate a report of these different metrics. Use the argument `zero_division=0` to avoid warnings.**

Here's what each of these terms represents:

1. **Precision**: This is the ratio of correctly predicted positive observations to the total predicted positives. It is an indicator of the accuracy of the positive predictions. For class $i$, precision is calculated as:
   $$
   \text{Precision}_i = \frac{TP_i}{TP_i + FP_i}
   $$
   where $TP_i$ are the true positives and $FP_i$ are the false positives for class $i$.

2. **Recall** (also known as Sensitivity or True Positive Rate): This is the ratio of correctly predicted positive observations to all observations in the actual class. It shows how well the model can find all the positive samples. For class $i$, recall is calculated as:
   $$
   \text{Recall}_i = \frac{TP_i}{TP_i + FN_i}
   $$
   where $FN_i$ are the false negatives for class $i$.

3. **F1-Score**: This is the harmonic mean of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. It is particularly useful when the class distribution is uneven. F1-score is calculated as:
   $$
   \text{F1-Score}_i = \frac{2}{\frac{1}{\text{Precision}_i} + \frac{1}{\text{Recall}_i}} = 2 \times \frac{\text{Precision}_i \times \text{Recall}_i}{\text{Precision}_i + \text{Recall}_i}
   $$

4. **Support**: This is the number of actual occurrences of the class in the specified dataset. It doesn’t reflect the model’s performance but is very useful for determining the significance of the classification metrics.

These metrics can be averaged to obtain:
- **Macro average**: This is the average of the precision, recall, and f1-score without taking class imbalance into account. It treats all classes equally, regardless of their support.
- **Weighted average**: This averages the precision, recall, and f1-score, with weighting by support for each class. This means that the influence of each class's score on the overall average is proportional to the number of instances of that class.

### The log loss

The log loss for a multiclass classification model is calculated as follows:
$$
\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log(p_{ik})
$$
where:
- $n$ is the total number of observations.
- $K$ is the number of classes.
- $y_{ik}$ is a binary indicator (0 or 1) if class label $k$ is the correct classification for observation $i$.
- $p_{ik}$ is the predicted probability that observation $i$ belongs to class $k$.

Remember from the course that $\argmin_{\theta \in \Theta} \mathbb{E}[-\log p(Y; \theta)] = \argmin_{\theta \in \Theta} \text{KL}(p_\theta, p)$.
Since the distribution that minimizes the KL divergence is the true distribution, the expectation of the log loss will be minimized when the model always predicts the correct vector of probabilities.

**8) Predict probabilities using the `predict_proba` method of the logistic regression model. Then calculate the log loss using the `log_loss` function.**

**9) Based on `y_test_binarized`, compute the log loss manually and check that it corresponds to the previous log loss.**

In [12]:
lb = LabelBinarizer()
lb.fit(y_train)
y_test_binarized = lb.transform(y_test)

## Hyperparameter tuning

**10) We will now experiment with various models for classification. For each one of the following models, design a grid of hyperparameters based on the corresponding scikit-learn documentation:**
- **[KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**
- **[Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)**
- **[Linear Discriminant Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)**
- **[Quadratic Discriminant Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html)**
- **[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)**
- **[Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**
- **[Gradient Boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)**

## Model fitting

**11) For each one of these models, select the hyperparameters that give the lowest log loss using the `RandomizedSearchCV` class. Don't forget to normalize the data if necessary. Compute the accuracy and log loss on the test dataset for the best hyperparameters.**

**Print the best hyperparameters corresponding to each model and plot a confusion matrix.**

**12) Create a pandas dataframe where each row corresponds to a model. The columns should correspond to the accuracy and log loss.**