# Machine Learning 2022-2023 - UMONS

# Classification

In this lab we'll experiment a bit more with the classification task in order to help you get started with the project.
We'll consider several models, some of which will be covered later in the course.
We'll be using the [Wine Quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) dataset, which contains several attributes of white wines.
Each observation is associated to a rating between 0 and 10 that will be the label of our classification task.

The columns of the dataframe contain the following information :
* fixed_acidity: amount of tartaric acid in g/dm^3
* volatile_acidity: amount of acetic acid in g/dm^3 
* citric_acid: amount of citric acid in g/dm^3
* residual_sugar: amount of remaining sugar after fermentation stops in g/l
* chlorides: amount of salt in wine 
* free_sulfur_dioxide: amount of free SO2
* total_sulfur_dioxide: amount of free and bound forms of SO2
* density: density of the wine
* pH: PH level of the wine on a scale from 0 to 14
* sulphates: amount of sulphates 
* alcohol: the percent of alcohol content
* quality: quality of the wine (score between 0 and 10)

**Import the necessary libraries**

In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (ConfusionMatrixDisplay, accuracy_score,
                             confusion_matrix, log_loss)
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

sns.set_theme()
np.random.seed(0)

## Data exploration

**1) Read the dataset 'wine.csv' and check its properties (length, types, missing values). Handle missing values if needed.** 

## Data splitting

**2) We predict the target 'quality' from all other features. Split your datasets into a training and test set following a 80/20 partition using the `train_test_split` function.**

## Data exploration

**3) Look at the distribution of the variable 'quality' in the training set using `sns.countplot`.**

**4) For each continuous feature, plot a boxplot of this feature grouped by label values. Use the `sns.boxplot` function. Which features seem to be the most useful to predict the label 'quality'?**

**5) Plot the pairwise relationship of the most useful features using the function `sns.pairplot`. Plot a different color according to the value of the variable 'quality' using the `hue` parameter. What do you observe?**

## Data preprocessing

**6) Normalize the continuous features using the `StandardScaler` class. Make sure to fit it on the training dataset to avoid data leakage.**

## Hyperparameter tuning

**7) For each one of the following models, create a grid of hyperparameters based on the corresponding scikit-learn documentation:**
- **KNN**
- **Gaussian Naive Bayes**
- **Linear Discriminant Analysis**
- **Logistic Regression**
- **Random Forest**
- **Gradient Boosting**

## Model fitting

**8) For each one of these models, select the hyperparameters that give the lowest log loss using the `RandomizedSearchCV` class. Compute the accuracy and log loss on the test dataset.**

**Print the best hyperparameters corresponding to each model and plot a confusion matrix using the function `confusion_matrix` and the class `ConfusionMatrixDisplay`.**

## Test metrics

**9) Create a pandas dataframe where each row corresponds to a model. Display the accuracy and log loss computed on the test set are displayed in two columns. What are the advantages of the log loss over the accuracy?**