In this lab, we'll experiment a bit more with the task of binary classification. We'll be considering four different classifiers, respectively the Logistic Regression, the Linear Discriminant Analysis (LDA), the Quadratic Discriminant Analysis (QDA), and Naïve Bayes. 

We'll be using the 'wine.csv' dataset, which contains several attributes of white wines, and each observation is associated to a binary quality value, indicating whethter the wine is of superior quality or not. The present goal is to use the above classifiers to determine the quality group of a wine based on its set of attributes. 

The columns of the dataframe contain the following information :
* fixed_acidity : amount of tartaric acid in g/dm^3
* volatile_acidity : amount of acetic acid in g/dm^3 
* citric_acid : amount of citric acid in g/dm^3
* residual_sugar : amount of remaining sugar after fermentation stops in g/l
* chlorides : amount of salt in wine 
* free_sulfur_dioxide : amount of free SO2
* total_sulfur_dioxide : amount of free and bound forms of SO2
* density : density of the wine
* pH : PH level of the wine on a scale from 0 to 14
* sulphates : amount of sulphates 
* alcohol : the percent of alcohol content
* quality : quality of the wine (1 : superior, 0 : inferior)

**Import necessary libraries**

In [None]:
import numpy as np 
import pandas as pd 
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
from scipy.stats import norm 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
import math
import seaborn as sns
import statsmodels.api as sm
from patsy import dmatrices


# Data exploration

**1) Read the dataset 'wine.csv', check its properties. Check and handle possible missing values.** 

**2) Look at the distribution of the target variable 'quality' (using a barplot).**

**Do you notice anything ?**

**3) Plot a boxplot for each of the predictor variables, while separating for the quality level. Use the 'boxplot' method of the seaborn library.**

**From the obtained boxplots, can you spot the 3 predictors that might seem to be most useful in predicting the target variable 'quality' ?**

# Logisitic Regression

**4) Select 'quality' as the target variable and 'density' and 'alcohol' as the predictors. Fit a logistic regression model the data, and output its summary.**

**How do you interpret the obtained coefficients ? Are they significant ? What does it tell you ?** 

**Use the "Logit" model of the 'statsmodel' library, and create your input matrices using the method 'dmatrices' from the 'patsy' library. Do not split the dataset.**

**5) Refit the same model as above, but introduce an interaction term between the variables 'density' and 'alcohol'.**

**What do you observe ? What happened to the significance of the coefficients ?**

# Linear Discriminant Analysis

**6) Select the predictor variable 'density' and fit a LDA model to the classify the target variable 'quality'. Do not split the dataset.**

**Compute the decision boundary of the model. Then, on the same plot, display the densities of the variable 'density' for each class of 'quality, and this boundary decision. What do you observe ?**

**You can draw a normal distribution using the method 'norm.pdf' from the 'scipy' library. Check the attributes of the class 'LDA' to see how you can obtain the priors, the means, and the variance**

**7) Select th variables 'sulphate' and 'alcohol' as predictor variables, and fit a LDA model to predict the variable 'quality'. Do not split the dataset.**

**Draw a scatter plot of the predictor variables, while separating for the class 'quality'. Then, on the predictor space, draw the decision boundary.** 

**Check how you can obtain the 'coefficients' and the 'intercept' for the decision boundary [here](https://scikit-learn.org/stable/modules/lda_qda.html#lda-qda-math) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html).**

# Classifiers comparison and metrics

**8) Using the three most powerful predictors identified in exercice 3, successively fit a logistic regression model, a LDA, a QDA, and a Naïve Bayes model to predict the target variable 'quality' using a 10-folds cross-validation. Evaluate the model on its accuracy.**

**For convergence issue, set the logistic regression solver to 'liblinear', and fit the intercept. The Gaussian Naïve Bayes classifier is called 'GaussianNB' is scikit-learn.**

**Which of the four classifiers performs best in classifying the target variable 'quality' ?**


**9) Perform the same experiment as the previous point, but now include all predictor variables. Do you notice a significant gain in performance ?**

**10) Select the best model found above and the 3 most relevant predictors, split your data in train and test datasets according to a 0.8/0.2 partition, and fit it to the training data.** 

**Evaluate the model on the accuracy, and display the confusion matrix. You'll need to use the 'confusion_matrix' and the 'ConfusionMatrixDisplay' methods from the scikit-learn library. Can you think of any problem that might arise when evaluating the model on the accuracy ? Think of the distribution of the target variable.**

**11) Compute the True Positive Rate, the True Negative Rate, the False Positive Rate, the False Negative Rate, and the Precision. How do you interpret these metrics ? The TP, FP, TN and FN can be directly obtained from the confusion matrix.**

**Draw the Receiver Operating Curve, and compute the area under the curve. You'll need the methods 'roc_curve' and 'roc_auc_score'. Does the model look to have any predictive power ?**