# Classification Challenge

Wine experts can identify wines from specific vineyards through smell and taste, but the factors that give different wines their individual charateristics are actually based on their chemical composition.

In this challenge, you must train a classification model to analyze the chemical and visual features of wine samples and classify them based on their cultivar (grape variety).

> **Citation**: The data used in this exercise was originally collected by Forina, M. et al.
>
> PARVUS - An Extendible Package for Data Exploration, Classification and Correlation.
Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno,
16147 Genoa, Italy.
>
> It can be downloaded from the UCI dataset repository (Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository]([http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science). 

## Explore the data

Run the following cell to load a CSV file of wine data, which consists of 12 numeric features and a classification label with the following classes:

- **0** (*variety A*)
- **1** (*variety B*)
- **2** (*variety C*)

In [1]:
import pandas as pd

# load the training dataset
data = pd.read_csv('data/wine.csv')
data.sample(10)

Unnamed: 0,Alcohol,Malic_acid,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color_intensity,Hue,OD280_315_of_diluted_wines,Proline,WineVariety
93,12.29,2.83,2.22,18.0,88,2.45,2.25,0.25,1.99,2.15,1.15,3.3,290,1
74,11.96,1.09,2.3,21.0,101,3.38,2.14,0.13,1.65,3.21,0.99,3.13,886,1
140,12.93,2.81,2.7,21.0,96,1.54,0.5,0.53,0.75,4.6,0.77,2.31,600,2
115,11.03,1.51,2.2,21.5,85,2.46,2.17,0.52,2.01,1.9,1.71,2.87,407,1
132,12.81,2.31,2.4,24.0,98,1.15,1.09,0.27,0.83,5.7,0.66,1.36,560,2
84,11.84,0.89,2.58,18.0,94,2.2,2.21,0.22,2.35,3.05,0.79,3.08,520,1
116,11.82,1.47,1.99,20.8,86,1.98,1.6,0.3,1.53,1.95,0.95,3.33,495,1
106,12.25,1.73,2.12,19.0,80,1.65,2.03,0.37,1.63,3.4,1.0,3.17,510,1
38,13.07,1.5,2.1,15.5,98,2.4,2.64,0.28,1.37,3.7,1.18,2.69,1020,0
124,11.87,4.31,2.39,21.0,82,2.86,3.03,0.21,2.91,2.8,0.75,3.64,380,1


Your challenge is to explore the data and train a classification model that achieves an overall *Recall* metric of over 0.95 (95%).

> **Note**: There is no single "correct" solution. A sample solution is provided in [03 - Wine Classification Solution.ipynb](03%20-%20Wine%20Classification%20Solution.ipynb).

## Train and evaluate a model

Add markdown and code cells as required to to explore the data, train a model, and evaluate the model's predictive performance.

### Exploration des données

In [2]:
labels = ['Variété A', 'Variété B', 'Variété C']

In [3]:
# Your code to evaluate data, and train and evaluate a classification model
data.describe()

Unnamed: 0,Alcohol,Malic_acid,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color_intensity,Hue,OD280_315_of_diluted_wines,Proline,WineVariety
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258,0.938202
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474,0.775035
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0,0.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5,0.0
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5,1.0
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0,2.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0,2.0


#### Vérification des valeurs manquantes

In [4]:
data.isna().sum()

Alcohol                       0
Malic_acid                    0
Ash                           0
Alcalinity                    0
Magnesium                     0
Phenols                       0
Flavanoids                    0
Nonflavanoids                 0
Proanthocyanins               0
Color_intensity               0
Hue                           0
OD280_315_of_diluted_wines    0
Proline                       0
WineVariety                   0
dtype: int64

#### Distribution des features

In [5]:
import plotly.express as px
from IPython.display import display

data_features = [
    "Alcohol",
    "Malic_acid",
    "Ash",
    "Alcalinity",
    "Magnesium",
    "Phenols",
    "Flavanoids",
    "Nonflavanoids",
    "Proanthocyanins",
    "Color_intensity",
    "Hue",
    "OD280_315_of_diluted_wines",
    "Proline",
]
data_label = "WineVariety"

for col in data_features:
    fig = px.box(data, y=col, color=data_label)
    fig.for_each_trace(
        lambda t: t.update(
            name=labels[int(t.name)],
            legendgroup=labels[int(t.name)],
            hovertemplate=t.hovertemplate.replace(t.name, labels[int(t.name)]),
        )
    )
    display(fig)


### Préparation des données
#### Séparation des données d'entraînement et de test

In [6]:
from sklearn.model_selection import train_test_split

X, y = data[data_features].values, data[data_label].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print(f'Training Set: {X_train.shape[0]} rows')
print(f'Test Set: {X_test.shape[0]} rows')

Training Set: 133 rows
Test Set: 45 rows


#### Preprocessing des données

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Define preprocessing for numeric columns (scale them)
numeric_features = [data.columns.get_loc(c) for c in data_features if c in data]
numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
    ]
)
display(preprocessor)


### Entraînement et évaluation du modèle
#### LogisticRegression

In [8]:
from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.1

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', LogisticRegression(solver='lbfgs', multi_class='auto'))])

# train a logistic regression model on the training set
model = pipeline.fit(X_train, y_train)
display(model)

In [9]:
from sklearn. metrics import classification_report

predictions = model.predict(X_test)

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      1.00      1.00        21
           2       1.00      1.00      1.00         8

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



In [10]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Overall Accuracy:", accuracy_score(y_test, predictions))
print("Overall Precision:", precision_score(y_test, predictions, average='macro'))
print("Overall Recall:", recall_score(y_test, predictions, average='macro'))

Overall Accuracy: 1.0
Overall Precision: 1.0
Overall Recall: 1.0


In [11]:
import plotly.figure_factory as ff
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, predictions)
px.imshow(
    cm,
    text_auto=True,
    x=labels,
    y=labels,
    color_continuous_scale="blues",
    labels={"x": "Predicted Species", "y": "Actual Species"},
)


In [12]:
import plotly.graph_objects as go
from sklearn.metrics import roc_curve

# Get class probability scores
y_scores = model.predict_proba(X_test)

# Create an empty figure, and iteratively add new lines
# every time we compute a new class
fig = go.Figure()
fig.add_shape(type="line", line={"dash": "dash"}, x0=0, x1=1, y0=0, y1=1)

# Get ROC metrics for each class
for i in range(len(labels)):
    fpr, tpr, _ = roc_curve(y_test, y_scores[:, i], pos_label=i)
    fig.add_trace(go.Scatter(x=fpr, y=tpr, name=labels[i], mode="lines"))


fig.update_layout(
    title="Multiclass ROC curve",
    xaxis_title="False Positive Rate",
    yaxis_title="True Positive Rate",
    yaxis={"scaleanchor": "x", "scaleratio": 1},
    xaxis={"constrain": "domain"},
    width=700,
    height=500,
)
display(fig)


In [13]:
from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_test, y_scores, multi_class='ovr')
print('Average AUC:', auc)

Average AUC: 1.0


#### SVC

In [14]:
from sklearn.svm import SVC

pipeline = Pipeline(
    steps=[("preprocessor", preprocessor), ("regressor", SVC(probability=True))]
)

# train a logistic regression model on the training set
model = pipeline.fit(X_train, y_train)
display(model)


In [15]:
# Get predictions from test data
predictions = model.predict(X_test)
y_scores = model.predict_proba(X_test)

# Overall metrics
print("Overall Accuracy:", accuracy_score(y_test, predictions))
print("Overall Precision:", precision_score(y_test, predictions, average="macro"))
print("Overall Recall:", recall_score(y_test, predictions, average="macro"))
print("Average AUC:", roc_auc_score(y_test, y_scores, multi_class="ovr"))

# Confusion matrix
cm = confusion_matrix(y_test, predictions)
px.imshow(
    cm,
    text_auto=True,
    x=labels,
    y=labels,
    color_continuous_scale="blues",
    labels={"x": "Predicted Species", "y": "Actual Species"},
)


Overall Accuracy: 1.0
Overall Precision: 1.0
Overall Recall: 1.0
Average AUC: 1.0


### Sauvegarde du modèle

In [16]:
import joblib

joblib.dump(model, 'wine_classification.joblib')

['wine_classification.joblib']

## Use the model with new data observation

When you're happy with your model's predictive performance, save it and then use it to predict classes for the following two new wine samples:

- \[13.72,1.43,2.5,16.7,108,3.4,3.67,0.19,2.04,6.8,0.89,2.87,1285\]
- \[12.37,0.94,1.36,10.6,88,1.98,0.57,0.28,0.42,1.95,1.05,1.82,520\]


In [17]:
# Your code to predict classes for the two new samples
import numpy as np

loaded_model = joblib.load("wine_classification.joblib")

X_new = np.array(
    [
        [13.72, 1.43, 2.5, 16.7, 108, 3.4, 3.67, 0.19, 2.04, 6.8, 0.89, 2.87, 1285],
        [12.37, 0.94, 1.36, 10.6, 88, 1.98, 0.57, 0.28, 0.42, 1.95, 1.05, 1.82, 520],
    ]
)

results = loaded_model.predict(X_new)
print("Estimation de deux vins :")
print(*[f'{p} ({labels[p]})' for p in results], sep="\n")


Estimation de deux vins :
0 (Variété A)
1 (Variété B)
