# Multiclass Classification of White Wine Quality with Over-sampling and k-NN

## Wine Data
Data from http://archive.ics.uci.edu/ml/datasets/Wine+Quality

### Citations
<pre>
Dua, D. and Karra Taniskidou, E. (2017). 
UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/index.php]. 
Irvine, CA: University of California, School of Information and Computer Science.
</pre>

<pre>
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
</pre>

Available at:
- [@Elsevier](http://dx.doi.org/10.1016/j.dss.2009.05.016)
- [Pre-press (pdf)](http://www3.dsi.uminho.pt/pcortez/winequality09.pdf)
- [bib](http://www3.dsi.uminho.pt/pcortez/dss09.bib)

## Setup

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

white_wine = pd.read_csv('../../lab_10/data/winequality-white.csv', sep=';')

## EDA

In [None]:
white_wine.head()

In [None]:
white_wine.describe()

In [None]:
white_wine.info()

In [None]:
def plot_quality_scores(df, kind):
    ax = df.quality.value_counts().sort_index().plot.barh(
        title=f'{kind.title()} Wine Quality Scores', figsize=(12, 3)
    )
    ax.axes.invert_yaxis()
    for bar in ax.patches:
        ax.text(
            bar.get_width(), 
            bar.get_y() + bar.get_height()/2, 
            f'{bar.get_width()/df.shape[0]:.1%}',
            verticalalignment='center'
        )
    plt.xlabel('count of wines')
    plt.ylabel('quality score')

    for spine in ['top', 'right']:
        ax.spines[spine].set_visible(False)

    return ax

plot_quality_scores(white_wine, 'white')

## White wine quality multiclass classification

### Train test split

In [None]:
from sklearn.model_selection import train_test_split

y = white_wine.quality
X = white_wine.drop(columns=['quality'])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.15, random_state=0, stratify=y
)

### Over-sampling

In [None]:
from imblearn.over_sampling import RandomOverSampler

X_train_oversampled, y_train_oversampled = RandomOverSampler(random_state=0).fit_resample(X_train, y_train)
pd.Series(y_train_oversampled).value_counts()

### Building a model

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scale', StandardScaler()), 
    ('knn', KNeighborsClassifier())
])

search_space = {
    'knn__n_neighbors': np.arange(1, 10)
}

grid = GridSearchCV(pipeline, search_space, scoring='f1_macro', cv=5).fit(X_train_oversampled, y_train_oversampled)

Check the best hyperparameters:

In [None]:
grid.best_params_

### Evaluating the model
Get the predictions:

In [None]:
preds = grid.predict(X_test)

Review the classification report:

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, preds))

Inpsect the confusion matrix:

In [None]:
from ml_utils.classification import confusion_matrix_visual

confusion_matrix_visual(y_test, preds, np.sort(y.unique()))

Make precision-recall curves:

In [None]:
from ml_utils.classification import plot_multiclass_pr_curve
plot_multiclass_pr_curve(y_test, grid.predict_proba(X_test))

<hr>
<div>
    <a href="./exercise_2.ipynb">
        <button>&#8592; Previous Solution</button>
    </a>
    <a href="./exercise_4.ipynb">
        <button style="float: right;">Next Solution &#8594;</button>
    </a>
</div>
<hr>