# Clustering by Wine Color

## Wine Data
Data from http://archive.ics.uci.edu/ml/datasets/Wine+Quality

### Citations
<pre>
Dua, D. and Karra Taniskidou, E. (2017). 
UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/index.php]. 
Irvine, CA: University of California, School of Information and Computer Science.
</pre>

<pre>
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
</pre>

Available at:
- [@Elsevier](http://dx.doi.org/10.1016/j.dss.2009.05.016)
- [Pre-press (pdf)](http://www3.dsi.uminho.pt/pcortez/winequality09.pdf)
- [bib](http://www3.dsi.uminho.pt/pcortez/dss09.bib)

## Setup

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

red_wine = pd.read_csv('../../lab_09/data/winequality-red.csv')
white_wine = pd.read_csv('../../lab_09/data/winequality-white.csv', sep=';')

## EDA

In [None]:
white_wine.head()

In [None]:
red_wine.head()

In [None]:
def plot_quality_scores(df, kind):
    ax = df.quality.value_counts().sort_index().plot.barh(
        title=f'{kind.title()} Wine Quality Scores', figsize=(12, 3)
    )
    ax.axes.invert_yaxis()
    for bar in ax.patches:
        ax.text(
            bar.get_width(), 
            bar.get_y() + bar.get_height()/2, 
            f'{bar.get_width()/df.shape[0]:.1%}',
            verticalalignment='center'
        )
    plt.xlabel('count of wines')
    plt.ylabel('quality score')

    for spine in ['top', 'right']:
        ax.spines[spine].set_visible(False)

    return ax

plot_quality_scores(white_wine, 'white')

In [None]:
plot_quality_scores(white_wine, 'red')

Combine the wine data:

In [None]:
wine = pd.concat([white_wine.assign(kind='white'), red_wine.assign(kind='red')])
wine.sample(5, random_state=10)

EDA on the wine data as a whole:

In [None]:
wine.info()

In [None]:
wine.describe()

In [None]:
wine.describe(include='object')

In [None]:
wine.kind.value_counts()

## Clustering to Separate Red and White Wines

In [None]:
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

y = wine.kind
X = wine.drop(columns=['quality', 'kind'])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0, stratify=y
)

kmeans_pipeline = Pipeline([
    ('scale', StandardScaler()), 
    ('kmeans', KMeans(n_clusters=2, random_state=0))
]).fit(X_train)

### Measure the agreement between predicted wine type and actual

In [None]:
pd.Series(kmeans_pipeline.predict(X_test)).value_counts()

In [None]:
y_test.value_counts()

#### Fowlkes Mallows Index
Values are in the range [0, 1] where 1 is perfect agreement:
$$ FMI = \frac{TP}{\sqrt{(TP + FP)\times(TP + FN)}} $$
where
- TP = points that are in the same cluster in the true labels are predicted to be in the same cluster
- FP = points that are in the same cluster in the true labels but are not predicted to be in the same cluster
- FN = points that are not in the same cluster in the true labels but are predicted to be in the same cluster

In [None]:
from sklearn.metrics import fowlkes_mallows_score
# we need to make y_test binary, but which label red becomes doesn't matter for the result
fowlkes_mallows_score(np.where(y_test == 'red', 0, 1), kmeans_pipeline.predict(X_test))

### Finding the Centroids

In [None]:
pd.DataFrame(
    kmeans_pipeline.named_steps['kmeans'].cluster_centers_,
    columns=X_train.columns
).T

<hr>
<div>
    <a href="../../lab_09/red_wine.ipynb">
        <button>&#8592; Chapter 9</button>
    </a>
    <a href="./exercise_2.ipynb">
        <button style="float: right;">Next Solution &#8594;</button>
    </a>
</div>
<hr>