# Exercise: Wine dataset

As with previous exercises, fill in the question marks with the correct code.

Last week you were introduced to the [wine dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality). We have 10 input variables and 1 output variables.

Input variables (based on physicochemical tests):

1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol

Output variable (based on sensory data):

12. quality (score between 0 and 10)

I suggest we look at two broad questions with this dataset:

1. Will dimension reduction reveal variable groupings? Think back to how we interpreted the loadings in the crime dataset.
2. What does clustering the wines well us?

## Load data and import libraries

In [None]:
#| error: true
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import seaborn as sn?
from sklearn.cluster import KMeans
from sklearn.decomposition import PC?
from sklearn.decomposition import S????ePCA
from sklearn.manifold import TSNE

df = pd.read_excel('data/winequality-red_v2.xlsx')

In [None]:
#| error: true
df.h??d()

In [None]:
#| error: true

# May take a while depending on your computer
# feel free not to run this
sns.pair????(df)

# Dimension reduction

In [None]:
#| error: true

from sklearn.decomposition import PCA

n_components = 2
 
pca = PCA(n_??????????=n_components)
df_pca = pca.fit(df?iloc[:, 0:11])

In [None]:
#| error: true

df_pca_vals = df_pca.???_transform(df.iloc[:, 0:11])
df['c1'] = [item[0] for item in df_pca_????]
df['c2'] = [item[1] for item in df_pca_vals]

In [None]:
#| error: true

sns.scatterplot(data = df, x = ?, y = ?, hue = 'quality')

In [None]:
#| error: true

print(df.columns)
df_pca.components_

What about other dimension reduction methods?

## SparcePCA

In [None]:
#| error: true

s_pca = SparsePCA(n_components=n_components)
df_s_pca = s_pca.fit(df.????[:, 0:11])

In [None]:
#| error: true

df_s_pca_vals = s_pca.fit_?????????(df.iloc[:, 0:11])
df['c1 spca'] = [item[0] for item in df_s_pca_vals]
df['c2 spca'] = [item[1] for item in df_s_pca_vals]

In [None]:
#| error: true

sns.scatterplot(data = df, x = 'c1 spca', y = 'c2 spca', hue = 'quality')

## tSNE

In [None]:
#| error: true

tsne_model = TSNE(n_components=n_components)
df_tsne = tsne_model.fit(df.iloc[:, 0:11])

In [None]:
#| error: true

df_tsne_vals = tsne_model.fit_transform(df.iloc[:, 0:11])
df['c1 tsne'] = [item[0] for item in ??_tsne_vals]
df['c2 tsne'] = [item[1] for item in df_tsne_vals]

In [None]:
#| error: true

# This plot does not look right
# I am not sure why.
sns.scatterplot(data = ??, x = 'c1 tsne', y = 'c1 tsne', hue = 'quality')

That looks concerning - there is a straight line. It looks like something in the data has caused the model to have issues.

Does normalising the data sort out the issue?

In [None]:
#| error: true

from sklearn.preprocessing import MinMaxScaler
col_names = df.columns
scaled_df =  pd.DataFrame(MinMaxScaler().fit_transform(df))
scaled_df.columns = col_names

In [None]:
#| error: true

tsne_model = TSNE(n_components=n_components)

scaled_df_tsne = tsne_model.fit(scaled_df.iloc[:, 0:11])
scaled_df_tsne_vals = tsne_model.fit_transform(df.iloc[:, 0:11])

scaled_df['c1 tsne'] = [item[0] for item in scaled_df_tsne_vals]
scaled_df['c2 tsne'] = [item[1] for item in scaled_df_tsne_vals]

sns.scatterplot(data = scaled_df, x = 'c1 tsne', y = 'c1 tsne', hue = 'quality')

Normalising the data makes no difference. It could be the model is getting stuck somehow. You could check the various attributes of the tsne fit object (tsne_model.fit), try using only a few columns and search google a lot - this could be a problem other have encountered.

For now, we will use PCA components.

In [None]:
#| error: true

data = {'columns' : df.iloc[:, 0:11].columns,
        'component 1' : df_pca.components_[0],
        'component 2' : df_pca.components_[1]}


loadings = pd.?????????(data)
loadings_sorted = loadings.sort_values(by=['component 1'], ascending=False)
loadings_sorted.iloc[1:10,:]

In [None]:
#| error: true

loadings_sorted = loadings.sort_values(by=['component 2'], ascending=False)
loadings_sorted.iloc[1:10,:]

## Clustering

In [None]:
#| error: true

from sklearn.cluster import KMeans

ks = range(1, 10)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    ????? = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(df[['c1', 'c2']])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

import matplotlib.pyplot as plt

plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

In [None]:
#| error: true

k_means_3 = KMeans(n_clusters = 3, init = 'random')
k_means_3.fit(df[['c1', 'c2']])
df['Three clusters'] = pd.Series(k_means_3.???????(df[['c1', 'c2']].values), index = df.index)

In [None]:
#| error: true

sns.scatterplot(data = df, x = 'c1', y = 'c2', hue = 'Three clusters')

Consider:

* Is that userful? 
* What might it mean?

Outside of this session you could try normalising the data (centering around the mean), clustering the raw data (and not the projections from PCA), trying to get tSNE working or using different numbers of components.