# Exercise set 8

>The goal of this exercise is to perform **principal component analysis**
>and **clustering** on a data set with many variables.

## Exercise 8.1

This exercise will explore the [wine data set](https://archive.ics.uci.edu/ml/datasets/Wine), a data set commonly used as an example for classification.
The data set contains the results of
a chemical analysis of wines from a region in Italy. These
wines are made using grapes grown by three different cultivators.
In this first exercise, we will explore the
data set using principal component analysis and investigate
if the results from the chemical analysis can be used to separate
the wines into groups that correspond to the cultivator of the grapes.

The data set contains the following columns:


| Column name                    | Description                                              |
|--------------------------------|----------------------------------------------------------|
| alcohol                        | The alcohol content of the wine.                         | 
| malic_acid                     | The amount of malic acid in the wine (malic acid has an apple aroma).  |
| ash                            | The amount of ash in the wine (ash is the matter that remains after evaporation and incineration).   | 
| alcalinity_of_ash              | The alkalinity of the ash content of the wine.           |
| magnesium                      | The amount of magnesium in the wine.                      |
| total_phenols                  | The total amount of [phenols](https://en.wikipedia.org/wiki/Phenolic_content_in_wine) (that are not flavanoids) in the wine. |
| flavanoids                     | The amount of [flavanoids](https://en.wikipedia.org/wiki/Flavonoid) in the wine |
| nonflavanoid_phenols           | The total amount of phenols in the wine.   |
| proanthocyanins                | The amount of [proanthocyanins](https://en.wikipedia.org/wiki/Proanthocyanidin) in the wine (important for red/blue/purple colors).   |
| color_intensity                | Color intensity of the wine (measured spectroscopically).  |
| hue                            | Color hue of the wine (measured spectroscopically).         |
| od280/od315_of_diluted_wines   | The protein content of the wine. OD280/OD315 is a method for determining protein concentration.                                     |
| proline                        | The amount of [proline](https://en.wikipedia.org/wiki/Proline) in the wine (proline is the main amino acid found in red wine).   |  
| target                         | The cultivator of the wine, given as 0, 1, or 2.   |

The data can be loaded as follows:

In [None]:
"""Load the wine data set"""
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.datasets import load_wine

sns.set_theme(style="ticks", context="notebook", palette="muted")

%matplotlib inline

# Load the data set as a pandas frame:
data_set = load_wine(as_frame=True)["frame"]
data_set.head()

### 8.1(a)
Begin by exploring the raw data. Here, you should choose
the method yourself. You can, for instance, look at histograms of the
different measured quantities, correlations between the quantities,
or other plots of the raw data (for instance, the 
[scatter plot matrix](https://seaborn.pydata.org/examples/scatterplot_matrix.html) we used in a previous exercise). After looking at the raw data, are there some of the
variables that seem to be able to distinguish
between the wines produced by the different cultivators?

To make things a bit more interesting (and to show you how to make things slightly more interactive in a
Jupyter notebook); here are two examples that create a dropdown selector for picking variables:

In [None]:
from ipywidgets import Dropdown, interact

# This code shows the distributions for the three targets for one variable:


def show_data(variable):
    fig1, (ax1, ax2) = plt.subplots(
        constrained_layout=True, ncols=2, figsize=(8, 4)
    )
    sns.boxplot(data=data_set, y=variable, x="target", ax=ax1)
    sns.kdeplot(
        data=data_set,
        x=variable,
        hue="target",
        fill=True,
        palette="muted",
        ax=ax2,
    )


variables = [i for i in data_set if i != "target"]

dropdown = Dropdown(options=variables, description="Variable:")
interact(show_data, variable=dropdown)

In [None]:
# This is a 2D plot to show the distribution with two variables:


def show_data2(variable_x, variable_y):
    grid = sns.jointplot(
        data=data_set,
        x=variable_x,
        y=variable_y,
        hue="target",
        palette="muted",
    )


dropdown1 = Dropdown(options=variables, description="Variable X:")
dropdown2 = Dropdown(options=variables, description="Variable Y:")
interact(show_data2, variable_x=dropdown1, variable_y=dropdown2)

In [None]:
# Your code here

#### Your answer to question 8.1(a): Did you find some variables that seem to distinguish between cultivators?
*Double click here*

### 8.1(b)
Perform a PCA on the data set (see the example code 
for this below)
and consider the following:

* (i)  Do you need to scale your data before
  performing PCA in this case (why/why not)?


* (ii)  Should you include the `target` column in the data you use for the PCA?


* (iii)  How many principal components are needed to explain 95 % of the
  variance in the data? Answer this by plotting the explained variance
  as a function of the number of principal components.


Example code for PCA:

In [None]:
"""Load the wine data set and run PCA."""
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

data_set = load_wine(as_frame=True)["frame"]
variables = [i for i in data_set.columns if i != "target"]
X = data_set[variables].to_numpy()

# Uncomment the following line to scale your data:
# X = scale(X)
pca = PCA()
scores = pca.fit_transform(X)

# Print out the percentage of variance explained by each component:
variance_ratio = pca.explained_variance_ratio_
print(variance_ratio)
# To get the cumulative variance explained, you can do the following:
print(np.cumsum(variance_ratio))

#### Your answer to question 8.1(b):
*Double click here*

### 8.1(c)

* (i)  Rerun the PCA with
  the number of components you found in the previous question. Select the number of components with the argument
  `n_components` in `PCA()`, e.g. `pca = PCA(n_components=13)`,
  or, (for 95 % of the variance) `pca = PCA(n_components=0.95)`


* (ii)  Obtain the scores, and make a plot of the scores for
  principal component 1 (on the $x$-axis) and principal component 2 (on the $y$-axis).


* (iii)  Do you see any grouping(s) ("clusters") in your scores plot?
  Here, you can choose to color the scores according
  to the cultivator (i.e., by using the values in the `target`
  column in the data set).

In [None]:
# Example plot for the scores:
#fig, ax = plt.subplots()
#ax.scatter(scores[:, 0], scores[:, 1])  # Plot scores on first and second PC
# Example for coloring:
#fig, ax = plt.subplots()
#sns.scatterplot(x=scores[:, 0], y=scores[:, 1], hue=data_set["target"], palette="muted", ax=ax)

In [None]:
# Your code here

#### Your answer to question 8.1(c):
*Double click here*

### 8.1(d)
Explore the loadings for your PCA model by plotting the
loadings for the variables (on principal component 1 and
principal component 2). Are any of the variables correlated?

In [None]:
# The loadings are stored as the transpose in pca.components_
# The loadings for PC1 is:
load1 = pca.components_[0, :]
# The loadings for PC2 is:
load2 = pca.components_[1, :]

# Aternatively:
# loadings = pca.components_.T
# load1 = loadings[:, 0]
# load2 = loadings[:, 1]

# Example plot:
fig, ax = plt.subplots()
ax.axhline(y=0, ls=":", color="black", lw=1)
ax.axvline(x=0, ls=":", color="black", lw=1)
ax.set_xlim(-0.6, 0.6)
ax.set_ylim(-0.6, 0.6)
ax.set_aspect("equal")

# Just plotting the points:
ax.scatter(load1, load2)

# Adding text (name of variables):
for i, variablei in enumerate(variables):
    ax.text(load1[i], load2[i], variablei, fontsize="small")

# Here, you can probably make the plot easier to read. Maybe it should be bigger,
# more colorful, with arrows, or maybe interactive like in the appedix in exercise 7?

In [None]:
# Your code here

#### Your answer to question 8.1(d):
*Double click here*

### 8.1(e)
Save the scores for the first two principal components.
We will use this information in the next part
of the exercise, where we will try to find clusters in our data.
Saving the scores can be
done with `pandas` as follows:

In [None]:
# Assuming that the scores are in the matrix scores, you can
# do the following to save the data (remember to limit to the first
# two PCs):

# 1. Create variable names for the principal components:
pc_name = [f"PC{i+1}" for i in range(scores.shape[1])]
# 2. Create a DataFrame from the scores:
scores_data = pd.DataFrame(scores, columns=pc_name)
scores_data["target"] = data_set["target"]
# 3. Save the scores to a comma separated values-file:
scores_data.to_csv("scores.csv", index=False)

# Note, here you could also save it into many other formats,
# for instance, Excel:
# scores_data.to_excel("scores.xlsx", index=False)
# or maybe as LaTeX for a report:
# print(scores_data.style.to_latex())

After running the code below, the file should be available here: [scores.csv](./scores.csv)

In [None]:
# Let us check that the file is present:
my_data = pd.read_csv("scores.csv")
my_data.head()

## Exercise 8.2

We will continue exploring the wine data set. We will pretend that we do not
know that there are three cultivators in the data set, and we will investigate
what the `KMeans` clustering method can tell us about it. For this
exercise, it is a good idea to read through all points below before
starting, since you will do the same analysis twice (first for the complete data set,
and then for the PCA scores you saved in part [8.1(e)](#8.1(e))).

### 8.2(a)
Outline the steps in the `KMeans` clustering algorithm.
How can we use this algorithm without knowing the number of clusters in the data?

#### Your answer to question 8.2(a):
*Double click here*

### 8.2(b)
Run `KMeans` clustering on the wine data set (see the example code below).
Here, you will have to
select a set of numbers of clusters to look for (limit yourself to
a maximum of 10 clusters).

After running the clustering for your 
data, obtain and plot the following metrics:

* (i) The sum of squared distances of the samples to
  their closest cluster center as a function of the number of clusters considered.
  
  
* (ii) The average silhouette value as a function of the number of clusters considered. (Note:
  if you want to plot the distribution of silhouette values (not required here!), take
  a look at this
  [silhouette example](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_samples.html).)


* (iii) The Gap statistic as a function of the number of clusters considered. (Skip this point if you are unable to install [gapstap](https://github.com/jmmaloney3/gapstat) - see the instructions below).

Explain briefly (with a few lines of text) how you use these plots to identify the "best" number of clusters and use them to decide how many clusters there are in the wine data set.

The cells below show Python code that runs the clustering and calculates the metrics to get you started.

In [None]:
"""Load the wine data set and run KMeans."""
from sklearn.cluster import KMeans
from sklearn.datasets import load_wine
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import scale

data_set = load_wine(as_frame=True)["frame"]
variables = [i for i in data_set.columns if i != "target"]
X = scale(data_set[variables].to_numpy())
# We scale the variance here (you have probably already
# figured out this is a good idea during the PCA part in 8.1.)

# Define a set of numbers of clusters to run KMeans for:
number_of_clusters = [2, 3, 4, 5]
# Set up variables for storing the results
results = []  # Results for the clustering

for i in number_of_clusters:
    # Set up the KMeans method with i cluster centers:
    cluster_k = KMeans(n_clusters=i, n_init="auto")
    # Run the clustering method:
    cluster_k.fit(X)
    # Store the results:
    results.append(cluster_k)

Note that the `cluster_k` object contains the following results as attributes:
 * `cluster_centers_`: Coordinates of cluster centers.
 * `labels_`: Labels of each sample. Each sample is assigned to a cluster, and the label shows which cluster a sample belongs to. Note that these are just
    labels - the actual numbers (0, 1, ...) do not have any meaning except being a label.
 * `inertia_`: Sum of squared distances of samples to their closest cluster center.
 * `n_iter_`: Number of iterations run.
 
The silhouette values can be calculated with [sklearn.metrics.silhouette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), and
the Gap statistic can be obtained via the [gapstat](https://github.com/jmmaloney3/gapstat) package. If you do not have this one installed, you can install it via:

```bash
pip install git+https://github.com/jmmaloney3/gapstat
```

In [None]:
# Uncomment the line below to install gapstat:
# !pip install git+https://github.com/jmmaloney3/gapstat
from gapstat import gapstat_score

In [None]:
# Here is how you can calculate the metrics needed for the plots:
for i, result in enumerate(results):
    print(f"Clustering with {result.n_clusters} clusters:")
    sse = result.inertia_  # This is the sum of squared distances
    print(f"\t- SSE = {sse}")
    silhouette = silhouette_score(
        X, result.labels_
    )  # Calculate average silhouette
    print(f"\t- Silhouette = {silhouette}")
    gap = gapstat_score(X, result.labels_, k=result.n_clusters)
    print(f"\t- GAP = {gap}")

In [None]:
# Your code here

#### Your answer to question 8.2(b): What seems to be the best number of clusters to use?
*Double click here*

### 8.2(c)
The clustering you just have done used all the variables. Visualizing the clusters (and potential regions for the different types) in
this 13-dimensional space is difficult! We will therefore use the scores from the principal
component analysis where we just stored two components. This means
that we now have a 2-dimensional problem!

Rerun the cluster analysis for the scores (again, vary the number of clusters)
and make the same plots as you made in [8.2(b)](#8.2(b)). What is the
best number of clusters to use now? Are your results different from the
cluster analysis on the full data set, and how does it compare to
what we know - that the samples come from three different cultivators of wine?

In [None]:
# Your code here

#### Your answer to question 8.2(c):
*Double click here*

### 8.2(d) Bonus: Showing the decision regions.
Since we reduced the problem to two dimensions in [8.2(c)](#8.2(c)), we
can plot the clusters. Here, we can also plot the so-called decision
regions, which show the areas that belong to each cluster. Use the code below
to show the decision regions for the best clustering you found in [8.2(c)](#8.2(c)).

In [None]:
import matplotlib
from sklearn.inspection import DecisionBoundaryDisplay

X = data_set[
    ["alcohol", "flavanoids"]
].to_numpy()  # Replace with the 2D-scores you used in 8.2(c)
cluster = KMeans(n_clusters=3, n_init="auto").fit(
    X
)  # Replace with the best clustering from 8.2(c)

y = cluster.labels_  # Use the assigned labeles

fig, ax = plt.subplots(constrained_layout=True)

# Show the samples:
colors = []
for i in sorted(set(y)):
    scat = ax.scatter(X[y == i, 0], X[y == i, 1], label=i)
    colors.append(scat.get_facecolors())  # Store colors, so we can reuse them
# Draw the boundaries:
cmap = matplotlib.colors.ListedColormap(colors)  # Use same colors
DecisionBoundaryDisplay.from_estimator(
    cluster,
    X,
    grid_resolution=200,
    ax=ax,
    cmap=cmap,
    alpha=0.1,
)
ax.legend(title="Cluster no.")