# Exercise set 8


>The goal of this exercise is to perform **principal component analysis**
>and **clustering** on a data set with many variables.

## Exercise 8.1

In this exercise, we will explore the "wine data set" which is a common example
data set used for classification. The data set contains the results of
a chemical analysis of wines from a region in Italy. These
wines are made using grapes grown by three different cultivators.
In this first exercise, we will explore the
data set using principal component analysis, and we will investigate
if the results from the chemical analysis can be used to separate
the wines into groups that correspond to the cultivator of the grapes.

The data set contains the following columns:


| Column name                    | Description                                              |
|--------------------------------|----------------------------------------------------------|
| alcohol                        | The alcohol content of the wine.                         | 
| malic_acid                     | The amount of malic acid in the wine (malic acid has an apple aroma).  |
| ash                            | The amount of ash in the wine (ash is the matter that remains after evaporation and incineration).   | 
| alcalinity_of_ash              | The alkalinity of the ash content of the wine.           |
| magnesium                      | The amount of magnesium in the wine.                      |
| total_phenols                  | The total amount of [phenols](https://en.wikipedia.org/wiki/Phenolic_content_in_wine) (that are not flavanoids) in the wine. |
| flavanoids                     | The amount of [flavanoids](https://en.wikipedia.org/wiki/Flavonoid) in the wine |
| nonflavanoid_phenols           | The total amount of phenols in the wine.   |
| proanthocyanins                | The amount of [proanthocyanins](https://en.wikipedia.org/wiki/Proanthocyanidin) in the wine (important for red/blue/purple colors).   |
| color_intensity                | Color intensity of the wine (measured spectroscopically).  |
| hue                            | Color hue of the wine (measured spectroscopically).         |
| od280/od315_of_diluted_wines   | The protein content of the wine. OD280/OD315 is a method for determining the protein concentration.                                     |
| proline                        | The amount of proline in the wine (proline is the main amino acid found in red wine).   |  
| target                         | The cultivator of the wine, given as 0, 1, or 2.   |

**(a)**  Begin by exploring the raw data. Here, you should choose
the method yourself. You can, for instance, look at histograms of the
different measured quantities, correlations between the quantities,
or other plots of the raw data (for instance, the 
[scatter plot matrix](https://seaborn.pydata.org/examples/scatterplot_matrix.html) we used in a previous exercise).

To get you started, you can find some Python code below for loading the data and creating
a report using the [pandas-profiling](https://pandas-profiling.ydata.ai/docs/master/rtd/) library.
This report will give you some information about the correlation/distribution of the different
variables in the data set.

After looking at the raw data, are there some of the
variables that seem to be able to distinguish
between the wines produced by the different cultivators?

In [None]:
"""Load the wine data set and print some info."""
from pandas_profiling import ProfileReport
from sklearn.datasets import load_wine
from matplotlib import pyplot as plt

plt.style.use("seaborn-talk")


# Load the data set as a pandas frame:
data_set = load_wine(as_frame=True)["frame"]
variables = [i for i in data_set.columns if i != "target"]
data_set

In [None]:
# To display properties grouped by the cultivator, we can do:
class_data = data_set.groupby("target")
print("\nInformation on malic_acid:")
class_data["malic_acid"].describe()

In [None]:
# Create a report for the data set:
from pandas_profiling import ProfileReport

# There is a bug in newer versions in pandas_profiling - this is a workaround (we skip the target variables)
# Ref: https://github.com/ydataai/pandas-profiling/issues/911
data = data_set[variables].copy()
# In jupyter, the profiler will open up figures it is creating with matplotlib notebook,
# we therefore set it to inline when creating the report (we reset it later):
%matplotlib inline
# Make the report
profile = ProfileReport(data, title='Wine data set')
profile.to_file('dataprofile.html')  # Save it as a HTML, in case you want to download it.
%matplotlib notebook

The report can (if it was successuflly created above) be [opened in a new tab here](./dataprofile.html).

In [None]:
# Uncomment this line to show the HTML report in jupyter:
#profile

In [None]:
# Your code here

**Your answer to 8.1(a)**: *Double click here*

**(b)** Perform a PCA on the data set (some example code 
for this can be found in below)
and plot the explained variance as a function
of the number of principal components.

* (i)  Do you need to scale your data before
  performing PCA in this case (why/why not)?


* (ii)  Should you include the `target` column in the data you use for the PCA?


* (iii)  How many principal components are needed to explain 95\% of the
  variance in the data? 


The following code can be used to run the PCA:

In [None]:
"""Load the wine data set and run PCA."""
from sklearn.datasets import load_wine
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA


data_set = load_wine(as_frame=True)["frame"]
variables = [i for i in data_set.columns if i != "target"]
X = data_set[variables].to_numpy()
# Uncomment the following line to scale your data:
# X = scale(X)
pca = PCA()
scores = pca.fit_transform(X)
# Print out the percentage of variance explained by each component:
print(pca.explained_variance_ratio_)

**Your answer to 8.1(b)**: *Double click here*

**(c)**

* (i)  Rerun the PCA with
  the number of components you found in the previous question.
  This can be done by defining the argument
  `n_components` in `PCA()`, e.g. `pca = PCA(n_components=13)`,
  or, (for 95 \% of the variance) `pca = PCA(n_components=0.95)`


* (ii)  Obtain the scores, and make a plot of the scores for
  principal component 1 (on the $x$-axis) and principal component 2 (on the $y$-axis).


* (iii)  Do you see any grouping(s) ("clusters") in your scores plot?
  Here, you can choose to color the scores according
  to the cultivator (i.e. by using the values in the `target`
  column in the data set).

In [None]:
# Your code here

**Your answer to 8.1(c):** *Double click here*

**(d)**  Explore the loadings for your PCA model by plotting the
loadings for the variables (on principal component 1 and
principal component 2). Do any of the variables seem to be correlated?

In [None]:
# Your code here

**Your answer to 8.1(d):** *Double click here*

**(e)** Save the scores for a PCA using only two principal components to a new file.
We will use this information in the next part
of the exercise, where we will try to find clusters in our data.
Saving the scores can be
done with `pandas` as follows:

In [None]:
"""Load the wine data set, run PCA and save scores."""
from sklearn.datasets import load_wine
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
import pandas as pd


data_set = load_wine(as_frame=True)["frame"]
variables = [i for i in data_set.columns if i != "target"]
X = scale(data_set[variables].to_numpy())
pca = PCA(n_components=2)
scores = pca.fit_transform(X)

# Create variable names for the principal components:
pc_name = [f"PC{i+1}" for i in range(pca.n_components_)]
# Create a DataFrame from the scores:
scores_data = pd.DataFrame(scores, columns=pc_name)
scores_data["target"] = data_set["target"]
# Save the scores to a comma separated values-file:
scores_data.to_csv("scores.csv", index=False)

After running the code below, the file should be avaiable here: [scores.csv](./scores.csv)

## Exercise 8.2

We will continue exploring the wine data set. We will pretend that we do not
know that there are 3 cultivators in the data set, and we will investigate
what the `KMeans` clustering method can tell us about it. For this
exercise, it is a good idea to read through all points below before
starting, as you will find
a link to a specific example you can use to answer most of the questions.

**(a)**  Explain the steps in the `KMeans` clustering algorithm.
How can we use this algorithm without knowing how many clusters
there are in the data?

**Your answer to 8.2(a):** *Double click here*

**(b)** Run `KMeans` clustering on the wine data set. Here, you will have to
select a set of numbers of clusters to look for. (Limit yourself to
a maximum of 10 clusters) After running the clustering for your 
data, plot the sum of squared distances of samples to their closest
cluster center, as a function of the number of clusters considered. 

Explain briefly how this plot can be used to identify the "correct"
number of clusters. 

How many clusters would you say there are in the
data set, based on this plot alone?
      
To get you started, the cell below has some Python code that can be used to run the
clustering and store the results (see also the [silhouette example](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_samples.html))

Note that the `cluster_km` object contains the following results as attributes:
 * `cluster_centers_`: Coordinates of cluster centers.
 * `labels_`: Labels of each point.
 * `inertia_`: Sum of squared distances of samples to their closest cluster center.
 * `n_iter_`: Number of iterations run.

In [None]:
"""Load the wine data set and run KMeans."""
from sklearn.datasets import load_wine
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans


data_set = load_wine(as_frame=True)["frame"]
variables = [i for i in data_set.columns if i != "target"]
X = scale(data_set[variables].to_numpy())
# Define a set of numbers of clusters to run KMeans for:
number_of_clusters = [2, 3, 4, 5]
# Set up variables for storing the results
results = []  # Results for the clustering
yfit = []  # Predicted clusters for data points in X
for i in number_of_clusters:
    cluster_km = KMeans(
        n_clusters=i,
        init="k-means++",
    )
    y = cluster_km.fit_predict(X)
    results.append(cluster_km)
    yfit.append(y)
# Print out some results:
print("Sum of squared distances of samples to their closest cluster center:")
for i, result in zip(number_of_clusters, results):
    print(f"Clusters: {i}: {result.inertia_}")

In [None]:
# Your code here

**Your answer to 8.2(b):** *Double click here*

**(c)** 
A general method that can be used to assess the clustering, 
is the silhouette method. This method calculates a silhouette 
value for each object which is a measure of how similar the 
object is to the cluster it belongs to (cohesion) compared to
other clusters (separation). This is rather easy to calculate 
with `sklearn` as there is a method to do just so: 
[silhouette_samples](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_samples.html) from the module `sklearn.metrics`.

Do the following:

* (i)  Calculate the silhouette values for each clustering you have tried.


* (ii)  Plot the average silhouette value as a function of the number of clusters considered.


* (iii)  For each clustering, plot the silhouette values
  grouped into clusters. Say, if you,
  for instance, considered 4 clusters in one of your clusterings, plot the silhouette
  values for each of these 4 clusters. An example of how to do this is available on
  the website
  of [`sklearn`](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html).
   

Using these results (average silhouette values) and the plots of silhouette values, what
is the best number of clusters to use? How does this compare with what we already know -
that the samples come from three different cultivators?

In [None]:
# Your code here

**Your answer to 8.2(c):** *Double click here*

**(d)** The clustering you just have done used all the variables. It is not easy to
visualize the clusters (and potential regions for the different types) in
this 13-dimensional space! We will therefore use the scores from the principal
component where we just used two components. This means
that we now have a 2-dimensional problem!

Do the following:

* (i)  Run a cluster analysis on the scores from the PCA and find the
  best number of clusters. Are your results different from the
  cluster analysis on the full data set, and how does it compare to
  what we know - that the samples come from three different cultivators of wine?


* (ii)  Visualize the clusters by plotting the original scores and coloring
  them according to the cluster they belong to. Also, plot the
  boundaries for the clusters. It is easiest to plot the boundaries
  by creating a bunch of points and then checking which of the cluster
  centers they are closest to.
  You can find an example of how this can be done in the cell below.

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import numpy as np

data_set = load_wine(as_frame=True)["frame"]
# Just pick two variables for this example:
X = scale(data_set[["proline", "hue"]].to_numpy())
cluster = KMeans(
    n_clusters=3,
    init="k-means++",
)
y = cluster.fit_predict(X)

fig, ax = plt.subplots()
ax.set(xlabel="proline", ylabel="hue")
for i in sorted(set(y)):
    ax.scatter(X[y == i, 0], X[y == i, 1], label=i)
ax.legend(title="Cultivator")

xlim = ax.get_xlim()
ylim = ax.get_ylim()

# generate a bunch of points to find the regions:
XX, YY = np.meshgrid(
    np.linspace(min(xlim), max(xlim), 200),
    np.linspace(min(ylim), max(ylim), 200),
)
Z = cluster.predict(np.c_[XX.ravel(), YY.ravel()])
Z = Z.reshape(XX.shape)
ax.contourf(XX, YY, Z, alpha=0.2)

In [None]:
# Your code here

**Your answer to 8.2(d):** *Double click here*