**Exercise set 8**
==============

> The goal of this exercise is to perform **principal component analysis**,
> **clustering**, and **classification** on a data set with many variables.


**Exercise 8.1**

In this exercise, we will explore the "wine dataset" which is a common example
dataset used for classification. The dataset contains the results of
a chemical analysis of wines from the same region in Italy, using grapes grown
by three different cultivators. In this first exercise, we will explore this
dataset using principal component analysis.


**(a)** Begin by exploring the raw data. Here, you should choose
the method yourself. You can, for instance, look at histograms of the
different measured quantities, correlations between the quantities,
or other plots of the raw data. It can also be useful to explore
statistical properties like averages and standard deviations. The Python
code in the following cells can be used to load the data set, 
and it will print out some summaries of the raw data which you may find
helpfull for your exploration

After looking at the raw data, are there some of the variables that seem to be able to distinguish between the wines produced by the different cultivators?

In [None]:
"""Load the wine data set and print some info."""
from sklearn.datasets import load_wine
import numpy as np
import pandas as pd


data_set = load_wine()
# Print out some information about the data set:
print('Variables in the data set:')
for i in data_set['feature_names']:
    print(i)
print('\nClasses in the data set (cultivators):')
for i in data_set['target_names']:
    print(i)
# Convert the data set into a pandas DataFrame:
data = pd.DataFrame(data_set['data'], columns=data_set['feature_names'])

In [None]:
# Print a table with a summary for each variable:
data.describe()

In [None]:
# We can also use the class information:
class_data = data_set['target']
class_names = dict(enumerate(data_set['target_names']))
variable = 'color_intensity'
for class_id, class_name in class_names.items():
    print(f'\nInformation about "{variable}" for "{class_name}"')
    idx = np.where(class_data == class_id)[0]
    data_class = data.loc[idx, variable]
    print(data_class.describe())

In [None]:
# Your code here

**Your answer to 8.1(a)**: *Double click here*

**(b)** Perform a PCA on the data set and plot the explained variance as a function
of the number of principal components. Do you need to scale your data before performing
PCA in this case (why/why not)? How many principal components are needed to explain 95%
of the variance in the data? The following code cell can be used to run the PCA

In [None]:
"""Load the wine data set and run PCA."""
from sklearn.datasets import load_wine
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd


data_set = load_wine()
data = pd.DataFrame(data_set['data'], columns=data_set['feature_names'])
X = data
# Uncomment the following line to scale your data:
#X = scale(data)
pca = PCA()
scores = pca.fit_transform(X)
# Print out the percentage of variance explained by each component:
print(pca.explained_variance_ratio_)

**Your answer to 8.1(b)**: *Double click here*

**(c)** Rerun the PCA with
the number of components you found in the previous question.
This can be done by defining argument `n_components` to 
`PCA()`, e.g.: `pca = PCA(n_components=13)`.

Obtain the scores, and make a plot of the scores for
principal component 1 (on the x-axis) and principal component 2 (on the y-axis).

Do you see any grouping(s) ("clusters") in your scores plot?
Here, you can choose to color the scores according
to the class they belong to (i.e. by using the class
data available in the data set).

In [None]:
# Your code here

**Your answer to 8.1(c):** *Double click here*

**(d)** 
Explore the loadings for your PCA model by plotting the
loadings for the variables (on principal component 1 and
principal component 2). Do any of the variables seem to be correlated?


In [None]:
# Your code here

**Your answer to 8.1(d):** *Double click here*

**(e)** 
Save the scores you have obtained to a new file.
We will use this information in the next part
of the exercise, where we will try to find clusters in our data.

Saving the scores can be done with `pandas` as follows:

In [None]:
"""Load the wine data set, run PCA and save scores."""
from sklearn.datasets import load_wine
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd


data_set = load_wine()
data = pd.DataFrame(data_set['data'], columns=data_set['feature_names'])
X = data
# Uncomment the following line to scale your data:
#X = scale(data)
pca = PCA(n_components=5)
scores = pca.fit_transform(X)
# Create variable names for the principal components:
pc_name = [f'PC{i+1}' for i in range(pca.n_components_)]
# Create a DataFrame from the scores:
scores_data = pd.DataFrame(scores, columns=pc_name)
# Save the scores to a comma separated values-file:
scores_data.to_csv('scores.csv')

**Exercise 8.2**
We will continue exploring the "wine dataset". We will pretend that we do not
know that there are 3 classes in the dataset, and we will investigate
what the `KMeans` clustering method can tell us about it. For this
exercise, it is a good idea to read through all points below before
starting, as you will find a link to a specific example you can use 
to answer most of the questions.

**(a)** Explain the steps in the `KMeans` clustering algorithm.
How can we use this algorithm without knowing how many clusters
there are in the data?

**Your answer to 8.2(a):** *Double click here*

**(b)**
Run `KMeans` clustering on the wine dataset. Here, you will have to
select a set of numbers of clusters to look for. (Limit yourself to
a maximum of 10 clusters) After running the clustering for your 
data, plot the sum of squared distances of samples to their closest
cluster center, as a function of the number of clusters considered. 

Explain briefly how this plot can be used to identify the "correct"
number of clusters. 

How many clusters would you say there are in the
dataset, based on this plot alone?
      
To get you started, the cell below has some Python code that can be used to run the
clustering and store the results (see also the [silhouette example](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_samples.html))

Note that the `cluster_km` object contains the following results as attributes:
 * `cluster_centers_`: Coordinates of cluster centers.
 * `labels_`: Labels of each point.
 * `inertia_`: Sum of squared distances of samples to their closest cluster center.
 * `n_iter_`: Number of iterations run.

In [None]:
"""Load the wine data set and run KMeans."""
from sklearn.datasets import load_wine
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd


data_set = load_wine()
data = pd.DataFrame(data_set['data'], columns=data_set['feature_names'])
X = scale(data)
# Define a set of numbers of clusters to run KMeans for:
number_of_clusters = [2, 5]
# Set up variables for storing the results
results = []  # Results for the clustering
yfit = []  # Predicted clusters for data points in X
for i in number_of_clusters:
    cluster_km = KMeans(
        n_clusters=i,
        init='k-means++',
    )
    y = cluster_km.fit_predict(X)
    results.append(cluster_km)
    yfit.append(y)
# Print out some results:
print('Sum of squared distances of samples to their closest cluster center:')
for i, result in zip(number_of_clusters, results):
    print('Clusters: {}: {}'.format(i, result.inertia_))

In [None]:
# Your code here

**Your answer to 8.2(b):** *Double click here*

**(c)** 
A general method that can be used to assess the clustering, 
is the silhouette method. This method calculates a silhouette 
value for each object which is a measure of how similar the 
object is to the cluster it belongs to (cohesion) compared to
other clusters (separation). This is rather easy to calculate 
with `sklearn` as there is a method to do just so: 
[`silhouette_samples`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_samples.html) from the module `sklearn.metrics`.

Do the following:
 * For each clustering you have considered, i.e. for each number of clusters you tried,
   calculate the silhouette values.
 * Plot the average silhouette value as a function of the number of clusters considered.
 * For each clustering, plot the silhouette values grouped into clusters. Say, if you,
   for instance, considered 4 clusters in one of your clusterings, plot the silhouette
   values for each of these 4 clusters. An example of how to do this is available on
   the website
   of [`sklearn`](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html).
   
Using these results (average silhouette values) and the plots of silhouette values, what
is the best number of clusters to use? How does this compare with what we already know -
that the samples come from 3 different cultivators?

In [None]:
# Your code here

**Your answer to 8.2(c):** *Double click here*

**(d)** 
Rerun your analysis on the scores you stored in the last point of the PCA part.
But use only the scores from principal components 1 and 2.
Do the results from this analysis differ from the cluster analysis on the full data set?

**Note:** As we only consider two of the principal components here, we have 2D-data. This
means that we can plot the clusters more easily. If you are curious, plot the
scores for principal components 1 and 2 and color the points according to the
clustering results you have obtained. Here, you can also show the centers of the
clusters by using the `cluster_centers_` attribute of the `KMeans` object you have
used. This part of the exercise also shows that PCA can be used as an initial 
method to reduce the dimensionality of the original problem. We have here 
combined PCA and KMeans to solve a clustering problem.

In [None]:
# Your code here

**Your answer to 8.2(d):** *Double click here*

**Exercise 8.3: LDA Example**

Both PCA and KMeans are examples of unsupervised methods - we did not use the class information available to us to
find clusters in our data.
[Linear discriminant analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
on the other hand, is a supervised method that uses the class
information for *classification*. In your own words, how would you
describe the difference between *classification* and
*clustering*?

LDA is similar to PCA, but rather than looking for latent variables that maximize
the covariance in our data, we rather look for latent variables that maximize the 
*class separation*.
Below, you will find a small script that will run LDA on the
wine data set. Run this script and observe the results.

Note here the difference when we train the LDA
model: `X_trans = lda.fit_transform(X, y)`.
We are supplying the "y" values (i.e. the classes) which is what we expect
for a supervised method.

For the curious student: Apply LDA to the 2D-example dataset from exercise 7,
where we investigated classification by PCA. Does this classification differ from
the simple rule we found there?

In [None]:
"""Load the wine data set and run LDA."""
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib.cm import tab10
from sklearn.datasets import load_wine
from sklearn.preprocessing import scale
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
import pandas as pd
plt.style.use('seaborn-talk')


data_set = load_wine()
data = pd.DataFrame(data_set['data'], columns=data_set['feature_names'])
X = scale(data)
y = data_set['target']  # load the class information
# Run LDA:
lda = LinearDiscriminantAnalysis()
X_trans = lda.fit_transform(X, y)
print('Number of classes:', len(lda.classes_))
# Predict classes for our original points:
y_hat = lda.predict(X)

# Plot the explained variance:
fig1, ax1 = plt.subplots(constrained_layout=True)
comp = list(range(1, len(lda.explained_variance_ratio_) + 1))
ax1.bar(
    comp,
    lda.explained_variance_ratio_,
    label='Variance explained by component'
)
ax1.plot(
    [0] + comp,
    [0] + list(np.cumsum(lda.explained_variance_ratio_)),
    color='black',
    marker='o',
    label='Cumulative variance explained'
)
ax1.set_xticks([0] + comp)
ax1.set(xlabel='LDA component', ylabel='Ratio of variance explained')
ax1.legend()
ax1.axhline(y=1, ls=':', color='black', alpha=0.8)

# Plot the transformed X, this is similar to the scores found in PCA:
fig2, ax2 = plt.subplots(constrained_layout=True)
for i in np.unique(y_hat):
    ax2.scatter(
        X_trans[y_hat == i, 0],
        X_trans[y_hat == i, 1],
        color=tab10.colors[i],
        s=150
    )
ax2.set(xlabel='LDA component 1', ylabel='LDA component 2')
# Plot the centers of the clusters found:
for center in lda.transform(lda.means_):
    ax2.scatter(
        center[0],
        center[1],
        s=250,
        color='black',
        marker='X',
        edgecolor='white'
    )

# Now, in order to plot the regions, we would like to have 2D data.
# The classification we have right now, expects 13 variables to
# classify samples. We therefore run a second LDA on the LDA we
# already have performed:
lda2 = LinearDiscriminantAnalysis()
X_trans2 = lda2.fit_transform(X_trans, y_hat)
y_hat2 = lda2.predict(X_trans)
# Show the regions:
fig3, ax3 = plt.subplots(constrained_layout=True)
X_set = X_trans
X1, X2 = np.meshgrid(
    np.linspace(X_trans[:, 0].min() - 1, X_trans[:, 0].max() + 1, 500),
    np.linspace(X_trans[:, 1].min() - 1, X_trans[:, 1].max() + 1, 500)
)
Z = lda2.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape)
ax3.contourf(X1, X2, Z, alpha = 0.3,
             cmap=ListedColormap(tab10.colors[:3]))
# Add the original samples:
for i in np.unique(y_hat2):
    ax3.scatter(
        X_trans2[y_hat2 == i, 0],
        X_trans2[y_hat2 == i, 1],
        color=tab10.colors[i],
        s=150
    )
ax3.set(xlabel='LDA component 1', ylabel='LDA component 2')
# Plot the centers of the clusters found:
for center in lda2.transform(lda2.means_):
    ax3.scatter(
        center[0],
        center[1],
        s=250,
        color='black',
        marker='X',
        edgecolor='white'
    )

**Your answer to 8.3:** *(Double click here)*