# Exercise set 4: Principal component analysis and clustering

The main goals of this exercise are to perform principal component analysis (PCA) and k-means clustering.


**Learning Objectives:**

After completing this exercise set, you will be able to:

* Run PCA to reduce the dimensionality of a data set.
* Visualise PCA results by creating score plots (showing data point projections), loading plots (illustrating variable influence), and variance-explained plots (indicating component significance).
* Interpret results from PCA by inspecting the scores and loadings plots to explain groupings and variable contributions.
* Run k-means clustering for a data set and use the [elbow method](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#Elbow_method) to select the best number of clusters.



**To get the exercise approved, complete the following problems:**

- [4.1(a)](#4.1(a)) and [4.1(b)](#4.1(b)): to show that you can perform PCA and plot the scores and the variance explained per principal component.
- [4.2(a)](#4.2(a)) and [4.2(b)](#4.2(b)): to show that you can also plot the loadings from PCA, and interpret the scores and loadings.
- [4.3(a)](#4.3(a)): to show that you can apply k-means clustering to a data set and select the best number of clusters

## Exercise 4.1 Molecular conformations

We have performed molecular dynamics simulations to model the various conformations a molecule can adopt. We have collected 4004 snapshots, each representing a conformation and we have recorded the 3D coordinates of each atom in each conformation.

The file `molecule.csv` contains these coordinates, organized as follows:

* Each row represents a single molecular conformation.
* The columns contain the x, y, and z coordinates of each atom.
* The column labels follow a pattern:
   * `1x`, `1y`, `1z` represent the coordinates of atom 1,
   * `2x`, `2y`, `2z` those of atom 2, and so on, up to atom 22.

Here is a snippet of the data (first three conformations/rows):

|     |    1x |    1y |    1z |    2x | ... |   22x |   22y |   22z |
|----:|------:|------:|------:|------:|:---:|------:|------:|------:|
|   0 | 14.585 | 13.725 | 12.373 | 13.759 | ... | 14.882 | 14.462 | 10.500 |
|   1 | 14.585 | 13.868 | 12.458 | 13.773 | ... | 15.061 | 14.033 | 10.411 |
|   2 | 14.668 | 13.689 | 12.557 | 13.667 | ... | 14.914 | 14.276 | 10.359 |


Our goal is to use Principal Component Analysis (PCA) to determine if we can identify distinct groups or clusters of these molecular conformations based on their atomic coordinate data.

In [None]:
# The raw data can be loaded as follows:
import pandas as pd

data1 = pd.read_csv("molecule.csv")
data1.head()

### 4.1(a)

**Task: Run PCA on this data set and plot the variance explained as a function of the principal components, for instance in a bar plot or a line plot. How much of the variance is explained by principal components 1 and 2?**

**Hints:** Assuming that `X` contains our data, a PCA can be carried out as follows:

```python
from sklearn.decomposition import PCA
pca = PCA()
scores = pca.fit_transform(X)
```

This will store the scores in the variable `scores` which can be directly used in a scatter plot.
It is also useful to inspect
how much of the variance each principal component is explaining.
The fraction of the variance explained by each component can be accessed via:
```python
variance = pca.explained_variance_ratio_
```

**Note:** The raw data has already been scaled so you can use it directly without preprocessing.

In [None]:
# Your code here

#### Your answer to question 4.1(a): How much of the variance is explained by principal components 1 and 2?

*Double click here*

### 4.1(b)

**Task: Create a scatter plot where you show the scores for PC1 and PC2 (the data projection onto the first two principal components). Can you see any groups in your data?** 

In [None]:
# Your code here

#### Your answer to question 4.1(b): Do you see any clusters in your plot of the scores?

*Double click here*

### 4.1(c)


**Task: Use [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) to investigate if there are any clusters in the data. Create a scatter plot of the t-SNE scores to visualize the data. Do you see any clusters?**

**Hint:** Assuming that `X` contains our data, dimensionality reduction by t-SNE can be carried out as follows:

```python
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
tsne_scores = tsne.fit_transform(X)
```

In [None]:
# Your code here

#### Your answer to question 4.1(c): Do you see any clusters when you plot the t-SNE scores?

*Double click here*

## Exercise 4.2 Detection of milk adulteration

[Prabowo](https://doi.org/10.5281/zenodo.13766649) recently investigated the feasibility of using a regular smartphone for milk quality analysis, specifically for the detection of adulteration.

Prabowo used image analysis techniques to extract information from digital images of various milk samples, including pure milk, milk adulterated with rice water, and milk contaminated with lead(II)-ions. The images were captured using a smartphone (iPhone 13 Pro) under controlled conditions to ensure consistency in lighting, zoom, and distance. From the images, 4 numerical values were extracted:

* The intensity of the red, green, and blue colour components from an area in the middle of the sample.
* The amount of grey colour in the same area.

This data can be found in the file [milk.csv](./milk.csv) which contains the following columns:

* `Red`: the red colour component intensity
* `Green`: the green color component intensity
* `Blue`: the blue colour component intensity
* `Red/Blue`: the ratio of the red to blue colour component intensity
* `Red/Green`: the ratio of the red to green colour component intensity
* `Blue/Green`: the ratio of the blue to green colour component intensity
* `Grey`: the average grey pixel intensity
* `Type`: a text describing the sample (type of milk pictured):
    * `Milk (control)`: Samples of pure milk
    * `Rice water (control)`: Samples of pure rice water mixtures
    * `Milk + rice water`: Samples created by mixing pure milk with rice water. This simulates adultered milk.
    * `Milk + lead`: Samples created by mixing pure milk with lead of various concentrations. This simulates lead-contaminated milk.

We will investigate if we can use this data to distinguish between the different types by performing principal component analysis.

### 4.2(a)

**Tasks:**
1. **Load the data set and perform PCA to obtain the scores. Scale the data before performing PCA.**
2. **Create scatter plots of the scores (you can investigate different combinations of principal components), colour the samples according to their type and investigate visually if the different sample types appear as distinct clusters.**

**Hints:**

1. In this case, the analysis may benefit from standardisation of the variance (since we may have different units or natural scales for the numbers). Assuming that our data is stored in the matrix `X`, we can standardise it as follows:
```python
from sklearn.preprocessing import scale
X_scaled = scale(X)
```

2. Coloring a scatter plot according to a column in a Pandas data frame can be done with [scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) from seaborn:
```python
import pandas as pd
import seaborn as sns

data2 = pd.read_csv("milk.csv")  # load data

# ... assuming scores contain the PCA scores:
sns.scatterplot(
    data=data2,  # select the data frame
    x=scores[:, 0],  # select data to put on the x-axis
    y=scores[:, 1],  # select data to put on the y-axis
    hue="Type",  # select data to use for colouring (column from data2)
)
```

In [None]:
# Your code here

#### Your answer to question 4.2(a): Do the different sample types appear as distinct clusters?

*Double click here*

### 4.2(b)

**Tasks: Interpret the scores plot(s) to identify the variables that contribute most significantly to the observed clusters, specifically:**

1. **Which variables are most influential in discriminating between pure rice water and samples containing milk?**
2. **Which variables are most influential in discriminating between pure milk and lead-contaminated milk samples?**

**(Use loading plots to guide your interpretation.)**


**Hint:** Create a scatter plot of the loadings to show their importance for different principal
components and interpret ths together with the scores. Scatterplots can be created as follows (assuming that `pca` is a `PCA` object from scikit-learn, and that the variables used are stored in a list `variables`):
```python
from matplotlib import pyplot as plt

loadings = pca.components_.T  # Extract the loadings
variables = [
    "Red",
    "Green",
    "Blue",
    "Red/Blue",
    "Red/Green",
    "Blue/Green",
    "Grey",
]  # Store variable names
fig, ax = plt.subplots()  # Create empty plot
ax.scatter(loadings[:, 0], loadings[:, 1])  # Scatter plot of the loadings

for i, text in enumerate(variables):
    # Add the name of the variable as text next to the scatter points:
    ax.text(loadings[i, 0], loadings[i, 1], text, fontsize="small")  
``` 

In [None]:
# Your code here

#### Your answer to question 4.2(b): What variables are important for distinguishing between (1) pure rice water and samples containing milk, and (2) samples of pure milk and milk contaminated by lead?

*Double click here*

## Exercise 4.3 Clustering

In [Exercise 4.1](#Exercise-4.1-Molecular-conformations), we analysed molecular conformations using PCA. The file [scores.4.1.csv](./scores.4.1.csv) contains the scores for principal components 1 (column `PC1`) and 2 (column `PC2`). In this exercise, we will investigate if we can find clusters in this data by applying [k-means](https://en.wikipedia.org/wiki/K-means_clustering) clustering.

### 4.3(a)

**Tasks:**
1. **Load the data from [scores.4.1.csv](./scores.4.1.csv) and perform k-means clustering, considering the number of clusters (k) from 1 to 10**
2. **Plot the within-cluster sum of squared distances of the samples to their closest cluster centre as a function of the number of clusters (k).**
3. **Use the plot created above (the "elbow method") to identify the best number of clusters. Explain your reasoning for selecting the best number of clusters.**

**Hint:** scikit-learn can perform [k-means clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). Here is one example to perform it for 3 clusters:
```python
import pandas as pd
from sklearn.cluster import KMeans

data = pd.read_csv("scores.4.1.csv")  # Load the data
# Set up the k-means method to look for 3 clusters:
cluster = KMeans(n_clusters=3)  # n_clusters selects the number of clusters
cluster.fit(data)  # Run clustering on our data
# Print out cluster centers:
print(cluster.cluster_centers_)
# Print out the within-cluster sum of squared distances of samples to their closest cluster centre:
print(cluster.inertia_)
```

**Note:** The elbow method is a heuristic, and does not always provide a clear answer.

In [None]:
# Your code here

#### Your answer to question 4.3(a): What is the best number of clusters, and how did you select it?

*Double click here*

### 4.3(b)

The [silhouette score](https://en.wikipedia.org/wiki/Silhouette_(clustering)) measures how similar a data point is to its own cluster compared to other clusters, and can be used to select the best number of clusters by comparing silhouette values for different clusterings.


**Task: Calculate the mean silhouette score for 2 to 10 clusters. Plot the mean silhouette value as a function of the number of cluster centres. What is the best number of clusters to use, based on this plot? Explain your reasoning for selecting the best number of clusters.**

**Hint:** Given a clustering, you can find the silhouette value as follows:
```python
from sklearn.metrics import silhouette_score

cluster = KMeans(n_clusters=3)  # n_clusters selects the number of clusters
cluster.fit(data)  # Run clustering on our data
# Get what cluster the different points are assigned to:
cluster_labels = cluster.predict(data)
silhouette_mean = silhouette_score(data, cluster_labels)
print(silhouette_mean)
```

**Note:** The silhouette score is *not defined* for 1 cluster. (Can you explain why?)

**Note:** The silhouette score is also a heuristic, and does not always provide a clear answer.

In [None]:
# Your code here

#### Your answer to question 4.3(b): What is the best number of clusters, and how did you select it?

*Double click here*

### 4.3(c)

The [Gap statistic](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_gap_statistics) compares the total within-cluster dispersion (often represented as the sum of pairwise distances within each cluster, W) with what we would expect for uniformly randomly distributed points (Ŵ). The optimal number of clusters is the point where the "Gap", which is the difference between log(W) and log(Ŵ), is largest.


**Task: Obtain and plot the Gap statistic value as a function of the number of cluster centres (consider 1 to 10 clusters). What is the best number of clusters to use, based on this plot? Explain your reasoning for selecting the best number of clusters.**

**Hint:** The Gap statistic can be obtained via the [gapstat](https://github.com/jmmaloney3/gapstat) package. If you do not have this one installed, you can install it via (in a terminal):

```bash
pip install git+https://github.com/jmmaloney3/gapstat
```

To install it directly from a Jupyter notebook, you need to add a "!" in front of the command:
```bash
!pip install git+https://github.com/jmmaloney3/gapstat
```

To calculate the Gap statistic:

```python
from gapstat import gapstat_score

cluster = KMeans(n_clusters=3)  # n_clusters selects the number of clusters
cluster.fit(data)  # Run clustering on our data
# Get what cluster the different points are assigned to:
cluster_labels = cluster.predict(data)

gap, _, _, _, error = gapstat_score(
    data, cluster_labels, k=3, calcStats=True
)

# gap = the Gap statistic
# error = standard deviation for the Gap statistic
```

**Note:** The Gap statistic is also a heuristic, and does not always provide a clear answer.

In [None]:
# Your code here

#### Your answer to question 4.3(c): What is the best number of clusters, and how did you select it?

*Double click here*

### 4.3(d)

**Task: Repeat [4.3(a)](#4.3(a))-[4.3(c)](#4.3(c)), but use the original data in [molecule.csv](./molecule.csv) instead of the PCA scores. What is the best number of clusters? Explain your reasoning for selecting the best number of clusters.**

In [None]:
# Your code here

#### Your answer to question 4.3(d): What is the best number of clusters, and how did you select it?

*Double click here*

### 4.3(e)

**Task: Repeat the clustering of the data in [scores.4.1.csv](./scores.4.1.csv), but use the density-based method [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). How many clusters were identified by DBSCAN?**

**Hint:** Assuming that the matrix `X` contains our raw data, DBSCAN clustering can be performed with:
```python
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.5, min_samples=5)
labels = clustering.fit(X)
```

**Note:** The results from DBSCAN may depend on the hyperparameters `eps` and `min_samples`. Explore different values for these parameters and investigate how they affect the number of clusters and noise points identified. Consider visualising the clusters or calculating silhouette scores.



**Hint:** Assuming that the matrix `X` contains our raw data, DBSCAN clustering can be performed with:
```python
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.5, min_samples=5)
labels = clustering.fit(X)
```

In [None]:
# Your code here

#### Your answer to question 4.3(e): How many clusters did you find with DBSCAN? How is this influenced by the hyperparameters?

*Double click here*

## Your feedback for Exercise 4

1. **Time & Difficulty:**
    * Length (1=too short, 5=too long): 1  2  3  4  5
    * Difficulty (1=too easy, 5=too difficult): 1  2  3  4  5
    * Most challenging part: _________________________

2. **Code Examples:**
    * More or less example code?  More  Less  About Right
    * Areas where more examples would be helpful: _________________________

3. **Errors/Inconsistencies:** Did you encounter any?  Yes  No  If yes, please describe: _________________________

4. **Suggestions:** How could this exercise be improved? _________________________