# Spectral clustering

In this notebook we will explore how spectral clustering  can be used to cluster neonatal brain MRI.

First import the basic libraries by running the cell below

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler

### Identify preterm and term scans

We will start by exploring a dataset that contains MRI images of preterm babies. Each baby was scanned twice, and our task is to automatically identify first and second scan in a database of the preterm babies using a clustering method. We have volumes of 86 structures to recognise the scans.

Here is the information about the patients:
* **Preterm:** born before 37 weeks GA
* **First scan**: within 4 weeks of birth
* **Second scan**: between 38 and 43 weeks GA

First we will load the dataset and visualise its structure of the dataset using `PCA`, the same way as in the previous notebook. Run the cell below to do that. Can you observe two clusters?

In [None]:
# load data
data = pd.read_csv("datasets/structures_first_second_scan.csv",header=None)
structure_volumes = data.to_numpy()

# Create features
X = StandardScaler().fit_transform(structure_volumes[:,1:])

# We have information about the first or second scan for comparison 
y = structure_volumes[:,0]

print('Number of samples: {}  Number of features: {}'.format(X.shape[0],X.shape[1])) 

# Apply PCA to reduce to two dimension and plot the data
from sklearn.decomposition import PCA
pca = PCA( n_components = 2) 
X2 = pca.fit_transform(X)
plt.plot(X2[:,0],X2[:,1],'bo', alpha = 0.8)
plt.title('PCA', fontsize = 16)
plt.xlabel('component 1', fontsize = 12)
plt.ylabel('component 2', fontsize = 12)

At this point we can simply apply k-means algorithm or Gaussian Mixture model to the whole or PCA transformed dataset. Run in the code below and to see the result. We also calculate the accuracy compared to the ground truth labels to see whether clustering worked.

In [None]:
# predict using k-means
from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=2).fit_predict(X2)

# Calculate accuracy score
from sklearn.metrics import accuracy_score
print('Accuracy score: ', round(accuracy_score(y,y_pred),2))
print('Accuracy score: ', round(accuracy_score(y,1-y_pred),2))

# Plot
def PlotData(X,y):
    # plot
    plt.plot(X[y==0,0],X[y==0,1],'bo',alpha=0.8, label = 'Cluster 1')
    plt.plot(X[y==1,0],X[y==1,1],'r*',alpha=0.8, label = 'Cluster 2')
    # annotate
    plt.legend()
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.title('Clustering result')

# Plot reduced dataset
PlotData(X2,y_pred)

Now we would like to explore how Spectral clustering will deal with this dataset. In Scikit-learn, we can do that using object `SpectralClustering`. 

**Activity 1:** Run the code below. You will see, that as expected Spectral Clustering worked perfectly too. Experiment with the number of components `n_components` of the embedded space to see whether that changes the results. What is the minimum number that works?

**Answer:** 

In [None]:
# predict using spectral clustering
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2, affinity = 'nearest_neighbors')
y_pred = model.fit_predict(X)

# Calculate accuracy score
print('Accuracy score: ', round(accuracy_score(y,y_pred),2))
print('Accuracy score: ', round(accuracy_score(y,1-y_pred),2))

# Plot reduced dataset
PlotData(X2,y_pred)

In the code above we have used PCA to visualise the results of Spectral clustering in 2D representation of the feature space. We will now replace `PCA` with `SpectralEmbedding`. The cell below contains the parameters of the `SpectralClustering` model that we have just fitted. (If `n_components` is `None` then it is equal to `n_clusters`, check the default settings in help.)

In [None]:
print(model.n_clusters)
print(model.n_components)
print(model.affinity)
print(model.n_neighbors)

**Activity 2:** Create the spectral embedding with the same parameters as `SpectralClustering` in activity 1 and use it to visualise the clustering instead of the PCA. 

In [None]:
# Calculate spectral embedding
from sklearn.manifold import SpectralEmbedding

# Create model
model=None

# Calculate embedded feature matrix
Xe = None

# Plot reduced dataset
PlotData(Xe,y_pred)

**Activity 3:** Fill in the code in the cell below to perform clustering in 3D embedded space. Print out three plots which
* present the data in a 2D space reduced by `PCA`
* colour-code the data by an embedded coordinate obtained using `SpectralClustering`.

In [None]:
# Create 3D embedding
model = None
Xe = None

# function for colorcoding data
def PlotDataColourcoded(X,Xe,dim):
    plt.scatter(X[:,0],X[:,1],c=Xe[:,dim])
    plt.title('Colour-coded by coordinate {}'.format(dim), fontsize = 16)
    plt.xlabel('PCA feature 1', fontsize = 14)
    plt.ylabel('PCA feature 2', fontsize = 14)

plt.figure(figsize=(14,4))
for i in range(3):
    plt.subplot(1,3,i+1)
    PlotDataColourcoded(None,None,i);
plt.tight_layout()

<img src="pictures/brain1.png" width = "100" style="float: right;">

# Exercise 4

### Spectral clustering from precomputed matrices

<img src="pictures/brain2.png" width = "100" style="float: right;"> 

In this exercise we will demonstrate how we can cluster MRI images of the babies scanned at 40 week GA. The images of 68 term and preterm babies were first co-aligned to the same reference space. After that cross-corelation between all pairs of images was calculated to measure their similarity. The matrix of similarities (also called **affinity matrix**) is available in the file 'babies.csv'.

### Load the affinity matrix

<img src="pictures/brain3.png" width = "100" style="float: right;"> 

Load the affinity matrix by running the cell bellow and inspect it. Which value is on the diagonal and why?

**Answer:** 

In [None]:
import pandas as pd

# read the file
df = pd.read_csv('datasets/babies.csv', header=None)

# print the affinity matrix
df

Next we will convert to matrix from dataframe object to a numpy array. What is the dimension of this matrix and why?

**Answer:** 

In [None]:
# convert to numpy array
NCC=df.to_numpy()

# print the shape
NCC.shape

### Visualise the dataset 

**Task 4.1:** Visualise the dataset defined by the affinity matrix by performing these steps:
* Calculate the `SpectralEmbedding` with 3 components and `precomputed` affinity matrix. Look in the help how to create the embedding model. 
* To fit the model, use the affinity matrix rather than the feature matrix in this case. 
* Once you have calculated the 3D feature matrix in the embedded space, plot the dataset in 2D using the first 2 embedded coordinates.

In [None]:
# Create the embedding
embedding = None

# Fit the model using the affinity matrix and calculate the feature matrix in the 3D embedded space
Xe = None

# Plot the first two dimensions of the embedded space
plt.plot(None,None,'bo', alpha = 0.8)

# Annotate the plot
plt.title('Spectral Embedding')
plt.xlabel('Embedded component 1')
plt.ylabel('Embedded component 2')

### Perform spectral clustering

**Task 4.2:** Perform spectral clustering by following these steps:
* Create the `SpectralClustering` model with 3 components and 3 clusters
* Fit the model using the precomputed affinity matrix and predict the labels
* Complete the function `PlotData3` that plots the fitst two dimensions of the data with 3 clusters
* Plot the clustering result

In [None]:
# Create spectral clustering model
clustering = None

# Fit and predict using the affinity matrix
y_pred = None

# Function for plotting data with three clusters
def PlotData3(X,y):
    # plot
    plt.plot(None,None,'bo',alpha=0.8, label = 'Cluster 1')
    plt.plot(None,None,'r*',alpha=0.8, label = 'Cluster 2')
    plt.plot(None,None,'g^',alpha=0.8, label = 'Cluster 3')
    # annotate
    plt.legend()
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.title('Clustering result')

# Plot


### Interpret the clusters

**Task 4.3:** We will now load the file that stores the gestational age at birth for the 68 babies in our dataset. Your task is to plot the first two dimensions of the embedded dataset using `scatter` plot colour-coded by the GA at birth `ages`. Look at the code of the function `PlotDataColourcoded` above to see how we can do the colour-coding. How can you interpret the clusters?

**Answer:** 

In [None]:
# Load GA and convert to numpy
df2 = pd.read_csv('datasets/ages.csv',header=None)
ages = df2.to_numpy()

# Scatterplot of the embedded space colour-coded by GA


# annotate the plot
plt.colorbar()
plt.xlabel('embedded coordinate 1')
plt.xlabel('embedded coordinate 2')
plt.title('Embedding colorcoded by age')