# Clustering Example

This task lets you explore different clustering approaches (kmeans and hierarchical). First, the data is loaded and you should estimate, how many real cluster where used to generate thes examples. 

You will use a different approach to measure the quality of the clustering approaches using the teh Silhouette Score. 

You will compare the clustering to the ground truth answering the question of how good your clustering is, when compared to the real labels. 

Furthermore, some initail plotting functions are indirectly introduced. 


In [None]:
import sys,os,os.path
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
sns.set_style("white")
import pandas as pd
import sklearn
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.metrics import pairwise_distances


## Loading a dataset

In [None]:
data_file = 'mystery_data_a1.csv'
df = pd.read_csv(data_file,index_col='id')


### Simple plotting

Plot the data (as it is only 2D)

In [None]:
plt.figure(figsize = (10, 10))
ax = sns.scatterplot(x="x1", y="x2",data=df,edgecolor='grey',alpha=0.5)

### k-Means

Do a first kmeans clustering using three clusters. Save the generated cluster assignements and score. 

In [None]:
X = df[['x1','x2']]

kmeans = KMeans(n_clusters=3, init='random').fit(X)

kmeans_centroids            = kmeans.cluster_centers_
kmeans_labels_k3            = kmeans.labels_
kmeans_labels_cluster_score = kmeans.inertia_


#### Centroids

Store the centroids and the input data in a new dataframe for plotting

In [None]:
df_centroids = pd.DataFrame(kmeans_centroids,columns=['x1','x2'])

#### Plotting data and centroids

In [None]:
plt.figure(figsize = (10, 10))
ax = sns.scatterplot(x="x1", y="x2",data=df,edgecolor='grey',alpha=0.5)
ax = sns.scatterplot(x="x1", y="x2",data=df_centroids,linewidth=2.0,marker='+',s=100)


#### A bit more colourful

Same as before, but using the assigned labels for coloring. 

In [None]:
df['labels_k3'] = kmeans_labels_k3
colorPalette='muted'
colors = dict(zip(df['labels_k3'].unique(),sns.color_palette(colorPalette)))


plt.figure(figsize = (10, 10))
ax = sns.scatterplot(x="x1", y="x2",hue='labels_k3',palette=colors,data=df,edgecolor='grey',alpha=0.5)
ax = sns.scatterplot(x="x1", y="x2",data=df_centroids,linewidth=2.0,marker='+',s=100)


### Silhouette Score

The score given by the kmean algorithm is only applicable to kmean and not to other available clustering approaches. An alternative score is the so-called Silhouette Score (see https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient) for more details. It takes into account the mean distance between a sample and all other points in the same class and mean distance between a sample and all other points in the next nearest cluster. The higher this score, the better the underlying clustering approach. The following loads the required parts and applies it to the example before. 

In [None]:
from sklearn import metrics
from sklearn.metrics import pairwise_distances
metrics.silhouette_score(X, kmeans_labels_k3, metric='euclidean')


## Finding the 'best' number of clusters

The following is a skeleton of a approach of going through k={1..10} to find the best k.

```python
centroids = {}
cluster_score = {}
for k in range(1,10):

    
    df['cluster_k{}'.format(k)] = 
    cluster_score[k] = kmeans.inertia_ # you might want to use the silhoute score here

   
```

Please fill in the missing parts and plot scores with regrads to k using the following approach:

```python
df_scores = pd.DataFrame.from_dict(cluster_score,orient='index',columns=['J'])
df_scores['k'] = df_scores.index
ax = sns.scatterplot(x='k', y='J',data=df_scores)
```


I.e., you might want to store your results (from each of the k in the loop) in another DataFrame, so it is easier to plot using ```sns.scatterplot(...)```.

## Comparing to Ground Truth 

The file mystery_data_a1.csv' was generated simple 2D Gaussians. The file 'mystery_data_a1_k.csv' contains the actual labels of each example. Can you load the data and compare the ground thruth (the actual labels given in the additional column) to the ones you have found using your clustering?

It might not be totally easy, as the labels generated by the clustering might not be the same by name as the ones given in the file. You might have to do a bit of manual investigation ... 

However, there exists a method to compare the similarity beteen two clusterings (here: the ground truth and your clustering). The Rand Index does exactly this ( https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html ) 



In [None]:
# Rand score example: 
from sklearn.metrics.cluster import adjusted_rand_score
adjusted_rand_score([0, 0, 0, 1, 1], [1, 1, 0, 2, 2])

Load the fully labeled data and compare your clustering. A warning in general: when comparing the labels you usually have to ensure that the data from the ground thruth is for the same example (i.e. the order is the same). Here both files have the same ids in the same order. 

# Hierarchical Clustering

## SciPy

The first approach is using hierarchical clustering from a different module (SciPy). This is mainly because of its ability to produce a nice dendogram.  

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

In case you have added addional information to the original dataframe, only take the original data



In [None]:
df_hier = df[['x1','x2']]

The call to teh clustering is failry simple. Different linckage exsists. Have have a look at: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

Try out different settings.

In [None]:
linked = linkage(df_hier[['x1','x2']], 'single')

In [None]:
plt.figure(figsize = (10, 10))
dendrogram(linked,
            orientation='top',
            labels=df_hier.index,
            show_leaf_counts=True)
plt.show()

## Hierarchical clustering using sklearn

For all options, please have a look at:

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html



In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
X = df_hier
hclustering = AgglomerativeClustering(linkage='single',n_clusters=3).fit(X)
hclustering_labels_k3  = hclustering.labels_
#kmeans_labels_cluster_score = kmeans.inertia_



In [None]:
df['labels_k3'] = hclustering_labels_k3
colorPalette='muted'
colors = dict(zip(df['labels_k3'].unique(),sns.color_palette(colorPalette)))


plt.figure(figsize = (10, 10))
ax = sns.scatterplot(x="x1", y="x2",hue='labels_k3',palette=colors,data=df,edgecolor='grey',alpha=0.5)



## Finding the best number of clusters using hierarchical clustering

Can you re-use your approach from above to estimate the best number of clusters? It should be straight forward, if you have been using the Silhouette Score from above. If you have not done so, please adept this part further up in the notebook. 



# Single cell RNA-seq

This example will be focused on real world applications of clustering. Consider a single cell RNA-seq dataset, taken from Pollen et al. (2014) study, which consists of 300 single cells (SC), measured across 8686 genes. 

Potentially reusing some part of your code before, use a clustering approach for different number of clusters k = [1,...,12]. You should establish how many different cell types this dataset might contain In order to determine the most appropriate number of clusters

First the data is loaded an processed.

In [None]:
cell_libraries_file = 'CellLibraries.txt'
df_c = pd.read_csv(cell_libraries_file)

pollen_file = 'Pollen2014.txt'
df_p = pd.read_csv(pollen_file)

df_p = df_p.apply(lambda x : np.log2(x+1)) # log transformation of count data
df_p = df_p.transpose() # cells in rows, genes in columns


In [None]:
df_p.head()

In [None]:
df_c.head()


For a first glance using the SciPy linkage function, we can look at an initial hierarchical clustering by looking at the dendogram. 

In [None]:
linked_p = linkage(df_p, 'single')
plt.figure(figsize = (10, 10))
dendrogram(linked_p,
            orientation='top',
            labels=df_p.index,
            show_leaf_counts=True)
plt.show()

## Plotting the data

As this is a high-dimanesional dataset, you can use dimensionality reduction methods such as PCA. Other commonly used approaches ate t-SNE or Spectral Embedding. Please note, that these methods can require some time and furthermore might have additional parameters

### Plotting using PCA

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(df_p) 
print(pca.explained_variance_ratio_) # Percentage of variance explained by each of the selected components.
df_p_pca = pd.DataFrame(pca.transform(df_p),index=df_p.index,columns=['pca_1','pca_2'])

In [None]:
plt.figure(figsize = (10, 10))

ax = sns.scatterplot(x="pca_1", y="pca_2",data=df_p_pca,edgecolor='grey',alpha=0.5)

### Plotting using Spectral Embedding

In [None]:
from sklearn.manifold import SpectralEmbedding

se = SpectralEmbedding(n_components=2)
df_p_se = pd.DataFrame(se.fit_transform(df_p),index=df_p.index,columns=['pca_1','pca_2'])
plt.figure(figsize = (10, 10))
ax = sns.scatterplot(x="pca_1", y="pca_2",data=df_p_se,edgecolor='grey',alpha=0.5)