# **Exercise Sheet 2: Clustering High-dimensional Data**

In [1]:
from sklearn.datasets import make_blobs, make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN, HDBSCAN, AgglomerativeClustering
from sklearn.metrics import normalized_mutual_info_score, adjusted_rand_score, silhouette_score
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
import numpy as np
import densired as ds

## **Exercise 2-1** *Getting familiar with ClustPy*
The purpose of this exercise is to get familiar with the Python library ClustPy, that is a library implementing
many traditional and deep clustering algorithms.

### **a)** Please read the documentation of ClustPy at https://github.com/collinleiber/ClustPy. <br> Which deep clustering algorithms are currently implemented?


- Auto-encoder Based Data Clustering (AEC)
- Deep Clustering Network (DCN)
- Deep Density-based Image Clustering (DDC)
- Deep Embedded Clustering (DEC)
- Improved Deep Embedded Clustering (IDEC)
- Deep Embedded Cluster Tree (DeepECT)
- Deep Embedded Clustering with k-Estimation (DipDECK)
- DipEncoder
- Deep k-Means (DKM)
- Embedded Non-Redundant Clustering (ENRC)
- Autoencoder Centroid-based Deep Cluster (ACeDeC, special case of ENRC)
- Variational Deep Embedding (VaDE)
- Not 2 Deep (N2D)

### **b)** Either install ClustPy following the instructions at https://github.com/collinleiber/ClustPy for users or use Google Colaboratory https://colab.google/. Open and execute the Jupyter notebook provided here https://tinyurl.com/rltutorial2023. This notebook compares standard K-means with some deep clustering method on an example image data set. What is the clustering accuracy of standard K-means in terms of AMI? What is the clustering accuracy of the best deep clustering algorithm in this notebook?

K-Means
- AMI: 63.80 (or 0.638 for not upscaled by 100.0)

Best Deep Clustering (DEC)
- AMI: 78.28 (or 0.7828 for not upscaled by 100.0)

### **c)** Apply one additional deep clustering algorithm of your choice to the image data set and describe its results in comparison to the previous results. Just add this code to the Jupyter notebook available at https://tinyurl.com/rltutorial2023 to try it out. Afterwards, just submit this part as your solution.

## Additional Clustering (ENRC)
Worse in ACC, ARI, NMI and AMI compared to the best other clustering for both clusterings, but the second is better than the frist.
### First Clustering
ACC: 29.14, ARI: 12.69, NMI: 24.42, AMI: 24.38
### Second Clustering
ACC: 60.11, ARI: 54.46, NMI: 64.57, AMI: 64.55

In [5]:
# code only works in combination with the imports, functions and clusterings of the Tutorial jupyter notebook, 
# will produce errors when executed in this notebook
from clustpy.deep import ENRC

dec_name = "enrc.pt"

TRAIN = False

clustering_lr = 1e-4
if TRAIN:
    # load pretrained autoencoder
    sd = torch.load(model_path)
    ae.load_state_dict(sd)
    ae.to(device)
    ae.eval();

    enrc = ENRC(n_clusters=[n_clusters,6],
              clustering_epochs=150,
              autoencoder=ae,
              clustering_optimizer_params={"lr": clustering_lr},
             )
    enrc.fit(data.cpu().detach().numpy())

    # save with joblib
    joblib.dump(enrc, os.path.join(base_path, dec_name))
else:
    # load with joblib
    enrc = joblib.load(os.path.join(base_path, dec_name))
    enrc.autoencoder.to(device)
print("KMeans - Clustering Result")
evaluate_clustering(labels, kmeans.labels_)
print("\nDCN - Clustering Result")
evaluate_clustering(labels, dcn.labels_)
print("\nDEC - Clustering Result")
evaluate_clustering(labels, dec.labels_)
print("\nIDEC - Clustering Result")
evaluate_clustering(labels, idec.labels_)
print("\nENRC - Clustering Result")
print("Clustering 1")
evaluate_clustering(labels, enrc.labels_[:,0])
print("Clustering 2")
evaluate_clustering(labels, enrc.labels_[:,1])

## **Exercise 2-2** *Implement Sychronization-based Clustering within ClustPy*
This exercise focuses on implementing the *SynC* algorithm within ClustPy. Please note that you only have
to implement the basic algorithm with pseudocode on Silde 15 of the lecture slides. The following materials
might be helpful: The paper describing the algorithm *SynC*, you find it in Moodle; a Java implementation of
the algorithm *SynC* is available here https://dm.uestc.edu.cn/wp-content/uploads/code/SynC.zip.

### **a)** *Implement the basic algorithm **SynC** in ClustPy. Please observe the instructions **for developers** at https://github.com/collinleiber/ClustPy. Write a test for your code. Submit two files, one for the algorithm named **sync.py**, one for the test with name **testsync.py**.*

The algorithm works quite well in ClustPy and is located in clustpy.deep.
The test file is located in the tests subdirectory.
Both the sync.py and the testsync.py file are available in the main directory of the zip file as well (next to this notebook).

### **b)** *Evaluate your implementation (or the Java implementation of **SynC** if your Python implementation is not working) on the synthetic data set that you have created for Exercise 1-2 (at least 3 density-based clusters that cannot be correctly detected by k-Means). Briefly describe the results.*

Results:
    NMI: 0.7446179258502434, ACC: 0.69, ARI: 0.6446044838728434, AMI: 0.7381557666768334

The SynC algorithm falls into many of the pitfalls that the k-keans algorithm does and splits the moon clusters into two seperate clusters, but does quite well in all the metrics despite that fact