#   Kernel k-Means

Implementing kernel k-means with RBF-kernel, using SpectralClustering from sklearn to reduce the programming efforTof the full implementation. 
Please implement kernel k-means algorithm with RBF-kernel, that is:

$$(\mathbf{x}_i,\mathbf{x}_j) = \exp(\frac{-\|\mathbf{x}_i - \mathbf{x}_j\|_2^2}{2\sigma^2})$$

For the $2\sigma^2$ parameter, instead of a constant value, please use mean pairwise squared distance between the datapoints, that is:

$$2\sigma^2 = \frac{1}{N^2} \sum_{i=1}^{N}\sum_{j=1}^{N} \| \mathbf{x}_i -\mathbf{x}_j\|_2^2$$ 

In order to calculate paiwise distance matrix, we use scipy.spatial distance function. <br>
As per documentation of the SpectralClustering function to use precomputed affinity matrix, which we supply with the above specification.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import SpectralClustering
from scipy.spatial import distance

df = pd.read_csv('train.csv', header = None)
X = df.iloc[:1000,1:].to_numpy(copy=True)
Y = df.iloc[:1000,:1].to_numpy(copy=True)

In [2]:
n_clusters = 10 # Number of clusters
centroids_rbf = np.array(n_clusters) # Array to store centroids

centroids_rbf = [] # Array to store centroids

# ====================== CODE HERE ======================  
# using random_state=0 in the SpectralClustering function to make the results reproducible.
pairwise_distances = distance.squareform(distance.pdist(X, 'sqeuclidean')) # Pairwise distance calculation

sigma_sq = np.mean(pairwise_distances) #Calculation for sigma squared parameter as per formula
affinity_matrix = np.exp(-pairwise_distances / (sigma_sq))# As per RBF kernel using the formula
clustering = SpectralClustering(n_clusters=n_clusters, affinity='precomputed', random_state=0) # Spectral Clustering
labels = clustering.fit_predict(affinity_matrix) # Labels for the clusters

centroids_rbf = [] # Array to store centroids
for cluster in range(n_clusters): # For each cluster
    centroids_rbf.append(np.mean(X[labels == cluster], axis=0))  # Calculate the centroid of the cluster

centroids_rbf = np.array(centroids_rbf) # Convert to numpy array
# =================================================================
    
print('Please copy the folowing result to Question 4 "Sum of centroids = )"') 
print(np.round(np.sum(centroids_rbf),2))


Please copy the folowing result to Question 4 "Sum of centroids = )"
36.78
