# Number of Clusters

This experiment has the purpose of defining the number of clusters presented in the data. For that, we use the t-SNE for the projection of the data (dimensionality reduction), and an error metric to determine the best number of clusters. The metrics used are: square of the 2-norm distance metric, silhouette score and Davies-Bouldin Index.

In this notebook we employed the **square of the 2-norm distance**. This metric computes the mean distance of every point to its cluster.

Metrics such as adjusted rand index, mutual information based scores, homogeneity, completeness and V-measure can not be used in this work because they require the ground truth (true labels) of the data.

In [1]:
import sys
sys.path.append('../')
from src import reader as r
from src import visualization as v

In [2]:
import numpy as np
import sklearn
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import silhouette_score

In [4]:
X = r.readWord2Vec()
print(X[:10])
print(X.shape)

['word2vec.csv', 'bags.csv', 'health.txt', 'health-dataset.zip', 'health-dataset']
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
(13229, 1203)


In [None]:
from MulticoreTSNE import MulticoreTSNE as TSNE

smp_sz = 7000
tsne_bow = TSNE(n_components=2, perplexity=500, verbose=True)
tsne_bow_result = tsne_bow.fit_transform(X[:smp_sz])

## Elbow Method --- K-means++
This methods aims to define the best number of clusters (K) by the cost function J computed during the K-means.

In [None]:
error = 0
Ks = []
Js = []
Epsilon = 1e-2
it = 1
J = 0.

k = 2
print("###############################")
while ((error > Epsilon) or (it == 1)):
    print("Number of Clusters:",k)
    print("Starting K-means++")
    cluster = KMeans(n_clusters=k,random_state=42,n_jobs=-1)
    cluster_result = cluster.fit(X)
    print("Finished")
    error = J
    J = cluster_result.inertia_ / X.shape[0]        
    print("J =",J)
    error = abs(error-J)  
    Ks.append(k)
    print("error =",error)
    Js.append(J)
    k += 10
    it += 1
    print("###############################")
    
best_K = k-10
print("\nBest k:",best_K)
print("Number of iterations:",it)

In [None]:
v.plot_cluster_errors(Ks,Js)

After analysing the elbow curve, we conclude that the best number of clusters (K) is 52.

In [None]:
print("#############################")
print("Best K =", best_K)
print("Applying K-means")
best_cluster = KMeans(n_clusters=best_K)
best_cluster_result = best_cluster.fit(X)
print("Finished")
print("#############################")

In [None]:
true_label = best_cluster_result.labels_

In [None]:
v.visualize_sup_scatter(tsne_bow_result, true_label[:smp_sz])

## DBSCAN
In this section we aim to define the best number of clusters (K), but this time by means of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The DBSCAN use a proximity and density primitives to determine if a datapoint belongs to an existing cluster or if another one should be created.

In [None]:
dbscan = DBSCAN(eps=0.9,min_samples=2)
y = dbscan.fit_predict(X)

In [None]:
print("Number of clusters",len(np.unique(y)))
v.visualize_sup_scatter(tsne_bow_result, y[:smp_sz])