# Validating the Clusters
In this experiment, multiple metrics will be employed in order to validate the clusters. Metrics such as adjusted rand index, mutual information based scores, homogeneity, completeness and V-measure can not be used in this work because they require the ground truth (true labels) of the data.

Number of clusters used: 32.

## Loading the Data

In [1]:
import sys
import os
sys.path.append('../')
from src import reader as r
from src import visualization as v

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
%matplotlib notebook
import numpy as np
import pandas as pd
import sklearn
print(sklearn.__version__)
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import davies_bouldin_score, calinski_harabaz_score

0.20.0


In [3]:
X = r.readWord2Vec()
npX = X
print(npX)
print(X[:10])
print(X.shape)

['word2vec.csv', 'bags.csv', 'health.txt', 'health-dataset.zip', 'health-dataset']
[[ 0.02987077 -0.15110606 -0.02884087 ...  0.02446168 -0.08834651
  -0.09221231]
 [ 0.05298314 -0.05420527  0.02592565 ...  0.01782615 -0.02950471
   0.00508323]
 [ 0.07016749 -0.05757345 -0.13483836 ...  0.10909334 -0.0250241
  -0.0654501 ]
 ...
 [ 0.11720041  0.02071754 -0.10931976 ... -0.05101222  0.00296909
   0.03905441]
 [ 0.00200901 -0.04285163  0.09034279 ...  0.05065811 -0.01281281
  -0.05488863]
 [ 0.0455922   0.00148772  0.06533482 ... -0.13578176 -0.0725346
  -0.13827453]]
[[ 0.02987077 -0.15110606 -0.02884087 ...  0.02446168 -0.08834651
  -0.09221231]
 [ 0.05298314 -0.05420527  0.02592565 ...  0.01782615 -0.02950471
   0.00508323]
 [ 0.07016749 -0.05757345 -0.13483836 ...  0.10909334 -0.0250241
  -0.0654501 ]
 ...
 [ 0.10532002 -0.05241808 -0.02433    ... -0.01405231  0.03333547
   0.01318201]
 [ 0.10429937 -0.1797766  -0.05073992 ...  0.01325834 -0.18105656
  -0.07903843]
 [ 0.13153867 -0.0

In [4]:
from MulticoreTSNE import MulticoreTSNE as TSNE

smp_sz = X.size
tsne_bow = TSNE(n_components=2, perplexity=10, verbose=True, n_jobs=-1)#500
tsne_bow_result = tsne_bow.fit_transform(X)

## Computing clusters

In [5]:
best_K = 32
print("#############################")
print("Best K =", best_K)
print("Applying K-means")
best_cluster = KMeans(n_clusters=best_K, n_jobs=-1)
best_cluster_result = best_cluster.fit(X)
y_pred = best_cluster_result.labels_
print("Finished")
print("#############################")   

#############################
Best K = 82
Applying K-means
Finished
#############################


### Checking if there are any cluster with 0 points

In [6]:
for i in range(best_K):
    print("Cluster "+str(i)+" has "+str(npX[y_pred==i].shape[0])+" elements")

Cluster 0 has 24 elements
Cluster 1 has 190 elements
Cluster 2 has 175 elements
Cluster 3 has 182 elements
Cluster 4 has 170 elements
Cluster 5 has 392 elements
Cluster 6 has 51 elements
Cluster 7 has 152 elements
Cluster 8 has 275 elements
Cluster 9 has 351 elements
Cluster 10 has 118 elements
Cluster 11 has 213 elements
Cluster 12 has 94 elements
Cluster 13 has 85 elements
Cluster 14 has 90 elements
Cluster 15 has 272 elements
Cluster 16 has 244 elements
Cluster 17 has 230 elements
Cluster 18 has 247 elements
Cluster 19 has 229 elements
Cluster 20 has 135 elements
Cluster 21 has 215 elements
Cluster 22 has 154 elements
Cluster 23 has 62 elements
Cluster 24 has 16 elements
Cluster 25 has 115 elements
Cluster 26 has 344 elements
Cluster 27 has 201 elements
Cluster 28 has 151 elements
Cluster 29 has 207 elements
Cluster 30 has 180 elements
Cluster 31 has 68 elements
Cluster 32 has 306 elements
Cluster 33 has 173 elements
Cluster 34 has 151 elements
Cluster 35 has 345 elements
Cluster 36

## Applying Metrics

### Davies Bouldin
This metric was not seen during classes but if the ground truth is not known, it can be used to evaluate the model. This index is defined as the average similarity between each cluster and its most similar one. Zero is the lowest possible score. Values closer to zero indicate a better partition.

In [7]:
score_db = davies_bouldin_score(npX,y_pred)
print("Davies Bouldin")
print(score_db)

Davies Bouldin
4.1631135708591644


  score = (intra_dists[:, None] + intra_dists) / centroid_distances
  score = (intra_dists[:, None] + intra_dists) / centroid_distances


### Kullback - Leibler

### Calinski and Harabaz
If the ground truth labels are not known, the Calinski-Harabaz index - also known as the Variance Ratio Criterion - can be used to evaluate the model, where a higher Calinski-Harabaz score relates to a model with better defined clusters.
To assess this metric, it must be computed for multiple clusters and visualy analyse the results.

In [8]:
score_ch = calinski_harabaz_score(npX,y_pred)
print("Calinski and Harabaz")
print(score_ch)

Calinski and Harabaz
42.908535907090254
