# Validating the Clusters
In this experiment, multiple metrics will be employed in order to validate the clusters. Metrics such as adjusted rand index, mutual information based scores, homogeneity, completeness and V-measure can not be used in this work because they require the ground truth (true labels) of the data.

Number of clusters used: 72.

## Loading the Data

In [1]:
import sys
import os
sys.path.append('../')
from src import reader as r
from src import visualization as v

In [12]:
%matplotlib notebook
import numpy as np
import pandas as pd
import sklearn
print(sklearn.__version__)
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import davies_bouldin_score, calinski_harabaz_score

0.20.0


In [3]:
X = r.readBOW()
npX = X.values
print(npX)
print(X.head(10))
print(X.shape)

['health.txt', 'bags.csv', 'word2vec.csv']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
   0     1     2     3     4     5     6     7     8     9     ...   1193  \
0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    0.0   
1   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    0.0   
2   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    0.0   
3   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    0.0   
4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    0.0   
5   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    0.0   
6   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    0.0   
7   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    0.0   
8   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    0.0   
9   0.0   0.0   0.0   0.0   0.0

In [4]:
from MulticoreTSNE import MulticoreTSNE as TSNE

smp_sz = X.size
tsne_bow = TSNE(n_components=2, perplexity=10, verbose=True, n_jobs=-1)#500
tsne_bow_result = tsne_bow.fit_transform(X)

## Computing clusters

In [5]:
best_K = 72
print("#############################")
print("Best K =", best_K)
print("Applying K-means")
best_cluster = KMeans(n_clusters=best_K, n_jobs=-1)
best_cluster_result = best_cluster.fit(X)
y_pred = best_cluster_result.labels_
print("Finished")
print("#############################")   

#############################
Best K = 72
Applying K-means
Finished
#############################


### Checking if there are any cluster with 0 points

In [11]:
for i in range(best_K):
    print("Cluster "+str(i)+" has "+str(npX[y_pred==i].shape[0])+" elements")

Cluster 0 has 143 elements
Cluster 1 has 366 elements
Cluster 2 has 128 elements
Cluster 3 has 40 elements
Cluster 4 has 192 elements
Cluster 5 has 124 elements
Cluster 6 has 1 elements
Cluster 7 has 96 elements
Cluster 8 has 156 elements
Cluster 9 has 1 elements
Cluster 10 has 841 elements
Cluster 11 has 160 elements
Cluster 12 has 118 elements
Cluster 13 has 154 elements
Cluster 14 has 160 elements
Cluster 15 has 109 elements
Cluster 16 has 314 elements
Cluster 17 has 232 elements
Cluster 18 has 46 elements
Cluster 19 has 108 elements
Cluster 20 has 205 elements
Cluster 21 has 69 elements
Cluster 22 has 288 elements
Cluster 23 has 142 elements
Cluster 24 has 171 elements
Cluster 25 has 1 elements
Cluster 26 has 106 elements
Cluster 27 has 91 elements
Cluster 28 has 383 elements
Cluster 29 has 129 elements
Cluster 30 has 1 elements
Cluster 31 has 46 elements
Cluster 32 has 89 elements
Cluster 33 has 124 elements
Cluster 34 has 4798 elements
Cluster 35 has 40 elements
Cluster 36 has 10

## Applying Metrics

### Davies Bouldin
This metric was not seen during classes but if the ground truth is not known, it can be used to evaluate the model. This index is defined as the average similarity between each cluster and its most similar one. Zero is the lowest possible score. Values closer to zero indicate a better partition.

In [8]:
score_db = davies_bouldin_score(npX,y_pred)
print("Davies Bouldin")
print(score_db)

Davies Bouldin
3.416583706740068


  score = (intra_dists[:, None] + intra_dists) / centroid_distances
  score = (intra_dists[:, None] + intra_dists) / centroid_distances


### Kullback - Leibler

### Calinski and Harabaz
If the ground truth labels are not known, the Calinski-Harabaz index - also known as the Variance Ratio Criterion - can be used to evaluate the model, where a higher Calinski-Harabaz score relates to a model with better defined clusters.
To assess this metric, it must be computed for multiple clusters and visualy analyse the results.

In [14]:
score_ch = calinski_harabaz_score(npX,y_pred)
print("Calinski and Harabaz")
print(score_ch)

Calinski and Harabaz
30.239542352563117
