# Test Utils for sBERT-RETR study
## Clustering Validation

In [1]:
from sklearn import datasets

In [2]:
import sys
sys.path.append('/content/drive/My Drive/Colab Notebooks/RTER')

In [3]:
import clustering_validation as clus_val

In [4]:
iris = datasets.load_iris()
iris.data.shape

(150, 4)

# External Measures

Let $\mathbf{D}$ be a dataset consiting of $n$ points $\mathbf{x}_{i}$ in  a $d$_demensional space, partitioned into $k$ clusters. Let $y_{i}\in \{1, 2, ..., k\}$ denote the ground-truth cluster membership or label information for each point. The ground-truth clustering is given as $\mathcal{T}=\{T_{1}, T_{2}, ..., T_{k}\}$, where the cluster $T_{j}$ consists of all the points with label $j$, i.e., $T_{j}=\{\mathbf{x}_{i}\in \mathbf{D}|y_{i}=j\}$. Also let $\mathcal{C}=\{C_{1}, C_{2}, ..., C{r}\}$ denote a clustering of the same dataset into $r$ clusters, obtained via some clustering algorithm, and let $\hat{y_{i}}\in \{1, 2, ..., r\}$ denote the cluster label for $\mathbf{x}_{i}$. For clarity, henceforth, we will refer to $\mathcal{T}$ as the ground-truth partitioning, and to teach $T_{i}$ as a partition. We will call $\mathcal{C}$ a clustering, with each $C_{i}$ referred to as a cluster. Because the ground truth is assumed to be known, typically clustering methods will be run with the correct number of clusters, that is, with $r=k$. However, to keep the discussion more general, we allow $r$ to be different from $k$.

External evaluation measures try capture the extent to which points from the same partition appear in the  same cluster, and the extent to which points from different partitions are grouped in different clusters. There is usually a trade-off between these two goals, which is either explicitly captured by a measure or is implicit in its computation. All of the external measures rely on the $r\times k$ contingency table $\mathbf{N}$ that is induced by a clustering $\mathcal{C}$ and the ground-truth partitioning $\mathcal{T}$, defined as follows:
$$\mathbf{N}(i, j)=n_{ij}=|C_{i}\cap T_{j}|$$
In other words, the count $n_{ij}$ denotes the number of points that are common to cluster $C_{i}$ and ground-truth partition $T_{j}$. Further, for clarity, let $n_{i}=|C_{i}|$ denote the number of points in cluster $C_{i}$, and $m_{j}=|T_{j}|$ denote the number of points in partition $T_{j}$. The contingency table can be computed from $\mathcal{T}$ and $\mathcal{C}$ in $O(n)$ time by examining the partition and cluster labels, $y_{i}$ and $\hat{y_{i}}$, for each point $\mathbf{x}_{i}\in \mathbf{D}$ and incrementing the corresponding count $n_{y_{i}\hat{y_{i}}}$ or $n_{\hat{y_{i}}y_{i}}$.

Scikit Learn Clustering Evaluation Section: https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation

In [6]:
from sklearn.cluster import KMeans

In [7]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [8]:
clust = KMeans(n_clusters=3)

In [9]:
clusters = clust.fit(iris.data)

In [10]:
clus_val.purity(iris.target, clusters.labels_)

0.8933333333333333

In [12]:
clus_val.fmeasure(iris.target, clusters.labels_)

0.8917748917748917

In [13]:
clus_val.clustering_measures(iris.target, clusters.labels_)

{'completeness': 0.7649861514489815,
 'fMeasure': 0.8917748917748917,
 'fowMal': 0.8208080729114153,
 'homogeneity': 0.7514854021988338,
 'mutInfo': 0.7551191675800484,
 'purity': 0.8933333333333333,
 'rand': 0.7302382722834697,
 'vMeasure': 0.7581756800057784}