<h1>DISCERN</h1>
<h2>Diversity-based Selection of Centroids and k-Estimation for Rapid Non-stochastic clustering</h2>
<h3>Applying DISCERN to small multivariate datasets</h3>

<h3>1. Import dependencies</h3>

In [1]:
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn.metrics import silhouette_score

from DISCERN import DISCERN
from utils.scores import purity_score as purity
from utils.SphericalKMeans import spherical_k_means

<h3>2. Import data</h3>

In [2]:
iris_dataset = pd.read_csv('path_to_iris/iris.csv', na_values='?')
iris_dataset = iris_dataset.fillna(0)

X_1 = iris_dataset.iloc[:,0:len(iris_dataset.columns)-1].to_numpy()
y_tmp_1 = iris_dataset.iloc[:,len(iris_dataset.columns)-1].to_numpy()
label_encoder_1 = preprocessing.LabelEncoder()
y_1 = label_encoder_1.fit_transform(np.asarray(y_tmp_1))
iris_num_class = len(np.unique(y_1))

In [3]:
wine_dataset = pd.read_csv('path_to_wine/wine.csv', na_values='?')
wine_dataset = wine_dataset.fillna(0)

X_2 = wine_dataset.iloc[:,0:len(wine_dataset.columns)-1].to_numpy()
y_tmp_2 = wine_dataset.iloc[:,len(wine_dataset.columns)-1].to_numpy()
label_encoder_2 = preprocessing.LabelEncoder()
y_2 = label_encoder_2.fit_transform(np.asarray(y_tmp_2))
wine_num_class = len(np.unique(y_2))

<h3>3. Running the algorithm</h3>

<h3>Iris</h3>

In [4]:
discern_iris = DISCERN(metric='cosine')

In [5]:
discern_iris.fit(X_1)

In [6]:
iris_silhouette = silhouette_score(X_1, discern_iris.labels_, metric='cosine')
iris_accuracy = purity(y_1, discern_iris.labels_)*100
iris_num_clusters = len(np.unique(discern_iris.labels_))

In [7]:
print("[Unsupervised Performance] Silhouette Score: {}".format(iris_silhouette))
print("[Supervised   Performance] Accuracy: {} %".format(iris_accuracy))
print("Predicted number of clusters: {}".format(iris_num_clusters))
print("Number of classes: {}".format(iris_num_class))

[Unsupervised Performance] Silhouette Score: 0.7489936586396754
[Supervised   Performance] Accuracy: 96.66666666666667 %
Predicted number of clusters: 3
Number of classes: 3


Notice that <b>Iris</b> was clustered using cosine similarity as opposed to wine which is clustered using the default metric, Euclidean distance. Do note that the accuracy level is 96.67 here while it was reported at 97.3 in the paper. This is because for those particular experiments, we implemented K-Means ourselves and therefore had full control over the distance metric. However, in this implementation, instead of changing the metric, we normalize the data before clustering them (divide each sample by its norm so that all of them are unitary):

$X = [ x_1, x_2, ..., x_n ]$

$x_i^{\prime} = \frac{x_i}{\| x_i \|_2^2}$

This will result in an almost identical clustering when clustered using Euclidean distance. However, there may be slight discrepancies at times, which is what we have here.

To simplify, when the metric is set to <b>cosine</b>:
<table style="width: 100%">
    <tr>
        <th>In the paper</th>
        <th>The implementation above</th>
    </tr>
    <tr>
        <td>$d(x, y) = \frac{x^T y}{\|x\|_2^2 \|y \|_2^2}$</td>
        <td>$d(x, y) = \| \frac{x}{\| x\|_2^2} - \frac{y}{\| y\|_2^2} \|$</td>
    </tr>
</table>

However, we can reproduce the results if we run K-Means with cosine distance manually, after DISCERN's done finding the initial centroids:

In [8]:
discern_iris.partial_fit(X_1)
cluster_centers = discern_iris.cluster_centers_

In [9]:
labels_, cluster_centers_ = spherical_k_means(X_1, cluster_centers)

In [10]:
iris_spherical_silhouette = silhouette_score(X_1, labels_, metric='cosine')
iris_spherical_accuracy = purity(y_1, labels_)*100

In [11]:
print("[Unsupervised Performance] Silhouette Score: {}".format(iris_spherical_silhouette))
print("[Supervised   Performance] Accuracy: {} %".format(iris_spherical_accuracy))

[Unsupervised Performance] Silhouette Score: 0.7484647230660484
[Supervised   Performance] Accuracy: 97.33333333333334 %


And the same results from the paper are now reproduced.

<h3>Wine</h3>

In [12]:
discern_wine = DISCERN()

In [13]:
discern_wine.fit(X_2)

In [14]:
wine_silhouette = silhouette_score(X_2, discern_wine.labels_, metric='sqeuclidean')
wine_accuracy = purity(y_2, discern_wine.labels_)*100
wine_num_clusters = len(discern_wine.cluster_centers_)

In [15]:
print("[Unsupervised Performance] Silhouette Score: {}".format(wine_silhouette))
print("[Supervised   Performance] Accuracy: {} %".format(wine_accuracy))
print("Predicted number of clusters: {}".format(wine_num_clusters))
print("Number of classes: {}".format(wine_num_class))

[Unsupervised Performance] Silhouette Score: 0.7322991109041611
[Supervised   Performance] Accuracy: 70.2247191011236 %
Predicted number of clusters: 3
Number of classes: 3
