# Validation with ELKI

We have written an interface to the ELKI CLIQUE implementation, contained in the compare_elki.py file. The main difficulty for adhering to the usual X,y_pred output format of Sklearn is that ELKI CLIQUE provides multiple overlapping clustering outputs. Since we get all clusterings for all subdimensional spaces, the list of outputs can become quite long. For testing our implementation, we used standard benchmarking sets from the Elki homepage.

In [2]:
import os
import re
import csv
from subprocess import Popen, PIPE
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import normalized_mutual_info_score as nmi
from collections import defaultdict
from clique import Clique
from compare_elki import compare_elki

## Dataset 1

In [3]:
!wget http://elki.dbs.ifi.lmu.de/datasets/snn/all-relevant/10d.csv.gz

--2022-11-26 01:53:02--  http://elki.dbs.ifi.lmu.de/datasets/snn/all-relevant/10d.csv.gz
Resolving elki.dbs.ifi.lmu.de (elki.dbs.ifi.lmu.de)... 129.187.228.14
Connecting to elki.dbs.ifi.lmu.de (elki.dbs.ifi.lmu.de)|129.187.228.14|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 909881 (889K) [application/octet-stream]
Saving to: ‘10d.csv.gz’


2022-11-26 01:53:03 (2.79 MB/s) - ‘10d.csv.gz’ saved [909881/909881]



In [4]:
!gzip -d 10d.csv.gz

In [4]:
X = np.empty((10000, 10))
Y = np.empty((10000, 1), dtype= int)
with open('10d.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ')
    for i, row in enumerate(reader):
        X[i] = np.array(row[:-1], dtype=np.float32)
        Y[i] = np.array(row[-1][1:])

In [5]:
for i, arr in enumerate(X):
    X[i] = np.round(arr, 5)

In [6]:
elki_output, nmis, res, res_implem = compare_elki(10, 0.3, X)

In [7]:
print(f"Average nmi of our clustering compared to elki: {np.mean(nmis)}")

Average nmi of our clustering compared to elki: 0.9988146811589667


### Why is the nmi not 1.0?

Lets have a look at the list of all nmis for dataset 1.


In [18]:
nmis

[1.0,
 0.9983915697391257,
 1.0,
 0.9984136803318849,
 0.9984271459381863,
 0.998503250330325,
 0.9986140838724291,
 0.9972094815527658,
 1.0,
 0.9985875998249507]

We can see that in some subspaces the clustering is exactly equal, while in other some points are misclassified. So what is the reason for this differences?

Here we choose one subspace where the nmi is not equal to 1 and just search through the cluster predictions to find the differing one. Since only one point is misclassified in this subspace we get as result the single point 6004. According to ELKI this point is in a cluster. Meanwhile in our algorithm it is declared as noise.

In [19]:
idx = 0
for i, j in zip(res[frozenset({6})], res_implem[frozenset({6})]):
    if i != j:
        print(f"{idx}. point: {i}, {j}")
    idx += 1

6004. point: 0, -1


One useful tool of using ELKI's CLIQUE algorithm is that you can see the dense units which are contained in each cluster in the output. As we can see the cluster in subspace 6 of our example contains 2 dense units that span the range between -0.0826 and 0.9484. If we search for the value of our point 6004 in the corresponding dimension we find the value 0.94832 which is very close to the upper bound of our cluster. Summing up points that are right at the edges of a cluster can rarely be misclassified by our algorithm.

In [20]:
X[6004]

array([-0.20027,  0.46546,  0.7094 ,  1.67448,  0.77875,  0.11428,
        0.94832, -0.26773,  1.0974 , -0.22633])

In [16]:
for cluster in elki_output.split("Dimensions: ["):
    if cluster.startswith('7'):
        print(cluster[:111])

7]
# Coverage: 7289
# Units: 
#    d6:[-0.0826; 0.4329)   3364 objects
#    d6:[0.4329; 0.9484)   3925 objects



## Dataset 2

In [None]:
!wget https://github.com/elki-project/elki/raw/master/data/synthetic/ABC-publication/pov.csv

In [46]:
X_pov = np.empty((150, 2))
Y_pov = np.empty((150, 1), dtype = int)

with open('pov.csv') as csvfile: 
    reader = csv.reader(csvfile, delimiter=' ')
    i = 0
    for row in reader:
        if row[0].startswith('#'):
            continue
        X_pov[i] = np.array(row[:-1], dtype=np.float32)
        Y_pov[i] = np.array(row[-1][-1])
        i += 1

In [47]:
for i, row in enumerate(X_pov):
    X_pov[i] = np.round(row, 6)

In [48]:
_, nmis, _, _ = compare_elki(10, 0.2, X_pov)

In [49]:
print(f"Average nmi of our clustering compared to elki: {np.mean(nmis)}")

Average nmi of our clustering compared to elki: 1.0


## Dataset 3

In [None]:
!wget http://elki.dbs.ifi.lmu.de/datasets/snn/all-relevant/20d.csv.gz

In [None]:
!gzip -d 20d.csv.gz

In [28]:
X_20 = np.empty((10000, 20))
Y_20 = np.empty((10000, 1), dtype = int)
with open('20d.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ')
    for i, row in enumerate(reader):
        X_20[i] = np.array(row[:-1], dtype=np.float32)
        Y_20[i] = np.array(row[-1][1:])

In [29]:
for i, arr in enumerate(X_20):
    X_20[i] = np.round(arr, 5)

In [30]:
nmis, _, _ = compare_elki(10, 0.4, X_20)

In [31]:
print(f"Average nmi of our clustering compared to elki: {np.mean(nmis)}")

Average nmi of our clustering compared to elki: 0.9993178656162137


## Dataset 4

In [52]:
!wget http://elki.dbs.ifi.lmu.de/datasets/snn/10-relevant/40d.csv.gz

--2022-11-26 17:59:41--  http://elki.dbs.ifi.lmu.de/datasets/snn/10-relevant/40d.csv.gz
Resolving elki.dbs.ifi.lmu.de (elki.dbs.ifi.lmu.de)... 129.187.228.14
Connecting to elki.dbs.ifi.lmu.de (elki.dbs.ifi.lmu.de)|129.187.228.14|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2936528 (2.8M) [application/octet-stream]
Saving to: ‘40d.csv.gz’


2022-11-26 17:59:43 (2.07 MB/s) - ‘40d.csv.gz’ saved [2936528/2936528]



In [53]:
!gzip -d 40d.csv.gz

In [160]:
X_40 = np.empty((10000, 40))
Y_40 = np.empty((10000, 1), dtype = int)
with open('40d.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ')
    for i, row in enumerate(reader):
        X_40[i] = np.array(row[:-1], dtype=np.float32)
        Y_40[i] = np.array(row[-1][1:])

In [161]:
for i, arr in enumerate(X_40):
    X_40[i] = np.round(arr, 5)

In [165]:
_, nmis, _, _ = compare_elki(10, 0.15, X_40)

In [167]:
print(f"Average nmi of our clustering compared to elki: {np.mean(nmis)}")

Average nmi of our clustering compared to elki: 0.9812578548662729


In conclusion, our algorithm mostly matches the output of the ELKI reference. Small differences can possibly be explained by rounding errors introduced by rounding or different floating point representations in the Java and Python implementations.