# Clustering Models

First of all, we are going to import the necessary libraries.

In [3]:
import pandas as pd
from cuml import NearestNeighbors
import numpy as np

First of all, we are going to create a class to compute all the metrics. This class will be used to evaluate the performance of the models using the K Fold method.


In [None]:
class ClusteringMetrics:


Loading the data

In [4]:
df = pd.read_parquet('../data/processed/selected_features_df.parquet')

Let's calculate the Hopkins statistic to check if the data is suitable for clustering.

In [13]:
def hopkins(X):
    d = X.shape[1]
    n = len(X)
    m = int(0.1 * n)
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)

    rand_X = np.random.rand(n, d)
    ujd = []
    wjd = []

    for j in range(m):
        u_dist, _ = nbrs.kneighbors(rand_X[j].reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[j].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])

    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if np.isnan(H):
        print(ujd, wjd)
        H = 0

    return H

  from .autonotebook import tqdm as notebook_tqdm


The Hopkins statistic ranges from 0 to 1. The closer to 1, the more suitable the data is for clustering. Let's calculate the Hopkins statistic for the data.

In [14]:
hopkins(df)

0.9645847043701986

The Hopkins statistic is close to 1, so the data is suitable for clustering.