# Measuring Consistency

In this notebook, we evaluate the consistency of cluster assignments across various subsets of data.
This ensures our model’s robustness, preventing over-sensitivity to particular data samples.
Initially, we establish a baseline by applying BGMM to the entire dataset. We then partition
the dataset into halves, thirds, and quarters, performing separate clustering on each subset.
By merging these subset clustering results and comparing them with the original dataset’s
clustering, we assess the solution’s stability. For each subset comparison, we calculate two
metrics: Accuracy and ARI. The Accuracy of cluster comparisons is determined by the proportion of data points re-
taining their original cluster assignments. The Adjusted Rand Index (ARI) is a metric
that quantifies the similarity between two data clusterings and corrects for chance.

In [1]:
# Importing the Relevant Libraries
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, adjusted_rand_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import BayesianGaussianMixture

# Reading in our Datasets
df_1024 = pd.read_csv("data/1024_unscaled.csv")
df_1024_scaled = pd.read_csv("data/1024_scaled.csv")

In [2]:
def scale_data(df, features):
    scaler = MinMaxScaler()
    scaled_values = scaler.fit_transform(df[features])
    df_scaled = pd.DataFrame(scaled_values, columns=features)
    return df_scaled

def fit_predict_model(model, data):
    try:
        labels = model.fit_predict(data)
    except AttributeError:
        model.fit(data)
        labels = model.predict(data)
    return labels

def adjust_labels_by_size(labels, num_clusters):
    # Count the occurrences of each label
    unique, counts = np.unique(labels, return_counts=True)
    label_counts = dict(zip(unique, counts))
    
    # Sort the labels by count (size)
    sorted_labels = sorted(label_counts, key=label_counts.get)
    
    # Create a mapping from old to new labels based on size
    label_mapping = {old_label: new_label for new_label, old_label in enumerate(sorted_labels)}
    
    # Adjust labels based on the mapping
    adjusted_labels = np.array([label_mapping[label] for label in labels])
    
    return adjusted_labels


def process_splits(model, df, features, n_splits, num_clusters):
    split_size = int(len(df) / n_splits)
    df_splits = [df[i*split_size:(i+1)*split_size].copy() for i in range(n_splits)]
    df_splits[-1] = df[(n_splits-1)*split_size:].copy()  # Include any leftover rows in the last split
    
    combined_labels = pd.Series(index=df.index, dtype='int')
    
    for df_split in df_splits:
        df_split_scaled = scale_data(df_split, features)
        labels = fit_predict_model(model, df_split_scaled)
        adjusted_labels = adjust_labels_by_size(labels, num_clusters)
        combined_labels.loc[df_split.index] = adjusted_labels
    
    return combined_labels


def calculate_accuracy(num_clusters, model_name):
    features = ['LSI_all', 'zeta_all', 'd5_all', 'Sk_all', 'q_all', 'Q6_all']
    
    model_mapping = {
        'BGMM': BayesianGaussianMixture(covariance_type="full", n_components=num_clusters),
        'KMeans': KMeans(n_clusters=num_clusters),
        'AgglomerativeClustering': AgglomerativeClustering(n_clusters=num_clusters),
        'DBSCAN': DBSCAN()  # DBSCAN doesn't require specifying the number of clusters
    }
    
    model = model_mapping[model_name]  # Change this line to use different models
    df_1024_scaled_labels = fit_predict_model(model, scale_data(df_1024_scaled, features))
    df_1024_scaled['labels'] = adjust_labels_by_size(df_1024_scaled_labels, num_clusters)
    
    for n_splits in [2, 3, 4]:
        labels = process_splits(model, df_1024_scaled, features, n_splits, num_clusters)
        accuracy = accuracy_score(df_1024_scaled['labels'], labels)
        ari = adjusted_rand_score(df_1024_scaled['labels'], labels)
        print(f"Accuracy for {n_splits} splits:", accuracy)
        print(f"ARI for {n_splits} splits:", ari)

In [3]:
# measuring consistency of BGMM applied to 2 clusters
calculate_accuracy(2, "BGMM")

Accuracy for 2 splits: 0.9996124441964286
ARI for 2 splits: 0.9983479415621945
Accuracy for 3 splits: 0.999588123139881
ARI for 3 splits: 0.9982443552822808
Accuracy for 4 splits: 0.9995329241071429
ARI for 4 splits: 0.9980092217891083
