To do:
- Preprocessing
  - *Compare minmax scaling and compare with normalization*
  - *Compare B&dM subset with all variables?*
- Models
  - *Adapt kmeans and AHC to use Mahalanobis distance?*
- Validity indices
  - Compute validity indices with the manhattan distance: done for silhouette, not possible for the others?
  - Define values for generalized dunn index => https://genieclust.gagolewski.com/genieclust_cluster_validity.html#genieclust.cluster_validity.generalised_dunn_index
- Model selection
  - **Store the validity index optimized by each best model**
  - Implement gap statistic
  - **Compute relative fit criteria for latent models**

Project to reorganize the notebook and reduce code redundancy:

**Preparation**
- Fit all models sequentially
  - *Including with different preprocessing?*
  - *Including with reduced and complete dataset?*
- Agregate results and simplify
  - *Eliminate redundant models (in terms of preprocessing and dataset)?*
  - Eliminate redundant validity indices by cheching their correlations

*End up with 1 df named all_models*

**Model selection**
- Identify the best models for each model-params and CVI
  - With min/max rules
  - With the elbow method
  - With the gap statistic

*End up with 3 df named abs_candidates, elbow_candidates and gap_candidates, storing the optimal n values for each model-params and CVI*

*Comment on the decision rules: min/max will likely have to be abandonned but for HDBSCAN?*

- Identify the best models for each model class and CVI (based on the min/max rule, no matter the rule used at the previous stage) 

*Merge the 4 df in candidate_models*

*End up with a df storing the CVI values and dummies about the criteria they were selected on (e.g., elbow-silhouette, or gap-dunn)*

*Eliminate redundant models*

- Identify the absolue best models for each CVI

*Store results in a df named best_models*

**Conclusion**

*Depending on the results...*

*Comment the best candidates for each model class, or the absolute best_models*

*Comment the model class successively or all model classes at once*

Notes
- All validity indices are applied to the original dataset instead of the transformed data.

In [None]:
# pip install genieclust
# pip install stepmix
# pip install kneed
# pip install seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from joblib import Parallel, delayed # for parallelization
from itertools import product

# Preprocessing
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler

# Clustering
from stepmix.stepmix import StepMix
from scipy.spatial.distance import cdist
from sklearn.cluster import AgglomerativeClustering, HDBSCAN
from scipy.spatial.distance import mahalanobis

# Evaluation
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from genieclust import cluster_validity
from collections import Counter
from kneed import KneeLocator
from sklearn.neighbors import BallTree

# Visualization
from sklearn.decomposition import PCA
from scipy.spatial import ConvexHull

# Preparation
## Data

In [None]:
data2004_i = pd.read_parquet("data/data2004_i.parquet") # load imputed data

# Dataset with numeric outcomes
data_n = data2004_i[[
    # Q2
    'clseusa_n', # 'clsetown_n', 'clsestat_n', 'clsenoam_n',
    # Q3
    'ambornin_n', 'amcit_n', 'amlived_n', 'amenglsh_n', 
    'amchrstn_n', 'amgovt_n', 'amfeel_n', # 'amancstr_n',
    # Q4
    'amcitizn_n', 'amshamed_n', 'belikeus_n', 'ambetter_n', 'ifwrong_n', # 'amsports_n', 'lessprd_n',
    # Q5
    'proudsss_n', 'proudgrp_n', 'proudpol_n', 'prouddem_n', 'proudeco_n',
    'proudspt_n', 'proudart_n', 'proudhis_n', 'proudmil_n', 'proudsci_n']]

## Scaling and normalizing
scaler = MinMaxScaler(feature_range=(-1,1))
data_n_scaled = scaler.fit_transform(data_n)

normalizer = StandardScaler()
data_n_norm = normalizer.fit_transform(data_n)

# Dataset with categorical outcomes
data_f = data2004_i[[
    # Q2
    'clseusa_f', # 'clsetown_f', 'clsestat_f', 'clsenoam_f',
    # Q3
    'ambornin_f', 'amcit_f', 'amlived_f', 'amenglsh_f', 
    'amchrstn_f', 'amgovt_f', 'amfeel_f', # 'amancstr_f',
    # Q4
    'amcitizn_f', 'amshamed_f', 'belikeus_f', 'ambetter_f', 'ifwrong_f', # 'amsports_f', 'lessprd_f',
    # Q5
    'proudsss_f', 'proudgrp_f', 'proudpol_f', 'prouddem_f', 'proudeco_f',
    'proudspt_f', 'proudart_f', 'proudhis_f', 'proudmil_f', 'proudsci_f']]

## One-hot encoding
data_f_oh = data_f.apply(lambda col: LabelEncoder().fit_transform(col))

# Dataset with controls
controls = data2004_i[[
    'sex', 'race_f', 'born_usa', 'party_fs', 'religstr_f', 
    'reltrad_f', 'region_f']]

## Parameters

In [None]:
max_clust = 16
max_threads = 8

val_indices = ['silhouette_ecd', 'silhouette_mnt', 'calinski_harabasz', 'davies_bouldin', 'dunn', 'gen_dunn']

## Validity indices

In [None]:
# Custom score functions to avoid throwing errors when undefined
def sil_score(data, pred_clust):
    try:
        sil_score_euclidean = silhouette_score(data, pred_clust, metric='euclidean')
    except ValueError:
        sil_score_euclidean = np.nan

    try:
        sil_score_manhattan = silhouette_score(data, pred_clust, metric='manhattan')
    except ValueError:
        sil_score_manhattan = np.nan

    return sil_score_euclidean, sil_score_manhattan

def ch_score(data, pred_clust):
    try:
        ch_score = calinski_harabasz_score(data, pred_clust)
    except ValueError:
        ch_score = np.nan
    return ch_score

def db_score(data, pred_clust):
    try:
        db_score = davies_bouldin_score(data, pred_clust)
    except ValueError:
        db_score = np.nan
    return db_score

def gen_dunn_score(data, pred_clust):
    try:
        gen_dunn_score_11 = cluster_validity.generalised_dunn_index(data, pred_clust, lowercase_d=1, uppercase_d=1)
    except Exception:
        gen_dunn_score_11 = np.nan
    
    try:
        gen_dunn_score_22 = cluster_validity.generalised_dunn_index(data, pred_clust, lowercase_d=2, uppercase_d=2)
    except Exception:
        gen_dunn_score_22 = np.nan

    return gen_dunn_score_11, gen_dunn_score_22

def clust_size(labels):
    cluster_sizes = Counter(labels)
    min_size = min(cluster_sizes.values())
    max_size = max(cluster_sizes.values())
    
    return min_size, max_size

In [None]:
# Function to return all validity indices at once, after filtering noise points
def get_metrics(model, params, n, data, pred_clust, **additional_metrics):

    noise = pred_clust == -1
    denoised_data = data_n[~noise]
    denoised_pred_clust = pred_clust[~noise]

    base_metrics = {
        'model': model,
        'params': params,
        'n_clust': n,
        'min_clust_size': clust_size(pred_clust)[0],
        'max_clust_size': clust_size(pred_clust)[1],
        'silhouette_ecd': sil_score(denoised_data, denoised_pred_clust)[0],
        'silhouette_mnt': sil_score(denoised_data, denoised_pred_clust)[1],
        'calinski_harabasz': ch_score(denoised_data, denoised_pred_clust),
        'davies_bouldin': db_score(denoised_data, denoised_pred_clust),
        'dunn': gen_dunn_score(denoised_data, denoised_pred_clust)[0],
        'gen_dunn': gen_dunn_score(denoised_data, denoised_pred_clust)[1]
    }

    base_metrics.update(additional_metrics)
    return base_metrics

## Visualization

In [None]:
# Function to display the optimal numbers of clutsters according to each validity index
def elbow_plot(df, val_index):
    res = df.dropna(subset=[val_index])

    x = res["n_clust"]
    y = res[val_index]

    if val_index in ['davies_bouldin', 'entropy']:
        knee_locator = KneeLocator(x, y, curve='concave', direction='increasing')
    else:
        knee_locator = KneeLocator(x, y, curve='convex', direction='decreasing')

    plt.figure(figsize=(8, 4))
    plt.plot(x, y, marker="o", linestyle="-", label=val_index)
    plt.axvline(x=knee_locator.knee, color="r", linestyle="--", label=f"Optimal k={knee_locator.knee}")
    plt.xlabel("Number of Clusters")
    plt.ylabel(f"{val_index} index")
    plt.title(f"Elbow Method for {val_index} index")
    plt.legend()
    plt.show()

In [None]:
# Function to plot datapoints and clusters
def plot_clusters(data, clust_range, pred_clust):
    
    # PCA to define the 2D space
    pca = PCA(n_components=2)
    reduced_space = pca.fit_transform(data_n)

    plt.figure(figsize=(8, 6))
    
    # Collect all hull vertices
    hull_vertices = []
    hull_colors = []
    for i in clust_range:
        cluster_points = reduced_space[pred_clust == i]
        if len(cluster_points) > 2:
            hull = ConvexHull(cluster_points)
            hull_vertices.append((
                cluster_points[hull.vertices, 0],
                cluster_points[hull.vertices, 1]
            ))
            hull_colors.append(i)

    # Plot datapoints
    scatter = plt.scatter(reduced_space[:, 0], reduced_space[:, 1], 
                         c=pred_clust, cmap='tab10', 
                         s=15, edgecolors='k')

    # Plot all hulls using the same colormap
    for vertices, i in zip(hull_vertices, hull_colors):
        plt.fill(vertices[0], vertices[1], 
                 alpha=0.3,
                 color=scatter.cmap(scatter.norm(i)))

    legend = plt.legend(*scatter.legend_elements())
    plt.xlabel("Dim 1")
    plt.ylabel("Dim 2")
    plt.title("Clusters with Convex Hulls")
    plt.show()

In [None]:
# Function to plot response patterns
def plot_cluster_profiles(features, 
                          cluster_labels, 
                          feature_names, 
                          class_names,
                          sd,
                          alpha=0.4):
    """
    Create a profile plot for clustering results, supporting both probabilistic
    (e.g., LCA, GMM) and deterministic (e.g., k-means) clustering methods.
    
    Parameters:
    -----------
    features : array-like or pandas.DataFrame
        The original feature matrix used for clustering (n_samples, n_features)
    cluster_labels : array-like
        Cluster assignments for each sample (n_samples,)
    feature_names : list, optional
        Names of the features (default: None, will use indices or DataFrame columns)
    class_names : list, optional
        Names of the classes (default: None, will use indices)
    sd : float
        Number of standard deviations around the mean to plot
    alpha : float, optional
        Base transparency for the scatter points
    """
    # Convert features to numpy array if it's a DataFrame
    if isinstance(features, pd.DataFrame):
        if feature_names is None:
            feature_names = features.columns.tolist()
        features = features.to_numpy()
    
    # Convert cluster_labels to numpy array if it's a Series
    if isinstance(cluster_labels, pd.Series):
        cluster_labels = cluster_labels.to_numpy()
    
    # Handle NaN values
    features = np.nan_to_num(features, nan=np.nanmean(features))
    
    n_features = features.shape[1]
    n_classes = len(np.unique(cluster_labels))
    
    if feature_names is None:
        feature_names = [f'Feature {i+1}' for i in range(n_features)]
    if class_names is None:
        class_names = [f'Class {i+1}' for i in range(n_classes)]
        
    # Create figure
    fig, ax = plt.subplots(figsize=(12,6))
    
    # Calculate class centroids and confidence intervals
    centroids = []
    std_devs = []
    
    for class_idx in range(n_classes):
        class_mask = cluster_labels == class_idx
        class_data = features[class_mask]
        
        if len(class_data) > 0:
            # Calculate centroid
            centroid = np.nanmean(class_data, axis=0)
            centroids.append(centroid)
            
            # Calculate standard deviations
            std_dev = np.nanstd(class_data, axis=0)
            std_devs.append(std_dev)
            
        else:
            # Handle empty classes
            centroids.append(np.zeros(n_features))
            std_devs.append(np.zeros(n_features))
    
    # Convert to numpy arrays for vectorized operations
    centroids = np.array(centroids)
    std_devs = np.array(std_devs)
    
    # Plot for each class
    x = np.arange(n_features)
    width = 0.8 / n_classes
    
    for i in range(n_classes):
        # Offset x positions for each class
        x_pos = x - (width * (n_classes-1)/2) + (i * width)
        
        # Plot standard deviation boxes
        for j in range(n_features):
            # Clamp values to [-2, 2] range
           lower = max(-2, centroids[i][j] - std_devs[i][j]*sd/2)
           upper = min(2, centroids[i][j] + std_devs[i][j]*sd/2)
           height = upper - lower
       
           rect = plt.Rectangle((x_pos[j] - width/2, lower),
                          width, height,
                          alpha=0.2, color=f'C{i}')
           ax.add_patch(rect)   
        
        # Plot centroids
        ax.scatter(x_pos, centroids[i], color=f'C{i}', 
                  label=class_names[i], marker='*', zorder=5)
    
    # Customize plot
    ax.set_xticks(x)
    ax.set_xticklabels(feature_names, rotation=45, ha='right')
    ax.set_ylabel('Answers')
    ax.legend(title='Clusters')
    ax.grid(True, axis='y', alpha=0.3)
    ax.axhline(y=0, color='grey', linestyle='dashed', linewidth=1)
    ax.set_title(f"Cluster Profile Plot (mean ± {sd} standard deviation)")
    plt.tight_layout()
    return fig, ax

# Latent models
With the StepMix package, see: https://github.com/Labo-Lacourse/stepmix

In [None]:
# Parameters
clust_range = range(1, max_clust+1)

opt_params = {
    'method': 'gradient',
    'intercept': True,
    'max_iter': 2500,
}

In [None]:
# Fit models without covariates
def do_StepMix(n, type, data):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=FutureWarning)

        latent_mod = StepMix(
            n_components = n, 
            measurement = type, 
            n_init = 3,
            init_params = 'kmeans',
            structural_params = opt_params,
            random_state = 123)
        
        latent_mod.fit(data)
        pred_clust = latent_mod.predict(data)

        model = 'LCA' if type == 'categorical' else 'LPA'
        params = 'without covariates'
        loglik = latent_mod.score(data)
        aic = latent_mod.aic(data)
        bic = latent_mod.aic(data)
        entropy = latent_mod.entropy(data)

    return get_metrics(model, params, n, data, pred_clust, LL = loglik, aic = aic, bic = bic, entropy = entropy)

cat_results = Parallel(n_jobs=8)(delayed(do_StepMix)(n, 'categorical', data_f_oh) for n in clust_range)
LCA_all = pd.DataFrame(cat_results)

num_results = Parallel(n_jobs=8)(delayed(do_StepMix)(n, 'continuous', data_n_scaled) for n in clust_range)
LPA_all = pd.DataFrame(num_results)

In [None]:
for val_index in val_indices + ['aic', 'bic', 'entropy']:
    elbow_plot(LCA_all, val_index)

In [None]:
# Fit models with covariates
def do_StepMix_covar(n, type, data):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=FutureWarning)
        
        latent_mod = StepMix(
            n_components = n,
            measurement = type,
            n_init = 3,
            init_params = 'kmeans',
            structural = 'covariate', 
            n_steps = 1,
            structural_params = opt_params,
            random_state = 123)
        
        latent_mod.fit(data, controls_dum)
        pred_clust = latent_mod.predict(data)
        
        model = 'LCA' if type == 'categorical' else 'LPA'
        params = 'with covariates'
        loglik = latent_mod.score(data)
        aic = latent_mod.aic(data)
        bic = latent_mod.aic(data)
        entropy = latent_mod.entropy(data)

    return get_metrics(model, params, n, data, pred_clust, LL = loglik, aic = aic, bic = bic, entropy = entropy)

controls_dum = pd.get_dummies(controls)

cat_results = Parallel(n_jobs=max_threads)(delayed(do_StepMix_covar)(n, 'categorical', data_f_oh) for n in clust_range)
LCA_covar_all = pd.DataFrame(cat_results)

# Data preprocessing?
num_results = Parallel(n_jobs=max_threads)(delayed(do_StepMix_covar)(n, 'continuous', data_n_scaled) for n in clust_range)
LPA_covar_all = pd.DataFrame(num_results)

## Best latent models

In [None]:
# How to select models based on aic / bic: using their absolute minimum, or an elbow method?
# Absolute minimum yields the model with the most classes, so not appropriate
LCA_aic_min = LCA_all.sort_values('aic', ascending=True).iloc[0]
LCA_bic_min = LCA_all.sort_values('bic', ascending=True).iloc[0]

LPA_aic_min = LPA_all.sort_values('aic', ascending=True).iloc[0]
LPA_bic_min = LPA_all.sort_values('bic', ascending=True).iloc[0]

abs_fit = pd.DataFrame([LCA_aic_min, LCA_bic_min, LPA_aic_min, LPA_bic_min])
abs_fit = abs_fit.drop_duplicates().reset_index(drop=True)
abs_fit

In [None]:
# Find best models according to relative fit = LRT / BLRT / BVR (LCA only)
# (nombre de paramètres = (n_clust - 1) + (nb modalités - 1)*nb questions*n_clust = k - 1 + 4*23*k = 93k - 1)

# LRT
def LRT(model_all) :
    model_all["LRT"] = 2 * (model_all["LL"] - model_all["LL"].shift(1))
    model_all["n_param"] = 93 * model_all["n_clust"] - 1
    model_all["LRT_p_value_chi2deg1"] = 1 - chi2.cdf(model_all["LRT"], 1)
    model_all["LRT_p_value_chi2_deg93"] = 1 - chi2.cdf(model_all["LRT"], 93)
    model_all.loc[model_all["n_clust"] == 1, ["LRT", "p_value"]] = np.nan
    
    return model_all[["model", "n_clust", "n_param", "LRT", "LRT_p_value_chi2deg1","LRT_p_value_chi2_deg93"]]

In [None]:
LCA_lrt = LRT(LCA_all)
LCA_lrt

In [None]:
LPA_lrt = LRT(LPA_all)
LPA_lrt

In [None]:
# BLRT with StepMix package
from stepmix.bootstrap import blrt_sweep

# LCA
model_LCA = StepMix(n_components = 3, measurement = 'categorical', n_init = 3, init_params = 'kmeans', structural_params = opt_params, random_state = 123, verbose=0, progress_bar=0)

with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=FutureWarning)

    BLRT_LCA = blrt_sweep(model, data_f_oh, low=1, high=max_clust, n_repetitions=100)

# LPA
model_LPA = StepMix(n_components = 3, measurement = 'continuous', n_init = 3, init_params = 'kmeans', structural_params = opt_params, random_state = 123, verbose=0, progress_bar=0)

with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=FutureWarning)

    BLRT_LPA = blrt_sweep(model, data_n_scaled, low=1, high=max_clust+1, n_repetitions=100)

In [None]:
# Résultat BLRT LCA
BLRT_LCA["n_clust"] = BLRT_LCA.index.to_series().str.extract(r"(\d+)").astype(int)
LCA_relat_fit = LCA_lrt.merge(BLRT_LCA, on="n_clust", how="left")
LCA_relat_fit.rename(columns={"p": "BLRT_p_value"}, inplace=True)
LCA_relat_fit

# p-value here is : n_clust - 1 vs. n_clust classes

In [None]:
# Résultat BLRT LPA
BLRT_LPA["n_clust"] = BLRT_LPA.index.to_series().str.extract(r"(\d+)").astype(int)
LPA_relat_fit = LPA_lrt.merge(BLRT_LPA, on="n_clust", how="left")
LPA_relat_fit.rename(columns={"p": "BLRT_p_value"}, inplace=True)
LPA_relat_fit

# p-value here is : n_clust - 1 vs. n_clust classes

In [None]:
# Find the best model for each combination of parameters through the Elbow method
def elbow_method(df, val_index):
    res = df.dropna(subset=[val_index])

    x = res["n_clust"]
    y = res[val_index]

    if val_index in ['davies_bouldin', 'entropy']:
        knee_locator = KneeLocator(x, y, curve='concave', direction='increasing')
    else:
        knee_locator = KneeLocator(x, y, curve='convex', direction='decreasing')
    
    return res[res["n_clust"] == knee_locator.knee]

models = [LCA_all, LPA_all] # + [LCA_covar_all, LPA_covar_all]

params = product(models, val_indices + ['aic', 'bic', 'entropy'])

latent_elbow = pd.DataFrame()
for model, val_index in params:
    best_model = elbow_method(model, val_index)
    latent_elbow = pd.concat([latent_elbow, best_model], ignore_index=True)

In [None]:
# Find absolute best models for each validity index
latent_elbow = latent_elbow.drop_duplicates().reset_index(drop=True)
# Need to add colums indicating which validity index is maximized.
# After that, duplicate models should be merged, not dropped.

best_silhouette_ecd = latent_elbow.sort_values('silhouette_ecd', ascending=False).iloc[0]
best_silhouette_mnt = latent_elbow.sort_values('silhouette_mnt', ascending=False).iloc[0]
best_ch = latent_elbow.sort_values('calinski_harabasz', ascending=False).iloc[0]
best_db = latent_elbow.sort_values('davies_bouldin', ascending=True).iloc[0]
best_dunn = latent_elbow.sort_values('dunn', ascending=False).iloc[0]

best_aic = latent_elbow.sort_values('aic', ascending=True).iloc[0]
best_bic = latent_elbow.sort_values('bic', ascending=True).iloc[0]
best_entropy = latent_elbow.sort_values('entropy', ascending=False).iloc[0]

latent_best = pd.DataFrame([best_silhouette_ecd, best_silhouette_mnt, best_ch, best_db, best_dunn])
latent_best = latent_best.drop_duplicates().reset_index(drop=True)

In [None]:
latent_best

In [None]:
# Refit the best model and display coefficients
data = data_n

with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=FutureWarning)
        
    latent_mod = StepMix(
        n_components = 3,
        measurement = 'continuous',
        n_init = 3,
        init_params = 'kmeans',
        structural_params = opt_params,
        random_state = 123,
        progress_bar = 0,
        verbose = 1)
        
    latent_mod.fit(data)
    pred_clust = latent_mod.predict(data)

In [None]:
plot_clusters(data, range(latent_mod.n_components), pred_clust)

In [None]:
# Response patterns
fig, ax = plot_cluster_profiles(data_n, pred_clust, feature_names = None, class_names = None, sd = 1.5)

# K-means
No available implementation allows to chose the distance metric and linkage function. Hence a custom class is proposed below.

In [None]:
class FlexibleKMeans:
    """
    K-Means implementation supporting different distance metrics and center computation methods.
    
    Parameters:
    -----------
    n_clusters : int
        Number of clusters
    metric : str, default='euclidean'
        Distance metric: 'euclidean', 'manhattan', 'chebyshev'
    center_method : str, default='mean'
        Method to compute cluster centers: 'mean', 'median', 'medoid'
    max_iter : int, default=100
        Maximum number of iterations
    n_init : int, default=10
        Number of times the k-means algorithm will be run with different centroid seeds.
        The final result will be the best output of n_init consecutive runs in terms of inertia.
    random_state : int or None, default=None
        Random state for reproducibility
    """
    
    def __init__(self, n_clusters, metric='euclidean', center_method='mean', 
                 max_iter=100, n_init=10, random_state=None):
        self.n_clusters = n_clusters
        self.metric = metric
        self.center_method = center_method
        self.max_iter = max_iter
        self.n_init = n_init
        self.random_state = random_state
        
        # Define mapping from user-friendly names to scipy metrics
        self.metric_mapping = {
            'euclidean': 'euclidean',
            'manhattan': 'cityblock',
            'chebyshev': 'chebyshev'
        }
        
        # Validate inputs
        valid_metrics = list(self.metric_mapping.keys())
        if metric not in valid_metrics:
            raise ValueError(f"metric must be one of {valid_metrics}")
            
        valid_centers = ['mean', 'median', 'medoid']
        if center_method not in valid_centers:
            raise ValueError(f"center_method must be one of {valid_centers}")
            
        if self.n_init <= 0:
            raise ValueError("n_init should be > 0")
    
    def _compute_distances(self, X, centers):
        """Compute distances between points and centers using specified metric."""
        return cdist(X, centers, metric=self.metric_mapping[self.metric])
    
    def _compute_centers(self, X, labels):
        """Compute new centers using specified method."""
        new_centers = np.zeros((self.n_clusters, X.shape[1]))
        
        for i in range(self.n_clusters):
            cluster_points = X[labels == i]
            
            if len(cluster_points) == 0:
                continue
                
            if self.center_method == 'mean':
                new_centers[i] = np.mean(cluster_points, axis=0)
            
            elif self.center_method == 'median':
                new_centers[i] = np.median(cluster_points, axis=0)
            
            elif self.center_method == 'medoid':
                # For medoid, find the point that minimizes sum of distances to other points
                distances = self._compute_distances(cluster_points, cluster_points)
                medoid_idx = np.argmin(np.sum(distances, axis=1))
                new_centers[i] = cluster_points[medoid_idx]
        
        return new_centers
    
    def _single_fit(self, X, seed):
        """Perform a single run of k-means with given random seed."""
        if seed is not None:
            np.random.seed(seed)
            
        # Initialize centers randomly
        idx = np.random.choice(len(X), self.n_clusters, replace=False)
        centers = X[idx].copy()
        
        for iteration in range(self.max_iter):
            # Store old centers for convergence check
            old_centers = centers.copy()
            
            # Assign points to nearest center
            distances = self._compute_distances(X, centers)
            labels = np.argmin(distances, axis=1)
            
            # Update centers
            centers = self._compute_centers(X, labels)
            
            # Check for convergence
            if np.allclose(old_centers, centers):
                n_iter = iteration + 1
                break
        else:
            n_iter = self.max_iter
            
        # Compute final inertia
        final_distances = self._compute_distances(X, centers)
        inertia = np.sum(np.min(final_distances, axis=1) ** 2)
        
        return centers, labels, inertia, n_iter
    
    def fit(self, X):
        """Fit the model to the data."""
        # Convert pandas DataFrame to numpy array if necessary
        if isinstance(X, pd.DataFrame):
            X = X.to_numpy()
        X = np.asarray(X)
        
        # Initialize best solution tracking
        best_inertia = np.inf
        best_labels = None
        best_centers = None
        best_n_iter = None
        
        # Run k-means n_init times
        for init in range(self.n_init):
            # Generate seed for this initialization
            if self.random_state is not None:
                seed = self.random_state + init
            else:
                seed = None
                
            # Perform single k-means run
            centers, labels, inertia, n_iter = self._single_fit(X, seed)
            
            # Update best solution if current one is better
            if inertia < best_inertia:
                best_centers = centers
                best_labels = labels
                best_inertia = inertia
                best_n_iter = n_iter
        
        # Store best solution
        self.cluster_centers_ = best_centers
        self.labels_ = best_labels
        self.inertia_ = best_inertia
        self.n_iter_ = best_n_iter
        
        return self
    
    def fit_predict(self, X):
        """Fit the model and return cluster labels."""
        return self.fit(X).labels_
    
    def predict(self, X):
        """Predict the closest cluster for each sample in X."""
        # Convert pandas DataFrame to numpy array if necessary
        if isinstance(X, pd.DataFrame):
            X = X.to_numpy()
        X = np.asarray(X)
        
        distances = self._compute_distances(X, self.cluster_centers_)
        return np.argmin(distances, axis=1)

In [None]:
# Fit the models
def do_kmeans(dist, link, n):
    kmeans = FlexibleKMeans(
        n_clusters = n,
        metric = dist,
        center_method = link,
        n_init = 25,
        random_state = 43)

    pred_clust = kmeans.fit_predict(data)
    
    model = 'kmeans'
    params = f"distance = {dist}, linkage = {link}"
    
    return get_metrics(model, params, n, data, pred_clust)

data = data_n

clust_range = range(1, max_clust+1)
distances = ['euclidean', 'manhattan', 'chebyshev']
linkages = ['mean', 'median', 'medoid']
params = product(distances, linkages, clust_range)

results = Parallel(n_jobs=max_threads)(delayed(do_kmeans)(dist, link, n) for dist, link, n in params)
kmeans_all = pd.DataFrame(results)

In [None]:
kmeans_all

In [None]:
kmeans_all.loc[kmeans_all['params'] == 'distance = euclidean, linkage = mean']

In [None]:
mod_subset = kmeans_all.loc[kmeans_all['params'] == 'distance = euclidean, linkage = mean']
for val_index in val_indices:
    elbow_plot(mod_subset, val_index)

In [None]:
# Find the best model for each combination of parameters through the Elbow method
def elbow_method(dist, link, val_index):
    params = f"distance = {dist}, linkage = {link}"
    res = kmeans_all[kmeans_all['params'] == params]
    
    res = res.dropna(subset=[val_index])

    x = res["n_clust"]
    y = res[val_index]

    if val_index == 'davies_bouldin':
        knee_locator = KneeLocator(x, y, curve='concave', direction='increasing')
    else:
        knee_locator = KneeLocator(x, y, curve='convex', direction='decreasing')
    
    return res[res["n_clust"] == knee_locator.knee]

kmeans_elbow = pd.DataFrame()

distances = ['euclidean', 'manhattan', 'chebyshev']
linkages = ['mean', 'median', 'medoid']
models = product(distances, linkages)

for dist, link in models:
    for val_index in val_indices:
        best_mod = elbow_method(dist, link, val_index)
        kmeans_elbow = pd.concat([kmeans_elbow, best_mod], ignore_index=True)

In [None]:
# Find absolute best models for each validity index
kmeans_elbow = kmeans_elbow.drop_duplicates().reset_index(drop=True)
# Need to add colums indicating which validity index is maximized.
# After that, duplicate models should be merged, not dropped.

best_silhouette_ecd = kmeans_elbow.sort_values('silhouette_ecd', ascending=False).iloc[0]
best_silhouette_mnt = kmeans_elbow.sort_values('silhouette_mnt', ascending=False).iloc[0]
best_ch = kmeans_elbow.sort_values('calinski_harabasz', ascending=False).iloc[0]
best_db = kmeans_elbow.sort_values('davies_bouldin', ascending=True).iloc[0]
best_dunn = kmeans_elbow.sort_values('dunn', ascending=False).iloc[0]

kmeans_best = pd.DataFrame([best_silhouette_ecd, best_silhouette_mnt, best_ch, best_db, best_dunn])
kmeans_best = kmeans_best.drop_duplicates().reset_index(drop=True)

In [None]:
kmeans_best

In [None]:
# Refit and plot the best model
data = data_n_scaled

kmeans = FlexibleKMeans(
    n_clusters = 3,
    metric = 'manhattan',
    center_method = 'mean',
    n_init = 25,
    random_state = 43)

pred_clust = kmeans.fit_predict(data)

plot_clusters(data, range(0,3), pred_clust)

In [None]:
fig, ax = plot_cluster_profiles(data_n, pred_clust, feature_names = None, class_names = None, sd = 1.5)

# AHC

In [None]:
# Fit the models
def do_AHC(n, dist, link):
    ahc = AgglomerativeClustering(
        n_clusters = n,
        metric = dist,
        linkage = link)
    
    ahc.fit(data)
    pred_clust = ahc.labels_

    model = 'AHC'
    params = f"distance = {dist}, linkage = {link}"

    return get_metrics(model, params, n, data, pred_clust)

data = data_n_scaled

clust_range = range(1, max_clust+1)
distances = ['manhattan', 'euclidean', 'chebyshev', 'hamming']
linkages = ['single', 'average', 'complete']
params = product(clust_range, distances, linkages)

results = Parallel(n_jobs=max_threads)(delayed(do_AHC)(n, dist, link) for n, dist, link in params)
results.extend([do_AHC(n, 'euclidean', 'ward') for n in clust_range])
ahc_all = pd.DataFrame(results)

In [None]:
mod_subset = ahc_all.loc[ahc_all['params'] == 'distance = manhattan, linkage = average']
for val_index in val_indices:
    elbow_plot(mod_subset, val_index)

In [None]:
# Find the best model for each combination of parameters through the Elbow method
def elbow_method(dist, link, val_index):
    params = f"distance = {dist}, linkage = {link}"
    res = ahc_all[ahc_all['params'] == params]
    
    res = res.dropna(subset=[val_index])

    x = res["n_clust"]
    y = res[val_index]

    if val_index == 'davies_bouldin':
        knee_locator = KneeLocator(x, y, curve='concave', direction='increasing')
    else:
        knee_locator = KneeLocator(x, y, curve='convex', direction='decreasing')
    
    return res[res["n_clust"] == knee_locator.knee]

ahc_elbow = pd.DataFrame()

distances = ['manhattan', 'euclidean', 'chebyshev']
linkages = ['single', 'average', 'complete']
models = product(distances, linkages)

for dist, link in models:
    for val_index in val_indices:
        best_mod = elbow_method(dist, link, val_index)
        ahc_elbow = pd.concat([ahc_elbow, best_mod], ignore_index=True)

In [None]:
# Find absolute best models for each validity index
ahc_elbow = ahc_elbow.drop_duplicates().reset_index(drop=True)
# Need to add colums indicating which validity index is maximized.
# After that, duplicate models should be merged, not dropped.

best_silhouette_ecd = ahc_elbow.sort_values('silhouette_ecd', ascending=False).iloc[0]
best_silhouette_mnt = ahc_elbow.sort_values('silhouette_mnt', ascending=False).iloc[0]
best_ch = ahc_elbow.sort_values('calinski_harabasz', ascending=False).iloc[0]
best_db = ahc_elbow.sort_values('davies_bouldin', ascending=True).iloc[0]
best_dunn = ahc_elbow.sort_values('dunn', ascending=False).iloc[0]

ahc_best = pd.DataFrame([best_silhouette_ecd, best_silhouette_mnt, best_ch, best_db, best_dunn])
ahc_best = ahc_best.drop_duplicates().reset_index(drop=True)

In [None]:
ahc_best

AHC yields only one interesting model, where the smallest cluster is not nearly empty. This model have 4 clusters. But its biggest cluster gathers 85 % of the individuals, meaning the others are really small.

In [None]:
# Refit and plot the best model
data = data_n_scaled

ahc = AgglomerativeClustering(
    n_clusters = 4,
    metric = 'euclidean',
    linkage = 'complete')
    
ahc.fit(data)
pred_clust = ahc.labels_

plot_clusters(data, range(0,4), pred_clust)

In [None]:
fig, ax = plot_cluster_profiles(data_n, pred_clust, feature_names = None, class_names = None, sd = 1.5)

# HDBSCAN

In [None]:
# Fit the models
def do_hdbscan(dist, min_c, min_s):
    if dist == 'mahalanobis':
        cov_matrix = np.cov(data, rowvar=False)  # Compute covariance
        inv_cov_matrix = np.linalg.inv(cov_matrix)  # Compute inverse

        # Define a Mahalanobis distance function
        def mahalanobis_metric(a, b):
            return mahalanobis(a, b, inv_cov_matrix)

        dist_func = mahalanobis_metric
    else:
        dist_func = dist
        
    hdb = HDBSCAN(
        metric = dist_func,
        min_cluster_size = min_c, 
        min_samples = min_s)
        
    pred_clust = hdb.fit_predict(data)

    model = 'HDBSCAN'
    params = f"distance = {dist}, min_cluster_size = {min_c}, min_samples = {min_s}"
    n = len(set(pred_clust[pred_clust != -1]))
    noise_freq = 100 * sum(pred_clust == -1) / len(pred_clust)

    return get_metrics(model, params, n, data, pred_clust, noise = noise_freq)

data = data_n_scaled

distances = ['manhattan', 'euclidean', 'chebyshev', 'mahalanobis', 'hamming']
min_cluster_sizes = range(2, 21)
min_samples_range = range(1, 21)
params = product(distances, min_cluster_sizes, min_samples_range)

results = Parallel(n_jobs=max_threads)(delayed(do_hdbscan)(dist, min_c, min_s) for dist, min_c, min_s in params)
hdbscan_all = pd.DataFrame(results)

In [None]:
# The Elbow method is inapplicable here. We simply select the model maximizing each validity index.
best_silhouette_ecd = hdbscan_all.sort_values('silhouette_ecd', ascending=False).iloc[0]
best_silhouette_mnt = hdbscan_all.sort_values('silhouette_mnt', ascending=False).iloc[0]
best_ch = hdbscan_all.sort_values('calinski_harabasz', ascending=False).iloc[0]
best_db = hdbscan_all.sort_values('davies_bouldin', ascending=True).iloc[0]
best_dunn = hdbscan_all.sort_values('dunn', ascending=False).iloc[0]

hdbscan_best = pd.DataFrame([best_silhouette_ecd, best_silhouette_mnt, best_ch, best_db, best_dunn])
hdbscan_best = hdbscan_best.drop_duplicates().reset_index(drop=True)

In [None]:
hdbscan_best

In [None]:
# Histogram of the number of clusters selected by HDBSCAN models, grouping the values above 8
bins = list(range(0, 9)) + [8.5]
labels = list(range(0, 8)) + ['8+']
plot_data = hdbscan_all['n_clust'].apply(lambda x: x if x <= 8 else 8.5)

plt.figure(figsize=(8, 4))
plt.hist(plot_data, bins=bins, edgecolor='black', align='left', rwidth=0.8)
plt.xticks(bins[:-1], labels)
plt.xlabel('Number of Clusters')
plt.ylabel('Number of Models')
plt.title('Number of clusters selected by HDBSCAN models')
plt.show()

In [None]:
# Refit and plot the best model
data = data_n_scaled

hdb = HDBSCAN(
    metric = 'manhattan', 
    min_cluster_size = 2, 
    min_samples = 1)

pred_clust = hdb.fit_predict(data)
n_clusters = len(set(pred_clust[pred_clust != -1]))

plot_clusters(data, range(0,n_clusters), pred_clust)

# Aggregate results
## Best models
Not sure it makes sense to compare models that work so differently. Seems better to analyze their results separately.

In [None]:
best_mod_list = [kmeans_best, ahc_best, hdbscan_best]
best_models = pd.concat(best_mod_list, ignore_index=True)

In [None]:
best_models

In [None]:
# Selecting the best performing model on each criteria across model classes eliminates hdbscan models
# Which could mean hdbscan is underperforming
# Or is picking non-convex clusters
# Or that data is non-clusterable!
best_mod_list = [kmeans_best, ahc_best, hdbscan_best]
best_models = pd.concat(best_mod_list, ignore_index=True)

best_silhouette_ecd = best_models.sort_values('silhouette_ecd', ascending=False).iloc[0]
best_silhouette_mnt = best_models.sort_values('silhouette_mnt', ascending=False).iloc[0]
best_ch = best_models.sort_values('calinski_harabasz', ascending=False).iloc[0]
best_db = best_models.sort_values('davies_bouldin', ascending=True).iloc[0]
best_dunn = best_models.sort_values('dunn', ascending=False).iloc[0]

best_models = pd.DataFrame([best_silhouette_ecd, best_silhouette_mnt, best_ch, best_db, best_dunn])
best_models = best_models.drop_duplicates().reset_index(drop=True)

In [None]:
best_models

## Checking the correlation of validity indices

In [None]:
all_mod_list = [LCA_all, LPA_all, kmeans_all, ahc_all, hdbscan_all]
all_models = pd.concat(all_mod_list, ignore_index=True)
all_models = all_models[['silhouette_ecd', 'silhouette_mnt', 'calinski_harabasz', 'davies_bouldin', 'dunn', 'gen_dunn']]

In [None]:
correlations = all_models.corr(method='spearman')

plt.figure(figsize=(7, 7)) 
sns.heatmap(correlations, annot=True, fmt=".2f", cmap="coolwarm", cbar=True, 
            square=True, linewidths=0.5, vmin=-1, vmax=1)
plt.show()

In [None]:
# Histogram
bins = np.arange(best_models['n_clust'].min() - 0.5, best_models['n_clust'].max() + 1.5, 1)

plt.figure(figsize=(8, 4))
plt.hist(best_models['n_clust'], bins=bins, edgecolor='black', rwidth=0.8)
plt.xlabel('Number of Clusters')
plt.ylabel('Number of Models')
plt.title('Optimal number of clusters according to best models')
plt.show()

# Hopkins Statistic

Function from the pyclustertend package, which could not be installed because its depencies are outdated.
See: https://pyclustertend.readthedocs.io/en/latest/_modules/pyclustertend/hopkins.html

In [None]:
def hopkins(data_frame, sampling_size):
    """Assess the clusterability of a dataset. A score between 0 and 1, a score around 0.5 express
    no clusterability and a score tending to 0 express a high cluster tendency.

    Parameters
    ----------
    data_frame : numpy array
        The input dataset
    sampling_size : int
        The sampling size which is used to evaluate the number of DataFrame.

    Returns
    ---------------------
    score : float
        The hopkins score of the dataset (between 0 and 1)
    """
    
    if type(data_frame) == np.ndarray:
        data_frame = pd.DataFrame(data_frame)

    # Sample n observations from D:P
    if sampling_size > data_frame.shape[0]:
        raise Exception(
            'The number of sample of sample is bigger than the shape of D')

    data_frame_sample = data_frame.sample(n=sampling_size)

    # Get the distance to their neirest neighbors in D:X
    tree = BallTree(data_frame, leaf_size=2)
    dist, _ = tree.query(data_frame_sample, k=2)
    data_frame_sample_distances_to_nearest_neighbours = dist[:, 1]

    # Randomly simulate n points with the same variation as in D:Q
    max_data_frame = data_frame.max()
    min_data_frame = data_frame.min()

    uniformly_selected_values_0 = np.random.uniform(min_data_frame[0], max_data_frame[0], sampling_size)
    uniformly_selected_values_1 = np.random.uniform(min_data_frame[1], max_data_frame[1], sampling_size)

    uniformly_selected_observations = np.column_stack((uniformly_selected_values_0, uniformly_selected_values_1))
    if len(max_data_frame) >= 2:
        for i in range(2, len(max_data_frame)):
            uniformly_selected_values_i = np.random.uniform(min_data_frame[i], max_data_frame[i], sampling_size)
            to_stack = (uniformly_selected_observations, uniformly_selected_values_i)
            uniformly_selected_observations = np.column_stack(to_stack)

    uniformly_selected_observations_df = pd.DataFrame(uniformly_selected_observations)

    # Get the distance to their neirest neighbors in D:Y
    tree = BallTree(data_frame, leaf_size=2)
    dist, _ = tree.query(uniformly_selected_observations_df, k=1)
    uniformly_df_distances_to_nearest_neighbours = dist

    # Return the hopkins score
    x = sum(data_frame_sample_distances_to_nearest_neighbours)
    y = sum(uniformly_df_distances_to_nearest_neighbours)

    if x + y == 0:
        raise Exception('The denominator of the hopkins statistics is null')

    return x / (x + y)[0]

In [None]:
float(hopkins(data_n.values, data_n.shape[0]))