# Clustering Analysis with Multiple Algorithms
This notebook performs comprehensive clustering analysis using multiple algorithms (KMeans, GMM, Spectral) and logs all results to Weights & Biases.

**Features:**
- Supports multiple k values for clustering
- Computes internal metrics (Silhouette, Davies-Bouldin, Calinski-Harabasz)
- Computes external metrics (NRI, ARI, Purity) if labels are available
- Logs all experiments to wandb
- Handles missing labels gracefully

## 1. Configuration Section
Set all configuration variables here before running the analysis.

In [21]:
# ===========================
# CONFIGURATION VARIABLES
# ===========================

# Path to the PCA-reduced, preprocessed & scaled dataset
DATA_CSV_PATH = "../results/pca/indian_pca.csv"
# DATA_CSV_PATH = "../results/pca/fma_medium_pca.csv"
# DATA_CSV_PATH = "../results/pca/fma_small_pca.csv"
# DATA_CSV_PATH = "../results/pca/gtzan_pca.csv"


# Option 1: Label column inside the same CSV (set to None if not applicable)
LABEL_COLUMN_IN_DATA = None  # e.g., "genre" or "label"

# Option 2: Separate label file (set to None if not applicable)
# LABEL_CSV_PATH = "../results/normalization/fma_medium_labels.csv" 
LABEL_CSV_PATH = "../results/normalization/indian_labels.csv" 
# LABEL_CSV_PATH = "../results/normalization/fma_small_labels.csv" 
# LABEL_CSV_PATH = "../results/normalization/gtzan_labels.csv" 
LABEL_CSV_LABEL_COLUMN = "label"  # Column name in the label CSV

# Weights & Biases configuration
WANDB_PROJECT = "music-clustering-fma"  # Required: your wandb project name
WANDB_ENTITY = None  # Optional: your wandb username/team (None = use default)

# Clustering configuration
RANDOM_STATE = 42  # For reproducibility
K_VALUES = [5, 8, 10, 16]  # Different number of clusters to test

# Display configuration
print("Configuration loaded successfully!")
print(f"Data path: {DATA_CSV_PATH}")
print(f"Label in data: {LABEL_COLUMN_IN_DATA}")
print(f"Label CSV path: {LABEL_CSV_PATH}")
print(f"K values: {K_VALUES}")

Configuration loaded successfully!
Data path: ../results/pca/indian_pca.csv
Label in data: None
Label CSV path: ../results/normalization/indian_labels.csv
K values: [5, 8, 10, 16]


## 2. Import Required Libraries
Import all necessary libraries for data processing, clustering, metrics, and logging.

In [None]:
# ===========================
# IMPORT LIBRARIES
# ===========================

import pandas as pd
import numpy as np
import warnings
import wandb
from pathlib import Path

# Clustering algorithms
from sklearn.cluster import KMeans, AgglomerativeClustering, SpectralClustering
from sklearn.mixture import GaussianMixture

# Dimensionality reduction for visualization
from sklearn.manifold import TSNE

# Visualization libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os

# Check if kaleido is available for saving plotly images
try:
    import kaleido
    KALEIDO_AVAILABLE = True
except ImportError:
    print("‚ö† Warning: kaleido not available. Image export for k=10 will be skipped.")
    print("  To enable image export, try: pip install kaleido")
    KALEIDO_AVAILABLE = False

# Internal clustering metrics
from sklearn.metrics import (
    silhouette_score,
    davies_bouldin_score,
    calinski_harabasz_score
)

# External clustering metrics (for labeled data)
from sklearn.metrics import (
    adjusted_rand_score,
    rand_score
)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("‚úì All libraries imported successfully!")

‚úì All libraries imported successfully!


## 3. Define Helper Functions
Create modular functions for data loading, metric computation, and clustering.

In [23]:
# ===========================
# DATA LOADING FUNCTIONS
# ===========================

def load_data_and_labels(data_path, label_col_in_data=None, 
                         label_csv_path=None, label_csv_col=None):
    """
    Load PCA-reduced dataset and optionally load labels.
    
    Args:
        data_path: Path to the PCA-reduced CSV file
        label_col_in_data: Column name if labels are in the same CSV
        label_csv_path: Path to separate label CSV file
        label_csv_col: Column name in the separate label CSV
    
    Returns:
        X: numpy array of features
        y: numpy array of labels (or None if not available)
        data_df: original dataframe
    """
    # Load the main dataset
    print(f"Loading data from: {data_path}")
    data_df = pd.read_csv(data_path)
    print(f"‚úì Data loaded: {data_df.shape[0]} samples, {data_df.shape[1]} features")
    
    # Extract features (all columns)
    X = data_df.values
    y = None
    
    # Try to load labels - Option 1: Label column in the same CSV
    if label_col_in_data and label_col_in_data in data_df.columns:
        print(f"‚úì Loading labels from column '{label_col_in_data}' in data")
        y = data_df[label_col_in_data].values
        # Remove label column from features
        X = data_df.drop(columns=[label_col_in_data]).values
        print(f"‚úì Labels loaded: {len(np.unique(y))} unique classes")
    
    # Try to load labels - Option 2: Separate label CSV
    elif label_csv_path and Path(label_csv_path).exists():
        print(f"‚úì Loading labels from: {label_csv_path}")
        label_df = pd.read_csv(label_csv_path)
        
        if label_csv_col and label_csv_col in label_df.columns:
            y = label_df[label_csv_col].values
            
            # Ensure label count matches data count
            if len(y) != len(X):
                print(f"‚ö† Warning: Label count ({len(y)}) doesn't match data count ({len(X)})")
                print("  Truncating to minimum length...")
                min_len = min(len(X), len(y))
                X = X[:min_len]
                y = y[:min_len]
            
            print(f"‚úì Labels loaded: {len(np.unique(y))} unique classes")
        else:
            print(f"‚ö† Warning: Column '{label_csv_col}' not found in label CSV")
    
    else:
        print("‚Ñπ No labels provided - will compute internal metrics only")
    
    return X, y, data_df

print("‚úì Data loading function defined")

‚úì Data loading function defined


In [24]:
# ===========================
# METRIC COMPUTATION FUNCTIONS
# ===========================

def compute_purity(y_true, y_pred):
    """
    Compute purity score for clustering results.
    
    Purity measures how "pure" each cluster is with respect to ground truth labels.
    Higher is better (max = 1.0).
    
    Args:
        y_true: Ground truth labels
        y_pred: Predicted cluster labels
    
    Returns:
        purity: float between 0 and 1
    """
    # Create contingency matrix
    contingency_matrix = pd.crosstab(y_pred, y_true)
    
    # Sum of maximum counts in each cluster
    purity = np.sum(np.max(contingency_matrix.values, axis=1)) / len(y_true)
    
    return purity


def compute_internal_metrics(X, labels):
    """
    Compute internal clustering validation metrics.
    These metrics don't require ground truth labels.
    
    Args:
        X: Feature matrix (numpy array)
        labels: Cluster assignments
    
    Returns:
        dict: Dictionary containing all internal metrics
    """
    metrics = {}
    
    # Silhouette Score: measures how similar samples are to their own cluster
    # Range: [-1, 1], higher is better
    metrics['silhouette_score'] = silhouette_score(X, labels)
    
    # Davies-Bouldin Index: ratio of within-cluster to between-cluster distances
    # Range: [0, ‚àû), lower is better
    metrics['davies_bouldin_index'] = davies_bouldin_score(X, labels)
    
    # Calinski-Harabasz Index: ratio of between-cluster to within-cluster variance
    # Range: [0, ‚àû), higher is better
    metrics['calinski_harabasz_index'] = calinski_harabasz_score(X, labels)
    
    return metrics


def compute_external_metrics(y_true, y_pred):
    """
    Compute external clustering validation metrics.
    These metrics require ground truth labels.
    
    Args:
        y_true: Ground truth labels
        y_pred: Predicted cluster labels
    
    Returns:
        dict: Dictionary containing all external metrics
    """
    metrics = {}
    
    # Normalized Rand Index (NRI): measures similarity between two clusterings
    # Range: [0, 1], higher is better
    metrics['nri'] = rand_score(y_true, y_pred)
    
    # Adjusted Rand Index (ARI): adjusted-for-chance version of Rand Index
    # Range: [-1, 1], higher is better (0 = random, 1 = perfect match)
    metrics['ari'] = adjusted_rand_score(y_true, y_pred)
    
    # Purity: percentage of correctly clustered samples
    # Range: [0, 1], higher is better
    metrics['purity'] = compute_purity(y_true, y_pred)
    
    return metrics

print("‚úì Metric computation functions defined")

‚úì Metric computation functions defined


In [None]:
# ===========================
# CLUSTERING ALGORITHM FUNCTIONS
# ===========================

def get_clustering_algorithm(algorithm_name, n_clusters, random_state):
    """
    Factory function to get the appropriate clustering algorithm.
    
    Args:
        algorithm_name: Name of the algorithm ('kmeans', 'agglomerative', 'gmm', 'spectral')
        n_clusters: Number of clusters
        random_state: Random seed for reproducibility
    
    Returns:
        Initialized clustering algorithm object
    """
    algorithms = {
        'kmeans': KMeans(
            n_clusters=n_clusters,
            random_state=random_state,
            init='random',  # Standard random initialization
            n_init=10,
            max_iter=300
        ),
        'agglomerative': AgglomerativeClustering(
            n_clusters=n_clusters,
            linkage='ward',  # Ward linkage minimizes variance
            metric='euclidean'
        ),
        'gmm': GaussianMixture(
            n_components=n_clusters,
            random_state=random_state,
            covariance_type='full',
            max_iter=100
        ),
        'spectral': SpectralClustering(
            n_clusters=n_clusters,
            random_state=random_state,
            affinity='nearest_neighbors',  # Changed from 'rbf' to 'nearest_neighbors'
            n_neighbors=10,  # Use 10 nearest neighbors
            assign_labels='discretize',  # Changed from 'kmeans' to 'discretize'
            n_init=10  # Multiple initializations
        )
    }
    
    if algorithm_name not in algorithms:
        raise ValueError(f"Unknown algorithm: {algorithm_name}")
    
    return algorithms[algorithm_name]


def run_clustering(X, algorithm_name, n_clusters, random_state):
    """
    Run a clustering algorithm on the data.
    
    Args:
        X: Feature matrix
        algorithm_name: Name of the algorithm
        n_clusters: Number of clusters
        random_state: Random seed
    
    Returns:
        labels: Cluster assignments for each sample
    """
    # Get the algorithm
    clusterer = get_clustering_algorithm(algorithm_name, n_clusters, random_state)
    
    # Fit and predict
    # Note: GMM uses predict() while others use fit_predict()
    if algorithm_name == 'gmm':
        clusterer.fit(X)
        labels = clusterer.predict(X)
    else:
        labels = clusterer.fit_predict(X)
    
    return labels

print("‚úì Clustering algorithm functions defined")

‚úì Clustering algorithm functions defined


### t-SNE Visualization Functions
Functions for creating 2D and 3D t-SNE visualizations of clustering results.

In [26]:
# ===========================
# T-SNE VISUALIZATION FUNCTIONS
# ===========================

def compute_tsne_embeddings(X, random_state=42):
    """
    Compute t-SNE embeddings in 2D and 3D for visualization.
    
    Args:
        X: Feature matrix
        random_state: Random seed for reproducibility
    
    Returns:
        tsne_2d: 2D t-SNE embeddings
        tsne_3d: 3D t-SNE embeddings
    """
    print("Computing t-SNE embeddings...")
    
    # Compute 2D t-SNE
    print("  - Computing 2D t-SNE...")
    tsne_2d = TSNE(n_components=2, random_state=random_state, perplexity=30, max_iter=1000)
    X_tsne_2d = tsne_2d.fit_transform(X)
    
    # Compute 3D t-SNE
    print("  - Computing 3D t-SNE...")
    tsne_3d = TSNE(n_components=3, random_state=random_state, perplexity=30, max_iter=1000)
    X_tsne_3d = tsne_3d.fit_transform(X)
    
    print("‚úì t-SNE embeddings computed successfully!")
    return X_tsne_2d, X_tsne_3d


def create_2d_cluster_plot(X_tsne_2d, cluster_labels, true_labels=None, 
                           algorithm_name="", k=0):
    """
    Create a 2D scatter plot of t-SNE embeddings colored by cluster assignments.
    
    Args:
        X_tsne_2d: 2D t-SNE embeddings
        cluster_labels: Predicted cluster labels
        true_labels: Ground truth labels (optional)
        algorithm_name: Name of the clustering algorithm
        k: Number of clusters
    
    Returns:
        fig: Plotly figure object
    """
    # Create DataFrame for plotting
    df = pd.DataFrame({
        'tsne_1': X_tsne_2d[:, 0],
        'tsne_2': X_tsne_2d[:, 1],
        'cluster': cluster_labels.astype(str)
    })
    
    if true_labels is not None:
        df['true_label'] = true_labels.astype(str)
        hover_data = ['cluster', 'true_label']
    else:
        hover_data = ['cluster']
    
    # Create the plot
    fig = px.scatter(
        df, 
        x='tsne_1', 
        y='tsne_2',
        color='cluster',
        title=f'2D t-SNE: {algorithm_name.upper()} (k={k})',
        labels={'tsne_1': 't-SNE Component 1', 'tsne_2': 't-SNE Component 2'},
        hover_data=hover_data,
        color_discrete_sequence=px.colors.qualitative.Set3
    )
    
    fig.update_traces(marker=dict(size=6, opacity=0.7))
    fig.update_layout(
        width=800,
        height=600,
        template='plotly_white',
        legend_title='Cluster'
    )
    
    return fig


def create_3d_cluster_plot(X_tsne_3d, cluster_labels, true_labels=None,
                           algorithm_name="", k=0):
    """
    Create a 3D scatter plot of t-SNE embeddings colored by cluster assignments.
    
    Args:
        X_tsne_3d: 3D t-SNE embeddings
        cluster_labels: Predicted cluster labels
        true_labels: Ground truth labels (optional)
        algorithm_name: Name of the clustering algorithm
        k: Number of clusters
    
    Returns:
        fig: Plotly figure object
    """
    # Create DataFrame for plotting
    df = pd.DataFrame({
        'tsne_1': X_tsne_3d[:, 0],
        'tsne_2': X_tsne_3d[:, 1],
        'tsne_3': X_tsne_3d[:, 2],
        'cluster': cluster_labels.astype(str)
    })
    
    if true_labels is not None:
        df['true_label'] = true_labels.astype(str)
        hover_data = ['cluster', 'true_label']
    else:
        hover_data = ['cluster']
    
    # Create the plot
    fig = px.scatter_3d(
        df,
        x='tsne_1',
        y='tsne_2',
        z='tsne_3',
        color='cluster',
        title=f'3D t-SNE: {algorithm_name.upper()} (k={k})',
        labels={
            'tsne_1': 't-SNE Component 1',
            'tsne_2': 't-SNE Component 2',
            'tsne_3': 't-SNE Component 3'
        },
        hover_data=hover_data,
        color_discrete_sequence=px.colors.qualitative.Set3
    )
    
    fig.update_traces(marker=dict(size=4, opacity=0.7))
    fig.update_layout(
        width=900,
        height=700,
        template='plotly_white',
        legend_title='Cluster'
    )
    
    return fig

print("‚úì t-SNE visualization functions defined")

‚úì t-SNE visualization functions defined


## 4. Load Data
Load the PCA-reduced dataset and labels (if available).

In [27]:
# ===========================
# LOAD DATA AND LABELS
# ===========================

# Load the dataset and labels (if available)
X, y, data_df = load_data_and_labels(
    data_path=DATA_CSV_PATH,
    label_col_in_data=LABEL_COLUMN_IN_DATA,
    label_csv_path=LABEL_CSV_PATH,
    label_csv_col=LABEL_CSV_LABEL_COLUMN
)

# Display summary
print(f"\n{'='*50}")
print(f"Dataset Summary:")
print(f"{'='*50}")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Labels available: {'Yes' if y is not None else 'No'}")
if y is not None:
    print(f"Number of unique labels: {len(np.unique(y))}")
print(f"{'='*50}\n")

Loading data from: ../results/pca/indian_pca.csv
‚úì Data loaded: 500 samples, 40 features
‚úì Loading labels from: ../results/normalization/indian_labels.csv
‚úì Labels loaded: 5 unique classes

Dataset Summary:
Number of samples: 500
Number of features: 40
Labels available: Yes
Number of unique labels: 5



## 4A. Compute t-SNE Embeddings
Compute 2D and 3D t-SNE embeddings for visualization. This is done once before clustering to save computation time.

In [28]:
# ===========================
# COMPUTE T-SNE EMBEDDINGS
# ===========================

# Compute t-SNE embeddings once for all visualizations
X_tsne_2d, X_tsne_3d = compute_tsne_embeddings(X, random_state=RANDOM_STATE)

print(f"\n‚úì 2D t-SNE shape: {X_tsne_2d.shape}")
print(f"‚úì 3D t-SNE shape: {X_tsne_3d.shape}")
print("\nThese embeddings will be used for all clustering visualizations.\n")

Computing t-SNE embeddings...
  - Computing 2D t-SNE...
  - Computing 3D t-SNE...
  - Computing 3D t-SNE...
‚úì t-SNE embeddings computed successfully!

‚úì 2D t-SNE shape: (500, 2)
‚úì 3D t-SNE shape: (500, 3)

These embeddings will be used for all clustering visualizations.

‚úì t-SNE embeddings computed successfully!

‚úì 2D t-SNE shape: (500, 2)
‚úì 3D t-SNE shape: (500, 3)

These embeddings will be used for all clustering visualizations.



## 5A. Login to Weights & Biases (First Time Only)
Run this cell only once to login to wandb. You'll need to get your API key from https://wandb.ai/authorize

In [29]:
# ===========================
# WANDB LOGIN (RUN THIS FIRST!)
# ===========================

# STEP 1: Get your API key from: https://wandb.ai/authorize
# STEP 2: Run this cell and paste your API key when prompted
# STEP 3: This only needs to be done once per machine

import wandb

# Login to wandb
wandb.login()

# Alternative: If you want to login without interactive prompt, use:
# wandb.login(key="YOUR_API_KEY_HERE")

print("‚úì Successfully logged in to Weights & Biases!")
print("You can now run the next cells to start your experiment.")

‚úì Successfully logged in to Weights & Biases!
You can now run the next cells to start your experiment.


## 5B. Initialize Weights & Biases Project
After logging in above, run this cell to initialize your experiment.

In [None]:
# ===========================
# INITIALIZE WANDB
# ===========================

# Note: Make sure to run wandb.login() before this cell if not already logged in

# Close any existing wandb runs first
try:
    if wandb.run is not None:
        print("‚ö† Closing existing WandB run...")
        wandb.finish()
except:
    pass

# Initialize wandb with project configuration and error handling
try:
    wandb.init(
        project=WANDB_PROJECT,
        entity=WANDB_ENTITY,
        name=f"clustering-comparison-{Path(DATA_CSV_PATH).stem}",
        config={
            'dataset': DATA_CSV_PATH,
            'n_samples': X.shape[0],
            'n_features': X.shape[1],
            'k_values': K_VALUES,
            'random_state': RANDOM_STATE,
            'has_labels': y is not None,
            'algorithms': ['kmeans', 'agglomerative', 'gmm', 'spectral']
        },
        reinit=True,  # Allow reinitialization
        settings=wandb.Settings(start_method="thread")  # Use thread mode to avoid fork issues
    )
    
    print("‚úì Weights & Biases initialized successfully!")
    print(f"  Project: {WANDB_PROJECT}")
    print(f"  Run name: {wandb.run.name}")
    
except Exception as e:
    print(f"‚ùå Error initializing WandB: {str(e)}")
    print("  Tip: Try running 'wandb login' or check your internet connection")
    raise

‚úì Weights & Biases initialized successfully!
  Project: music-clustering-fma
  Run name: clustering-comparison-indian_pca


## 6. Run Clustering Experiments
Run all combinations of algorithms and k values, computing metrics and logging to wandb.

In [None]:
# ===========================
# RUN ALL CLUSTERING EXPERIMENTS
# ===========================

# Define algorithms to test
algorithms = ['kmeans', 'agglomerative', 'gmm', 'spectral']

# Storage for all results
all_results = []

# Total number of experiments (all algorithms run for each k value)
total_experiments = len(algorithms) * len(K_VALUES)
experiment_counter = 0

print(f"Starting {total_experiments} clustering experiments...")
print(f"Algorithms to test: {', '.join(algorithms)}")
print(f"K values to test: {K_VALUES}")
print(f"{'='*70}\n")

# Loop through each algorithm
for algorithm_name in algorithms:
    print(f"Algorithm: {algorithm_name.upper()}")
    print(f"{'-'*70}")
    
    # Loop through each k value
    for k in K_VALUES:
        experiment_counter += 1
        
        print(f"  [{experiment_counter}/{total_experiments}] Running {algorithm_name} with k={k}...")
        
        try:
            # Run clustering
            cluster_labels = run_clustering(X, algorithm_name, k, RANDOM_STATE)
            
            # Check if we have valid clusters (at least 2 unique labels)
            unique_labels = np.unique(cluster_labels)
            n_clusters_found = len(unique_labels)
            
            if n_clusters_found < 2:
                print(f"      ‚ö† Warning: Only {n_clusters_found} cluster(s) found. Skipping metrics.")
                continue
            
            # Compute internal metrics (always available)
            internal_metrics = compute_internal_metrics(X, cluster_labels)
            
            # Initialize results dictionary
            result = {
                'algorithm': algorithm_name,
                'k': k,
                'n_clusters_found': n_clusters_found,
                'silhouette_score': internal_metrics['silhouette_score'],
                'davies_bouldin_index': internal_metrics['davies_bouldin_index'],
                'calinski_harabasz_index': internal_metrics['calinski_harabasz_index']
            }
            
            # Compute external metrics (only if labels are available)
            if y is not None:
                external_metrics = compute_external_metrics(y, cluster_labels)
                    
                result.update({
                    'nri': external_metrics['nri'],
                    'ari': external_metrics['ari'],
                    'purity': external_metrics['purity']
                })
                
                print(f"      ‚úì Clusters: {n_clusters_found} | Silhouette: {internal_metrics['silhouette_score']:.4f} | "
                      f"ARI: {external_metrics['ari']:.4f} | "
                      f"Purity: {external_metrics['purity']:.4f}")
            else:
                # Set external metrics to None if labels not available
                result.update({
                    'nri': None,
                    'ari': None,
                    'purity': None
                })
                
                print(f"      ‚úì Clusters: {n_clusters_found} | Silhouette: {internal_metrics['silhouette_score']:.4f} | "
                      f"DB Index: {internal_metrics['davies_bouldin_index']:.4f}")
            
            # Store result with cluster labels for later visualization
            result['cluster_labels'] = cluster_labels
            all_results.append(result)
            
        except Exception as e:
            print(f"      ‚úó Error: {str(e)}")
            import traceback
            traceback.print_exc()
            continue
    
    print()  # Empty line between algorithms

print(f"{'='*70}")
print(f"‚úì All {experiment_counter} experiments completed!")
print(f"‚úì {len(all_results)} successful results collected")
print(f"‚úì Results ready for visualization and logging")
print(f"{'='*70}")

Starting 20 clustering experiments...
Algorithms to test: kmeans, agglomerative, gmm, spectral, hdbscan
K values to test: [5, 8, 10, 16]

Algorithm: KMEANS
----------------------------------------------------------------------
  [1/20] Running kmeans with k=5...
      ‚úì Clusters: 5 | Silhouette: 0.0677 | ARI: 0.1736 | Purity: 0.4760
  [2/20] Running kmeans with k=8...
      ‚úì Clusters: 8 | Silhouette: 0.0684 | ARI: 0.1087 | Purity: 0.4320
  [3/20] Running kmeans with k=10...
      ‚úì Clusters: 10 | Silhouette: 0.0647 | ARI: 0.1008 | Purity: 0.4700
  [4/20] Running kmeans with k=16...
      ‚úì Clusters: 16 | Silhouette: 0.0707 | ARI: 0.0965 | Purity: 0.5100

Algorithm: AGGLOMERATIVE
----------------------------------------------------------------------
  [5/20] Running agglomerative with k=5...
      ‚úì Clusters: 5 | Silhouette: 0.0483 | ARI: 0.1443 | Purity: 0.4580
  [6/20] Running agglomerative with k=8...
      ‚úì Clusters: 8 | Silhouette: 0.0553 | ARI: 0.2094 | Purity: 0.530

## 7. Create Results DataFrame
Aggregate all results into a pandas DataFrame for easy analysis.

## 7A. Visualize Metrics Comparison
Create comprehensive comparison plots showing all algorithms and k values in single graphs.

In [32]:
# ===========================
# CREATE RESULTS DATAFRAME
# ===========================

# Check if we have any results
if not all_results:
    print("\n" + "="*70)
    print("‚ö† WARNING: No results to display!")
    print("="*70)
    print("Please run the clustering experiments (Section 6) first.")
    print("="*70)
else:
    # Convert results list to DataFrame
    results_df = pd.DataFrame(all_results)
    
    # Remove cluster_labels and figure columns for display and logging
    cols_to_drop = ['cluster_labels']
    if 'fig_2d' in results_df.columns:
        cols_to_drop.append('fig_2d')
    if 'fig_3d' in results_df.columns:
        cols_to_drop.append('fig_3d')
    
    results_df_display = results_df.drop(columns=cols_to_drop, errors='ignore')
    
    # Sort by algorithm and k for better readability
    results_df_display = results_df_display.sort_values(['algorithm', 'k']).reset_index(drop=True)
    
    # Display the results
    print("\n" + "="*70)
    print("CLUSTERING RESULTS SUMMARY")
    print("="*70)
    print(results_df_display.to_string(index=False))
    print("="*70)
    
    # Add interpretation notes if we have labels
    if y is not None:
        n_true_labels = len(np.unique(y))
        print("\n" + "="*70)
        print("INTERPRETATION NOTES")
        print("="*70)
        print(f"Number of true labels in dataset: {n_true_labels}")
        print()
        print("üìä Understanding External Metrics:")
        print("  ‚Ä¢ When k ‚â† number of true labels, external metrics can be misleading")
        print(f"  ‚Ä¢ For your data (k={n_true_labels} true labels), compare results at k={n_true_labels}")
        print()
        print("  Metric Interpretation:")
        print("  ‚Ä¢ ARI (Adjusted Rand Index): Most reliable - adjusts for chance")
        print("  ‚Ä¢ Purity: Can be misleading with many clusters (artificially high)")
        print("  ‚Ä¢ NRI: Similar to ARI but less adjusted for chance")
        print()
        print("  Best practice: Focus on ARI and compare k values close to true label count")
        print("="*70)
    
    # Log the summary table to wandb
    wandb.log({"results_table": wandb.Table(dataframe=results_df_display)})
    
    print("\n‚úì Results table logged to wandb!")


CLUSTERING RESULTS SUMMARY
    algorithm  k  n_clusters_found  silhouette_score  davies_bouldin_index  calinski_harabasz_index      nri      ari  purity
agglomerative  5                 5          0.048327              3.016025                29.635680 0.706613 0.144264   0.458
agglomerative  8                 8          0.055256              2.636638                25.035954 0.774092 0.209373   0.530
agglomerative 10                10          0.067424              2.310833                22.685463 0.771984 0.195912   0.530
agglomerative 16                16          0.065352              2.226921                19.000196 0.783319 0.138223   0.554
          gmm  5                 5          0.062725              2.726419                34.347058 0.724224 0.180220   0.474
          gmm  8                 8          0.066393              2.448907                27.904623 0.742477 0.113134   0.450
          gmm 10                10          0.069911              2.419840                

In [None]:
# ===========================
# CREATE EVALUATION METRIC GRAPHS FOR WANDB
# ===========================

if not all_results:
    print("‚ö† No results available. Please run Section 6 first.")
else:
    import matplotlib.pyplot as plt
    import tempfile
    
    print("\n" + "="*70)
    print("CREATING EVALUATION METRIC GRAPHS")
    print("="*70)
    
    # Get unique algorithms and k values
    algorithms = results_df_display['algorithm'].unique()
    k_values = sorted(results_df_display['k'].unique())
    
    # Color mapping for algorithms
    colors = {
        'kmeans': '#1f77b4',
        'agglomerative': '#98df8a',
        'gmm': '#2ca02c',
        'spectral': '#d62728'
    }
    
    # Marker styles for variety
    markers = {
        'kmeans': 'o',
        'agglomerative': '^',
        'gmm': '^',
        'spectral': 'D',
        'spectral': 'D'
    }
    
    # Define metrics to plot
    internal_metrics_list = [
        ('silhouette_score', 'Silhouette Score', 'Higher is Better'),
        ('davies_bouldin_index', 'Davies-Bouldin Index', 'Lower is Better'),
        ('calinski_harabasz_index', 'Calinski-Harabasz Index', 'Higher is Better')
    ]
    
    external_metrics_list = []
    if y is not None:
        external_metrics_list = [
            ('nri', 'Normalized Rand Index (NRI)', 'Higher is Better'),
            ('ari', 'Adjusted Rand Index (ARI)', 'Higher is Better'),
            ('purity', 'Purity Score', 'Higher is Better')
        ]
    
    all_metrics = internal_metrics_list + external_metrics_list
    
    # Create and log each metric graph
    temp_files = []
    
    for metric_col, metric_name, metric_note in all_metrics:
        # Create figure
        fig, ax = plt.subplots(figsize=(10, 6))
        
        # Plot each algorithm
        for algo in algorithms:
            algo_data = results_df_display[results_df_display['algorithm'] == algo]
            ax.plot(algo_data['k'], algo_data[metric_col],
                   marker=markers.get(algo, 'o'),
                   color=colors.get(algo, '#000000'),
                   linewidth=2.5,
                   markersize=10,
                   label=algo.upper(),
                   alpha=0.8)
        
        # Customize plot
        ax.set_xlabel('Number of Clusters (k)', fontsize=12, fontweight='bold')
        ax.set_ylabel(metric_name, fontsize=12, fontweight='bold')
        ax.set_title(f'{metric_name}\n({metric_note})', fontsize=14, fontweight='bold', pad=20)
        ax.legend(loc='best', fontsize=11, framealpha=0.9)
        ax.grid(True, alpha=0.3, linestyle='--')
        ax.set_xticks(k_values)
        
        # Tight layout
        plt.tight_layout()
        
        # Save to temporary file
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.png')
        plt.savefig(temp_file.name, dpi=150, bbox_inches='tight')
        temp_files.append((temp_file.name, metric_col))
        plt.close()
        
        print(f"  ‚úì Created graph for {metric_name}")
    
    # Log all graphs to wandb as media
    print("\n" + "-"*70)
    print("Logging graphs to WandB...")
    print("-"*70)
    
    wandb_log_dict = {}
    for temp_path, metric_col in temp_files:
        wandb_log_dict[f"metrics/{metric_col}"] = wandb.Image(temp_path)
    
    wandb.log(wandb_log_dict)
    
    # Clean up temporary files
    import os
    for temp_path, _ in temp_files:
        os.remove(temp_path)
    
    print(f"\n‚úì All {len(temp_files)} metric graphs logged to WandB as images!")
    print("="*70)


CREATING EVALUATION METRIC GRAPHS
  ‚úì Created graph for Silhouette Score
  ‚úì Created graph for Davies-Bouldin Index
  ‚úì Created graph for Calinski-Harabasz Index
  ‚úì Created graph for Davies-Bouldin Index
  ‚úì Created graph for Calinski-Harabasz Index
  ‚úì Created graph for Normalized Rand Index (NRI)
  ‚úì Created graph for Adjusted Rand Index (ARI)
  ‚úì Created graph for Normalized Rand Index (NRI)
  ‚úì Created graph for Adjusted Rand Index (ARI)
  ‚úì Created graph for Purity Score

----------------------------------------------------------------------
Logging graphs to WandB...
----------------------------------------------------------------------

‚úì All 6 metric graphs logged to WandB as images!
  ‚úì Created graph for Purity Score

----------------------------------------------------------------------
Logging graphs to WandB...
----------------------------------------------------------------------

‚úì All 6 metric graphs logged to WandB as images!


## 7B. Create and Log t-SNE Visualizations to WandB
Generate 2D and 3D t-SNE cluster visualizations and log them to WandB.

In [34]:
# ===========================
# CREATE AND LOG t-SNE VISUALIZATIONS TO WANDB
# ===========================

if not all_results:
    print("‚ö† No results available. Please run Section 6 first.")
else:
    import os
    import tempfile
    import matplotlib.pyplot as plt
    from matplotlib import cm
    import numpy as np
    
    print("\n" + "="*70)
    print("CREATING t-SNE VISUALIZATIONS FOR WANDB")
    print("="*70)
    
    # We'll create visualizations for all k values and all algorithms
    # Both 2D and 3D t-SNE plots as PNG images using matplotlib
    
    temp_viz_files = []
    
    # Group by k value for better organization
    k_values_to_viz = sorted(set([r['k'] for r in all_results]))
    
    for k_val in k_values_to_viz:
        print(f"\nCreating visualizations for k={k_val}:")
        print("-"*70)
        
        # Get results for this k value
        k_results = [r for r in all_results if r['k'] == k_val]
        
        for result in k_results:
            algo_name = result['algorithm']
            cluster_labels = result['cluster_labels']
            
            # Create 2D t-SNE plot using matplotlib
            fig, ax = plt.subplots(figsize=(12, 8))
            
            # Get unique clusters and create colormap
            n_clusters = len(np.unique(cluster_labels))
            colors = cm.tab20(np.linspace(0, 1, n_clusters))
            
            # Plot each cluster
            for cluster_id in np.unique(cluster_labels):
                mask = cluster_labels == cluster_id
                ax.scatter(X_tsne_2d[mask, 0], X_tsne_2d[mask, 1],
                          c=[colors[cluster_id]], label=f'Cluster {cluster_id}',
                          alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
            
            ax.set_xlabel('t-SNE Component 1', fontsize=12, fontweight='bold')
            ax.set_ylabel('t-SNE Component 2', fontsize=12, fontweight='bold')
            ax.set_title(f'2D t-SNE: {algo_name.upper()} (k={k_val})', 
                        fontsize=14, fontweight='bold', pad=20)
            ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
            ax.grid(True, alpha=0.3, linestyle='--')
            plt.tight_layout()
            
            # Save 2D plot to temporary PNG file
            temp_2d = tempfile.NamedTemporaryFile(delete=False, suffix='.png')
            plt.savefig(temp_2d.name, dpi=150, bbox_inches='tight')
            temp_viz_files.append((temp_2d.name, f"tsne_2d/{algo_name}_k{k_val}"))
            plt.close()
            
            # Create 3D t-SNE plot using matplotlib
            fig = plt.figure(figsize=(12, 9))
            ax = fig.add_subplot(111, projection='3d')
            
            # Plot each cluster
            for cluster_id in np.unique(cluster_labels):
                mask = cluster_labels == cluster_id
                ax.scatter(X_tsne_3d[mask, 0], X_tsne_3d[mask, 1], X_tsne_3d[mask, 2],
                          c=[colors[cluster_id]], label=f'Cluster {cluster_id}',
                          alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
            
            ax.set_xlabel('t-SNE Component 1', fontsize=11, fontweight='bold')
            ax.set_ylabel('t-SNE Component 2', fontsize=11, fontweight='bold')
            ax.set_zlabel('t-SNE Component 3', fontsize=11, fontweight='bold')
            ax.set_title(f'3D t-SNE: {algo_name.upper()} (k={k_val})', 
                        fontsize=14, fontweight='bold', pad=20)
            ax.legend(bbox_to_anchor=(1.15, 1), loc='upper left', fontsize=9)
            ax.grid(True, alpha=0.3)
            plt.tight_layout()
            
            # Save 3D plot to temporary PNG file
            temp_3d = tempfile.NamedTemporaryFile(delete=False, suffix='.png')
            plt.savefig(temp_3d.name, dpi=150, bbox_inches='tight')
            temp_viz_files.append((temp_3d.name, f"tsne_3d/{algo_name}_k{k_val}"))
            plt.close()
            
            print(f"  ‚úì Created 2D and 3D PNG plots for {algo_name.upper()}")
    
    # Log all visualizations to wandb as images
    print("\n" + "-"*70)
    print("Logging t-SNE visualizations to WandB...")
    print("-"*70)
    
    wandb_viz_dict = {}
    for temp_path, wandb_key in temp_viz_files:
        wandb_viz_dict[wandb_key] = wandb.Image(temp_path)
    
    wandb.log(wandb_viz_dict)
    
    # Clean up temporary files
    for temp_path, _ in temp_viz_files:
        os.remove(temp_path)
    
    print(f"\n‚úì All {len(temp_viz_files)} t-SNE visualizations logged to WandB as PNG images!")
    print(f"  ‚Ä¢ {len(temp_viz_files)//2} 2D plots")
    print(f"  ‚Ä¢ {len(temp_viz_files)//2} 3D plots")
    print("="*70)



CREATING t-SNE VISUALIZATIONS FOR WANDB

Creating visualizations for k=5:
----------------------------------------------------------------------
  ‚úì Created 2D and 3D PNG plots for KMEANS
  ‚úì Created 2D and 3D PNG plots for KMEANS
  ‚úì Created 2D and 3D PNG plots for AGGLOMERATIVE
  ‚úì Created 2D and 3D PNG plots for AGGLOMERATIVE
  ‚úì Created 2D and 3D PNG plots for GMM
  ‚úì Created 2D and 3D PNG plots for GMM
  ‚úì Created 2D and 3D PNG plots for SPECTRAL
  ‚úì Created 2D and 3D PNG plots for SPECTRAL
  ‚úì Created 2D and 3D PNG plots for HDBSCAN

Creating visualizations for k=8:
----------------------------------------------------------------------
  ‚úì Created 2D and 3D PNG plots for HDBSCAN

Creating visualizations for k=8:
----------------------------------------------------------------------
  ‚úì Created 2D and 3D PNG plots for KMEANS
  ‚úì Created 2D and 3D PNG plots for KMEANS
  ‚úì Created 2D and 3D PNG plots for AGGLOMERATIVE
  ‚úì Created 2D and 3D PNG plots for 

## 8. Analyze Best Performing Configurations
Find the best algorithm for each k value based on different metrics.

In [35]:
# ===========================
# FIND BEST CONFIGURATIONS
# ===========================

print("\n" + "="*70)
print("BEST PERFORMING CONFIGURATIONS")
print("="*70)

# Best by Silhouette Score (higher is better)
print("\n1. Best by Silhouette Score (Higher is Better):")
print("-"*70)
best_silhouette = results_df_display.loc[results_df_display['silhouette_score'].idxmax()]
print(f"   Algorithm: {best_silhouette['algorithm']}")
print(f"   k: {best_silhouette['k']}")
print(f"   Silhouette Score: {best_silhouette['silhouette_score']:.4f}")

# Best by Davies-Bouldin Index (lower is better)
print("\n2. Best by Davies-Bouldin Index (Lower is Better):")
print("-"*70)
best_db = results_df_display.loc[results_df_display['davies_bouldin_index'].idxmin()]
print(f"   Algorithm: {best_db['algorithm']}")
print(f"   k: {best_db['k']}")
print(f"   Davies-Bouldin Index: {best_db['davies_bouldin_index']:.4f}")

# Best by Calinski-Harabasz Index (higher is better)
print("\n3. Best by Calinski-Harabasz Index (Higher is Better):")
print("-"*70)
best_ch = results_df_display.loc[results_df_display['calinski_harabasz_index'].idxmax()]
print(f"   Algorithm: {best_ch['algorithm']}")
print(f"   k: {best_ch['k']}")
print(f"   Calinski-Harabasz Index: {best_ch['calinski_harabasz_index']:.4f}")

# If external metrics are available
if y is not None:
    print("\n4. Best by Adjusted Rand Index (Higher is Better):")
    print("-"*70)
    best_ari = results_df_display.loc[results_df_display['ari'].idxmax()]
    print(f"   Algorithm: {best_ari['algorithm']}")
    print(f"   k: {best_ari['k']}")
    print(f"   ARI: {best_ari['ari']:.4f}")
    
    print("\n5. Best by Purity (Higher is Better):")
    print("-"*70)
    best_purity = results_df_display.loc[results_df_display['purity'].idxmax()]
    print(f"   Algorithm: {best_purity['algorithm']}")
    print(f"   k: {best_purity['k']}")
    print(f"   Purity: {best_purity['purity']:.4f}")

print("\n" + "="*70)


BEST PERFORMING CONFIGURATIONS

1. Best by Silhouette Score (Higher is Better):
----------------------------------------------------------------------
   Algorithm: hdbscan
   k: 5
   Silhouette Score: 0.0989

2. Best by Davies-Bouldin Index (Lower is Better):
----------------------------------------------------------------------
   Algorithm: agglomerative
   k: 16
   Davies-Bouldin Index: 2.2269

3. Best by Calinski-Harabasz Index (Higher is Better):
----------------------------------------------------------------------
   Algorithm: kmeans
   k: 5
   Calinski-Harabasz Index: 35.8664

4. Best by Adjusted Rand Index (Higher is Better):
----------------------------------------------------------------------
   Algorithm: agglomerative
   k: 8
   ARI: 0.2094

5. Best by Purity (Higher is Better):
----------------------------------------------------------------------
   Algorithm: spectral
   k: 16
   Purity: 0.5640



## 8A. Visualize Best Configurations
Display interactive 2D and 3D t-SNE plots for the best performing configurations.

In [36]:
# ===========================
# VISUALIZE BEST CONFIGURATIONS
# ===========================

print("\n" + "="*70)
print("INTERACTIVE VISUALIZATIONS - BEST CONFIGURATIONS")
print("="*70)

# Find best configuration by Silhouette Score
best_silhouette_idx = results_df_display['silhouette_score'].idxmax()
best_config = all_results[best_silhouette_idx]

print(f"\nShowing visualizations for best configuration:")
print(f"Algorithm: {best_config['algorithm']}")
print(f"k: {best_config['k']}")
print(f"Silhouette Score: {best_config['silhouette_score']:.4f}")
print("-"*70)

# Create and display 2D visualization
print("\nüìä 2D t-SNE Visualization:")
fig_2d_best = create_2d_cluster_plot(
    X_tsne_2d, 
    best_config['cluster_labels'], 
    y, 
    best_config['algorithm'], 
    best_config['k']
)
fig_2d_best.show()

# Create and display 3D visualization
print("\nüìä 3D t-SNE Visualization:")
fig_3d_best = create_3d_cluster_plot(
    X_tsne_3d, 
    best_config['cluster_labels'], 
    y, 
    best_config['algorithm'], 
    best_config['k']
)
fig_3d_best.show()

print("\n" + "="*70)
print("‚úì Interactive plots displayed above")
print("‚úì All plots are also available in your wandb dashboard")
print("="*70)


INTERACTIVE VISUALIZATIONS - BEST CONFIGURATIONS

Showing visualizations for best configuration:
Algorithm: gmm
k: 5
Silhouette Score: 0.0627
----------------------------------------------------------------------

üìä 2D t-SNE Visualization:



üìä 3D t-SNE Visualization:



‚úì Interactive plots displayed above
‚úì All plots are also available in your wandb dashboard


## 8B. Compare All Algorithms Visually
Display a comparison grid of visualizations for all algorithms at a specific k value.

In [37]:
# ===========================
# COMPARE ALL ALGORITHMS VISUALLY
# ===========================

# Choose a k value to compare across all algorithms (use the first k value)
compare_k = K_VALUES[0]

print("\n" + "="*70)
print(f"VISUAL COMPARISON: All Algorithms with k={compare_k}")
print("="*70)

# Filter results for the chosen k value
compare_results = [r for r in all_results if r['k'] == compare_k]

# Display 2D visualizations for each algorithm
print(f"\nüìä 2D t-SNE Comparisons (k={compare_k}):")
print("-"*70)

for result in compare_results:
    print(f"\n{result['algorithm'].upper()}:")
    fig = create_2d_cluster_plot(
        X_tsne_2d,
        result['cluster_labels'],
        y,
        result['algorithm'],
        result['k']
    )
    fig.show()

print("\n" + "="*70)
print("‚úì Visual comparisons displayed")
print(f"  Tip: Change 'compare_k' variable to compare different k values")
print("="*70)


VISUAL COMPARISON: All Algorithms with k=5

üìä 2D t-SNE Comparisons (k=5):
----------------------------------------------------------------------

KMEANS:



AGGLOMERATIVE:



GMM:



SPECTRAL:



HDBSCAN:



‚úì Visual comparisons displayed
  Tip: Change 'compare_k' variable to compare different k values


## 9. Save Results to CSV
Save the results DataFrame to a CSV file for later analysis.

In [38]:
# ===========================
# SAVE RESULTS TO CSV
# ===========================

# Create output directory if it doesn't exist
output_dir = Path("results") / "clustering_comparison"
output_dir.mkdir(parents=True, exist_ok=True)

# Generate output filename based on dataset name
dataset_name = Path(DATA_CSV_PATH).stem
output_file = output_dir / f"{dataset_name}_clustering_results.csv"

# Save results to CSV (without cluster_labels column)
results_df_display.to_csv(output_file, index=False)

print(f"\n‚úì Results saved to: {output_file}")

# Also save to wandb as an artifact
artifact = wandb.Artifact(
    name=f"{dataset_name}_clustering_results",
    type="results",
    description=f"Clustering comparison results for {dataset_name}"
)
artifact.add_file(str(output_file))
wandb.log_artifact(artifact)

print(f"‚úì Results artifact logged to wandb!")


‚úì Results saved to: results/clustering_comparison/indian_pca_clustering_results.csv
‚úì Results artifact logged to wandb!
‚úì Results artifact logged to wandb!


## 10. Finish wandb Run
Close the wandb run to finalize logging.

In [39]:
# ===========================
# FINISH WANDB RUN
# ===========================

# Close the wandb run
wandb.finish()

print("\n" + "="*70)
print("EXPERIMENT COMPLETED SUCCESSFULLY!")
print("="*70)
print(f"‚úì Tested {len(algorithms)} algorithms")
print(f"‚úì Tested {len(K_VALUES)} k values")
print(f"‚úì Total experiments: {len(all_results)}")
print(f"‚úì Results saved to: {output_file}")
print(f"‚úì All metrics logged to wandb")
print("="*70)
print("\nYou can view your results at: https://wandb.ai")
print("="*70)


EXPERIMENT COMPLETED SUCCESSFULLY!
‚úì Tested 5 algorithms
‚úì Tested 4 k values
‚úì Total experiments: 18
‚úì Results saved to: results/clustering_comparison/indian_pca_clustering_results.csv
‚úì All metrics logged to wandb

You can view your results at: https://wandb.ai


---

## Optional: Quick Configuration Examples

Below are some quick configuration examples for different datasets. Uncomment and modify as needed.

In [40]:
# ===========================
# CONFIGURATION EXAMPLES
# ===========================

# Example 1: GTZAN Dataset
# DATA_CSV_PATH = "data/clustering_ready/gtzan_clustering.csv"
# LABEL_CSV_PATH = "data/label_references/gtzan_labels.csv"
# LABEL_CSV_LABEL_COLUMN = "label"

# Example 2: FMA Small Dataset
# DATA_CSV_PATH = "data/clustering_ready/fma_small_clustering.csv"
# LABEL_CSV_PATH = "data/label_references/fma_small_labels.csv"
# LABEL_CSV_LABEL_COLUMN = "label"

# Example 3: FMA Medium Dataset
# DATA_CSV_PATH = "data/clustering_ready/fma_medium_clustering.csv"
# LABEL_CSV_PATH = "data/label_references/fma_medium_labels.csv"
# LABEL_CSV_LABEL_COLUMN = "label"

# Example 4: Instrumental Dataset
# DATA_CSV_PATH = "data/clustering_ready/instrumental_clustering.csv"
# LABEL_CSV_PATH = "data/label_references/instrumental_labels.csv"
# LABEL_CSV_LABEL_COLUMN = "label"

# Example 5: Dataset without labels (internal metrics only)
# DATA_CSV_PATH = "data/clustering_ready/your_dataset.csv"
# LABEL_CSV_PATH = None
# LABEL_CSV_LABEL_COLUMN = None

print("Configuration examples provided above. Modify as needed.")

Configuration examples provided above. Modify as needed.


---

## Installation Requirements

Before running this notebook, make sure you have all required packages installed:

```bash
pip install pandas numpy scikit-learn wandb plotly
```

## How to Use This Notebook

1. **Login to wandb** (run once):
   ```python
   import wandb
   wandb.login()
   ```

2. **Configure your experiment** in Section 1:
   - Set `DATA_CSV_PATH` to your PCA-reduced dataset
   - Set label paths if you have ground truth labels
   - Configure `WANDB_PROJECT` name
   - Adjust `K_VALUES` if needed

3. **Run all cells** sequentially

4. **View results**:
   - Interactive 2D and 3D t-SNE plots will be displayed in the notebook
   - Results CSV will be saved in `results/clustering_comparison/`
   - All metrics and visualizations will be logged to wandb dashboard

## Features

### Algorithms
- **KMeans**: Classic centroid-based clustering
- **GMM**: Probabilistic clustering with Gaussian distributions
- **Spectral**: Graph-based clustering for non-convex clusters

### Visualizations
- **2D t-SNE plots**: Interactive 2D scatter plots showing cluster distributions
- **3D t-SNE plots**: Interactive 3D scatter plots for deeper insights
- **Comparison views**: Side-by-side comparisons of different algorithms
- **Color coding**: Clusters are colored distinctly for easy identification
- **True labels overlay**: If labels are available, hover to see both cluster and true label

### Metrics
All visualizations are logged to wandb along with performance metrics:
- Internal metrics (Silhouette, Davies-Bouldin, Calinski-Harabasz)
- External metrics (NRI, ARI, Purity) if labels are available

## Metrics Explanation

### Internal Metrics (No labels required):
- **Silhouette Score**: Range [-1, 1], higher is better. Measures how similar samples are to their own cluster vs other clusters.
- **Davies-Bouldin Index**: Range [0, ‚àû), lower is better. Ratio of within-cluster to between-cluster distances.
- **Calinski-Harabasz Index**: Range [0, ‚àû), higher is better. Ratio of between-cluster to within-cluster variance.

### External Metrics (Labels required):
- **NRI (Normalized Rand Index)**: Range [0, 1], higher is better. Measures similarity between clusterings.
- **ARI (Adjusted Rand Index)**: Range [-1, 1], higher is better. Adjusted-for-chance version of Rand Index.
- **Purity**: Range [0, 1], higher is better. Percentage of correctly clustered samples.