# FMA Music Genre Clustering - Complete Implementation

## Unsupervised Music Genre Discovery Using Audio Feature Learning

This notebook implements a comprehensive pipeline for discovering music genres through audio feature analysis and unsupervised clustering algorithms using the FMA (Free Music Archive) dataset.

### Pipeline Overview:
1. **Feature Extraction**: Extract audio features (MFCCs, Chroma, Spectral features, Tempo)
2. **Data Analysis**: Statistical analysis, outlier detection, and data cleaning
3. **Data Preprocessing**: Standardization and dimensionality reduction (PCA)
4. **Clustering**: K-Means, Spectral Clustering, DBSCAN, and GMM
5. **Evaluation**: Multiple internal and external metrics
6. **Visualization**: Results comparison and analysis

---

## 1. Import Required Libraries

In [None]:
# Core libraries
import os
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Audio processing
import librosa
import soundfile

# Machine learning and clustering
from sklearn.cluster import KMeans, MiniBatchKMeans, SpectralClustering, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors

# Evaluation metrics
from sklearn.metrics import (
    silhouette_score,
    davies_bouldin_score,
    calinski_harabasz_score,
    adjusted_rand_score,
    normalized_mutual_info_score,
    v_measure_score
)
from scipy.optimize import linear_sum_assignment

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Utilities
from tqdm import tqdm
from datetime import datetime

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Librosa version: {librosa.__version__}")

## 2. Configuration and Setup

Define paths and parameters for the pipeline.

In [None]:
# Configuration
CONFIG = {
    'data_path': 'fma_small',
    'output_dir': 'output/results',
    'sr': 22050,  # Sample rate
    'duration': 30,  # Duration in seconds
    'n_mfcc': 20,  # Number of MFCC coefficients
    'n_clusters': 10,  # Number of clusters
    'n_pca_components': 20,  # PCA components
    'random_state': 42,
    'max_files': None  # Set to a number for testing (e.g., 100)
}

# Create output directory
os.makedirs(CONFIG['output_dir'], exist_ok=True)

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

## 3. Feature Extraction from Audio Files

Extract comprehensive audio features including:
- **MFCCs** (Mel-Frequency Cepstral Coefficients): 20 coefficients + delta + delta-delta
- **Chroma Features**: 12 pitch classes
- **Spectral Features**: Centroid, Rolloff, Bandwidth
- **Temporal Features**: Zero Crossing Rate, Tempo, RMS Energy

In [None]:
def extract_audio_features(audio_path, sr=22050, duration=30):
    """
    Extract comprehensive audio features from a single audio file.
    
    Parameters:
    -----------
    audio_path : str
        Path to the audio file
    sr : int
        Sample rate
    duration : int
        Duration in seconds
        
    Returns:
    --------
    dict : Dictionary containing all extracted features
    """
    try:
        # Load audio
        y, sr = librosa.load(audio_path, sr=sr, duration=duration)
        
        features = {}
        
        # 1. MFCCs (20 coefficients)
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
        for i in range(20):
            features[f'mfcc_{i}_mean'] = np.mean(mfccs[i])
            features[f'mfcc_{i}_std'] = np.std(mfccs[i])
        
        # 2. Delta MFCCs (temporal dynamics)
        mfcc_delta = librosa.feature.delta(mfccs)
        for i in range(20):
            features[f'mfcc_delta_{i}_mean'] = np.mean(mfcc_delta[i])
            features[f'mfcc_delta_{i}_std'] = np.std(mfcc_delta[i])
        
        # 3. Delta-Delta MFCCs (acceleration)
        mfcc_delta2 = librosa.feature.delta(mfccs, order=2)
        for i in range(20):
            features[f'mfcc_delta2_{i}_mean'] = np.mean(mfcc_delta2[i])
            features[f'mfcc_delta2_{i}_std'] = np.std(mfcc_delta2[i])
        
        # 4. Chroma features (12 pitch classes)
        chroma = librosa.feature.chroma_stft(y=y, sr=sr)
        for i in range(12):
            features[f'chroma_{i}_mean'] = np.mean(chroma[i])
            features[f'chroma_{i}_std'] = np.std(chroma[i])
        
        # 5. Spectral Centroid
        spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
        features['spectral_centroid_mean'] = np.mean(spectral_centroid)
        features['spectral_centroid_std'] = np.std(spectral_centroid)
        
        # 6. Spectral Rolloff
        spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
        features['spectral_rolloff_mean'] = np.mean(spectral_rolloff)
        features['spectral_rolloff_std'] = np.std(spectral_rolloff)
        
        # 7. Spectral Bandwidth
        spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)
        features['spectral_bandwidth_mean'] = np.mean(spectral_bandwidth)
        features['spectral_bandwidth_std'] = np.std(spectral_bandwidth)
        
        # 8. Zero Crossing Rate
        zcr = librosa.feature.zero_crossing_rate(y)
        features['zcr_mean'] = np.mean(zcr)
        features['zcr_std'] = np.std(zcr)
        
        # 9. Tempo
        tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
        features['tempo'] = tempo
        
        # 10. RMS Energy
        rms = librosa.feature.rms(y=y)
        features['rms_mean'] = np.mean(rms)
        features['rms_std'] = np.std(rms)
        
        return features
    
    except Exception as e:
        print(f"Error processing {audio_path}: {str(e)}")
        return None

print("✓ Feature extraction function defined")

In [None]:
def extract_all_features(data_path, max_files=None):
    """
    Extract features from all audio files in the dataset.
    
    Parameters:
    -----------
    data_path : str
        Path to FMA dataset directory
    max_files : int, optional
        Maximum number of files to process
        
    Returns:
    --------
    pd.DataFrame : DataFrame with extracted features
    """
    features_list = []
    audio_files = []
    
    # Collect all MP3 files
    for root, dirs, files in os.walk(data_path):
        for file in files:
            if file.endswith('.mp3'):
                audio_files.append(os.path.join(root, file))
    
    if max_files:
        audio_files = audio_files[:max_files]
    
    print(f"Found {len(audio_files)} audio files")
    print("Extracting features...")
    
    # Extract features with progress bar
    for audio_path in tqdm(audio_files):
        features = extract_audio_features(
            audio_path,
            sr=CONFIG['sr'],
            duration=CONFIG['duration']
        )
        
        if features:
            # Add track_id (from filename)
            track_id = os.path.basename(audio_path).replace('.mp3', '')
            features['track_id'] = track_id
            features_list.append(features)
    
    # Create DataFrame
    df = pd.DataFrame(features_list)
    
    # Reorder columns (track_id first)
    cols = ['track_id'] + [col for col in df.columns if col != 'track_id']
    df = df[cols]
    
    print(f"\n✓ Feature extraction complete!")
    print(f"  Total tracks processed: {len(df)}")
    print(f"  Total features per track: {len(df.columns) - 1}")
    
    return df

print("✓ Batch extraction function defined")

### 3.1 Extract Features (Optional - Skip if already extracted)

**Note**: If you have already extracted features and saved them to a CSV file, you can skip this cell and load the existing features in the next section.

In [None]:
# Extract features from audio files
# Uncomment to run feature extraction
# features_df = extract_all_features(CONFIG['data_path'], max_files=CONFIG['max_files'])
# features_df.to_csv(os.path.join(CONFIG['output_dir'], 'extracted_features.csv'), index=False)
# print(f"Features saved to {CONFIG['output_dir']}/extracted_features.csv")

# Load existing features (if already extracted)
features_path = os.path.join(CONFIG['output_dir'], 'extracted_features.csv')
if os.path.exists(features_path):
    features_df = pd.read_csv(features_path)
    print(f"✓ Loaded existing features from {features_path}")
    print(f"  Shape: {features_df.shape}")
else:
    print(f"⚠ Features file not found at {features_path}")
    print("  Please run feature extraction or check the path")

## 4. Data Analysis and Cleaning

Perform statistical analysis, outlier detection, and data cleaning.

In [None]:
# Basic data information
print("="*70)
print("DATA OVERVIEW")
print("="*70)
print(f"Shape: {features_df.shape}")
print(f"Total tracks: {len(features_df)}")
print(f"Total features: {len(features_df.columns) - 1}")
print(f"\nFirst few columns: {list(features_df.columns[:10])}")
print(f"\nData types:\n{features_df.dtypes.value_counts()}")

# Check for missing values
missing = features_df.isnull().sum()
if missing.sum() > 0:
    print(f"\n⚠ Missing values detected:")
    print(missing[missing > 0])
else:
    print("\n✓ No missing values")

# Display sample
features_df.head()

In [None]:
# Descriptive statistics
print("="*70)
print("DESCRIPTIVE STATISTICS")
print("="*70)

# Exclude track_id column
numeric_cols = [col for col in features_df.columns if col != 'track_id']
stats_df = features_df[numeric_cols].describe()

print(stats_df)

# Show a subset of features for better visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
fig.suptitle('Distribution of Selected Features', fontsize=16)

sample_features = numeric_cols[:6]
for idx, feature in enumerate(sample_features):
    ax = axes[idx // 3, idx % 3]
    ax.hist(features_df[feature], bins=30, edgecolor='black', alpha=0.7)
    ax.set_title(feature)
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    
plt.tight_layout()
plt.show()

In [None]:
# Outlier detection using IQR method
def detect_outliers_iqr(df, columns, threshold=1.5):
    """Detect outliers using IQR method."""
    outlier_indices = set()
    
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        
        lower_bound = Q1 - threshold * IQR
        upper_bound = Q3 + threshold * IQR
        
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index
        outlier_indices.update(outliers)
    
    return list(outlier_indices)

print("="*70)
print("OUTLIER DETECTION")
print("="*70)

numeric_cols = [col for col in features_df.columns if col != 'track_id']
outlier_indices = detect_outliers_iqr(features_df, numeric_cols)

print(f"Total outliers detected: {len(outlier_indices)}")
print(f"Percentage of outliers: {len(outlier_indices)/len(features_df)*100:.2f}%")

# Option 1: Remove outliers
# features_cleaned = features_df.drop(outlier_indices).reset_index(drop=True)

# Option 2: Keep all data (recommended for music features)
features_cleaned = features_df.copy()

print(f"\nCleaned data shape: {features_cleaned.shape}")

## 5. Data Preprocessing

Standardize features and apply PCA for dimensionality reduction.

In [None]:
print("="*70)
print("DATA PREPROCESSING")
print("="*70)

# Separate track IDs and features
track_ids = features_cleaned['track_id'].values
X = features_cleaned.drop('track_id', axis=1).values

print(f"Original feature matrix shape: {X.shape}")

# Step 1: Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"✓ Features standardized (mean=0, std=1)")

# Step 2: PCA for dimensionality reduction
n_components = CONFIG['n_pca_components']
pca = PCA(n_components=n_components, random_state=CONFIG['random_state'])
X_pca = pca.fit_transform(X_scaled)

explained_variance = pca.explained_variance_ratio_.sum()
print(f"✓ PCA applied: {X.shape[1]} features → {n_components} components")
print(f"  Explained variance: {explained_variance*100:.2f}%")

# Visualize explained variance
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(range(1, n_components+1), pca.explained_variance_ratio_, 'bo-')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot - Individual Explained Variance')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(range(1, n_components+1), np.cumsum(pca.explained_variance_ratio_), 'ro-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance')
plt.axhline(y=0.95, color='g', linestyle='--', label='95% threshold')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

print(f"\n✓ Preprocessed data shape: {X_pca.shape}")

## 6. Clustering Algorithms

Implement multiple clustering algorithms:
1. **K-Means**: Partitioning-based clustering
2. **Spectral Clustering**: Graph-based clustering
3. **DBSCAN**: Density-based clustering
4. **GMM**: Probabilistic model-based clustering

### 6.1 K-Means Clustering

In [None]:
print("="*70)
print("K-MEANS CLUSTERING")
print("="*70)

# Apply K-Means
kmeans = KMeans(
    n_clusters=CONFIG['n_clusters'],
    random_state=CONFIG['random_state'],
    n_init=10,
    max_iter=300
)

labels_kmeans = kmeans.fit_predict(X_pca)

print(f"✓ K-Means clustering complete")
print(f"  Number of clusters: {CONFIG['n_clusters']}")
print(f"  Inertia: {kmeans.inertia_:.2f}")

# Cluster distribution
unique, counts = np.unique(labels_kmeans, return_counts=True)
print(f"\nCluster distribution:")
for cluster, count in zip(unique, counts):
    print(f"  Cluster {cluster}: {count} tracks ({count/len(labels_kmeans)*100:.1f}%)")

### 6.2 Spectral Clustering

In [None]:
print("="*70)
print("SPECTRAL CLUSTERING")
print("="*70)

# Apply Spectral Clustering
spectral = SpectralClustering(
    n_clusters=CONFIG['n_clusters'],
    random_state=CONFIG['random_state'],
    affinity='nearest_neighbors',
    n_neighbors=10
)

labels_spectral = spectral.fit_predict(X_pca)

print(f"✓ Spectral clustering complete")
print(f"  Number of clusters: {CONFIG['n_clusters']}")

# Cluster distribution
unique, counts = np.unique(labels_spectral, return_counts=True)
print(f"\nCluster distribution:")
for cluster, count in zip(unique, counts):
    print(f"  Cluster {cluster}: {count} tracks ({count/len(labels_spectral)*100:.1f}%)")

### 6.3 DBSCAN Clustering

In [None]:
print("="*70)
print("DBSCAN CLUSTERING")
print("="*70)

# Determine optimal eps using k-distance plot
k = 5
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_pca)
distances, indices = neighbors.kneighbors(X_pca)
distances = np.sort(distances[:, k-1], axis=0)

# Plot k-distance
plt.figure(figsize=(10, 5))
plt.plot(distances)
plt.xlabel('Data Points sorted by distance')
plt.ylabel(f'{k}-NN Distance')
plt.title(f'K-Distance Plot (k={k})')
plt.grid(True)
plt.show()

# Apply DBSCAN with chosen eps and min_samples
eps = 2.5  # Adjust based on the k-distance plot
min_samples = 5

dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels_dbscan = dbscan.fit_predict(X_pca)

n_clusters = len(set(labels_dbscan)) - (1 if -1 in labels_dbscan else 0)
n_noise = list(labels_dbscan).count(-1)

print(f"✓ DBSCAN clustering complete")
print(f"  eps: {eps}, min_samples: {min_samples}")
print(f"  Number of clusters: {n_clusters}")
print(f"  Number of noise points: {n_noise} ({n_noise/len(labels_dbscan)*100:.1f}%)")

# Cluster distribution (excluding noise)
unique, counts = np.unique(labels_dbscan[labels_dbscan != -1], return_counts=True)
if len(unique) > 0:
    print(f"\nCluster distribution (excluding noise):")
    for cluster, count in zip(unique, counts):
        print(f"  Cluster {cluster}: {count} tracks")

### 6.4 Gaussian Mixture Model (GMM)

In [None]:
print("="*70)
print("GAUSSIAN MIXTURE MODEL (GMM)")
print("="*70)

# Apply GMM
gmm = GaussianMixture(
    n_components=CONFIG['n_clusters'],
    random_state=CONFIG['random_state'],
    covariance_type='full',
    max_iter=100
)

labels_gmm = gmm.fit_predict(X_pca)

print(f"✓ GMM clustering complete")
print(f"  Number of components: {CONFIG['n_clusters']}")
print(f"  Converged: {gmm.converged_}")
print(f"  BIC: {gmm.bic(X_pca):.2f}")
print(f"  AIC: {gmm.aic(X_pca):.2f}")

# Cluster distribution
unique, counts = np.unique(labels_gmm, return_counts=True)
print(f"\nCluster distribution:")
for cluster, count in zip(unique, counts):
    print(f"  Cluster {cluster}: {count} tracks ({count/len(labels_gmm)*100:.1f}%)")

## 7. Evaluation Metrics

Evaluate clustering quality using multiple internal metrics.

In [None]:
def evaluate_clustering(X, labels, algorithm_name):
    """
    Evaluate clustering using internal metrics.
    
    Parameters:
    -----------
    X : array-like
        Feature matrix
    labels : array-like
        Cluster labels
    algorithm_name : str
        Name of the algorithm
        
    Returns:
    --------
    dict : Dictionary with evaluation metrics
    """
    # Remove noise points for DBSCAN
    if -1 in labels:
        mask = labels != -1
        X_filtered = X[mask]
        labels_filtered = labels[mask]
        n_noise = np.sum(labels == -1)
    else:
        X_filtered = X
        labels_filtered = labels
        n_noise = 0
    
    n_clusters = len(np.unique(labels_filtered))
    
    # Skip if too few clusters
    if n_clusters < 2:
        return {
            'Algorithm': algorithm_name,
            'N_Clusters': n_clusters,
            'Silhouette': np.nan,
            'Davies-Bouldin': np.nan,
            'Calinski-Harabasz': np.nan,
            'Noise_Points': n_noise
        }
    
    # Calculate metrics
    silhouette = silhouette_score(X_filtered, labels_filtered)
    davies_bouldin = davies_bouldin_score(X_filtered, labels_filtered)
    calinski_harabasz = calinski_harabasz_score(X_filtered, labels_filtered)
    
    return {
        'Algorithm': algorithm_name,
        'N_Clusters': n_clusters,
        'Silhouette': silhouette,
        'Davies-Bouldin': davies_bouldin,
        'Calinski-Harabasz': calinski_harabasz,
        'Noise_Points': n_noise
    }

print("✓ Evaluation function defined")

In [None]:
print("="*70)
print("EVALUATION RESULTS")
print("="*70)

# Evaluate all algorithms
results = []
results.append(evaluate_clustering(X_pca, labels_kmeans, 'K-Means'))
results.append(evaluate_clustering(X_pca, labels_spectral, 'Spectral'))
results.append(evaluate_clustering(X_pca, labels_dbscan, 'DBSCAN'))
results.append(evaluate_clustering(X_pca, labels_gmm, 'GMM'))

# Create results DataFrame
results_df = pd.DataFrame(results)

print("\nInternal Metrics Comparison:")
print(results_df.to_string(index=False))

# Interpretation guide
print("\n" + "="*70)
print("METRIC INTERPRETATION")
print("="*70)
print("Silhouette Score: Range [-1, 1], Higher is better")
print("  > 0.7: Strong structure")
print("  0.5-0.7: Reasonable structure")
print("  0.25-0.5: Weak structure")
print("  < 0.25: No substantial structure")
print("\nDavies-Bouldin Index: Lower is better (indicates better separation)")
print("\nCalinski-Harabasz Index: Higher is better (indicates better defined clusters)")

## 8. Visualization

Visualize clustering results using 2D projection (first 2 PCA components).

In [None]:
# Create 2D visualization using first 2 PCA components
fig, axes = plt.subplots(2, 2, figsize=(16, 14))
fig.suptitle('Clustering Results Visualization (First 2 PCA Components)', fontsize=16)

algorithms = [
    ('K-Means', labels_kmeans),
    ('Spectral', labels_spectral),
    ('DBSCAN', labels_dbscan),
    ('GMM', labels_gmm)
]

for idx, (name, labels) in enumerate(algorithms):
    ax = axes[idx // 2, idx % 2]
    
    # Create scatter plot
    scatter = ax.scatter(
        X_pca[:, 0],
        X_pca[:, 1],
        c=labels,
        cmap='tab10',
        alpha=0.6,
        s=30,
        edgecolors='black',
        linewidth=0.5
    )
    
    ax.set_xlabel('First Principal Component')
    ax.set_ylabel('Second Principal Component')
    ax.set_title(f'{name} Clustering')
    ax.grid(True, alpha=0.3)
    
    # Add colorbar
    plt.colorbar(scatter, ax=ax, label='Cluster')

plt.tight_layout()
plt.show()

In [None]:
# Visualize metrics comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Clustering Metrics Comparison', fontsize=16)

# Filter valid results
valid_results = results_df[results_df['Silhouette'].notna()]

# Silhouette Score
ax1 = axes[0]
bars1 = ax1.bar(valid_results['Algorithm'], valid_results['Silhouette'], color='skyblue', edgecolor='black')
ax1.set_ylabel('Silhouette Score')
ax1.set_title('Silhouette Score (Higher is Better)')
ax1.axhline(y=0.5, color='r', linestyle='--', label='Good threshold (0.5)')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')
for bar in bars1:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.3f}', ha='center', va='bottom')

# Davies-Bouldin Index
ax2 = axes[1]
bars2 = ax2.bar(valid_results['Algorithm'], valid_results['Davies-Bouldin'], color='salmon', edgecolor='black')
ax2.set_ylabel('Davies-Bouldin Index')
ax2.set_title('Davies-Bouldin Index (Lower is Better)')
ax2.grid(True, alpha=0.3, axis='y')
for bar in bars2:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.3f}', ha='center', va='bottom')

# Calinski-Harabasz Index
ax3 = axes[2]
bars3 = ax3.bar(valid_results['Algorithm'], valid_results['Calinski-Harabasz'], color='lightgreen', edgecolor='black')
ax3.set_ylabel('Calinski-Harabasz Index')
ax3.set_title('Calinski-Harabasz Index (Higher is Better)')
ax3.grid(True, alpha=0.3, axis='y')
for bar in bars3:
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.1f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## 9. Save Results

In [None]:
# Save clustering results
output_df = pd.DataFrame({
    'track_id': track_ids,
    'kmeans_cluster': labels_kmeans,
    'spectral_cluster': labels_spectral,
    'dbscan_cluster': labels_dbscan,
    'gmm_cluster': labels_gmm
})

output_path = os.path.join(CONFIG['output_dir'], 'clustering_results.csv')
output_df.to_csv(output_path, index=False)
print(f"✓ Clustering results saved to: {output_path}")

# Save evaluation metrics
metrics_path = os.path.join(CONFIG['output_dir'], 'evaluation_metrics.csv')
results_df.to_csv(metrics_path, index=False)
print(f"✓ Evaluation metrics saved to: {metrics_path}")

# Display sample results
print(f"\nSample clustering assignments:")
output_df.head(10)

## 10. Summary and Conclusions

### Key Findings:

Based on the evaluation metrics, we can compare the performance of different clustering algorithms:

**Best Performing Algorithm:** The algorithm with:
- Highest Silhouette Score (best cluster cohesion and separation)
- Lowest Davies-Bouldin Index (best cluster separation)
- Highest Calinski-Harabasz Index (best defined clusters)

### Algorithm Characteristics:

1. **K-Means**:
   - Fast and scalable
   - Works well with spherical clusters
   - Requires pre-specified number of clusters
   
2. **Spectral Clustering**:
   - Can capture complex cluster shapes
   - Good for non-convex clusters
   - More computationally expensive
   
3. **DBSCAN**:
   - Density-based approach
   - Can find arbitrary shaped clusters
   - Identifies noise points
   - Does not require pre-specified number of clusters
   
4. **GMM (Gaussian Mixture Model)**:
   - Probabilistic approach
   - Provides soft cluster assignments
   - Flexible covariance structures

### Next Steps:

1. **Fine-tune parameters**: Adjust eps and min_samples for DBSCAN, number of clusters for K-Means, etc.
2. **Feature engineering**: Experiment with different audio features or feature combinations
3. **External validation**: If genre labels are available, evaluate using ARI, NMI, Purity
4. **Domain analysis**: Examine tracks within each cluster to understand musical characteristics
5. **Ensemble methods**: Combine multiple clustering results for more robust assignments

## Bonus: Advanced Analysis (Optional)

### Elbow Method for Optimal K

In [None]:
# Elbow method to find optimal number of clusters
print("Finding optimal number of clusters using Elbow Method...")

k_range = range(2, 21)
inertias = []
silhouette_scores = []

for k in tqdm(k_range):
    kmeans_temp = KMeans(n_clusters=k, random_state=CONFIG['random_state'], n_init=10)
    labels_temp = kmeans_temp.fit_predict(X_pca)
    inertias.append(kmeans_temp.inertia_)
    silhouette_scores.append(silhouette_score(X_pca, labels_temp))

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Elbow curve
axes[0].plot(k_range, inertias, 'bo-')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia (Within-cluster sum of squares)')
axes[0].set_title('Elbow Method')
axes[0].grid(True)

# Silhouette scores
axes[1].plot(k_range, silhouette_scores, 'ro-')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score vs. Number of Clusters')
axes[1].grid(True)

plt.tight_layout()
plt.show()

# Find optimal k based on silhouette score
optimal_k = k_range[np.argmax(silhouette_scores)]
print(f"\nOptimal number of clusters based on Silhouette Score: {optimal_k}")
print(f"Best Silhouette Score: {max(silhouette_scores):.4f}")

### Cluster Analysis: Feature Importance

Analyze which features are most important for cluster separation.

In [None]:
# Analyze cluster characteristics using K-Means results
cluster_features = pd.DataFrame(X_scaled, columns=features_cleaned.drop('track_id', axis=1).columns)
cluster_features['cluster'] = labels_kmeans

# Calculate mean feature values per cluster
cluster_means = cluster_features.groupby('cluster').mean()

# Visualize top features for each cluster
n_top_features = 10
fig, axes = plt.subplots(2, 5, figsize=(20, 8))
fig.suptitle('Top Features by Cluster (K-Means)', fontsize=16)

for cluster_id in range(min(10, CONFIG['n_clusters'])):
    ax = axes[cluster_id // 5, cluster_id % 5]
    
    # Get top features for this cluster
    cluster_profile = cluster_means.loc[cluster_id].sort_values(ascending=False)[:n_top_features]
    
    ax.barh(range(len(cluster_profile)), cluster_profile.values)
    ax.set_yticks(range(len(cluster_profile)))
    ax.set_yticklabels(cluster_profile.index, fontsize=8)
    ax.set_xlabel('Mean Value (standardized)')
    ax.set_title(f'Cluster {cluster_id}')
    ax.invert_yaxis()

plt.tight_layout()
plt.show()

print("✓ Cluster feature analysis complete")

## Conclusion

This notebook demonstrated a complete music genre clustering pipeline using the FMA dataset:

✅ **Feature Extraction**: Extracted 147 audio features including MFCCs, Chroma, and Spectral features  
✅ **Data Preprocessing**: Applied standardization and PCA for dimensionality reduction  
✅ **Clustering**: Implemented 4 different algorithms (K-Means, Spectral, DBSCAN, GMM)  
✅ **Evaluation**: Compared algorithms using Silhouette, Davies-Bouldin, and Calinski-Harabasz metrics  
✅ **Visualization**: Created comprehensive visualizations of results  

### References:
- FMA Dataset: https://github.com/mdeff/fma
- Librosa Documentation: https://librosa.org/
- Scikit-learn Clustering: https://scikit-learn.org/stable/modules/clustering.html

---

**Author**: FMA Music Genre Clustering Project  
**Date**: 2025