# 3 — Analysing K-Means: Choosing **k** (Beginner-Friendly)

**Purpose:** Find a sensible number of clusters (*k*) for Moosic playlists.

**You’ll learn:**
- What the **Elbow** method (inertia) tells us
- What **Silhouette**, **Davies–Bouldin**, and **Calinski–Harabasz** measure
- How to pick a practical `k` (not just the mathematically ‘best’)

> Tip: Use metrics as **guides**. Pair them with listening & editorial judgment.

## 0. Imports & setup

In [None]:

import numpy as np
import pandas as pd
import re
from pathlib import Path

import matplotlib.pyplot as plt

from sklearn.preprocessing import QuantileTransformer, StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

plt.rcParams['figure.figsize'] = (7,5)
RNG = np.random.RandomState(42)


## 1. Load data and select features
Place your CSV at `../data/spotify_5000_songs.csv`. We’ll do a light clean of column names.

In [None]:

DATA = Path("../data/spotify_5000_songs.csv")
assert DATA.exists(), f"Missing data at {DATA}. Place your CSV there."

def clean_col(c):
    s = re.sub(r"\s+", " ", str(c)).strip()
    return s.split(" ")[0]

df_raw = pd.read_csv(DATA)
df = df_raw.copy()
df.columns = [clean_col(c) for c in df.columns]

FEATURES = ['danceability','energy','acousticness','instrumentalness','liveness','valence',
            'tempo','speechiness','loudness','duration_ms','key','mode','time_signature']
available = [c for c in FEATURES if c in df.columns]
X = df[available].apply(pd.to_numeric, errors='coerce').dropna()

print("Using features:", available)
X.shape


## 2. Scale features (crucial for distance-based clustering)
We’ll use **QuantileTransformer** by default (good for skew). You can toggle StandardScaler to compare.

In [None]:

use_quantile = True

if use_quantile:
    scaler = QuantileTransformer(output_distribution='normal', n_quantiles=min(1000, len(X)), random_state=42)
else:
    scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)
X_scaled[:2]


## 3. Define a `k` sweep and metrics
We’ll try a range and record four values per `k`: **inertia**, **silhouette**, **Davies–Bouldin**, **Calinski–Harabasz**.

In [None]:

def safe_metrics(Xt, labels, inertia):
    uniq = set(labels)
    if len(uniq) < 2:
        return {'inertia': float(inertia), 'silhouette': None, 'davies_bouldin': None, 'calinski_harabasz': None}
    return {
        'inertia': float(inertia),
        'silhouette': float(silhouette_score(Xt, labels)),
        'davies_bouldin': float(davies_bouldin_score(Xt, labels)),
        'calinski_harabasz': float(calinski_harabasz_score(Xt, labels)),
    }

def k_sweep(Xt, k_values, n_init=10, random_state=42):
    rows = []
    for k in k_values:
        km = KMeans(n_clusters=k, n_init=n_init, random_state=random_state).fit(Xt)
        rows.append({'k': k, **safe_metrics(Xt, km.labels_, km.inertia_)})
    return pd.DataFrame(rows)


## 4. Run the sweep
We’ll try `k` from 4 to 30 (step 2). Adjust if you want more/less granularity.

In [None]:

k_values = list(range(4, 31, 2))
df_k = k_sweep(X_scaled, k_values)
df_k


### Plot: Elbow (Inertia)
Look for a **bend** where adding more clusters gives diminishing returns.

In [None]:

plt.figure()
plt.plot(df_k['k'], df_k['inertia'], marker='o')
plt.xlabel('k'); plt.ylabel('Inertia')
plt.title('Elbow Curve (K-Means)')
plt.show()


### Plot: Silhouette vs k
Higher is better. Peaks suggest well-separated clusters.

In [None]:

plt.figure()
plt.plot(df_k['k'], df_k['silhouette'], marker='o')
plt.xlabel('k'); plt.ylabel('Silhouette')
plt.title('Silhouette vs k (K-Means)')
plt.show()


## 5. Pick a practical `k`
We’ll sort by Silhouette (descending) and inspect the top candidates. Remember to balance quality vs. practicality.

In [None]:

top = df_k.dropna(subset=['silhouette']).sort_values('silhouette', ascending=False).head(5)
top


**How to choose**
- Pick a `k` near a **Silhouette peak** **and** an **Elbow bend**.
- For Moosic, `k≈20` offers a good balance: enough variety without over-fragmentation.
- You can re-run this with **StandardScaler** to compare.

## 6. Save results for reuse
Other notebooks (and the README) can use this sweep table.

In [None]:

OUT = Path("../reports")
OUT.mkdir(parents=True, exist_ok=True)
out_path = OUT / "kmeans_sweep_results.csv"
df_k.to_csv(out_path, index=False)
print(f"Saved: {out_path}")


---
## 7. Next steps
- Compare K-Means vs **Agglomerative** and **DBSCAN** (different cluster assumptions)
- Visualize your chosen `k` in 2D (PCA/UMAP) and show **example tracks per cluster**
- Keep `k` flexible; editors might prefer fewer/more playlists depending on use-case