# 1 — Introduction to K‑Means (Moosic)

**Goal of this notebook:**
- Teach the *idea* of K‑Means (quick, visual intuition)
- Apply K‑Means on the real Spotify dataset (5000 songs)
- Produce a *first draft* of playlist clusters for Moosic

**You’ll learn:**
1) What K‑Means optimizes, and why scaling matters
2) How to run K‑Means on Spotify features
3) How to read basic cluster metrics and a 2D visualization

> Tip: This is a **teaching + applied** notebook. For deeper tuning, see the later notebooks.

## 0. Imports & Settings

In [None]:

# Standard scientific stack
import numpy as np
import pandas as pd
from pathlib import Path

# Modeling
from sklearn.preprocessing import QuantileTransformer, StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Plotting
import matplotlib.pyplot as plt

# Reproducibility
RNG = np.random.RandomState(42)
plt.rcParams['figure.figsize'] = (7,5)


## 1. A 2‑minute intuition: what does K‑Means do?
K‑Means tries to split data into **k groups** so that points are close to their group's **centroid**.

- It starts from random centroids
- Repeats: *assign → update* until things stop changing
- It minimizes **inertia** (sum of squared distances to centroids)

Let’s see a tiny 2D demo to build intuition (not the Spotify data yet).

In [None]:

# --- Tiny synthetic demo (for intuition only) ---
# 3 blobs in 2D so we can see what's happening
n_per = 150
A = RNG.normal(loc=[-2, 0], scale=[0.7, 0.5], size=(n_per, 2))
B = RNG.normal(loc=[ 2, 0], scale=[0.7, 0.5], size=(n_per, 2))
C = RNG.normal(loc=[ 0, 3], scale=[0.7, 0.5], size=(n_per, 2))
X_demo = np.vstack([A,B,C])

km_demo = KMeans(n_clusters=3, n_init=10, random_state=42).fit(X_demo)
labels_demo = km_demo.labels_

plt.scatter(X_demo[:,0], X_demo[:,1], c=labels_demo, s=12)
plt.title("K-Means on a simple 2D dataset")
plt.xlabel("x1"); plt.ylabel("x2"); plt.show()


**Reading the plot:** Points with the same color form a cluster. K‑Means found 3 compact groups.
On real data, we won’t have only two features to plot, so we’ll use a 2D projection later.

## 2. Load the Spotify dataset (5000 songs)
Place your CSV at `../data/spotify_5000_songs.csv`.
We’ll do a small column cleanup because the original file may contain extra spaces.

In [None]:

import re
DATA = Path("../data/spotify_5000_songs.csv")
assert DATA.exists(), f"Missing data file at {DATA}. Put your CSV there."

# Load and clean columns (strip extra spaces / keep first token if necessary)
df_raw = pd.read_csv(DATA)

def clean_col(c):
    s = re.sub(r"\s+", " ", str(c)).strip()
    return s.split(" ")[0]  # keep first token (works for columns like 'name   ...')

df = df_raw.copy()
df.columns = [clean_col(c) for c in df.columns]

# Take a peek
df.head(3)


### Available features
We’ll focus on Spotify’s numeric audio features commonly used for clustering.

In [None]:

FEATURES = [
    'danceability','energy','acousticness','instrumentalness','liveness','valence',
    'tempo','speechiness','loudness','duration_ms','key','mode','time_signature'
]
available = [c for c in FEATURES if c in df.columns]
missing = [c for c in FEATURES if c not in df.columns]
print("Using features:", available)
print("Missing (ignored):", missing)

X = df[available].apply(pd.to_numeric, errors='coerce').dropna()
print(X.shape)
X.describe().T


## 3. Why scaling matters (and what we’ll use)
Clustering uses **distances**. If one feature has a much larger range (e.g., `duration_ms`), it can dominate.

We’ll use **QuantileTransformer** to make each feature more Gaussian-like (good for skewed audio features).
You can try **StandardScaler** too; results may differ.

In [None]:

# Choose a scaler
use_quantile = True

if use_quantile:
    scaler = QuantileTransformer(output_distribution="normal",
                                 n_quantiles=min(1000, len(X)),
                                 random_state=42)
else:
    scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)
X_scaled[:3]


## 4. First real run of K‑Means on Spotify
We’ll start with **k=20** clusters (a reasonable draft number of playlists).
Later notebooks will **tune k** and **compare algorithms**.

In [None]:

k = 20
km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X_scaled)
labels = km.labels_
df_clusters = df.copy()
df_clusters['cluster'] = labels
df_clusters['cluster'].value_counts().sort_index()


### Quick metrics (how good are these clusters?)
- **Silhouette** (higher = better separation)
- **Davies–Bouldin** (lower = better)
- **Calinski–Harabasz** (higher = denser, better)

> These don’t know ‘genres’; they only look at distance separation. Use them as a *guideline*, not gospel.

In [None]:

def safe_metrics(Xt, y):
    uniq = set(y)
    if len(uniq) < 2:
        return {"silhouette": None, "davies_bouldin": None, "calinski_harabasz": None}
    return {
        "silhouette": float(silhouette_score(Xt, y)),
        "davies_bouldin": float(davies_bouldin_score(Xt, y)),
        "calinski_harabasz": float(calinski_harabasz_score(Xt, y)),
    }

metrics = safe_metrics(X_scaled, labels)
metrics


## 5. A quick 2D view with PCA
We’ll project the scaled features to **2 principal components** just to visualize clusters.
Note: PCA is only for plotting here (we are **not** clustering in PCA space for this notebook).

In [None]:

pca = PCA(n_components=2, random_state=42)
X2 = pca.fit_transform(X_scaled)

plt.scatter(X2[:,0], X2[:,1], c=labels, s=5)
plt.title(f'K-Means (k={k}) — PCA 2D view')
plt.xlabel('PC1'); plt.ylabel('PC2')
plt.show()


**How to read this:**
- Points with the same color are assigned to the same cluster (potential playlist)
- Tight, separated color blobs → better clustering
- Overlap → playlists that may need tuning, more features, or a different algorithm

## 6. Make it tangible: a few example tracks per cluster
Listing a few songs per cluster helps editors *feel* the mood. If your data has `name` and `artist`, we’ll show the top 3 per cluster.

In [None]:

name_cols = [c for c in ['name','song_name'] if c in df_clusters.columns]
artist_cols = [c for c in ['artist'] if c in df_clusters.columns]
show_cols = name_cols + artist_cols + (['energy','valence','tempo'] if set(['energy','valence','tempo']).issubset(df_clusters.columns) else []) + ['cluster']

examples = (df_clusters[show_cols]
            .groupby('cluster', group_keys=False)
            .head(3)
            .sort_values(['cluster'] + name_cols if name_cols else ['cluster']))
examples.head(15)


## 7. Save outputs for later notebooks / presentation
We’ll save the cluster assignments so other notebooks (e.g., evaluation, visualization) can reuse them. These files are great to show in your GitHub repo.

In [None]:

OUT_DIR = Path("../reports")
OUT_DIR.mkdir(parents=True, exist_ok=True)

assign_path = OUT_DIR / "kmeans20_quantile_assignments.csv"
df_clusters.to_csv(assign_path, index=False)

print(f"Saved: {assign_path}")


---
## 8. What to do next
- Use the **K‑Means analysis** notebook to choose **k** (elbow/silhouette)
- Compare **Agglomerative** and **DBSCAN** to see different structures
- Add **UMAP** visualization for prettier cluster maps

**Key takeaway for Moosic:** K‑Means gives a fast, understandable *first draft* of playlists. Human editors can rename/trim the clusters into final playlists, and later we can add user signals to personalize.