# 2 â€” Scaling the Spotify Audio Features (Beginner-Friendly)

**Why this notebook?**

Clustering uses distances between songs. If some features have very large ranges or are strongly skewed, they can dominate the distance measurement.

Here we compare common scalers and **see how scaling affects K-Means quality** on the Moosic dataset.


**Youâ€™ll learn:**

- When and why to scale features

- What each scaler does (Standard/MinMax/Robust/Quantile)

- How scaling changes clustering metrics for K-Means at a fixed k (we'll use k=20)


> Tip: We keep `k=20` here so we isolate the effect of **scaling** only. In the next notebook, weâ€™ll tune `k`.


## 0. Imports & setup

In [None]:

import numpy as np
import pandas as pd
from pathlib import Path
import re

# Scalers & model
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, QuantileTransformer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Plotting
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (7,5)
RNG = np.random.RandomState(42)


## 1. Load the data and choose features
Place your CSV at `../data/spotify_5000_songs.csv`.

In [None]:

DATA = Path("../data/spotify_5000_songs.csv")
assert DATA.exists(), f"Missing data at {DATA}. Place your CSV there."

# Clean column names: collapse spaces and keep first token (handles 'name     ...' exports)
def clean_col(c):
    s = re.sub(r"\s+", " ", str(c)).strip()
    return s.split(" ")[0]

df_raw = pd.read_csv(DATA)
df = df_raw.copy()
df.columns = [clean_col(c) for c in df.columns]

FEATURES = ['danceability','energy','acousticness','instrumentalness','liveness','valence',
            'tempo','speechiness','loudness','duration_ms','key','mode','time_signature']
available = [c for c in FEATURES if c in df.columns]
X = df[available].apply(pd.to_numeric, errors='coerce').dropna()

print("Using features:", available)
X.describe().T


## 2. Check distribution shapes (skew)
Skewed features benefit from transformations that make them more symmetric (e.g., Quantile).

In [None]:

skew = X.skew(numeric_only=True).sort_values(ascending=False)
skew


*(Optional quick look)* Histograms for a couple of features before scaling.
Use this to spot very skewed features like `acousticness` or `instrumentalness`.

In [None]:

cols_to_plot = [c for c in ['energy','valence','tempo','duration_ms','acousticness','instrumentalness'] if c in X.columns][:3]
for c in cols_to_plot:
    plt.figure()
    X[c].hist(bins=40)
    plt.title(f'Histogram (raw): {c}')
    plt.xlabel(c); plt.ylabel('count')
    plt.show()


## 3. Define scalers and a helper to score clusters
Weâ€™ll fit **K-Means (k=20)** on each scaled matrix and compute three common metrics.

In [None]:

SCALERS = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler(),
    'QuantileTransformer': QuantileTransformer(output_distribution='normal', n_quantiles=min(1000, len(X)), random_state=42)
}

def kmeans_metrics(Xt, k=20, random_state=42):
    km = KMeans(n_clusters=k, n_init=10, random_state=random_state).fit(Xt)
    labels = km.labels_
    # If there's only one cluster (rare here), metrics are undefined
    uniq = set(labels)
    if len(uniq) < 2:
        return {'silhouette': None, 'davies_bouldin': None, 'calinski_harabasz': None, 'inertia': float(km.inertia_)}
    return {
        'silhouette': float(silhouette_score(Xt, labels)),
        'davies_bouldin': float(davies_bouldin_score(Xt, labels)),
        'calinski_harabasz': float(calinski_harabasz_score(Xt, labels)),
        'inertia': float(km.inertia_)
    }


## 4. Run the comparison
For each scaler â†’ scale the data â†’ run K-Means (k=20) â†’ compute metrics.

In [None]:

rows = []
for name, scaler in SCALERS.items():
    Xt = scaler.fit_transform(X)
    met = kmeans_metrics(Xt, k=20, random_state=42)
    rows.append({'scaler': name, **met})

results = pd.DataFrame(rows).sort_values(['silhouette'], ascending=False)
results


**How to read this table**
- **Higher Silhouette** â†’ tighter, more separated clusters
- **Lower Daviesâ€“Bouldin** â†’ better separation
- **Higher Calinskiâ€“Harabasz** â†’ denser, more compact clusters

ðŸ‘‰ If your data is skewed, **QuantileTransformer** often performs well. **RobustScaler** can help when there are outliers.

## 5. Visual effect of scaling (side-by-side histograms)
Weâ€™ll pick one or two skewed features and plot *before vs after* scaling.

In [None]:

# Choose a feature to visualize
feat = 'acousticness' if 'acousticness' in X.columns else available[0]

# Raw
plt.figure()
X[feat].hist(bins=40)
plt.title(f'Histogram (raw): {feat}')
plt.xlabel(feat); plt.ylabel('count')
plt.show()

# Quantile-transformed
qt = SCALERS['QuantileTransformer']
Xt_q = qt.fit_transform(X)
# Work with the single column of interest
import numpy as np
feat_idx = list(X.columns).index(feat)
plt.figure()
pd.Series(Xt_q[:, feat_idx]).hist(bins=40)
plt.title(f'Histogram (Quantile): {feat}')
plt.xlabel(f'{feat} (quantile scaled)'); plt.ylabel('count')
plt.show()


**What to look for**
- If the raw histogram has a long tail or lots of values near 0/1, the quantile-scaled version should look more symmetric.
This helps distance-based clustering treat features more fairly.

## 6. Save this comparison for the report (optional)
Stores the table so later notebooks (or README) can cite it.

In [None]:

OUT = Path("../reports")
OUT.mkdir(parents=True, exist_ok=True)
out_path = OUT / "scaler_comparison_kmeans20.csv"
results.to_csv(out_path, index=False)
print(f"Saved: {out_path}")


---
## 7. Takeaways
- Scaling is not optional for distance-based clustering.
- **QuantileTransformer** often wins on skewed features; **RobustScaler** helps with outliers; **Standard/MinMax** are solid defaults.
- Keep `k` fixed while comparing scalers. Then, in the **next notebook**, sweep `k` to choose the number of playlists.

**Next:** Open `3_analysing_k_means__choosing_k_Spotify_5000.ipynb` (rewrite) to run **Elbow** and **Silhouette vs. k** and pick a good `k`.