# 5 — DBSCAN (Density-Based Clustering) on Spotify (Beginner-Friendly)

**Goal:** Use DBSCAN to discover clusters without pre-setting `k`, and identify **outliers** (noise) among songs.

**You’ll learn:**
- What DBSCAN’s `eps` and `min_samples` mean (intuitively)
- How to grid-search a few settings and interpret **noise%** and **#clusters**
- How to compute Silhouette / Davies–Bouldin / Calinski–Harabasz when valid
- How to visualize a couple of best settings on a PCA 2D map

> DBSCAN shines when clusters are **irregular shapes** and there are **true outliers** to be filtered out.

## 0) Imports & Setup

In [None]:

import numpy as np
import pandas as pd
import re
from pathlib import Path

import matplotlib.pyplot as plt

from sklearn.preprocessing import QuantileTransformer, StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

plt.rcParams['figure.figsize'] = (7,5)
RNG = np.random.RandomState(42)


## 1) Load and prepare the data
We’ll clean column names and focus on Spotify audio features.

In [None]:

DATA = Path("../data/spotify_5000_songs.csv")
assert DATA.exists(), f"Missing data at {DATA}. Place your CSV there."

def clean_col(c):
    s = re.sub(r"\s+", " ", str(c)).strip()
    return s.split(" ")[0]

df_raw = pd.read_csv(DATA)
df = df_raw.copy()
df.columns = [clean_col(c) for c in df.columns]

FEATURES = ['danceability','energy','acousticness','instrumentalness','liveness','valence',
            'tempo','speechiness','loudness','duration_ms','key','mode','time_signature']
available = [c for c in FEATURES if c in df.columns]
X = df[available].apply(pd.to_numeric, errors='coerce').dropna()
print('Using features:', available, '| Shape:', X.shape)


## 2) Scale before DBSCAN
DBSCAN uses distances under the hood. We’ll use **QuantileTransformer** (good for skew). You can toggle **StandardScaler** to compare.

In [None]:

use_quantile = True

if use_quantile:
    scaler = QuantileTransformer(output_distribution='normal', n_quantiles=min(1000, len(X)), random_state=42)
else:
    scaler = StandardScaler()

Xt = scaler.fit_transform(X)
Xt[:2]


## 3) DBSCAN parameters (intuition)
- **`eps`**: neighborhood radius — larger means more points considered ‘neighbors’ → fewer, bigger clusters
- **`min_samples`**: how many points are needed to form a dense region — larger means more conservative clustering

**Heuristics:**
- Start with small `eps` and increase until structure appears
- Try `min_samples` in `{5, 10, 20}` (density tolerance)


## 4) Try a small grid of settings and score results
We’ll compute **Silhouette**, **Davies–Bouldin**, **Calinski–Harabasz** when at least two clusters are found. We also track **noise%** and **#clusters** (excluding noise).

In [None]:

def safe_metrics(Xt, labels):
    uniq = set(labels)
    valid = [l for l in uniq if l != -1]
    if len(valid) < 2:
        return {'silhouette': None, 'davies_bouldin': None, 'calinski_harabasz': None}
    return {
        'silhouette': float(silhouette_score(Xt, labels)),
        'davies_bouldin': float(davies_bouldin_score(Xt, labels)),
        'calinski_harabasz': float(calinski_harabasz_score(Xt, labels)),
    }

grid_eps = [0.5, 0.8, 1.0, 1.2]
grid_min = [5, 10, 20]

rows = []
for eps in grid_eps:
    for ms in grid_min:
        model = DBSCAN(eps=eps, min_samples=ms).fit(Xt)
        labels = model.labels_
        noise_pct = float(np.mean(labels==-1)) * 100.0
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        met = safe_metrics(Xt, labels)
        rows.append({
            'eps': eps, 'min_samples': ms,
            'clusters': n_clusters,
            'noise_pct': noise_pct,
            **met
        })

df_grid = pd.DataFrame(rows).sort_values(['silhouette'], ascending=False)
df_grid


**How to read this table**
- **silhouette**: higher is better (only when ≥2 clusters)
- **noise_pct**: % of songs labeled as outliers (too high may be impractical)
- **clusters**: if 0–1, DBSCAN didn’t find structure at those settings

## 5) Visualize a couple of settings on a PCA 2D map
We’ll pick:
- the **best** setting by Silhouette (if available)
- a **contrasting** setting (e.g., with much higher noise%)
So you can *see* the effect of parameters.

In [None]:

# Prepare a 2D PCA projection for plotting only
from sklearn.decomposition import PCA
P = PCA(n_components=2, random_state=42).fit_transform(Xt)

def plot_dbscan(eps, ms, title_suffix=''):
    model = DBSCAN(eps=eps, min_samples=ms).fit(Xt)
    labels = model.labels_
    plt.figure()
    plt.scatter(P[:,0], P[:,1], c=labels, s=4)
    plt.title(f'DBSCAN eps={eps}, min_samples={ms} {title_suffix}')
    plt.xlabel('PC1'); plt.ylabel('PC2')
    plt.show()

# Plot best silhouette combo (if exists)
best = df_grid.dropna(subset=['silhouette']).sort_values('silhouette', ascending=False).head(1)
if len(best):
    b = best.iloc[0]
    plot_dbscan(b['eps'], int(b['min_samples']), '(best silhouette)')

# Plot a high-noise case (if available)
hi_noise = df_grid.sort_values('noise_pct', ascending=False).head(1)
if len(hi_noise):
    h = hi_noise.iloc[0]
    plot_dbscan(h['eps'], int(h['min_samples']), '(high noise)')


**Reading the plots**
- Large grey-ish (–1) regions would indicate many outliers (if your viewer maps -1 to a single color)
- A small number of big color islands → larger clusters
- Many little fragments → overly sensitive parameters

## 6) Save results for your report
We’ll save the grid and a short ‘best rows’ table.

In [None]:

OUT = Path("../reports")
OUT.mkdir(parents=True, exist_ok=True)

grid_path = OUT / "dbscan_grid_results.csv"
df_grid.to_csv(grid_path, index=False)

best_path = OUT / "dbscan_best_rows.csv"
df_grid.dropna(subset=['silhouette']).sort_values('silhouette', ascending=False).head(5).to_csv(best_path, index=False)

print("Saved:", grid_path)
print("Saved:", best_path)


---
## 7) When DBSCAN helps (and when it doesn’t)
- ✅ **Finds outliers** naturally (songs that don’t fit any mood cluster)
- ✅ Can find **arbitrary-shaped** clusters (not just spheres)
- ⚠️ Sensitive to `eps`/`min_samples`; you’ll often get either **too much noise** or **one giant cluster**
- ⚠️ On moderately uniform datasets, centroid/hierarchical methods can perform more consistently

**Moosic takeaway:** Use DBSCAN **in addition** to K-Means/Agglomerative when you want to **flag oddball tracks** and prototype ‘edge-case’ playlists (e.g., experimental, live-only).