# 03 — Target & Features with  scikit-learn

## Ziel
Dieses Notebook erstellt **professionelle ML-Datensätze** aus der **Clean-Layer** (Notebook 02) und trainiert mehrere Modelle **ausschließlich im scikit-learn-Ökosystem**.
Am Ende stehen reproduzierbare Datasets, gespeicherte Pipelines/Modelle und strukturierte Reports.

---

## Anforderungen / Aufgaben
Dieses Notebook deckt folgende ML-Use-Cases ab:

1. **Track-Popularität** (Regression)
2. **Album-Popularität** (Regression)
3. **Hit-Prediction** (Binary Classification)
4. **Explicit / Content-Prediction** (Binary Classification)
5. **Mood Tags** (Multi-Label Classification)
   - Labels werden aus Features abgeleitet (Rule-based / Derived Labels)
6. **Artist Clustering / Community Detection** (Unsupervised Learning)

---

## Input (Clean-Layer aus Notebook 02)
Bevorzugt:
- `../data/processed/parquet/*.parquet`

Fallback:
- `../data/processed/clean_csv/*.csv`

---

## Output

### 1) Modellierungs-Datasets (Parquet)
- `../data/processed/modeling/track_dataset.parquet`
- `../data/processed/modeling/album_dataset.parquet`
- `../data/processed/modeling/artist_dataset.parquet`

### 2) Gespeicherte Modelle & Pipelines (joblib)
- `../data/models/03_track_popularity_pipeline.joblib`
- `../data/models/03_album_popularity_pipeline.joblib`
- `../data/models/03_hit_pipeline.joblib`
- `../data/models/03_explicit_pipeline.joblib`
- `../data/models/03_mood_pipeline.joblib`
- `../data/models/03_artist_clustering.joblib`

### 3) Konfiguration & Reports
- `../data/models/feature_config.json`
- `../data/reports/03_target_and_features/*.json`

---

## Ergebnis
Nach dem Notebook existieren:
- modellierungsfertige Parquet-Datasets,
- trainierte und gespeicherte scikit-learn Pipelines,
- sowie Reports/Configs für nachvollziehbares Training und spätere Batch-Inferenz.


In [112]:
import ast
import json
import math
import time
import platform
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Any

import numpy as np
import pandas as pd

# sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
    roc_auc_score, average_precision_score, f1_score,
    classification_report, confusion_matrix
)

from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

from joblib import dump

## Global Config

In [113]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Leakage controls:
# - If True: allow "post-release / popularity-like" proxy features (often boosts scores but less realistic)
# - If False: drop strongest leakage/proxies (recommended for realistic evaluation)
ALLOW_LEAKY_FEATURES = False

# "Main album per track" selection strategy:
# - "earliest_release": choose album with earliest release_date_parsed
# - "deterministic_id": choose smallest album_id (stable fallback)
MAIN_ALBUM_STRATEGY = "earliest_release"

# Hit label definition
HIT_PERCENTILE = 0.80
HIT_FALLBACK_POP_THRESHOLD = 60

# Genre multi-hot size
TOP_K_GENRES = 50

# Mood labels quantile rules (weak-label demonstration)
MOOD_TAGS = [
    ("energetic", "energy", 0.75, "gt"),
    ("danceable", "danceability", 0.75, "gt"),
    ("acoustic", "acousticness", 0.75, "gt"),
    ("instrumental", "instrumentalness", 0.75, "gt"),
    ("happy", "valence", 0.75, "gt"),
    ("sad", "valence", 0.25, "lt"),
    ("chill", "energy", 0.25, "lt"),
]

# Clustering
K_CLUSTERS = 30
TSNE_SAMPLE_MAX = 4000

pd.set_option("display.max_columns", 250)
pd.set_option("display.width", 180)
pd.set_option("display.max_rows", 40)
try:
    pd.options.mode.copy_on_write = True
except Exception:
    pass

## Paths

In [114]:
@dataclass(frozen=True)
class Paths:
    clean_parquet_dir: Path = Path("../data/processed/parquet")
    clean_csv_dir: Path = Path("../data/processed/clean_csv")

    modeling_dir: Path = Path("../data/processed/modeling")
    models_dir: Path = Path("../data/models")
    reports_dir: Path = Path("../data/reports/03_target_and_features")

PATHS = Paths()
for p in [PATHS.modeling_dir, PATHS.models_dir, PATHS.reports_dir]:
    p.mkdir(parents=True, exist_ok=True)

RUN_META = {
    "run_ts_unix": int(time.time()),
    "python": platform.python_version(),
    "platform": platform.platform(),
    "numpy": np.__version__,
    "pandas": pd.__version__,
    "random_seed": RANDOM_SEED,
    "allow_leaky_features": ALLOW_LEAKY_FEATURES,
    "main_album_strategy": MAIN_ALBUM_STRATEGY,
    "paths": {k: str(v) for k, v in asdict(PATHS).items()},
}

## Data Loading

In [115]:
TABLES = [
    "tracks",
    "audio_features",
    "albums",
    "artists",
    "genres",
    "r_albums_tracks",
    "r_track_artist",
    "r_artist_genre",
    "r_albums_artists",
]

def load_table(name: str) -> pd.DataFrame:
    pq = PATHS.clean_parquet_dir / f"{name}.parquet"
    csv = PATHS.clean_csv_dir / f"{name}.csv"

    if pq.exists():
        return pd.read_parquet(pq)
    if csv.exists():
        return pd.read_csv(csv, low_memory=False)
    raise FileNotFoundError(f"Missing {name} in parquet/csv clean layer.")

data: Dict[str, pd.DataFrame] = {}
for t in TABLES:
    pq = PATHS.clean_parquet_dir / f"{t}.parquet"
    csv = PATHS.clean_csv_dir / f"{t}.csv"
    if pq.exists() or csv.exists():
        data[t] = load_table(t)

{k: v.shape for k, v in data.items()}

{'tracks': (294618, 13),
 'audio_features': (294594, 21),
 'albums': (129152, 8),
 'artists': (139608, 6),
 'genres': (5416, 1),
 'r_albums_tracks': (305933, 2),
 'r_track_artist': (391700, 2),
 'r_artist_genre': (169289, 2),
 'r_albums_artists': (142153, 2)}

## Quick integrity sanity

In [116]:
required = ["tracks", "audio_features", "albums", "artists", "r_albums_tracks", "r_track_artist", "r_artist_genre"]
missing = [t for t in required if t not in data]
assert not missing, f"Missing required tables in clean layer: {missing}"

tracks = data["tracks"].copy()
audio = data["audio_features"].copy()
albums = data["albums"].copy()
artists = data["artists"].copy()
rat = data["r_albums_tracks"].copy()
rta = data["r_track_artist"].copy()
rag = data["r_artist_genre"].copy()
genres = data.get("genres", pd.DataFrame(columns=["id"]))  # optional
raa = data.get("r_albums_artists", pd.DataFrame(columns=["album_id", "artist_id"])).copy()

# PK expectations (guarded)
assert "track_id" in tracks.columns, "tracks must contain track_id"
assert tracks["track_id"].is_unique

assert "id" in audio.columns and audio["id"].is_unique
assert "id" in albums.columns and albums["id"].is_unique
assert "id" in artists.columns and artists["id"].is_unique

if not genres.empty and "id" in genres.columns:
    assert genres["id"].is_unique

print("Clean layer looks consistent.")

Clean layer looks consistent.


## Helper utilities

In [117]:
def col_or_na(df: pd.DataFrame, col: str, dtype: Optional[str] = None) -> pd.Series:
    """
    Return df[col] if it exists; otherwise return an all-NA Series with the same index.
    Never returns None.
    """
    if df is None or not isinstance(df, pd.DataFrame):
        raise TypeError("col_or_na: df must be a pandas DataFrame")

    if col in df.columns:
        s = df[col]
        if dtype is not None:
            try:
                s = s.astype(dtype)
            except Exception:
                pass
        return s

    return pd.Series(pd.NA, index=df.index)

def safe_len_series(s: pd.Series) -> pd.Series:
    return s.astype("string").fillna("").str.len().astype("int32")

def safe_word_count_series(s: pd.Series) -> pd.Series:
    return s.astype("string").fillna("").str.split().str.len().astype("int32")

def add_release_time_features(df: pd.DataFrame, date_col: str) -> pd.DataFrame:
    """Adds release_year/month/decade from a datetime-like column."""
    df = df.copy()
    dt = pd.to_datetime(col_or_na(df, date_col), errors="coerce")
    df["release_year"] = dt.dt.year.astype("Int64")
    df["release_month"] = dt.dt.month.astype("Int64")
    df["release_decade"] = ((dt.dt.year // 10) * 10).astype("Int64")
    return df

def log1p_numeric(s: pd.Series) -> pd.Series:
    x = pd.to_numeric(s, errors="coerce")
    return np.log1p(x).astype("float64")

def ensure_list_column(s: pd.Series) -> pd.Series:
    """
    Ensure a column contains python lists.
    Accepts:
      - actual lists
      - JSON strings
      - repr strings like "['a','b']"
      - NaN/None
    """
    def parse_one(v):
        if isinstance(v, list):
            return v
        if v is None or (isinstance(v, float) and np.isnan(v)):
            return []
        if isinstance(v, str):
            v = v.strip()
            if not v:
                return []
            # try JSON
            try:
                parsed = json.loads(v)
                if isinstance(parsed, list):
                    return parsed
            except Exception:
                pass
            # try python literal
            try:
                parsed = ast.literal_eval(v)
                if isinstance(parsed, list):
                    return parsed
            except Exception:
                pass
        return []
    return s.apply(parse_one)

def top_k_list_counts(list_series: pd.Series, top_k: int) -> List[str]:
    from collections import Counter
    c = Counter()
    for lst in list_series:
        if isinstance(lst, list):
            for x in lst:
                if pd.notna(x):
                    c[str(x)] += 1
    return [k for k, _ in c.most_common(top_k)]

def genres_to_multihot(df: pd.DataFrame, list_col: str, top_genres: List[str], prefix: str) -> pd.DataFrame:
    if not top_genres:
        return pd.DataFrame(index=df.index)
    m = np.zeros((len(df), len(top_genres)), dtype=np.int8)
    idx = {g: i for i, g in enumerate(top_genres)}
    lists = df[list_col]
    for r, lst in enumerate(lists):
        if isinstance(lst, list):
            for g in lst:
                j = idx.get(str(g))
                if j is not None:
                    m[r, j] = 1
    return pd.DataFrame(m, columns=[f"{prefix}genre_{g}" for g in top_genres])

def onehot_encoder_compat() -> OneHotEncoder:
    """Handle sklearn versions where sparse_output may not exist."""
    try:
        return OneHotEncoder(handle_unknown="ignore", sparse_output=True)
    except TypeError:
        return OneHotEncoder(handle_unknown="ignore", sparse=True)

def kmeans_compat(n_clusters: int, random_state: int) -> KMeans:
    """Handle sklearn versions where n_init='auto' may not exist."""
    try:
        return KMeans(n_clusters=n_clusters, n_init="auto", random_state=random_state)
    except TypeError:
        return KMeans(n_clusters=n_clusters, n_init=10, random_state=random_state)

def regression_report(y_true, y_pred) -> Dict[str, float]:
    rmse = float(np.sqrt(mean_squared_error(y_true, y_pred)))
    return {
        "MAE": float(mean_absolute_error(y_true, y_pred)),
        "RMSE": rmse,
        "R2": float(r2_score(y_true, y_pred)),
    }

def classification_report_binary(y_true, y_proba, threshold=0.5) -> Dict[str, Any]:
    y_true = np.asarray(y_true)
    y_proba = np.asarray(y_proba)
    y_pred = (y_proba >= threshold).astype(int)
    out = {
        "roc_auc": float(roc_auc_score(y_true, y_proba)) if len(np.unique(y_true)) > 1 else None,
        "pr_auc": float(average_precision_score(y_true, y_proba)) if len(np.unique(y_true)) > 1 else None,
        "f1": float(f1_score(y_true, y_pred)) if len(np.unique(y_true)) > 1 else None,
        "confusion_matrix": confusion_matrix(y_true, y_pred).tolist(),
    }
    return out

## Track-Level Dataset (eine Zeile = ein Track)

**Ziel:** Wir bauen eine denormalisierte, ML-fertige Tabelle, in der **jede Zeile einen Track** repräsentiert.
Dazu kombinieren wir Informationen aus mehreren Tabellen (Tracks, Audio-Features, Alben, Artists, Genres) und erzeugen zusätzlich **aggregierte** sowie **engineerte Features**.

### Was passiert hier genau?

1. **Tracks + Audio-Features (1:1 / left join)**
   - Wir hängen die numerischen Audio-Features (z. B. energy, danceability, loudness, tempo) direkt an den Track.
   - Falls für einzelne Tracks keine Audio-Features existieren, bleiben diese Felder `NaN` (left join).

2. **Track → Album (Many-to-Many) und Auswahl eines „Main Albums“**
   - Ein Track kann auf mehreren Alben vorkommen (Album, Compilation, Re-Release).
   - Für ML brauchen wir aber **einen eindeutigen Album-Kontext** pro Track.
   - Deshalb wählen wir deterministisch genau **ein Album pro Track** (z. B. das früheste Release-Datum).

3. **Album-Metadaten an Track anhängen**
   - Wir mergen Album-Infos (z. B. album_type, release_date, album_popularity) auf Track-Ebene.
   - Danach erzeugen wir Zeitfeatures wie `release_year`, `release_month`, `release_decade`.

4. **Track → Artists (Many-to-Many) + Aggregation**
   - Ein Track kann mehrere Artists haben (feat., collabs).
   - Wir speichern:
     - `artist_ids` als Liste (für spätere Analysen)
     - Aggregierte Artist-Statistiken pro Track:
       - Anzahl Artists (`n_artists`)
       - Mittelwert/Maximum von Artist-Popularität und Followers

5. **Track → Genres über Artist-Genres (Many-to-Many, Union)**
   - Genres hängen bei Spotify oft an Artists, nicht direkt an Tracks.
   - Wir bauen:
     - `artist_id -> [genre_ids]`
     - `track_id -> union(artist_genres)` als `track_genres` (Liste)

6. **Feature Engineering**
   - Aus Text / Metadaten:
     - `has_preview`: ob Preview-URL vorhanden ist (0/1)
     - `name_len`, `name_words`: Länge und Wortanzahl des Track-Namens
   - Log-Transforms:
     - `log_duration`: reduziert Schiefe bei Dauer
     - `log_artist_followers_*`: stabilisiert heavy-tailed follower counts
   - Qualitätsindikatoren:
     - `has_audio_features`: ob Audio-Features vorhanden sind (0/1)

**Ergebnis:** `track_df` ist eine „Feature-Matrix“ auf Track-Ebene

In [118]:
# ------------------------------------------------------------
# 1) Join: tracks -> audio_features (left join)
# ------------------------------------------------------------
# Why:
#   - Audio features are core ML predictors (danceability, energy, loudness, tempo, ...)
#   - Left join keeps all tracks even if audio features are missing for some rows.
assert "audio_feature_id" in tracks.columns, "tracks must contain audio_feature_id for join with audio_features.id"

# Rename audio PK 'id' to match tracks FK 'audio_feature_id'
audio_small = audio.rename(columns={"id": "audio_feature_id"})

# Merge track metadata + audio features into one wide table
track_df = tracks.merge(audio_small, on="audio_feature_id", how="left", suffixes=("", "_af"))

# ------------------------------------------------------------
# 2) Track -> Album (Many-to-Many) and choose ONE "main album"
# ------------------------------------------------------------
# Problem:
#   - A track can appear on multiple albums (releases, compilations, deluxe editions).
# ML requirement:
#   - We want a single album context per track to avoid duplicate rows / ambiguity.
# Strategy:
#   - Deterministic selection:
#       MAIN_ALBUM_STRATEGY == "earliest_release" -> pick earliest release_date
#       else -> pick smallest album_id (stable fallback)
albums_for_pick = albums.copy().rename(columns={"id": "album_id"})
albums_for_pick["release_date_parsed"] = pd.to_datetime(
    col_or_na(albums_for_pick, "release_date_parsed"), errors="coerce"
)

# Attach album release dates to the relationship table (album_id, track_id)
rat2 = rat.merge(
    albums_for_pick[["album_id", "release_date_parsed"]],
    on="album_id",
    how="left"
)

# Pick main album per track based on strategy
if MAIN_ALBUM_STRATEGY == "earliest_release":
    rat2 = rat2.sort_values(["track_id", "release_date_parsed", "album_id"], ascending=[True, True, True])
    main_album_per_track = rat2.drop_duplicates("track_id", keep="first")[["track_id", "album_id"]]
else:
    rat2 = rat2.sort_values(["track_id", "album_id"], ascending=[True, True])
    main_album_per_track = rat2.drop_duplicates("track_id", keep="first")[["track_id", "album_id"]]

# Merge the selected main album_id into track_df
track_df = track_df.merge(main_album_per_track, on="track_id", how="left")

# ------------------------------------------------------------
# 3) Merge album metadata onto track
# ------------------------------------------------------------
# Why:
#   - album_type / release_date provide useful context
#   - album_popularity is a strong proxy but can be leakage depending on your goal
#     (you can later drop it via ALLOW_LEAKY_FEATURES switch in feature selection)
albums_join = albums_for_pick.copy()

# Rename to avoid name clash with track popularity
rename_map = {}
if "popularity" in albums_join.columns:
    rename_map["popularity"] = "album_popularity"
albums_join = albums_join.rename(columns=rename_map)

track_df = track_df.merge(albums_join, on="album_id", how="left", suffixes=("", "_album"))

# ------------------------------------------------------------
# 4) Add release time features (year/month/decade)
# ------------------------------------------------------------
# Why:
#   - Popularity and audio trends are time-dependent
#   - Helps model capture temporal shift and era effects
track_df = add_release_time_features(track_df, "release_date_parsed")

# ------------------------------------------------------------
# 5) Track -> Artists list (Many-to-Many)
# ------------------------------------------------------------
# Why:
#   - A track can have multiple artists
#   - Keeping a list can be useful for later analysis/debugging
track_to_artists = (
    rta.groupby("track_id")["artist_id"]
       .apply(list)
       .reset_index()
       .rename(columns={"artist_id": "artist_ids"})
)
track_df = track_df.merge(track_to_artists, on="track_id", how="left")

# ------------------------------------------------------------
# 6) Aggregate artist statistics per track
# ------------------------------------------------------------
# Why:
#   - Artist popularity/followers often correlate with track reach
#   - For multi-artist tracks, we aggregate to stable numeric features
artist_feat = artists.rename(
    columns={"id": "artist_id", "popularity": "artist_popularity", "followers": "artist_followers"}
)
rta_art = rta.merge(artist_feat, on="artist_id", how="left")

artist_agg = (
    rta_art.groupby("track_id")
           .agg(
               n_artists=("artist_id", "nunique"),
               artist_popularity_mean=("artist_popularity", "mean"),
               artist_popularity_max=("artist_popularity", "max"),
               artist_followers_mean=("artist_followers", "mean"),
               artist_followers_max=("artist_followers", "max"),
           )
           .reset_index()
)
track_df = track_df.merge(artist_agg, on="track_id", how="left")

# ------------------------------------------------------------
# 7) Track -> Genres (via artist genres), union per track
# ------------------------------------------------------------
# Why:
#   - Genres are usually attached to artists
#   - We derive a track-level genre profile by taking the union across its artists
#
# Note:
#   - We store genre IDs because they are stable keys.
#   - Later you can convert to names or multi-hot encode Top-K genres.
rag2 = rag.copy()
if "genre_id" not in rag2.columns and "id" in rag2.columns:
    rag2 = rag2.rename(columns={"id": "genre_id"})

# Build artist -> [genre_id] list
artist_to_genres = (
    rag2.groupby("artist_id")["genre_id"]
        .apply(lambda x: sorted(set([g for g in x.dropna().tolist()])))
        .reset_index()
        .rename(columns={"genre_id": "artist_genres"})
)

# Join artist genres into track-artist mapping, then union genres per track
rta_gen = rta.merge(artist_to_genres, on="artist_id", how="left")
track_to_genres = (
    rta_gen.groupby("track_id")["artist_genres"]
           .apply(lambda rows: sorted(set([g for lst in rows.dropna()
                                          for g in (lst if isinstance(lst, list) else [])])))
           .reset_index()
           .rename(columns={"artist_genres": "track_genres"})
)
track_df = track_df.merge(track_to_genres, on="track_id", how="left")

# Ensure list type (important when loading from CSV where lists may become strings)
track_df["track_genres"] = ensure_list_column(col_or_na(track_df, "track_genres"))

# ------------------------------------------------------------
# 8) Feature Engineering (binary flags, text-derived, log transforms)
# ------------------------------------------------------------

# Preview availability: a simple content/availability indicator
track_df["has_preview"] = col_or_na(track_df, "preview_url").notna().astype("int8")

# Track name features: cheap but sometimes useful
track_df["name_len"] = safe_len_series(col_or_na(track_df, "name"))
track_df["name_words"] = safe_word_count_series(col_or_na(track_df, "name"))

# Duration robust handling: datasets often have duration_ms instead of duration
dur_col = "duration" if "duration" in track_df.columns else ("duration_ms" if "duration_ms" in track_df.columns else None)
track_df["log_duration"] = log1p_numeric(track_df[dur_col]) if dur_col else pd.Series(np.nan, index=track_df.index)

# Followers are heavy-tailed -> log helps stabilize scale
track_df["log_artist_followers_max"] = log1p_numeric(col_or_na(track_df, "artist_followers_max"))
track_df["log_artist_followers_mean"] = log1p_numeric(col_or_na(track_df, "artist_followers_mean"))

# Indicator whether audio features are present (helps model handle missingness)
track_df["has_audio_features"] = col_or_na(track_df, "audio_feature_id").notna().astype("int8")

print("Track-level dataset shape:", track_df.shape)
track_df.head(3)

Track-level dataset shape: (294618, 56)


Unnamed: 0,track_id,disc_number,duration,explicit,audio_feature_id,name,track_number,popularity,has_preview,is_long_track,is_tracknum_extreme,is_multidisc,is_disc_extreme,acousticness,analysis_url,danceability,duration_af,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,is_time_signature_rare,is_tempo_extreme,is_loudness_very_low,is_af_long,is_high_speech,is_instrumental,album_id,name_album,album_type,release_date,album_popularity,release_date_parsed,is_release_year_invalid,release_year,release_month,release_decade,artist_ids,n_artists,artist_popularity_mean,artist_popularity_max,artist_followers_mean,artist_followers_max,track_genres,name_len,name_words,log_duration,log_artist_followers_max,log_artist_followers_mean,has_audio_features
0,2DZN6ceJ7fMU2X6YWuIGHk,1,285053,False,2DZN6ceJ7fMU2X6YWuIGHk,Toccada del 3 Tono,14,0,0,0,0,0,0,0.621,https://api.spotify.com/v1/audio-analysis/2DZN...,0.147,285053.0,0.148,0.282,7,0.151,-21.444,1,0.0324,80.634003,3,0.0367,0.0,0.0,0.0,0.0,0.0,0.0,5v0bDDSl25qgrxOzxqoWXJ,Pedro Ruimonte en Bruselas (Música en la Corte...,album,1509062400000.0,0.0,2017-10-27,0.0,2017.0,10.0,2010.0,"[6xadlZzmcIMmgspceWCkt3, 0HWL7UfTuSRYVCrvTW5tj...",4,0.0,0,112.5,390,[musica antigua],18,4,12.560434,5.968708,4.731803,1
1,1dizvxctg9dHEyaYTFufVi,1,275893,True,1dizvxctg9dHEyaYTFufVi,Gz And Hustlas (feat. Nancy Fletcher),12,0,0,0,0,0,0,0.164,https://api.spotify.com/v1/audio-analysis/1diz...,0.652,275893.0,0.814,0.0,1,0.36,-4.901,1,0.31,91.888,4,0.788,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,NaT,,,,,"[7hJcb9fa4alzcOq3EaNPoG, 3E2vuvr0IQbReTbXw2MhX8]",2,0.0,0,3416346.5,6831895,"[g funk, gangster rap, hip hop, pop rap, rap, ...",37,6,12.527772,15.737113,15.044083,1
2,2g8HN35AnVGIk7B8yMucww,1,252746,True,2g8HN35AnVGIk7B8yMucww,Big Poppa - 2005 Remaster,13,0,0,0,0,0,0,0.43,https://api.spotify.com/v1/audio-analysis/2g8H...,0.78,252747.0,0.575,0.0,9,0.143,-7.247,0,0.273,84.491997,4,0.773,0.0,0.0,0.0,0.0,0.0,0.0,2HTbQ0RHwukKVXAlTmCZP2,Ready to Die (The Remaster),album,779414400000.0,0.0,1994-09-13,0.0,1994.0,9.0,1990.0,[5me0Irg2ANcsgc93uaYrpb],1,0.0,0,6258716.0,6258716,"[east coast hip hop, gangster rap, hardcore hi...",25,5,12.440144,15.649486,15.649486,1


## Album-Level Dataset (eine Zeile = ein Album)

**Ziel:** Wir bauen eine ML-fertige Tabelle, in der **jede Zeile ein Album** repräsentiert.
Da ein Album aus vielen Tracks besteht und oft mehrere Artists hat, erzeugen wir vor allem **Aggregations-Features**.

### Was passiert hier genau?

1. **Album-Stammdaten + Release-Time-Features**
   - Wir starten mit `albums` (Album-Metadaten).
   - Wir parsen `release_date_parsed` und erzeugen daraus:
     - `release_year`, `release_month`, `release_decade`

2. **Album-Größe (Track-Anzahl)**
   - Über `r_albums_tracks` zählen wir:
     - `n_tracks` = Anzahl eindeutiger Tracks pro Album
   - Das ist ein starkes Strukturfeature (Singles/EPs vs. Alben).

3. **Album-Audio-Profil (Aggregierte Track-Audio-Features)**
   - Über alle Tracks eines Albums aggregieren wir Audio-Features:
     - z. B. `album_mean_energy`, `album_mean_danceability`, `album_mean_loudness`, `album_mean_tempo`
   - Dadurch entsteht eine „Audio-Signatur“ des Albums.

4. **Album-Artist-Profil (falls `r_albums_artists` vorhanden)**
   - Ein Album kann mehrere Artists haben.
   - Wir aggregieren Artists pro Album:
     - `n_album_artists`
     - Popularity/Follower Mittelwert und Maximum

5. **Album-Genre-Profil (Union der Genres der Album-Artists)**
   - Genres kommen typischerweise von Artists.
   - Wir bilden `album_genres` als Vereinigung aller Artist-Genres im Album.

6. **Feature Engineering**
   - `log_n_tracks`: log-transform gegen Schiefe
   - `name_len`, `name_words`: simple Text-Features aus Albumname

**Ergebnis:** `album_df` ist eine Album-Feature-Matrix

In [119]:
# 1) Start from album master data + parse dates
# ------------------------------------------------------------
# Why:
#   - Album-level tasks (e.g., album popularity regression) need a single row per album
#   - Release time features capture era effects and time bias

album_df=albums.copy()
album_df=album_df.rename(columns={"id": "album_id"})

album_df["release_date_parsed"] = pd.to_datetime(
    col_or_na(album_df,"release_date_parsed"), errors="coerce"
)

# Add derived time features: year / month / decade
album_df = add_release_time_features(album_df, "release_date_parsed")

# ------------------------------------------------------------
# 2) Album size feature: number of tracks per album
# ------------------------------------------------------------
# Why:
#   - Singles/EPs vs albums differ structurally (track count)
#   - Useful as a predictor and for data sanity checks
album_track_counts = (
    rat.groupby("album_id")["track_id"]
       .nunique()
       .reset_index()
       .rename(columns={"track_id": "n_tracks"})
)

# Fixed merge: both sides use album_id
album_df = album_df.merge(album_track_counts, on="album_id", how="left")

# ------------------------------------------------------------
# 3) Album audio signature: mean of track audio features
# ------------------------------------------------------------
# Why:
#   - Albums consist of multiple tracks; we aggregate to get a stable album-level profile
#   - Mean is a strong baseline aggregation (you could also add std/min/max later)
POLICY_AUDIO = [
    "acousticness", "danceability", "energy", "instrumentalness", "liveness",
    "speechiness", "valence", "loudness", "tempo"
]

# Keep only audio columns that exist (robust to schema differences)
audio_cols_present = [c for c in POLICY_AUDIO if c in track_df.columns]

# Join album-track relation to track audio features
rat_track_audio = rat.merge(track_df[["track_id"] + audio_cols_present], on="track_id", how="left")

# Aggregate per album (mean)
album_audio_agg = rat_track_audio.groupby("album_id")[audio_cols_present].mean().reset_index()

# Prefix columns so they are clearly album-aggregates
album_audio_agg = album_audio_agg.add_prefix("album_mean_").rename(columns={"album_mean_album_id": "album_id"})

# Merge audio aggregates back to album table
album_df = album_df.merge(album_audio_agg, on="album_id", how="left")

# ------------------------------------------------------------
# 4) Album -> artists aggregates (optional)
# ------------------------------------------------------------
# Why:
#   - Albums can have multiple artists; their popularity/followers often influence album success
#   - This block runs only if r_albums_artists exists in your export
if not raa.empty and "album_id" in raa.columns and "artist_id" in raa.columns:
    raa_art = raa.merge(artist_feat, on="artist_id", how="left")

    album_artist_agg = (
        raa_art.groupby("album_id")
              .agg(
                  n_album_artists=("artist_id", "nunique"),
                  album_artist_popularity_mean=("artist_popularity", "mean"),
                  album_artist_popularity_max=("artist_popularity", "max"),
                  album_artist_followers_mean=("artist_followers", "mean"),
                  album_artist_followers_max=("artist_followers", "max"),
              )
              .reset_index()
    )

    album_df = album_df.merge(album_artist_agg, on="album_id", how="left")

# ------------------------------------------------------------
# 5) Album genres union
# ------------------------------------------------------------
# Why:
#   - Spotify-like schemas often attach genres to artists
#   - We derive an album's genre profile as the union of all album artists' genres
if not raa.empty:
    raa_gen = raa.merge(artist_to_genres, on="artist_id", how="left")

    album_to_genres = (
        raa_gen.groupby("album_id")["artist_genres"]
              .apply(lambda rows: sorted(set([
                  g for lst in rows.dropna()
                  for g in (lst if isinstance(lst, list) else [])
              ])))
              .reset_index()
              .rename(columns={"artist_genres": "album_genres"})
    )

    album_df = album_df.merge(album_to_genres, on="album_id", how="left")
else:
    # Keep a consistent schema even if we can't compute genres
    album_df["album_genres"] = [[] for _ in range(len(album_df))]

# Ensure list type (important for CSV fallback)
album_df["album_genres"] = ensure_list_column(col_or_na(album_df, "album_genres"))

# ------------------------------------------------------------
# 6) Feature engineering (log transforms, name features)
# ------------------------------------------------------------
# log transform track count (often heavy-tailed: singles vs compilations)
album_df["log_n_tracks"] = log1p_numeric(col_or_na(album_df, "n_tracks"))

# Simple text features from album name
album_df["name_len"] = safe_len_series(col_or_na(album_df, "name"))
album_df["name_words"] = safe_word_count_series(col_or_na(album_df, "name"))

print("Album-level dataset shape:", album_df.shape)
album_df.head(3)

Album-level dataset shape: (129152, 29)


Unnamed: 0,album_id,name,album_type,release_date,popularity,release_date_parsed,is_release_year_invalid,release_year,release_month,release_decade,n_tracks,album_mean_acousticness,album_mean_danceability,album_mean_energy,album_mean_instrumentalness,album_mean_liveness,album_mean_speechiness,album_mean_valence,album_mean_loudness,album_mean_tempo,n_album_artists,album_artist_popularity_mean,album_artist_popularity_max,album_artist_followers_mean,album_artist_followers_max,album_genres,log_n_tracks,name_len,name_words
0,7zr66qWybr1mAMSUVVosKU,Reflexo,album,1464220800000,0,2016-05-26,0,2016,5,2010,1,0.519,0.726,0.491,7e-06,0.0965,0.126,0.287,-11.166,109.935997,1.0,0.0,0,147089.0,147089,[hip hop tuga],0.693147,7,1
1,7zrLd0zddHOwA9DGlsDr4h,Floating World,album,1410652800000,0,2014-09-14,0,2014,9,2010,1,0.000516,0.335,0.823,0.331,0.213,0.0437,0.0699,-7.041,90.175003,1.0,0.0,0,75.0,75,[crossover prog],0.693147,14,2
2,7zri1pX9eMh0IqwpxMxOwp,Arne Aano's Beste - Slepp Himlen I Sjela Di Inn,album,1236729600000,0,2009-03-11,0,2009,3,2000,1,0.888,0.537,0.303,0.0,0.135,0.0365,0.541,-9.413,137.932007,1.0,0.0,0,0.0,0,[],0.693147,47,10


## Artist-Level Dataset (eine Zeile = ein Artist)

**Ziel:** Wir bauen eine ML-fertige Tabelle, in der **jede Zeile einen Artist** repräsentiert.
Diese Tabelle wird vor allem für **Clustering / Community Detection** (unsupervised) genutzt, kann aber später auch für supervised Tasks (z. B. Artist-Popularity) verwendet werden.

### Was passiert hier genau?

1. **Artist-Stammdaten**
   - Wir starten mit `artists` und benennen die ID-Spalte zu `artist_id`, damit Joins konsistent sind.

2. **Artist-Style-Profil aus Tracks (Aggregation)**
   - Über `r_track_artist` verknüpfen wir Artists mit ihren Tracks.
   - Wir hängen die Track-Features an (Audio + optional Popularity/Explicit) und aggregieren dann pro Artist:
     - `n_tracks`: Anzahl eindeutiger Tracks
     - `track_pop_mean`: durchschnittliche Track-Popularität (falls vorhanden)
     - `explicit_rate`: Anteil „explicit“-Tracks (falls vorhanden)
     - `mean_<audio_feature>`: durchschnittliche Audio-Signatur (z. B. mean_energy, mean_danceability, …)

   Ergebnis: Jeder Artist bekommt einen stabilen numerischen Vektor, der seinen „Sound“ beschreibt.

3. **Genres pro Artist**
   - Wir mergen die Liste der Genres (`artist_genres`) pro Artist (aus `r_artist_genre`).
   - Diese Liste kann später z. B. als Multi-Hot-Features genutzt werden.

4. **Feature Engineering**
   - `log_followers`: Log-Transform für heavy-tailed Followers
   - `log_n_tracks`: Log-Transform, da Track-Anzahl oft sehr schief verteilt ist

**Ergebnis:** `artist_df` enthält pro Artist:
- Stammdaten (name, popularity, followers, …)
- Aggregierte Track-Audio-Signatur
- Genre-Liste
- log-transformierte Stabilitätsfeatures


In [120]:
# ------------------------------------------------------------
# 1) Start from artist master data
# ------------------------------------------------------------
# Why:
#   - We want a single vector per artist for clustering / similarity analysis
#   - Rename PK to artist_id for consistent joins across tables
artist_df = artists.rename(columns={"id": "artist_id"}).copy()

# ------------------------------------------------------------
# 2) Build artist "style profile" by aggregating over all their tracks
# ------------------------------------------------------------
# Why:
#   - Artists have many tracks (Many-to-Many: r_track_artist)
#   - We want stable numeric features per artist:
#       * number of tracks
#       * average audio signature (mean_energy, mean_danceability, ...)
#       * optionally: average track popularity and explicit rate
cols_for_artist_agg = ["track_id"] + audio_cols_present

if "popularity" in track_df.columns:
    cols_for_artist_agg += ["popularity"]

if "explicit" in track_df.columns:
    cols_for_artist_agg += ["explicit"]

rta_track_audio = rta.merge(track_df[cols_for_artist_agg], on="track_id", how="left")


# Helper explicit rate per artist

def explicit_rate_fn(x):
    xx = pd.to_numeric(x,errors="coerce")
    if xx.dropna().empty:
        return np.nan
    return float(np.nanmean(xx))

agg_dict = {
    "n_tracks":("track_id","nunique")
}

# Optional: average track popularity (proxy of how popular their tracks tend to be)
if "popularity" in rta_track_audio.columns:
    agg_dict["track_pop_mean"] = ("popularity", "mean")

# Optional: explicit rate (share of explicit tracks)
if "explicit" in rta_track_audio.columns:
    agg_dict["explicit_rate"] = ("explicit", explicit_rate_fn)

# Core: mean audio signature per artist
for c in audio_cols_present:
    agg_dict[f"mean_{c}"] = (c, "mean")

artist_audio_agg = (
    rta_track_audio.groupby("artist_id")
    .agg(**agg_dict)
    .reset_index()
)

# Merge aggregated features back into artist table
artist_df = artist_df.merge(artist_audio_agg, on="artist_id", how="left")

# ------------------------------------------------------------
# 3) Attach genres list per artist
# ------------------------------------------------------------
# Why:
#   - Genres are usually provided at artist-level
#   - We keep them as list for later multi-hot encoding (Top-K)
artist_df = artist_df.merge(artist_to_genres, on="artist_id", how="left")
artist_df["artist_genres"] = ensure_list_column(col_or_na(artist_df, "artist_genres"))

# ------------------------------------------------------------
# 4) Feature engineering (log transforms for heavy-tailed counts)
# ------------------------------------------------------------
# followers and track counts are typically very skewed -> log stabilizes scale
artist_df["log_followers"] = log1p_numeric(col_or_na(artist_df, "followers"))
artist_df["log_n_tracks"] = log1p_numeric(col_or_na(artist_df, "n_tracks"))

print("Artist-level dataset shape:", artist_df.shape)
artist_df.head(3)


Artist-level dataset shape: (139608, 21)


Unnamed: 0,artist_id,name,popularity,followers,is_followers_extreme,followers_log1p,n_tracks,track_pop_mean,explicit_rate,mean_acousticness,mean_danceability,mean_energy,mean_instrumentalness,mean_liveness,mean_speechiness,mean_valence,mean_loudness,mean_tempo,artist_genres,log_followers,log_n_tracks
0,7zzl8HQ2v9hVdLh0Ygkwgc,Megatherio,0,59,0,4.094345,3,0.0,1.0,0.000101,0.267333,0.99,0.013101,0.104633,0.0902,0.3119,-4.583,136.755666,[brazilian thrash metal],4.094345,1.386294
1,00045gNg7mLEf9UY9yhD0t,Kubus & BangBang,0,820,0,6.710523,11,0.0,1.0,0.121626,0.658273,0.651,0.000177,0.246955,0.3073,0.497918,-8.265273,122.092817,[dutch hip hop],6.710523,2.484907
2,000xagx3GkcunHTFdB4ly0,Moxa,0,156,0,5.056246,1,0.0,0.0,0.000151,0.441,0.959,0.334,0.229,0.0611,0.171,-4.694,138.009003,[indie emo],5.056246,0.693147


## Save modeling datasets

In [121]:
track_out = PATHS.modeling_dir / "track_dataset.parquet"
album_out = PATHS.modeling_dir / "album_dataset.parquet"
artist_out = PATHS.modeling_dir / "artist_dataset.parquet"

track_df.to_parquet(track_out, index=False)
album_df.to_parquet(album_out, index=False)
artist_df.to_parquet(artist_out, index=False)

print(" Saved modeling datasets:")
print(" -", track_out)
print(" -", album_out)
print(" -", artist_out)

 Saved modeling datasets:
 - ..\data\processed\modeling\track_dataset.parquet
 - ..\data\processed\modeling\album_dataset.parquet
 - ..\data\processed\modeling\artist_dataset.parquet



#  Targets
   In this section we construct all targets used in this project:
   (A) Track popularity regression target (continuous)
   (B) Album popularity regression target (continuous)
   (C) Hit prediction target (binary) using year-relative threshold
   (D) Explicit/content target (binary)
   (E) Mood tags target (multi-label; weak supervision via audio feature quantiles)

 Why separate targets from features?
 - Prevent leakage: targets are derived ONLY from allowed columns.
 - Reproducibility: same label definition used later (Notebook 4 scoring).


In [122]:
# (A) Track popularity regression
# Popularity is typically in [0,100]. Some rows may have missing popularity -> keep as NaN and mask later.
assert "popularity" in track_df.columns, "track_df must contain 'popularity' for track popularity target"
y_track_pop = pd.to_numeric(track_df["popularity"], errors="coerce").astype("float64")

# (B) Album popularity regression
# Similar to tracks, popularity is the numeric target, and NaN indicates missing label.
assert "popularity" in album_df.columns, "album_df must contain 'popularity' for album popularity target"
y_album_pop = pd.to_numeric(album_df["popularity"], errors="coerce").astype("float64")

# (C) Hit prediction (binary)
# Default definition:
# - A "hit" is defined within each release year using a percentile threshold.
# - This is more fair than a fixed popularity threshold across decades.
#
# Fallback:
# - If release_year is missing OR threshold can't be computed for a year, use HIT_FALLBACK_POP_THRESHOLD.

def build_hit_labels_robust(
    df: pd.DataFrame,
    hit_percentile: float = 0.90,        # "top 10%" (within year if possible)
    desired_rate: float = 0.10,          # safety fallback target positive rate
    min_tracks_per_year: int = 200,      # lower this if your sample per year is small
    use_nonzero: bool = True             # ignore popularity==0 when computing thresholds
) -> pd.Series:
    pop = pd.to_numeric(df["popularity"], errors="coerce").astype("float64")
    year = pd.to_numeric(df.get("release_year", np.nan), errors="coerce").round()

    # ---------- global threshold (non-zero aware) ----------
    if use_nonzero:
        nz = pop[(pop > 0) & pop.notna()]
    else:
        nz = pop.dropna()

    if len(nz) > 0:
        global_thr = float(nz.quantile(hit_percentile))
    else:
        # if everything is 0/NaN, fall back to regular quantile
        global_thr = float(pop.dropna().quantile(hit_percentile)) if pop.notna().any() else 0.0

    # ---------- per-year thresholds (only for "good" years) ----------
    y = pd.Series(np.nan, index=df.index, dtype="float64")

    if year.notna().any():
        tmp = pd.DataFrame({"year": year, "pop": pop}).dropna(subset=["year", "pop"])

        # keep only years with enough rows
        counts = tmp["year"].value_counts()
        good_years = counts[counts >= min_tracks_per_year].index
        tmp_good = tmp[tmp["year"].isin(good_years)]

        if len(tmp_good) > 0:
            def year_thr_func(s: pd.Series) -> float:
                s = s.dropna()
                if use_nonzero:
                    s = s[s > 0]
                if len(s) == 0:
                    return np.nan
                return float(s.quantile(hit_percentile))

            year_thr = tmp_good.groupby("year")["pop"].apply(year_thr_func)
            thr = year.map(year_thr)  # NaN for missing/rare years

            # year rule where threshold exists
            y = (pop >= thr).where(thr.notna(), np.nan)

    # ---------- fill missing with global rule ----------
    y = y.where(pd.notna(y), pop >= global_thr)

    # finalize boolean -> int8
    y = pd.Series(y).fillna(False).astype(bool).astype("int8")

    # ---------- safety: if label became one-class, force top-K globally ----------
    if y.nunique() < 2:
        n = int(pop.notna().sum())
        k = max(1, int(desired_rate * n))

        # Take top-k by popularity (ties handled)
        top_idx = pop.fillna(-1).nlargest(k).index
        y = pd.Series(0, index=df.index, dtype="int8")
        y.loc[top_idx] = 1

    return y

y_hit = build_hit_labels_robust(
    track_df,
    hit_percentile=HIT_PERCENTILE,
    desired_rate=0.10,
    min_tracks_per_year=200,   # if your sample per year is smaller, set 50
    use_nonzero=True
)

print("Hit label distribution:", y_hit.value_counts(dropna=False).to_dict())
print("Hit positive rate:", float(y_hit.mean()))


# (D) Explicit prediction (binary)
# explicit is already (0/1) in most Spotify dumps. Missing -> 0 (conservative).
if "explicit" in track_df.columns:
    y_explicit = pd.to_numeric(track_df["explicit"], errors="coerce").fillna(0).astype("int8")
else:
    y_explicit = pd.Series(0, index=track_df.index, dtype="int8")

# (E) Mood tags (multi-label)
# We create weak supervision labels using quantiles of audio features.
# Example: "high_energy" = 1 if energy is above 80th percentile.
#
# NOTE on evaluation:
# - Strict: compute thresholds on TRAIN ONLY to avoid slight leakage.
# - Demo/baseline: compute thresholds on FULL data (fast and stable).
mood_thresholds: Dict[tuple, float] = {}

for name, col, q, direction in MOOD_TAGS:
    if col in track_df.columns:
        vals = pd.to_numeric(track_df[col], errors="coerce").dropna()
        mood_thresholds[(name, col, q, direction)] = float(vals.quantile(q)) if len(vals) else np.nan
    else:
        mood_thresholds[(name, col, q, direction)] = np.nan

def build_mood_labels(df: pd.DataFrame) -> pd.DataFrame:
    """
    Build a multi-label target matrix Y_mood of shape (n_samples, n_labels).
    Each label is derived from a threshold on an audio feature.
    Missing audio feature values become label=0 (no evidence for tag).
    """
    out = pd.DataFrame(index=df.index)

    for name, col, q, direction in MOOD_TAGS:
        if col not in df.columns:
            out[name] = 0
            continue

        thr = mood_thresholds.get((name, col, q, direction), np.nan)
        x = pd.to_numeric(df[col], errors="coerce")

        if np.isnan(thr):
            out[name] = 0
            continue

        if direction == "gt":
            out[name] = (x >= thr).fillna(False).astype("int8")
        else:
            out[name] = (x <= thr).fillna(False).astype("int8")

    return out

Y_mood = build_mood_labels(track_df)

print("Targets prepared:")
print(" - y_track_pop:", y_track_pop.shape, "missing_rate:", float(y_track_pop.isna().mean()))
print(" - y_album_pop:", y_album_pop.shape, "missing_rate:", float(y_album_pop.isna().mean()))
print(" - y_hit dist:", y_hit.value_counts(dropna=False).to_dict())
print(" - y_explicit dist:", y_explicit.value_counts(dropna=False).to_dict())
print(" - Y_mood:", Y_mood.shape)

Hit label distribution: {0: 265157, 1: 29461}
Hit positive rate: 0.09999728461940546
Targets prepared:
 - y_track_pop: (294618,) missing_rate: 0.0
 - y_album_pop: (129152,) missing_rate: 0.0
 - y_hit dist: {0: 265157, 1: 29461}
 - y_explicit dist: {0: 214945, 1: 79673}
 - Y_mood: (294618, 7)


  y = pd.Series(y).fillna(False).astype(bool).astype("int8")


## Genre multi-hop (Top-K) for track/album/artist

 Genres are stored as LISTS (e.g. track_genres = [genre_id1, genre_id2, ...]).
 Most ML models need fixed-size numeric vectors, so we:
   1) pick the Top-K most frequent genres in track_df
   2) create a multi-hot encoding (0/1 columns) for those Top-K genres

 Why Top-K?
 - The full genre space can be huge.
 - Top-K keeps dimensionality reasonable and avoids sparse explosion.
 - Rare genres can be grouped into "other" implicitly (all zeros).

In [123]:
top_genres = top_k_list_counts(track_df["track_genres"], top_k=TOP_K_GENRES) if "track_genres" in track_df.columns else []

track_genre_mh = (
    genres_to_multihot(track_df, "track_genres", top_genres, prefix="track_")
    if top_genres else pd.DataFrame(index=track_df.index)
)
album_genre_mh = (
    genres_to_multihot(album_df, "album_genres", top_genres, prefix="album_")
    if (top_genres and "album_genres" in album_df.columns) else pd.DataFrame(index=album_df.index)
)
artist_genre_mh = (
    genres_to_multihot(artist_df, "artist_genres", top_genres, prefix="artist_")
    if (top_genres and "artist_genres" in artist_df.columns) else pd.DataFrame(index=artist_df.index)
)

print("Genre multi-hot shapes:", track_genre_mh.shape, album_genre_mh.shape, artist_genre_mh.shape)


Genre multi-hot shapes: (294618, 50) (129152, 50) (139608, 50)


## Feature Selection & Leakage Guards

In this step we decide **which columns are allowed as model inputs** (features).
The goal is to build a **stable, reproducible feature schema** that can be reused during inference (Notebook 4).

### Track feature groups
The track-level feature set is composed of several feature families:

- **Track metadata**
  - Example: `duration`, `disc_number`, `track_number`, text-derived features like `name_len`, `name_words`
- **Release time features**
  - Example: `release_year`, `release_month`, `release_decade`
  - Reason: popularity and audio trends shift across eras
- **Artist aggregates**
  - Example: `n_artists`, `artist_followers_mean/max`, `artist_popularity_mean/max`, plus `log1p` variants
  - Reason: artist reach often correlates with track exposure
- **Audio features**
  - Example: `energy`, `valence`, `tempo`, `loudness`, `danceability`, `speechiness`, etc.
  - Reason: core predictors for mood/content modeling and popularity structure
- **Genre vectors**
  - Multi-hot encoded **Top-K genres** derived from artist genres aggregated to track-level

### Leakage guards (high importance)
Some features can act as **post-success proxies** and may cause unrealistic performance estimates.

- **`album_popularity`**
  - Often reflects the same popularity ecosystem as track popularity.
  - Using it to predict track popularity can leak information and inflate results.
  - **Default policy:** excluded unless explicitly allowed via `ALLOW_LEAKY_FEATURES`.

- **`artist_popularity_*`**
  - Artist popularity can be updated after hits and may correlate strongly with success.
  - Depending on the deployment scenario, it may be partially leaky.
  - **Default policy (recommended):** remove `artist_popularity_mean/max` when `ALLOW_LEAKY_FEATURES = False`.

These guards ensure the model is closer to a real-world setting where only **pre-available / non-prox**


In [124]:
# Audio feature columns (policy-driven)
track_audio_extra = [c for c in ["key", "mode", "time_signature"] if c in track_df.columns]
track_audio_main = [c for c in POLICY_AUDIO if c in track_df.columns]

# Duration base column (depends on your dataset naming)
duration_feature = None
if isinstance(dur_col, str) and dur_col.strip() and (dur_col in track_df.columns):
    duration_feature = dur_col


# Track numeric columns
TRACK_NUMERIC = [
    "disc_number", "track_number",
    *( [duration_feature] if duration_feature else [] ),
    "log_duration",
    "has_preview",
    "has_audio_features",
    "release_year", "release_month", "release_decade",
    "n_artists",
    "artist_popularity_mean", "artist_popularity_max",
    "artist_followers_mean", "artist_followers_max",
    "log_artist_followers_mean", "log_artist_followers_max",
    "name_len", "name_words",
] + track_audio_main + track_audio_extra

# Track categorical columns (keep small-cardinality only)
# album_group should be gone; keep album_type if present.
TRACK_CATEGORICAL = [c for c in ["album_type"] if c in track_df.columns]

# Remove missing columns safely
TRACK_NUMERIC = [
    c for c in TRACK_NUMERIC
    if (c is not None) and (not pd.isna(c)) and (c in track_df.columns)
]

TRACK_CATEGORICAL = [
    c for c in TRACK_CATEGORICAL
    if (c is not None) and (not pd.isna(c)) and (c in track_df.columns)
]

# Leakage guard:
# album_popularity is a very strong proxy; default OFF to avoid leakage.
if "album_popularity" in track_df.columns and ALLOW_LEAKY_FEATURES:
    TRACK_NUMERIC = TRACK_NUMERIC + ["album_popularity"]

# If leakage is OFF, optionally remove artist popularity proxies too.
if not ALLOW_LEAKY_FEATURES:
    TRACK_NUMERIC = [c for c in TRACK_NUMERIC if c not in {"artist_popularity_mean", "artist_popularity_max"}]

# Build X_track with base features + genre multi-hot
X_track_base = track_df[TRACK_NUMERIC + TRACK_CATEGORICAL].copy()
X_track = pd.concat([X_track_base.reset_index(drop=True), track_genre_mh.reset_index(drop=True)], axis=1)

# --- Create task-specific masks ---
# Regression requires non-null target
mask_track_pop = y_track_pop.notna()
X_track_pop = X_track.loc[mask_track_pop].reset_index(drop=True)
y_track_pop_clean = y_track_pop.loc[mask_track_pop].reset_index(drop=True)

# Hit requires popularity (already used to construct label)
mask_hit = track_df["popularity"].notna()
X_track_hit = X_track.loc[mask_hit].reset_index(drop=True)
y_hit_clean = y_hit.loc[mask_hit].reset_index(drop=True)

# Explicit uses all rows; missing explicit defaulted to 0
X_track_explicit = X_track.reset_index(drop=True)
y_explicit_clean = y_explicit.reset_index(drop=True)

# Mood requires audio features available (otherwise weak labels meaningless)
mask_mood = (track_df["has_audio_features"] == 1)
X_track_mood = X_track.loc[mask_mood].reset_index(drop=True)
Y_mood_clean = Y_mood.loc[mask_mood].reset_index(drop=True)

print("X_track shapes:")
print(" - pop:", X_track_pop.shape, y_track_pop_clean.shape)
print(" - hit:", X_track_hit.shape, y_hit_clean.shape)
print(" - explicit:", X_track_explicit.shape, y_explicit_clean.shape)
print(" - mood:", X_track_mood.shape, Y_mood_clean.shape)

# ---------- Album features ----------
# Album numeric: time + counts + aggregated features (from earlier album building)
ALBUM_NUMERIC = [
    "release_year", "release_month", "release_decade",
    "n_tracks", "log_n_tracks",
    "name_len", "name_words",
] + [c for c in album_df.columns if c.startswith("album_mean_") or c.startswith("album_artist_")]

ALBUM_NUMERIC = [c for c in ALBUM_NUMERIC if c in album_df.columns]

# album_group should be gone (100% missing); keep album_type if present.
ALBUM_CATEGORICAL = [c for c in ["album_type"] if c in album_df.columns]

X_album_base = album_df[ALBUM_NUMERIC + ALBUM_CATEGORICAL].copy()
X_album = pd.concat([X_album_base.reset_index(drop=True), album_genre_mh.reset_index(drop=True)], axis=1)

mask_album_pop = y_album_pop.notna()
X_album_pop = X_album.loc[mask_album_pop].reset_index(drop=True)
y_album_pop_clean = y_album_pop.loc[mask_album_pop].reset_index(drop=True)

print("X_album_pop:", X_album_pop.shape, y_album_pop_clean.shape)


X_track shapes:
 - pop: (294618, 79) (294618,)
 - hit: (294618, 79) (294618,)
 - explicit: (294618, 79) (294618,)
 - mood: (294616, 79) (294616, 7)
X_album_pop: (129152, 71) (129152,)


## Preprocessing Builder (sklearn ColumnTransformer)

We use a **ColumnTransformer** to apply different preprocessing steps to **numeric** and **categorical** feature groups.
This is a best-practice approach in scikit-learn because it keeps the entire transformation logic inside a single, reproducible pipeline.

### Numeric pipeline
For numeric columns we apply:

- **Median imputation**
  - Robust against outliers and skewed distributions (common in followers, popularity proxies, etc.)
- **Standard scaling (`StandardScaler`)**
  - We use `with_mean=False` to remain compatible with sparse matrices produced downstream
  - This matters because after One-Hot / multi-hot encoding, the final feature matrix is typically sparse

### Categorical pipeline
For categorical columns we apply:

- **Most-frequent imputation**
  - Ensures missing categories don’t break training
- **One-Hot encoding (`OneHotEncoder(handle_unknown="ignore")`)**
  - Converts categories into binary indicator columns
  - `handle_unknown="ignore"` prevents inference crashes when unseen categories appear in new data

### Why we build preprocessing as a pipeline
Using a pipeline is essential because it:

- **Prevents training/serving skew**
  - The same preprocessing logic is used during training and during inference (Notebook 4)
- **Improves reproducibility**
  - Models become portable artifacts (single saved `.joblib` pipeline)
- **Supports large-scale inference**
  - The output can be sparse (efficient memory and speed for hundreds of thousands to millions of rows)


In [125]:

def build_preprocessor(X: pd.DataFrame) -> Tuple[ColumnTransformer, List[str], List[str]]:
    # Identify numeric columns by dtype
    numeric_cols = [c for c in X.columns if pd.api.types.is_numeric_dtype(X[c])]
    # Remaining columns are treated as categorical
    categorical_cols = [c for c in X.columns if c not in numeric_cols]

    num_pipe = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler(with_mean=False)),  # safe when combined with sparse matrices
    ])

    cat_pipe = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", onehot_encoder_compat()),         # should do handle_unknown="ignore"
    ])

    pre = ColumnTransformer(
        transformers=[
            ("num", num_pipe, numeric_cols),
            ("cat", cat_pipe, categorical_cols),
        ],
        remainder="drop",
        sparse_threshold=0.3
    )

    return pre, numeric_cols, categorical_cols


## Train: Track Popularity Regression

This step trains a regression model to predict **track popularity** (`tracks.popularity`, typically 0–100).

### Why Ridge Regression?
We choose **Ridge Regression** as a strong baseline because:

- It is **stable and robust** on high-dimensional tabular data
- It works very well with **sparse feature matrices**, which we get after:
  - One-Hot encoding (categorical features)
  - Multi-hot encoding (Top-K genres)
- It handles **multicollinearity** (many correlated features) via L2 regularization
- It is computationally efficient and scales well to **~300k tracks**

### Training setup
- We use a standard **train/test split** (e.g., 80/20) with a fixed seed for reproducibility
- Preprocessing (imputation + scaling + one-hot) is integrated into the pipeline to avoid leakage and serving skew

### Evaluation metrics
We evaluate using common regression metrics:

- **MAE (Mean Absolute Error)**
  Interpretable average absolute deviation in popularity points
- **RMSE (Root Mean Squared Error)**
  Penalizes larger errors more strongly than MAE
- **R² (Coefficient of Determination)**
  Measures how much variance is explained by the model

These metrics are computed using the project’s `regression_report` helper.


In [126]:
from sklearn.preprocessing import FunctionTransformer
def sklearn_sanitize_df(X):
    if not isinstance(X, pd.DataFrame):
        return X

    X = X.copy()

    # Convert NaT -> np.nan
    X = X.replace({pd.NaT: np.nan})

    for c in X.columns:
        dt = X[c].dtype

        # pandas string or categorical -> object + np.nan
        if pd.api.types.is_string_dtype(dt) or isinstance(dt, pd.CategoricalDtype):
            X[c] = X[c].astype("object")
            X[c] = X[c].where(pd.notna(X[c]), np.nan)

        # pandas nullable boolean -> float (0/1/nan)
        elif str(dt) == "boolean":
            X[c] = X[c].astype("float64")

        # pandas nullable integer (Int64, Int32...) -> float (so missing -> np.nan)
        elif str(dt).startswith("Int"):
            X[c] = X[c].astype("float64")

        # object columns might still contain pd.NA -> replace with np.nan
        elif X[c].dtype == "object":
            X[c] = X[c].where(pd.notna(X[c]), np.nan)

    return X

sanitize_tf = FunctionTransformer(sklearn_sanitize_df, feature_names_out="one-to-one")



Xtr, Xte, ytr, yte = train_test_split(
    X_track_pop, y_track_pop_clean,
    test_size=0.2, random_state=RANDOM_SEED
)


pre_track, num_cols_track, cat_cols_track = build_preprocessor(X_track_pop)

pipe_track_pop = Pipeline(steps=[
    ("sanitize", sanitize_tf),
    ("pre", pre_track),
    ("model", Ridge(alpha=2.0))
])

pipe_track_pop.fit(Xtr, ytr)
pred = pipe_track_pop.predict(Xte)

track_pop_metrics = regression_report(yte, pred)

dump(pipe_track_pop, PATHS.models_dir / "03_track_popularity_pipeline.joblib")

['..\\data\\models\\03_track_popularity_pipeline.joblib']

## Train: Album Popularity Regression

This step trains a regression model to predict **album popularity** (`albums.popularity`, typically 0–100) using the **album-level feature table** built earlier.

### Why this model and setup?
We use the **same approach as track popularity regression** because the data characteristics are similar:

- The feature space can be **high-dimensional** due to:
  - One-Hot encoded categorical fields (e.g., `album_type`)
  - Optional genre multi-hot vectors (Top-K)
- The design matrix is often **sparse**, so we prefer models that handle sparse input efficiently.

### Model choice: Ridge Regression
**Ridge Regression** is a strong and scalable baseline for album popularity because:

- It performs well on linear relationships with many correlated predictors
- It remains stable when features are correlated (L2 regularization)
- It trains fast on large datasets and works with sparse matrices

### Training pipeline (best practice)
The pipeline is:

1. **Sanitize input** (convert pandas `pd.NA` to `np.nan`, avoid nullable dtypes)
2. **Preprocess features** using a `ColumnTransformer`
   - numeric: median imputation + scaling
   - categorical: most-frequent imputation + one-hot encoding
3. **Fit Ridge regression**
4. **Evaluate on a held-out test split**
5. **Persist the full pipeline** (`.joblib`) for reuse in Notebook 4

### Evaluation metrics
We report standard regression metrics:

- **MAE** — average absolute error in popularity points
- **RMSE** — penalizes large errors more strongly
- **R²** — explained variance of the target

These are computed via your `regression_report` helper.


In [127]:
Xtr, Xte, ytr, yte = train_test_split(
    X_album_pop, y_album_pop_clean,
    test_size=0.2, random_state=RANDOM_SEED
)

pre_album, num_cols_album, cat_cols_album = build_preprocessor(X_album_pop)

pipe_album_pop = Pipeline(steps=[
    ("sanitize", sanitize_tf),
    ("pre", pre_album),
    ("model", Ridge(alpha=2.0))
])

pipe_album_pop.fit(Xtr, ytr)
pred = pipe_album_pop.predict(Xte)

album_pop_metrics = regression_report(yte, pred)

dump(pipe_album_pop, PATHS.models_dir / "03_album_popularity_pipeline.joblib")



['..\\data\\models\\03_album_popularity_pipeline.joblib']

## Train: Hit Prediction (Binary Classification)

This step trains a binary classifier to predict whether a track is a **“hit”** (`y_hit ∈ {0,1}`).

### Target definition recap (robust + scalable)
A track is labeled as a hit using a **time-aware, zero-inflation-aware rule**:

1. **Preferred (time-aware):**
   If `release_year` is available and there are enough samples per year, we define a hit as:
   - **Hit = 1** if track popularity is in the **top `HIT_PERCENTILE`** *within its release year*
     (e.g., top 10% of tracks from the same year)

2. **Zero-aware fallback:**
   Popularity is often **zero-inflated** (many tracks have `popularity = 0`).
   If per-year thresholds are not reliable for some rows/years, we use a **global percentile computed on non-zero popularity** to avoid the “everything is a hit” problem.

3. **Safety fallback (guarantees two classes):**
   If the above would still produce a single-class label (all 0 or all 1), we fallback to a deterministic **top-K rule** to ensure the dataset contains both classes for training.

This definition reduces bias across eras and is stable on sampled datasets.

### Model choice: SGDClassifier (logistic loss)
We use **`SGDClassifier(loss="log_loss")`**, which is effectively a **linear logistic regression model trained with stochastic gradient descent**.

Why this model (especially for 300k rows + sparse features)?

- Extremely fast and memory-efficient on **large sparse matrices**
  - One-Hot encoding and genre multi-hot vectors produce sparse inputs
- Scales well to hundreds of thousands (or millions) of rows
- Supports probability outputs via `predict_proba` when using `loss="log_loss"`

We also set:

- `class_weight="balanced"`
  to handle class imbalance (hits are rarer than non-hits), preventing the model from collapsing to majority predictions.

### Training pipeline (best practice)
The pipeline follows a reproducible structure:

1. **Sanitize** pandas missing values (`pd.NA → np.nan`) for sklearn stability
2. **ColumnTransformer preprocessing**
   - numeric: median imputation + scaling (sparse-safe)
   - categorical: most-frequent imputation + one-hot encoding (ignore unknown categories)
3. **Fit SGD logistic classifier**
4. Evaluate on a **stratified** train/test split
5. Save the fitted pipeline for Notebook 4 batch scoring

### Evaluation metrics
We evaluate using metrics suited for imbalanced classification:

- **ROC-AUC** — ranking quality across thresholds
- **PR-AUC** — more informative when the positive class is rare
- **F1 score** — balances precision and recall at a chosen threshold
- **Precision/Recall** — interpretable trade-off for business decisions

Metrics are computed from predicted probabilities (`predict_proba`) and a chosen decision threshold (default 0.5, but you can tune it to optimize F1/precision/recall).


In [134]:
Xtr, Xte, ytr, yte = train_test_split(
    X_track_hit, y_hit_clean,
    test_size=0.2, random_state=RANDOM_SEED, stratify=y_hit_clean
)

pre_hit, _, _ = build_preprocessor(X_track_hit)

from sklearn.linear_model import SGDClassifier

hit_model = SGDClassifier(
    loss="log_loss",
    alpha=1e-4,
    max_iter=2000,
    tol=1e-3,
    class_weight="balanced",
    random_state=RANDOM_SEED
)


pipe_hit = Pipeline(steps=[
    ("sanitize", sanitize_tf),
    ("pre", pre_hit),
    ("model", hit_model)
])

pipe_hit.fit(Xtr, ytr)
proba = pipe_hit.predict_proba(Xte)[:, 1]

hit_metrics = classification_report_binary(yte, proba, threshold=0.5)


dump(pipe_hit, PATHS.models_dir / "03_hit_pipeline.joblib")

['..\\data\\models\\03_hit_pipeline.joblib']

## Train: Explicit / Content Prediction (Binary Classification)

This step trains a binary classifier to predict whether a track is **explicit** (`y_explicit ∈ {0,1}`).
In practice, this is a simple but important “content” prediction task.

### Target definition
- **Explicit = 1** if the track is marked as explicit in the metadata
- **Explicit = 0** otherwise

> Note: This target is directly taken from your dataset (no heuristic labeling required).

### Model choice: SGDClassifier (logistic loss) — fast on sparse data
We use **`SGDClassifier(loss="log_loss")`**, which is effectively a **linear logistic classifier trained with stochastic gradient descent**.

Why this model?

- Our feature matrix is **high-dimensional and sparse**
  - One-Hot encoding for categorical columns
  - Multi-hot encoding for genres
- `SGDClassifier` scales extremely well to **hundreds of thousands of rows**
- It is typically **much faster** than classic `LogisticRegression(saga)` on large sparse inputs
- It outputs probabilities via `predict_proba`, enabling threshold tuning

We also use:
- `class_weight="balanced"` to handle class imbalance
- `early_stopping=True` (when enabled) to stop training automatically once the validation score no longer improves

### Training pipeline (best practice)
The pipeline mirrors the structure used for the hit model:

1. **Sanitize** pandas missing values (`pd.NA → np.nan`) for sklearn stability
2. **ColumnTransformer preprocessing**
   - numeric: median imputation + scaling
   - categorical: most-frequent imputation + one-hot encoding (`handle_unknown="ignore"`)
3. **Fit SGD logistic classifier**
4. Evaluate on a stratified train/test split
5. Save the fitted pipeline for Notebook 4 batch scoring

### Evaluation metrics
Since this is binary classification (and can be imbalanced), we report:

- **ROC-AUC**
- **PR-AUC**
- **F1 score**
- **Precision / Recall** at a chosen threshold (default 0.5, can be tuned)


In [138]:
explicit_model = SGDClassifier(
    loss="log_loss",          # logistic regression via SGD
    alpha=1e-4,               # regularization strength (tune later)
    max_iter=2000,
    tol=1e-3,
    class_weight="balanced",
    random_state=RANDOM_SEED,
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=5,
    shuffle=True
)

pre_exp, _, _ = build_preprocessor(X_track_explicit)

pipe_explicit = Pipeline(steps=[
    ("sanitize", sanitize_tf),
    ("pre", pre_exp),
    ("model", explicit_model)
])

pipe_explicit.fit(Xtr, ytr)
proba = pipe_explicit.predict_proba(Xte)[:, 1]
explicit_metrics = classification_report_binary(yte, proba, threshold=0.5)

dump(pipe_explicit, PATHS.models_dir / "03_explicit_pipeline.joblib")


['..\\data\\models\\03_explicit_pipeline.joblib']

## Train: Mood Tags (Multi-label Classification)

This step trains a **multi-label classifier** that predicts multiple “mood tags” for each track.

### Why multi-label?
Mood is not mutually exclusive:
- A track can be **happy** *and* **danceable** at the same time.
- Therefore the target is a **set of labels per track**, not a single class.

In the notebook, mood tags are constructed from audio features using threshold rules (e.g., high `valence` → “happy”, high `energy` → “energetic”).
This produces a target matrix:

- `Y_mood` with shape `(n_tracks, n_labels)`
- each cell is `0/1`

### Model choice: One-vs-Rest (OvR) with a sparse-friendly linear base learner
We use **OneVsRestClassifier** because it is a standard, scalable approach for multi-label problems:

- It trains **one binary classifier per mood label**
  - e.g., a separate classifier for `happy`, another for `energetic`, etc.
- Each classifier predicts `P(label=1)` independently
- Works extremely well with sparse, high-dimensional inputs (OneHot + multi-hot genres)

The base model is a **linear classifier** (e.g., Logistic Regression with a sparse-friendly solver or `SGDClassifier(loss="log_loss")`) because:
- the feature matrix is sparse and wide
- linear models scale well to large datasets
- `predict_proba` enables threshold tuning per label (optional advanced step)

### Training pipeline (best practice)
The multi-label pipeline mirrors the binary pipelines:

1. **Sanitize** missing values (`pd.NA → np.nan`)
2. **ColumnTransformer preprocessing**
   - numeric: median imputation + scaling
   - categorical: most-frequent imputation + one-hot encoding
3. **Fit OneVsRestClassifier(base_model)** on the multi-label target matrix
4. Predict probabilities for each label and convert to binary predictions via a threshold (default `0.5`)
5. Save the fitted pipeline for Notebook 4 batch scoring

### Evaluation metrics
Multi-label evaluation is different from single-label classification. We report:

- **Micro F1**
  - aggregates contributions of all labels
  - sensitive to frequent labels (good global signal)
- **Macro F1**
  - computes F1 per label and averages them equally
  - highlights performance on rare/hard labels
- **Per-label F1**
  - diagnostic view to see which moods are easy vs difficult

These metrics are computed from predicted label probabilities (or binary predictions after thresholding).


In [142]:
Xtr, Xte, Ytr, Yte = train_test_split(
    X_track_mood, Y_mood_clean,
    test_size=0.2, random_state=RANDOM_SEED
)

pre_mood, _, _ = build_preprocessor(X_track_mood)

base_sgd = SGDClassifier(
    loss="log_loss",          # logistic
    alpha=1e-4,               # regularization (tune later)
    max_iter=2000,
    tol=1e-3,
    class_weight="balanced",
    random_state=RANDOM_SEED,
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=5,
    shuffle=True
)

mood_model = OneVsRestClassifier(base_sgd, n_jobs=-1)

pipe_mood = Pipeline(steps=[
    ("sanitize", sanitize_tf),
    ("pre", pre_mood),
    ("model", mood_model),
])

pipe_mood.fit(Xtr, Ytr)

proba = pipe_mood.predict_proba(Xte)          # (n_samples, n_labels)
pred  = (proba >= 0.5).astype(int)            # threshold can be tuned per-label

mood_micro_f1 = float(f1_score(Yte, pred, average="micro"))
mood_macro_f1 = float(f1_score(Yte, pred, average="macro"))
per_label_f1  = {col: float(f1_score(Yte[col], pred[:, i])) for i, col in enumerate(Yte.columns)}

mood_metrics = {
    "micro_f1": mood_micro_f1,
    "macro_f1": mood_macro_f1,
    "per_label_f1": per_label_f1
}

dump(pipe_mood, PATHS.models_dir / "03_mood_pipeline.joblib")
mood_metrics


{'micro_f1': 0.9299329063738945,
 'macro_f1': 0.9313848087716066,
 'per_label_f1': {'energetic': 0.9760701496034003,
  'danceable': 0.8377549281372493,
  'acoustic': 0.9797505605381166,
  'instrumental': 0.9684679976307445,
  'happy': 0.9361661766100731,
  'sad': 0.861611300838249,
  'chill': 0.9598725480434134}}

## Artist Clustering (Unsupervised)

This step builds **artist clusters/segments** for exploration and potential downstream use (e.g., recommendations, catalog segmentation, marketing personas, or as features in other models).

> Note: This is not true graph-based “community detection” (like Louvain/Leiden on an artist collaboration graph).
> Instead, it is a **scalable proxy** using feature-based clustering.

### Goal
- Group artists into meaningful segments based on:
  - popularity / followers
  - catalog size and track statistics
  - (optionally) genre profile via multi-hot encoding

### Approach overview
We follow a standard unsupervised ML pipeline:

1. **Select artist-level features**
   - numeric artist attributes (e.g., followers, popularity, track aggregates)
   - optional genre multi-hot vectors (Top-K genres)

2. **Impute missing values**
   - use median imputation to keep clustering stable and avoid dropping artists

3. **Scale features**
   - clustering is distance-based, so scaling is required
   - we use standardization (mean=0, std=1)
   - `with_mean=True` is fine here because we work with a dense numeric matrix at this stage

4. **Dimensionality reduction (PCA)**
   - reduces noise and feature collinearity
   - improves KMeans stability and speed on high-dimensional inputs
   - we keep a bounded number of components (e.g., up to 30)

5. **KMeans clustering**
   - assigns each artist to one of `K` clusters
   - produces a `cluster` label per artist which can be saved and reused later

6. **Optional visualization (t-SNE on a sample)**
   - t-SNE is expensive, so we run it only on a random subset
   - useful for inspecting whether clusters visually separate (diagnostics only)

### Outputs
- `artist_df["cluster"]`: the assigned cluster label per artist
- saved clustering artifacts (preprocessing + PCA + KMeans) for reuse in Notebook 4 and future analysis


In [143]:
ARTIST_NUM = [
    "popularity", "followers", "log_followers",
    "n_tracks", "log_n_tracks",
    "track_pop_mean", "explicit_rate",
] + [c for c in artist_df.columns if c.startswith("mean_")]

ARTIST_NUM = [c for c in ARTIST_NUM if c in artist_df.columns]
X_artist_base = artist_df[ARTIST_NUM].copy()

# Optional: add genre multihot (numeric features)
X_artist = pd.concat([X_artist_base.reset_index(drop=True), artist_genre_mh.reset_index(drop=True)], axis=1)

# Keep only numeric columns for clustering (KMeans requires numeric)
num_cols = [c for c in X_artist.columns if pd.api.types.is_numeric_dtype(X_artist[c])]
X_num = X_artist[num_cols].copy()

cluster_pre = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler(with_mean=True)),
])

X_scaled = cluster_pre.fit_transform(X_num)

# PCA: reduce dims to make clustering easier and more stable
pca_components = min(30, X_scaled.shape[1])
pca = PCA(n_components=pca_components, random_state=RANDOM_SEED)
X_pca = pca.fit_transform(X_scaled)

# KMeans clusters
kmeans = kmeans_compat(n_clusters=K_CLUSTERS, random_state=RANDOM_SEED)
artist_clusters = kmeans.fit_predict(X_pca)

artist_df["cluster"] = artist_clusters

SAMPLE_TSNE = min(TSNE_SAMPLE_MAX, len(artist_df))
sample_idx = np.random.choice(len(artist_df), size=SAMPLE_TSNE, replace=False)

tsne = TSNE(n_components=2, perplexity=30, learning_rate="auto", init="pca", random_state=RANDOM_SEED)
X_tsne = tsne.fit_transform(X_pca[sample_idx])

artist_cluster_artifact = {
    "k": int(K_CLUSTERS),
    "pca_components": int(pca_components),
    "pca_explained_variance_ratio_sum": float(np.sum(pca.explained_variance_ratio_)),
    "tsne_sample_n": int(SAMPLE_TSNE),
}

# Save clustering artifacts for reuse (Notebook 4 or analysis)
cluster_bundle = {
    "cluster_pre": cluster_pre,
    "pca": pca,
    "kmeans": kmeans,
    "numeric_columns_used": num_cols,
    "k": int(K_CLUSTERS),
}
dump(cluster_bundle, PATHS.models_dir / "03_artist_clustering.joblib")
artist_df.to_parquet(PATHS.modeling_dir / "artist_dataset_with_clusters.parquet", index=False)


## Save `feature_config.json` (Critical for Notebook 4)

At this point we export a **`feature_config.json`** file.
This file acts as the **scoring contract** between Notebook 3 (training) and Notebook 4 (full-data batch scoring).

### Why this file is critical
When you score millions of rows later, you must apply **exactly the same rules** as during training.
`feature_config.json` makes the pipeline reproducible by storing the key decisions that would otherwise be “hidden” in notebook code.

### What we store in `feature_config.json`
The config contains the most important pieces of information needed for consistent inference:

- **Feature lists**
  - which numeric columns are used
  - which categorical columns are used
  - which multi-hot genre columns were created

- **Genre encoding contract**
  - the exact `top_genres` list used for Top-K multi-hot encoding
  (same order and same set → ensures consistent column layout)

- **Mood tag definitions**
  - the thresholds used to create each mood label (e.g., quantiles on `valence`, `energy`, etc.)
  - the list of mood tags and their logic (`gt` / `lt`)

- **Hit label definition**
  - parameters used to create the hit target (e.g., `HIT_PERCENTILE`, fallback rule)
  - ensures we can reproduce hit labeling consistently across datasets

- **Run metadata (optional but recommended)**
  - timestamp / sampling settings / dataset version identifiers
  - helps trace which config belongs to which model training run


In [144]:
feature_config = {
    "run_meta": RUN_META,
    "top_genres": top_genres,
    "mood_thresholds": {str(k): v for k, v in mood_thresholds.items()},  # keys must be JSON-serializable
    "mood_tags": MOOD_TAGS,
    "track_features": {
        "numeric": TRACK_NUMERIC,
        "categorical": TRACK_CATEGORICAL,
        "genre_multi_hot_cols": list(track_genre_mh.columns),
    },
    "album_features": {
        "numeric": ALBUM_NUMERIC,
        "categorical": ALBUM_CATEGORICAL,
        "genre_multi_hot_cols": list(album_genre_mh.columns),
    },
    "artist_features": {
        "numeric_used_for_clustering": num_cols,
        "genre_multi_hot_cols": list(artist_genre_mh.columns),
        "kmeans_k": int(K_CLUSTERS),
    },
    "targets": {
        "hit_percentile_within_year": float(HIT_PERCENTILE),
        "hit_fallback_popularity_threshold": int(HIT_FALLBACK_POP_THRESHOLD),
    }
}

(PATHS.models_dir / "feature_config.json").write_text(json.dumps(feature_config, indent=2), encoding="utf-8")
print("Saved feature_config.json")


Saved feature_config.json


## Write Reports (JSON)

In this step we persist the most important outputs of Notebook 3 as **machine-readable reports**.

### Why we write JSON reports
Saving metrics only as printed notebook output is not reproducible.
A JSON report allows you to:

- compare experiments across runs (baseline vs tuned models)
- track progress over time (model improvements)
- integrate results into dashboards or CI pipelines
- make Notebook 4 and later steps auditable

### What we store
The report typically includes:

- **Model metrics**
  - track popularity regression (MAE / RMSE / R²)
  - album popularity regression (MAE / RMSE / R²)
  - hit prediction (ROC-AUC / PR-AUC / F1 / precision / recall)
  - explicit prediction (ROC-AUC / PR-AUC / F1 / precision / recall)
  - mood multi-label metrics (micro F1 / macro F1 / per-label F1)
  - artist clustering artifacts (e.g., `k`, PCA variance explained)

- **Dataset shapes**
  - shapes of the prepared datasets (`track_df`, `album_df`, `artist_df`)
  - shapes of the feature matrices used for training
  - helps detect accidental row drops or mismatched joins later

### Output location
The JSON report is written into the `reports/` directory (e.g. `metrics_report.json`).
This becomes the single source of truth for model performance for this run.


In [145]:
reports = {
    "track_popularity_regression": track_pop_metrics,
    "album_popularity_regression": album_pop_metrics,
    "hit_prediction": hit_metrics,
    "explicit_prediction": explicit_metrics,
    "mood_multilabel": mood_metrics,
    "artist_clustering": artist_cluster_artifact,
    "dataset_shapes": {
        "track_df": [int(track_df.shape[0]), int(track_df.shape[1])],
        "album_df": [int(album_df.shape[0]), int(album_df.shape[1])],
        "artist_df": [int(artist_df.shape[0]), int(artist_df.shape[1])],
        "X_track_pop": [int(X_track_pop.shape[0]), int(X_track_pop.shape[1])],
        "X_album_pop": [int(X_album_pop.shape[0]), int(X_album_pop.shape[1])],
    },
}

(PATHS.reports_dir / "metrics_report.json").write_text(json.dumps(reports, indent=2), encoding="utf-8")
print("Wrote metrics report:", PATHS.reports_dir / "metrics_report.json")


Wrote metrics report: ..\data\reports\03_target_and_features\metrics_report.json
