# 03 — Target & Features with  scikit-learn

## Ziel
Dieses Notebook erstellt **professionelle ML-Datensätze** aus der **Clean-Layer** (Notebook 02) und trainiert mehrere Modelle **ausschließlich im scikit-learn-Ökosystem**.
Am Ende stehen reproduzierbare Datasets, gespeicherte Pipelines/Modelle und strukturierte Reports.

---

## Anforderungen / Aufgaben
Dieses Notebook deckt folgende ML-Use-Cases ab:

1. **Track-Popularität** (Regression)
2. **Album-Popularität** (Regression)
3. **Hit-Prediction** (Binary Classification)
4. **Explicit / Content-Prediction** (Binary Classification)
5. **Mood Tags** (Multi-Label Classification)
   - Labels werden aus Features abgeleitet (Rule-based / Derived Labels)
6. **Artist Clustering / Community Detection** (Unsupervised Learning)

---

## Input (Clean-Layer aus Notebook 02)
Bevorzugt:
- `../data/processed/parquet/*.parquet`

Fallback:
- `../data/processed/clean_csv/*.csv`

---

## Output

### 1) Modellierungs-Datasets (Parquet)
- `../data/processed/modeling/track_dataset.parquet`
- `../data/processed/modeling/album_dataset.parquet`
- `../data/processed/modeling/artist_dataset.parquet`

### 2) Gespeicherte Modelle & Pipelines (joblib)
- `../data/models/03_track_popularity_pipeline.joblib`
- `../data/models/03_album_popularity_pipeline.joblib`
- `../data/models/03_hit_pipeline.joblib`
- `../data/models/03_explicit_pipeline.joblib`
- `../data/models/03_mood_pipeline.joblib`
- `../data/models/03_artist_clustering.joblib`

### 3) Konfiguration & Reports
- `../data/models/feature_config.json`
- `../data/reports/03_target_and_features/*.json`

---

## Ergebnis
Nach dem Notebook existieren:
- modellierungsfertige Parquet-Datasets,
- trainierte und gespeicherte scikit-learn Pipelines,
- sowie Reports/Configs für nachvollziehbares Training und spätere Batch-Inferenz.


In [44]:
import ast
import json
import math
import time
import platform
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Any

import numpy as np
import pandas as pd

# sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
    roc_auc_score, average_precision_score, f1_score,
    classification_report, confusion_matrix
)

from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

from joblib import dump

## Global Config

In [45]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Leakage controls:
# - If True: allow "post-release / popularity-like" proxy features (often boosts scores but less realistic)
# - If False: drop strongest leakage/proxies (recommended for realistic evaluation)
ALLOW_LEAKY_FEATURES = False

# "Main album per track" selection strategy:
# - "earliest_release": choose album with earliest release_date_parsed
# - "deterministic_id": choose smallest album_id (stable fallback)
MAIN_ALBUM_STRATEGY = "earliest_release"

# Hit label definition
HIT_PERCENTILE = 0.90
HIT_FALLBACK_POP_THRESHOLD = 70

# Genre multi-hot size
TOP_K_GENRES = 50

# Mood labels quantile rules (weak-label demonstration)
MOOD_TAGS = [
    ("energetic", "energy", 0.75, "gt"),
    ("danceable", "danceability", 0.75, "gt"),
    ("acoustic", "acousticness", 0.75, "gt"),
    ("instrumental", "instrumentalness", 0.75, "gt"),
    ("happy", "valence", 0.75, "gt"),
    ("sad", "valence", 0.25, "lt"),
    ("chill", "energy", 0.25, "lt"),
]

# Clustering
K_CLUSTERS = 30
TSNE_SAMPLE_MAX = 4000

pd.set_option("display.max_columns", 250)
pd.set_option("display.width", 180)
pd.set_option("display.max_rows", 40)
try:
    pd.options.mode.copy_on_write = True
except Exception:
    pass

## Paths

In [46]:
@dataclass(frozen=True)
class Paths:
    clean_parquet_dir: Path = Path("../data/processed/parquet")
    clean_csv_dir: Path = Path("../data/processed/clean_csv")

    modeling_dir: Path = Path("../data/processed/modeling")
    models_dir: Path = Path("../data/models")
    reports_dir: Path = Path("../data/reports/03_target_and_features")

PATHS = Paths()
for p in [PATHS.modeling_dir, PATHS.models_dir, PATHS.reports_dir]:
    p.mkdir(parents=True, exist_ok=True)

RUN_META = {
    "run_ts_unix": int(time.time()),
    "python": platform.python_version(),
    "platform": platform.platform(),
    "numpy": np.__version__,
    "pandas": pd.__version__,
    "random_seed": RANDOM_SEED,
    "allow_leaky_features": ALLOW_LEAKY_FEATURES,
    "main_album_strategy": MAIN_ALBUM_STRATEGY,
    "paths": {k: str(v) for k, v in asdict(PATHS).items()},
}

## Data Loading

In [47]:
TABLES = [
    "tracks",
    "audio_features",
    "albums",
    "artists",
    "genres",
    "r_albums_tracks",
    "r_track_artist",
    "r_artist_genre",
    "r_albums_artists",
]

def load_table(name: str) -> pd.DataFrame:
    pq = PATHS.clean_parquet_dir / f"{name}.parquet"
    csv = PATHS.clean_csv_dir / f"{name}.csv"

    if pq.exists():
        return pd.read_parquet(pq)
    if csv.exists():
        return pd.read_csv(csv, low_memory=False)
    raise FileNotFoundError(f"Missing {name} in parquet/csv clean layer.")

data: Dict[str, pd.DataFrame] = {}
for t in TABLES:
    pq = PATHS.clean_parquet_dir / f"{t}.parquet"
    csv = PATHS.clean_csv_dir / f"{t}.csv"
    if pq.exists() or csv.exists():
        data[t] = load_table(t)

{k: v.shape for k, v in data.items()}

{'tracks': (294618, 13),
 'audio_features': (294594, 21),
 'albums': (129152, 8),
 'artists': (139608, 6),
 'genres': (5416, 1),
 'r_albums_tracks': (305933, 2),
 'r_track_artist': (391700, 2),
 'r_artist_genre': (169289, 2),
 'r_albums_artists': (142153, 2)}

## Quick integrity sanity

In [48]:
required = ["tracks", "audio_features", "albums", "artists", "r_albums_tracks", "r_track_artist", "r_artist_genre"]
missing = [t for t in required if t not in data]
assert not missing, f"Missing required tables in clean layer: {missing}"

tracks = data["tracks"].copy()
audio = data["audio_features"].copy()
albums = data["albums"].copy()
artists = data["artists"].copy()
rat = data["r_albums_tracks"].copy()
rta = data["r_track_artist"].copy()
rag = data["r_artist_genre"].copy()
genres = data.get("genres", pd.DataFrame(columns=["id"]))  # optional
raa = data.get("r_albums_artists", pd.DataFrame(columns=["album_id", "artist_id"])).copy()

# PK expectations (guarded)
assert "track_id" in tracks.columns, "tracks must contain track_id"
assert tracks["track_id"].is_unique

assert "id" in audio.columns and audio["id"].is_unique
assert "id" in albums.columns and albums["id"].is_unique
assert "id" in artists.columns and artists["id"].is_unique

if not genres.empty and "id" in genres.columns:
    assert genres["id"].is_unique

print("Clean layer looks consistent.")

Clean layer looks consistent.


## Helper utilities

In [49]:
def col_or_na(df: pd.DataFrame, col: str, dtype: Optional[str] = None) -> pd.Series:
    """
    Return df[col] if it exists; otherwise return an all-NA Series with the same index.
    Never returns None.
    """
    if df is None or not isinstance(df, pd.DataFrame):
        raise TypeError("col_or_na: df must be a pandas DataFrame")

    if col in df.columns:
        s = df[col]
        if dtype is not None:
            try:
                s = s.astype(dtype)
            except Exception:
                pass
        return s

    return pd.Series(pd.NA, index=df.index)

def safe_len_series(s: pd.Series) -> pd.Series:
    return s.astype("string").fillna("").str.len().astype("int32")

def safe_word_count_series(s: pd.Series) -> pd.Series:
    return s.astype("string").fillna("").str.split().str.len().astype("int32")

def add_release_time_features(df: pd.DataFrame, date_col: str) -> pd.DataFrame:
    """Adds release_year/month/decade from a datetime-like column."""
    df = df.copy()
    dt = pd.to_datetime(col_or_na(df, date_col), errors="coerce")
    df["release_year"] = dt.dt.year.astype("Int64")
    df["release_month"] = dt.dt.month.astype("Int64")
    df["release_decade"] = ((dt.dt.year // 10) * 10).astype("Int64")
    return df

def log1p_numeric(s: pd.Series) -> pd.Series:
    x = pd.to_numeric(s, errors="coerce")
    return np.log1p(x).astype("float64")

def ensure_list_column(s: pd.Series) -> pd.Series:
    """
    Ensure a column contains python lists.
    Accepts:
      - actual lists
      - JSON strings
      - repr strings like "['a','b']"
      - NaN/None
    """
    def parse_one(v):
        if isinstance(v, list):
            return v
        if v is None or (isinstance(v, float) and np.isnan(v)):
            return []
        if isinstance(v, str):
            v = v.strip()
            if not v:
                return []
            # try JSON
            try:
                parsed = json.loads(v)
                if isinstance(parsed, list):
                    return parsed
            except Exception:
                pass
            # try python literal
            try:
                parsed = ast.literal_eval(v)
                if isinstance(parsed, list):
                    return parsed
            except Exception:
                pass
        return []
    return s.apply(parse_one)

def top_k_list_counts(list_series: pd.Series, top_k: int) -> List[str]:
    from collections import Counter
    c = Counter()
    for lst in list_series:
        if isinstance(lst, list):
            for x in lst:
                if pd.notna(x):
                    c[str(x)] += 1
    return [k for k, _ in c.most_common(top_k)]

def genres_to_multihot(df: pd.DataFrame, list_col: str, top_genres: List[str], prefix: str) -> pd.DataFrame:
    if not top_genres:
        return pd.DataFrame(index=df.index)
    m = np.zeros((len(df), len(top_genres)), dtype=np.int8)
    idx = {g: i for i, g in enumerate(top_genres)}
    lists = df[list_col]
    for r, lst in enumerate(lists):
        if isinstance(lst, list):
            for g in lst:
                j = idx.get(str(g))
                if j is not None:
                    m[r, j] = 1
    return pd.DataFrame(m, columns=[f"{prefix}genre_{g}" for g in top_genres])

def onehot_encoder_compat() -> OneHotEncoder:
    """Handle sklearn versions where sparse_output may not exist."""
    try:
        return OneHotEncoder(handle_unknown="ignore", sparse_output=True)
    except TypeError:
        return OneHotEncoder(handle_unknown="ignore", sparse=True)

def kmeans_compat(n_clusters: int, random_state: int) -> KMeans:
    """Handle sklearn versions where n_init='auto' may not exist."""
    try:
        return KMeans(n_clusters=n_clusters, n_init="auto", random_state=random_state)
    except TypeError:
        return KMeans(n_clusters=n_clusters, n_init=10, random_state=random_state)

def regression_report(y_true, y_pred) -> Dict[str, float]:
    rmse = float(np.sqrt(mean_squared_error(y_true, y_pred)))
    return {
        "MAE": float(mean_absolute_error(y_true, y_pred)),
        "RMSE": rmse,
        "R2": float(r2_score(y_true, y_pred)),
    }

def classification_report_binary(y_true, y_proba, threshold=0.5) -> Dict[str, Any]:
    y_true = np.asarray(y_true)
    y_proba = np.asarray(y_proba)
    y_pred = (y_proba >= threshold).astype(int)
    out = {
        "roc_auc": float(roc_auc_score(y_true, y_proba)) if len(np.unique(y_true)) > 1 else None,
        "pr_auc": float(average_precision_score(y_true, y_proba)) if len(np.unique(y_true)) > 1 else None,
        "f1": float(f1_score(y_true, y_pred)) if len(np.unique(y_true)) > 1 else None,
        "confusion_matrix": confusion_matrix(y_true, y_pred).tolist(),
    }
    return out

## Track-Level Dataset (eine Zeile = ein Track)

**Ziel:** Wir bauen eine denormalisierte, ML-fertige Tabelle, in der **jede Zeile einen Track** repräsentiert.
Dazu kombinieren wir Informationen aus mehreren Tabellen (Tracks, Audio-Features, Alben, Artists, Genres) und erzeugen zusätzlich **aggregierte** sowie **engineerte Features**.

### Was passiert hier genau?

1. **Tracks + Audio-Features (1:1 / left join)**
   - Wir hängen die numerischen Audio-Features (z. B. energy, danceability, loudness, tempo) direkt an den Track.
   - Falls für einzelne Tracks keine Audio-Features existieren, bleiben diese Felder `NaN` (left join).

2. **Track → Album (Many-to-Many) und Auswahl eines „Main Albums“**
   - Ein Track kann auf mehreren Alben vorkommen (Album, Compilation, Re-Release).
   - Für ML brauchen wir aber **einen eindeutigen Album-Kontext** pro Track.
   - Deshalb wählen wir deterministisch genau **ein Album pro Track** (z. B. das früheste Release-Datum).

3. **Album-Metadaten an Track anhängen**
   - Wir mergen Album-Infos (z. B. album_type, release_date, album_popularity) auf Track-Ebene.
   - Danach erzeugen wir Zeitfeatures wie `release_year`, `release_month`, `release_decade`.

4. **Track → Artists (Many-to-Many) + Aggregation**
   - Ein Track kann mehrere Artists haben (feat., collabs).
   - Wir speichern:
     - `artist_ids` als Liste (für spätere Analysen)
     - Aggregierte Artist-Statistiken pro Track:
       - Anzahl Artists (`n_artists`)
       - Mittelwert/Maximum von Artist-Popularität und Followers

5. **Track → Genres über Artist-Genres (Many-to-Many, Union)**
   - Genres hängen bei Spotify oft an Artists, nicht direkt an Tracks.
   - Wir bauen:
     - `artist_id -> [genre_ids]`
     - `track_id -> union(artist_genres)` als `track_genres` (Liste)

6. **Feature Engineering**
   - Aus Text / Metadaten:
     - `has_preview`: ob Preview-URL vorhanden ist (0/1)
     - `name_len`, `name_words`: Länge und Wortanzahl des Track-Namens
   - Log-Transforms:
     - `log_duration`: reduziert Schiefe bei Dauer
     - `log_artist_followers_*`: stabilisiert heavy-tailed follower counts
   - Qualitätsindikatoren:
     - `has_audio_features`: ob Audio-Features vorhanden sind (0/1)

**Ergebnis:** `track_df` ist eine „Feature-Matrix“ auf Track-Ebene

In [50]:
# ------------------------------------------------------------
# 1) Join: tracks -> audio_features (left join)
# ------------------------------------------------------------
# Why:
#   - Audio features are core ML predictors (danceability, energy, loudness, tempo, ...)
#   - Left join keeps all tracks even if audio features are missing for some rows.
assert "audio_feature_id" in tracks.columns, "tracks must contain audio_feature_id for join with audio_features.id"

# Rename audio PK 'id' to match tracks FK 'audio_feature_id'
audio_small = audio.rename(columns={"id": "audio_feature_id"})

# Merge track metadata + audio features into one wide table
track_df = tracks.merge(audio_small, on="audio_feature_id", how="left", suffixes=("", "_af"))

# ------------------------------------------------------------
# 2) Track -> Album (Many-to-Many) and choose ONE "main album"
# ------------------------------------------------------------
# Problem:
#   - A track can appear on multiple albums (releases, compilations, deluxe editions).
# ML requirement:
#   - We want a single album context per track to avoid duplicate rows / ambiguity.
# Strategy:
#   - Deterministic selection:
#       MAIN_ALBUM_STRATEGY == "earliest_release" -> pick earliest release_date
#       else -> pick smallest album_id (stable fallback)
albums_for_pick = albums.copy().rename(columns={"id": "album_id"})
albums_for_pick["release_date_parsed"] = pd.to_datetime(
    col_or_na(albums_for_pick, "release_date_parsed"), errors="coerce"
)

# Attach album release dates to the relationship table (album_id, track_id)
rat2 = rat.merge(
    albums_for_pick[["album_id", "release_date_parsed"]],
    on="album_id",
    how="left"
)

# Pick main album per track based on strategy
if MAIN_ALBUM_STRATEGY == "earliest_release":
    rat2 = rat2.sort_values(["track_id", "release_date_parsed", "album_id"], ascending=[True, True, True])
    main_album_per_track = rat2.drop_duplicates("track_id", keep="first")[["track_id", "album_id"]]
else:
    rat2 = rat2.sort_values(["track_id", "album_id"], ascending=[True, True])
    main_album_per_track = rat2.drop_duplicates("track_id", keep="first")[["track_id", "album_id"]]

# Merge the selected main album_id into track_df
track_df = track_df.merge(main_album_per_track, on="track_id", how="left")

# ------------------------------------------------------------
# 3) Merge album metadata onto track
# ------------------------------------------------------------
# Why:
#   - album_type / release_date provide useful context
#   - album_popularity is a strong proxy but can be leakage depending on your goal
#     (you can later drop it via ALLOW_LEAKY_FEATURES switch in feature selection)
albums_join = albums_for_pick.copy()

# Rename to avoid name clash with track popularity
rename_map = {}
if "popularity" in albums_join.columns:
    rename_map["popularity"] = "album_popularity"
albums_join = albums_join.rename(columns=rename_map)

track_df = track_df.merge(albums_join, on="album_id", how="left", suffixes=("", "_album"))

# ------------------------------------------------------------
# 4) Add release time features (year/month/decade)
# ------------------------------------------------------------
# Why:
#   - Popularity and audio trends are time-dependent
#   - Helps model capture temporal shift and era effects
track_df = add_release_time_features(track_df, "release_date_parsed")

# ------------------------------------------------------------
# 5) Track -> Artists list (Many-to-Many)
# ------------------------------------------------------------
# Why:
#   - A track can have multiple artists
#   - Keeping a list can be useful for later analysis/debugging
track_to_artists = (
    rta.groupby("track_id")["artist_id"]
       .apply(list)
       .reset_index()
       .rename(columns={"artist_id": "artist_ids"})
)
track_df = track_df.merge(track_to_artists, on="track_id", how="left")

# ------------------------------------------------------------
# 6) Aggregate artist statistics per track
# ------------------------------------------------------------
# Why:
#   - Artist popularity/followers often correlate with track reach
#   - For multi-artist tracks, we aggregate to stable numeric features
artist_feat = artists.rename(
    columns={"id": "artist_id", "popularity": "artist_popularity", "followers": "artist_followers"}
)
rta_art = rta.merge(artist_feat, on="artist_id", how="left")

artist_agg = (
    rta_art.groupby("track_id")
           .agg(
               n_artists=("artist_id", "nunique"),
               artist_popularity_mean=("artist_popularity", "mean"),
               artist_popularity_max=("artist_popularity", "max"),
               artist_followers_mean=("artist_followers", "mean"),
               artist_followers_max=("artist_followers", "max"),
           )
           .reset_index()
)
track_df = track_df.merge(artist_agg, on="track_id", how="left")

# ------------------------------------------------------------
# 7) Track -> Genres (via artist genres), union per track
# ------------------------------------------------------------
# Why:
#   - Genres are usually attached to artists
#   - We derive a track-level genre profile by taking the union across its artists
#
# Note:
#   - We store genre IDs because they are stable keys.
#   - Later you can convert to names or multi-hot encode Top-K genres.
rag2 = rag.copy()
if "genre_id" not in rag2.columns and "id" in rag2.columns:
    rag2 = rag2.rename(columns={"id": "genre_id"})

# Build artist -> [genre_id] list
artist_to_genres = (
    rag2.groupby("artist_id")["genre_id"]
        .apply(lambda x: sorted(set([g for g in x.dropna().tolist()])))
        .reset_index()
        .rename(columns={"genre_id": "artist_genres"})
)

# Join artist genres into track-artist mapping, then union genres per track
rta_gen = rta.merge(artist_to_genres, on="artist_id", how="left")
track_to_genres = (
    rta_gen.groupby("track_id")["artist_genres"]
           .apply(lambda rows: sorted(set([g for lst in rows.dropna()
                                          for g in (lst if isinstance(lst, list) else [])])))
           .reset_index()
           .rename(columns={"artist_genres": "track_genres"})
)
track_df = track_df.merge(track_to_genres, on="track_id", how="left")

# Ensure list type (important when loading from CSV where lists may become strings)
track_df["track_genres"] = ensure_list_column(col_or_na(track_df, "track_genres"))

# ------------------------------------------------------------
# 8) Feature Engineering (binary flags, text-derived, log transforms)
# ------------------------------------------------------------

# Preview availability: a simple content/availability indicator
track_df["has_preview"] = col_or_na(track_df, "preview_url").notna().astype("int8")

# Track name features: cheap but sometimes useful
track_df["name_len"] = safe_len_series(col_or_na(track_df, "name"))
track_df["name_words"] = safe_word_count_series(col_or_na(track_df, "name"))

# Duration robust handling: datasets often have duration_ms instead of duration
dur_col = "duration" if "duration" in track_df.columns else ("duration_ms" if "duration_ms" in track_df.columns else None)
track_df["log_duration"] = log1p_numeric(track_df[dur_col]) if dur_col else pd.Series(np.nan, index=track_df.index)

# Followers are heavy-tailed -> log helps stabilize scale
track_df["log_artist_followers_max"] = log1p_numeric(col_or_na(track_df, "artist_followers_max"))
track_df["log_artist_followers_mean"] = log1p_numeric(col_or_na(track_df, "artist_followers_mean"))

# Indicator whether audio features are present (helps model handle missingness)
track_df["has_audio_features"] = col_or_na(track_df, "audio_feature_id").notna().astype("int8")

print("Track-level dataset shape:", track_df.shape)
track_df.head(3)

Track-level dataset shape: (294618, 56)


Unnamed: 0,track_id,disc_number,duration,explicit,audio_feature_id,name,track_number,popularity,has_preview,is_long_track,is_tracknum_extreme,is_multidisc,is_disc_extreme,acousticness,analysis_url,danceability,duration_af,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,is_time_signature_rare,is_tempo_extreme,is_loudness_very_low,is_af_long,is_high_speech,is_instrumental,album_id,name_album,album_type,release_date,album_popularity,release_date_parsed,is_release_year_invalid,release_year,release_month,release_decade,artist_ids,n_artists,artist_popularity_mean,artist_popularity_max,artist_followers_mean,artist_followers_max,track_genres,name_len,name_words,log_duration,log_artist_followers_max,log_artist_followers_mean,has_audio_features
0,2DZN6ceJ7fMU2X6YWuIGHk,1,285053,False,2DZN6ceJ7fMU2X6YWuIGHk,Toccada del 3 Tono,14,0,0,0,0,0,0,0.621,https://api.spotify.com/v1/audio-analysis/2DZN...,0.147,285053.0,0.148,0.282,7,0.151,-21.444,1,0.0324,80.634003,3,0.0367,0.0,0.0,0.0,0.0,0.0,0.0,5v0bDDSl25qgrxOzxqoWXJ,Pedro Ruimonte en Bruselas (Música en la Corte...,album,1509062400000.0,0.0,2017-10-27,0.0,2017.0,10.0,2010.0,"[6xadlZzmcIMmgspceWCkt3, 0HWL7UfTuSRYVCrvTW5tj...",4,0.0,0,112.5,390,[musica antigua],18,4,12.560434,5.968708,4.731803,1
1,1dizvxctg9dHEyaYTFufVi,1,275893,True,1dizvxctg9dHEyaYTFufVi,Gz And Hustlas (feat. Nancy Fletcher),12,0,0,0,0,0,0,0.164,https://api.spotify.com/v1/audio-analysis/1diz...,0.652,275893.0,0.814,0.0,1,0.36,-4.901,1,0.31,91.888,4,0.788,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,NaT,,,,,"[7hJcb9fa4alzcOq3EaNPoG, 3E2vuvr0IQbReTbXw2MhX8]",2,0.0,0,3416346.5,6831895,"[g funk, gangster rap, hip hop, pop rap, rap, ...",37,6,12.527772,15.737113,15.044083,1
2,2g8HN35AnVGIk7B8yMucww,1,252746,True,2g8HN35AnVGIk7B8yMucww,Big Poppa - 2005 Remaster,13,0,0,0,0,0,0,0.43,https://api.spotify.com/v1/audio-analysis/2g8H...,0.78,252747.0,0.575,0.0,9,0.143,-7.247,0,0.273,84.491997,4,0.773,0.0,0.0,0.0,0.0,0.0,0.0,2HTbQ0RHwukKVXAlTmCZP2,Ready to Die (The Remaster),album,779414400000.0,0.0,1994-09-13,0.0,1994.0,9.0,1990.0,[5me0Irg2ANcsgc93uaYrpb],1,0.0,0,6258716.0,6258716,"[east coast hip hop, gangster rap, hardcore hi...",25,5,12.440144,15.649486,15.649486,1


## Album-Level Dataset (eine Zeile = ein Album)

**Ziel:** Wir bauen eine ML-fertige Tabelle, in der **jede Zeile ein Album** repräsentiert.
Da ein Album aus vielen Tracks besteht und oft mehrere Artists hat, erzeugen wir vor allem **Aggregations-Features**.

### Was passiert hier genau?

1. **Album-Stammdaten + Release-Time-Features**
   - Wir starten mit `albums` (Album-Metadaten).
   - Wir parsen `release_date_parsed` und erzeugen daraus:
     - `release_year`, `release_month`, `release_decade`

2. **Album-Größe (Track-Anzahl)**
   - Über `r_albums_tracks` zählen wir:
     - `n_tracks` = Anzahl eindeutiger Tracks pro Album
   - Das ist ein starkes Strukturfeature (Singles/EPs vs. Alben).

3. **Album-Audio-Profil (Aggregierte Track-Audio-Features)**
   - Über alle Tracks eines Albums aggregieren wir Audio-Features:
     - z. B. `album_mean_energy`, `album_mean_danceability`, `album_mean_loudness`, `album_mean_tempo`
   - Dadurch entsteht eine „Audio-Signatur“ des Albums.

4. **Album-Artist-Profil (falls `r_albums_artists` vorhanden)**
   - Ein Album kann mehrere Artists haben.
   - Wir aggregieren Artists pro Album:
     - `n_album_artists`
     - Popularity/Follower Mittelwert und Maximum

5. **Album-Genre-Profil (Union der Genres der Album-Artists)**
   - Genres kommen typischerweise von Artists.
   - Wir bilden `album_genres` als Vereinigung aller Artist-Genres im Album.

6. **Feature Engineering**
   - `log_n_tracks`: log-transform gegen Schiefe
   - `name_len`, `name_words`: simple Text-Features aus Albumname

**Ergebnis:** `album_df` ist eine Album-Feature-Matrix

In [51]:
# 1) Start from album master data + parse dates
# ------------------------------------------------------------
# Why:
#   - Album-level tasks (e.g., album popularity regression) need a single row per album
#   - Release time features capture era effects and time bias

album_df=albums.copy()
album_df=album_df.rename(columns={"id": "album_id"})

album_df["release_date_parsed"] = pd.to_datetime(
    col_or_na(album_df,"release_date_parsed"), errors="coerce"
)

# Add derived time features: year / month / decade
album_df = add_release_time_features(album_df, "release_date_parsed")

# ------------------------------------------------------------
# 2) Album size feature: number of tracks per album
# ------------------------------------------------------------
# Why:
#   - Singles/EPs vs albums differ structurally (track count)
#   - Useful as a predictor and for data sanity checks
album_track_counts = (
    rat.groupby("album_id")["track_id"]
       .nunique()
       .reset_index()
       .rename(columns={"track_id": "n_tracks"})
)

# Fixed merge: both sides use album_id
album_df = album_df.merge(album_track_counts, on="album_id", how="left")

# ------------------------------------------------------------
# 3) Album audio signature: mean of track audio features
# ------------------------------------------------------------
# Why:
#   - Albums consist of multiple tracks; we aggregate to get a stable album-level profile
#   - Mean is a strong baseline aggregation (you could also add std/min/max later)
POLICY_AUDIO = [
    "acousticness", "danceability", "energy", "instrumentalness", "liveness",
    "speechiness", "valence", "loudness", "tempo"
]

# Keep only audio columns that exist (robust to schema differences)
audio_cols_present = [c for c in POLICY_AUDIO if c in track_df.columns]

# Join album-track relation to track audio features
rat_track_audio = rat.merge(track_df[["track_id"] + audio_cols_present], on="track_id", how="left")

# Aggregate per album (mean)
album_audio_agg = rat_track_audio.groupby("album_id")[audio_cols_present].mean().reset_index()

# Prefix columns so they are clearly album-aggregates
album_audio_agg = album_audio_agg.add_prefix("album_mean_").rename(columns={"album_mean_album_id": "album_id"})

# Merge audio aggregates back to album table
album_df = album_df.merge(album_audio_agg, on="album_id", how="left")

# ------------------------------------------------------------
# 4) Album -> artists aggregates (optional)
# ------------------------------------------------------------
# Why:
#   - Albums can have multiple artists; their popularity/followers often influence album success
#   - This block runs only if r_albums_artists exists in your export
if not raa.empty and "album_id" in raa.columns and "artist_id" in raa.columns:
    raa_art = raa.merge(artist_feat, on="artist_id", how="left")

    album_artist_agg = (
        raa_art.groupby("album_id")
              .agg(
                  n_album_artists=("artist_id", "nunique"),
                  album_artist_popularity_mean=("artist_popularity", "mean"),
                  album_artist_popularity_max=("artist_popularity", "max"),
                  album_artist_followers_mean=("artist_followers", "mean"),
                  album_artist_followers_max=("artist_followers", "max"),
              )
              .reset_index()
    )

    album_df = album_df.merge(album_artist_agg, on="album_id", how="left")

# ------------------------------------------------------------
# 5) Album genres union
# ------------------------------------------------------------
# Why:
#   - Spotify-like schemas often attach genres to artists
#   - We derive an album's genre profile as the union of all album artists' genres
if not raa.empty:
    raa_gen = raa.merge(artist_to_genres, on="artist_id", how="left")

    album_to_genres = (
        raa_gen.groupby("album_id")["artist_genres"]
              .apply(lambda rows: sorted(set([
                  g for lst in rows.dropna()
                  for g in (lst if isinstance(lst, list) else [])
              ])))
              .reset_index()
              .rename(columns={"artist_genres": "album_genres"})
    )

    album_df = album_df.merge(album_to_genres, on="album_id", how="left")
else:
    # Keep a consistent schema even if we can't compute genres
    album_df["album_genres"] = [[] for _ in range(len(album_df))]

# Ensure list type (important for CSV fallback)
album_df["album_genres"] = ensure_list_column(col_or_na(album_df, "album_genres"))

# ------------------------------------------------------------
# 6) Feature engineering (log transforms, name features)
# ------------------------------------------------------------
# log transform track count (often heavy-tailed: singles vs compilations)
album_df["log_n_tracks"] = log1p_numeric(col_or_na(album_df, "n_tracks"))

# Simple text features from album name
album_df["name_len"] = safe_len_series(col_or_na(album_df, "name"))
album_df["name_words"] = safe_word_count_series(col_or_na(album_df, "name"))

print("Album-level dataset shape:", album_df.shape)
album_df.head(3)

Album-level dataset shape: (129152, 29)


Unnamed: 0,album_id,name,album_type,release_date,popularity,release_date_parsed,is_release_year_invalid,release_year,release_month,release_decade,n_tracks,album_mean_acousticness,album_mean_danceability,album_mean_energy,album_mean_instrumentalness,album_mean_liveness,album_mean_speechiness,album_mean_valence,album_mean_loudness,album_mean_tempo,n_album_artists,album_artist_popularity_mean,album_artist_popularity_max,album_artist_followers_mean,album_artist_followers_max,album_genres,log_n_tracks,name_len,name_words
0,7zr66qWybr1mAMSUVVosKU,Reflexo,album,1464220800000,0,2016-05-26,0,2016,5,2010,1,0.519,0.726,0.491,7e-06,0.0965,0.126,0.287,-11.166,109.935997,1.0,0.0,0,147089.0,147089,[hip hop tuga],0.693147,7,1
1,7zrLd0zddHOwA9DGlsDr4h,Floating World,album,1410652800000,0,2014-09-14,0,2014,9,2010,1,0.000516,0.335,0.823,0.331,0.213,0.0437,0.0699,-7.041,90.175003,1.0,0.0,0,75.0,75,[crossover prog],0.693147,14,2
2,7zri1pX9eMh0IqwpxMxOwp,Arne Aano's Beste - Slepp Himlen I Sjela Di Inn,album,1236729600000,0,2009-03-11,0,2009,3,2000,1,0.888,0.537,0.303,0.0,0.135,0.0365,0.541,-9.413,137.932007,1.0,0.0,0,0.0,0,[],0.693147,47,10


## Artist-Level Dataset (eine Zeile = ein Artist)

**Ziel:** Wir bauen eine ML-fertige Tabelle, in der **jede Zeile einen Artist** repräsentiert.
Diese Tabelle wird vor allem für **Clustering / Community Detection** (unsupervised) genutzt, kann aber später auch für supervised Tasks (z. B. Artist-Popularity) verwendet werden.

### Was passiert hier genau?

1. **Artist-Stammdaten**
   - Wir starten mit `artists` und benennen die ID-Spalte zu `artist_id`, damit Joins konsistent sind.

2. **Artist-Style-Profil aus Tracks (Aggregation)**
   - Über `r_track_artist` verknüpfen wir Artists mit ihren Tracks.
   - Wir hängen die Track-Features an (Audio + optional Popularity/Explicit) und aggregieren dann pro Artist:
     - `n_tracks`: Anzahl eindeutiger Tracks
     - `track_pop_mean`: durchschnittliche Track-Popularität (falls vorhanden)
     - `explicit_rate`: Anteil „explicit“-Tracks (falls vorhanden)
     - `mean_<audio_feature>`: durchschnittliche Audio-Signatur (z. B. mean_energy, mean_danceability, …)

   Ergebnis: Jeder Artist bekommt einen stabilen numerischen Vektor, der seinen „Sound“ beschreibt.

3. **Genres pro Artist**
   - Wir mergen die Liste der Genres (`artist_genres`) pro Artist (aus `r_artist_genre`).
   - Diese Liste kann später z. B. als Multi-Hot-Features genutzt werden.

4. **Feature Engineering**
   - `log_followers`: Log-Transform für heavy-tailed Followers
   - `log_n_tracks`: Log-Transform, da Track-Anzahl oft sehr schief verteilt ist

**Ergebnis:** `artist_df` enthält pro Artist:
- Stammdaten (name, popularity, followers, …)
- Aggregierte Track-Audio-Signatur
- Genre-Liste
- log-transformierte Stabilitätsfeatures


In [52]:
# ------------------------------------------------------------
# 1) Start from artist master data
# ------------------------------------------------------------
# Why:
#   - We want a single vector per artist for clustering / similarity analysis
#   - Rename PK to artist_id for consistent joins across tables
artist_df = artists.rename(columns={"id": "artist_id"}).copy()

# ------------------------------------------------------------
# 2) Build artist "style profile" by aggregating over all their tracks
# ------------------------------------------------------------
# Why:
#   - Artists have many tracks (Many-to-Many: r_track_artist)
#   - We want stable numeric features per artist:
#       * number of tracks
#       * average audio signature (mean_energy, mean_danceability, ...)
#       * optionally: average track popularity and explicit rate
cols_for_artist_agg = ["track_id"] + audio_cols_present

if "popularity" in track_df.columns:
    cols_for_artist_agg += ["popularity"]

if "explicit" in track_df.columns:
    cols_for_artist_agg += ["explicit"]

rta_track_audio = rta.merge(track_df[cols_for_artist_agg], on="track_id", how="left")


# Helper explicit rate per artist

def explicit_rate_fn(x):
    xx = pd.to_numeric(x,errors="coerce")
    if xx.dropna().empty:
        return np.nan
    return float(np.nanmean(xx))

agg_dict = {
    "n_tracks":("track_id","nunique")
}

# Optional: average track popularity (proxy of how popular their tracks tend to be)
if "popularity" in rta_track_audio.columns:
    agg_dict["track_pop_mean"] = ("popularity", "mean")

# Optional: explicit rate (share of explicit tracks)
if "explicit" in rta_track_audio.columns:
    agg_dict["explicit_rate"] = ("explicit", explicit_rate_fn)

# Core: mean audio signature per artist
for c in audio_cols_present:
    agg_dict[f"mean_{c}"] = (c, "mean")

artist_audio_agg = (
    rta_track_audio.groupby("artist_id")
    .agg(**agg_dict)
    .reset_index()
)

# Merge aggregated features back into artist table
artist_df = artist_df.merge(artist_audio_agg, on="artist_id", how="left")

# ------------------------------------------------------------
# 3) Attach genres list per artist
# ------------------------------------------------------------
# Why:
#   - Genres are usually provided at artist-level
#   - We keep them as list for later multi-hot encoding (Top-K)
artist_df = artist_df.merge(artist_to_genres, on="artist_id", how="left")
artist_df["artist_genres"] = ensure_list_column(col_or_na(artist_df, "artist_genres"))

# ------------------------------------------------------------
# 4) Feature engineering (log transforms for heavy-tailed counts)
# ------------------------------------------------------------
# followers and track counts are typically very skewed -> log stabilizes scale
artist_df["log_followers"] = log1p_numeric(col_or_na(artist_df, "followers"))
artist_df["log_n_tracks"] = log1p_numeric(col_or_na(artist_df, "n_tracks"))

print("Artist-level dataset shape:", artist_df.shape)
artist_df.head(3)


Artist-level dataset shape: (139608, 21)


Unnamed: 0,artist_id,name,popularity,followers,is_followers_extreme,followers_log1p,n_tracks,track_pop_mean,explicit_rate,mean_acousticness,mean_danceability,mean_energy,mean_instrumentalness,mean_liveness,mean_speechiness,mean_valence,mean_loudness,mean_tempo,artist_genres,log_followers,log_n_tracks
0,7zzl8HQ2v9hVdLh0Ygkwgc,Megatherio,0,59,0,4.094345,3,0.0,1.0,0.000101,0.267333,0.99,0.013101,0.104633,0.0902,0.3119,-4.583,136.755666,[brazilian thrash metal],4.094345,1.386294
1,00045gNg7mLEf9UY9yhD0t,Kubus & BangBang,0,820,0,6.710523,11,0.0,1.0,0.121626,0.658273,0.651,0.000177,0.246955,0.3073,0.497918,-8.265273,122.092817,[dutch hip hop],6.710523,2.484907
2,000xagx3GkcunHTFdB4ly0,Moxa,0,156,0,5.056246,1,0.0,0.0,0.000151,0.441,0.959,0.334,0.229,0.0611,0.171,-4.694,138.009003,[indie emo],5.056246,0.693147


## Save modeling datasets

In [53]:
track_out = PATHS.modeling_dir / "track_dataset.parquet"
album_out = PATHS.modeling_dir / "album_dataset.parquet"
artist_out = PATHS.modeling_dir / "artist_dataset.parquet"

track_df.to_parquet(track_out, index=False)
album_df.to_parquet(album_out, index=False)
artist_df.to_parquet(artist_out, index=False)

print(" Saved modeling datasets:")
print(" -", track_out)
print(" -", album_out)
print(" -", artist_out)

 Saved modeling datasets:
 - ..\data\processed\modeling\track_dataset.parquet
 - ..\data\processed\modeling\album_dataset.parquet
 - ..\data\processed\modeling\artist_dataset.parquet
