# Modeling Tables Build for ML Pipelines

## Ziel
Dieses Notebook erstellt die notwendigen Tabellen und Strukturen, um ML-Pipelines für die Spotify-Datenanalyse zu bauen.

Am Ende erzeugen wir:
- Bereite Daten für ML-Modelle vor
- Speichere die Daten in einem geeigneten Format für die Modellierung

## Imports und Setup

In [1]:
import numpy as np
import pandas as pd
from typing import Dict

from utils.data.parsing import parse_datetime_from_candidates,col_or_na,ensure_list_column
from utils.features.time_features import  add_release_time_features,pick_release_cols_fast
from utils.features.numeric_transform import log1p_numeric
from utils.features.text_features import safe_len_series,safe_word_count_series

from utils.config.settings import (
    RANDOM_SEED,
    ALLOW_LEAKY_FEATURES,
    MAIN_ALBUM_STRATEGY,
)

from utils.core.paths import (
    load_sample_name,
    make_paths,
    ensure_dirs,
build_run_meta)

SAMPLE_NAME = load_sample_name()
PATHS = make_paths(SAMPLE_NAME)
ensure_dirs(PATHS)

RUN_META = build_run_meta(
    PATHS,
    random_seed=RANDOM_SEED,
    allow_leaky_features=ALLOW_LEAKY_FEATURES,
    main_album_strategy=MAIN_ALBUM_STRATEGY,
)



## Daten laden und vorbereiten

In [2]:
TABLES = [
    "tracks",
    "audio_features",
    "albums",
    "artists",
    "genres",
    "r_albums_tracks",
    "r_track_artist",
    "r_artist_genre",
    "r_albums_artists",
]


def load_table(name: str) -> pd.DataFrame:
    pq = PATHS.clean_parquet_dir / f"{name}.parquet"

    if pq.exists():
        return pd.read_parquet(pq)

    raise FileNotFoundError(f"Missing {name} in parquet/csv clean layer.")


data: Dict[str, pd.DataFrame] = {}
for t in TABLES:
    pq = PATHS.clean_parquet_dir / f"{t}.parquet"
    if pq.exists():
        data[t] = load_table(t)

{k: v.shape for k, v in data.items()}

{'tracks': (300000, 13),
 'audio_features': (299954, 21),
 'albums': (195938, 8),
 'artists': (187440, 6),
 'genres': (5455, 1),
 'r_albums_tracks': (340898, 2),
 'r_track_artist': (407296, 2),
 'r_artist_genre': (194023, 2),
 'r_albums_artists': (218032, 2)}

## Data als Variablen extrahieren

In [3]:
tracks2 = data["tracks"].copy()
audio2 = data["audio_features"].copy()
rat2src = data["r_albums_tracks"].copy()
albums_src = data["albums"].copy()
rta = data["r_track_artist"].copy()
rag = data["r_artist_genre"].copy()
raa = data["r_albums_artists"].copy()
artists = data["artists"].copy()

## Track-Level Dataset (1 Zeile = 1 Track)

**Ziel:** Aufbau einer denormalisierten, ML-fertigen Tabelle (`track_df`), in der **jede Zeile einen Track** repräsentiert.

**Schritte (kurz):**
1. **Tracks + Audio-Features**
   Left-Join numerischer Audio-Features (fehlende Werte bleiben `NaN`).

2. **Track → Album (M:N → 1)**
   Deterministische Auswahl eines *Main Albums* pro Track (z. B. frühestes Release).

3. **Album-Metadaten & Zeitfeatures**
   Merge von Album-Infos + Ableitung von `release_year`, `release_month`, `release_decade`.

4. **Track → Artists (M:N → Aggregation)**
   Speicherung der `artist_ids` + Aggregation von Artist-Statistiken
   (z. B. Anzahl Artists, Mittel/Max von Popularity & Followers).

5. **Genres über Artists**
   Union aller Artist-Genres pro Track → `track_genres` (Liste).

6. **Feature Engineering**
   - Text & Meta: `has_preview`, `name_len`, `name_words`
   - Log-Transforms: Dauer & Follower-Counts
   - Qualitätsflag: `has_audio_features`

**Output:** `track_df` – eine konsistente Feature-Tabelle auf Track-Ebene.


In [4]:



# Ensure required id columns are strings everywhere (safe for UUIDs too)
for df, col in [(tracks2, "track_id"), (tracks2, "audio_feature_id")]:
    if col in df.columns:
        df[col] = df[col].astype(str)

if "id" in audio2.columns:
    audio2["id"] = audio2["id"].astype(str)

for col in ["album_id", "track_id"]:
    if col in rat2src.columns:
        rat2src[col] = rat2src[col].astype(str)

# album id name normalization
albums_for_pick = albums_src.copy()
if "album_id" not in albums_for_pick.columns and "id" in albums_for_pick.columns:
    albums_for_pick = albums_for_pick.rename(columns={"id": "album_id"})

albums_for_pick["album_id"] = albums_for_pick["album_id"].astype(str)

# ------------------------------------------------------------
# 1) Join: tracks -> audio_features (LEFT JOIN)
# ------------------------------------------------------------
assert "audio_feature_id" in tracks2.columns, (
    "tracks muss audio_feature_id enthalten, um mit audio_features.id zu joinen"
)

audio_small = audio2.rename(columns={"id": "audio_feature_id"}).copy()
audio_small["audio_feature_id"] = audio_small["audio_feature_id"].astype(str)

track_df = tracks2.merge(
    audio_small,
    on="audio_feature_id",
    how="left",
    suffixes=("", "_af")
)

# Parse TRACK release date (fallback; tracks often don't have it)
track_df = parse_datetime_from_candidates(
    track_df,
    candidates=["release_date", "track_release_date", "track_release_date_parsed"],
    out_col="track_release_date_parsed"
)

# ------------------------------------------------------------
# 2) Track -> Album (Many-to-Many) + choose ONE "Main Album"
# ------------------------------------------------------------
# Robust album release date parse:
# IMPORTANT: handle BOTH release_date and release_date_parsed (your schemas vary)
albums_for_pick = parse_datetime_from_candidates(
    albums_for_pick,
    candidates=["release_date_parsed", "release_date", "album_release_date_parsed", "album_release_date"],
    out_col="album_release_date_parsed"
)

# rat: (album_id, track_id) + album release date
rat2 = rat2src.merge(
    albums_for_pick[["album_id", "album_release_date_parsed"]],
    on="album_id",
    how="left"
)

# Choose main album per track
if MAIN_ALBUM_STRATEGY == "earliest_release":
    # NaT sorts last -> earliest valid date wins
    rat2 = rat2.sort_values(
        ["track_id", "album_release_date_parsed", "album_id"],
        ascending=[True, True, True]
    )
    main_album_per_track = rat2.drop_duplicates("track_id", keep="first")[["track_id", "album_id"]]
else:
    rat2 = rat2.sort_values(["track_id", "album_id"], ascending=[True, True])
    main_album_per_track = rat2.drop_duplicates("track_id", keep="first")[["track_id", "album_id"]]

track_df["track_id"] = track_df["track_id"].astype(str)
track_df = track_df.merge(main_album_per_track, on="track_id", how="left")

# ------------------------------------------------------------
# 3) Merge Album metadata onto track_df
# ------------------------------------------------------------
albums_join = albums_for_pick.copy()

# avoid collision with track popularity
if "popularity" in albums_join.columns:
    albums_join = albums_join.rename(columns={"popularity": "album_popularity"})

# keep raw release date string for debugging (optional)
if "release_date" in albums_join.columns:
    albums_join = albums_join.rename(columns={"release_date": "album_release_date_raw"})

track_df["album_id"] = track_df["album_id"].astype("string")
albums_join["album_id"] = albums_join["album_id"].astype("string")

track_df = track_df.merge(
    albums_join,
    on="album_id",
    how="left",
    suffixes=("", "_album")
)

# ------------------------------------------------------------
# 3.1) Master release date (Album > Track fallback)
# ------------------------------------------------------------
# Prefer album date, fallback to track date
track_df["release_date_parsed"] = col_or_na(track_df, "album_release_date_parsed").combine_first(
    col_or_na(track_df, "track_release_date_parsed")
)

# Build time features from master parsed date
track_df = add_release_time_features(track_df, "release_date_parsed")

# ------------------------------------------------------------
# 4) Artist aggregations per track
# ------------------------------------------------------------
artist_feat = artists.rename(
    columns={
        "id": "artist_id",
        "popularity": "artist_popularity",
        "followers": "artist_followers"
    }
)

artist_feat["artist_id"] = artist_feat["artist_id"].astype(str)
rta["artist_id"] = rta["artist_id"].astype(str)
rta["track_id"] = rta["track_id"].astype(str)

rta_art = rta.merge(artist_feat, on="artist_id", how="left")

artist_agg = (
    rta_art.groupby("track_id")
    .agg(
        artist_ids=("artist_id", lambda x: sorted(set(x.dropna().tolist()))),
        n_artists=("artist_id", "nunique"),
        artist_popularity_mean=("artist_popularity", "mean"),
        artist_popularity_max=("artist_popularity", "max"),
        artist_followers_mean=("artist_followers", "mean"),
        artist_followers_max=("artist_followers", "max"),
    )
    .reset_index()
)

track_df = track_df.merge(artist_agg, on="track_id", how="left")

# ------------------------------------------------------------
# 5) Track -> Genres via Artist genres
# ------------------------------------------------------------
rag2 = rag.copy()
if "genre_id" not in rag2.columns and "id" in rag2.columns:
    rag2 = rag2.rename(columns={"id": "genre_id"})

if "genre_id" in rag2.columns:
    rag2["genre_id"] = rag2["genre_id"].astype(str)

# ensure artist_id dtype matches rta
if "artist_id" in rag2.columns:
    rag2["artist_id"] = rag2["artist_id"].astype(str)

artist_to_genres = (
    rag2.groupby("artist_id")["genre_id"]
    .apply(lambda x: sorted(set(x.dropna().tolist())))
    .reset_index()
    .rename(columns={"genre_id": "artist_genres"})
)

rta_gen = rta.merge(artist_to_genres, on="artist_id", how="left")

track_to_genres = (
    rta_gen.groupby("track_id")["artist_genres"]
    .apply(lambda rows: sorted(set([
        g for lst in rows.dropna()
        for g in (lst if isinstance(lst, list) else [])
    ])))
    .reset_index()
    .rename(columns={"artist_genres": "track_genres"})
)

track_df = track_df.merge(track_to_genres, on="track_id", how="left")
track_df["track_genres"] = ensure_list_column(col_or_na(track_df, "track_genres"))

# ------------------------------------------------------------
# 6) Feature Engineering
# ------------------------------------------------------------
track_df["has_preview"] = col_or_na(track_df, "preview_url").notna().astype("int8")

track_df["name_len"] = safe_len_series(col_or_na(track_df, "name"))
track_df["name_words"] = safe_word_count_series(col_or_na(track_df, "name"))

dur_col = "duration" if "duration" in track_df.columns else (
    "duration_ms" if "duration_ms" in track_df.columns else None)
track_df["log_duration"] = log1p_numeric(track_df[dur_col]) if dur_col else pd.Series(np.nan, index=track_df.index)

track_df["log_artist_followers_max"] = log1p_numeric(col_or_na(track_df, "artist_followers_max"))
track_df["log_artist_followers_mean"] = log1p_numeric(col_or_na(track_df, "artist_followers_mean"))

track_df["has_audio_features"] = col_or_na(track_df, "audio_feature_id").notna().astype("int8")

# ------------------------------------------------------------
# Debug / Sanity checks (WHY NaT happens)
# ------------------------------------------------------------
print("Track-level dataset shape:", track_df.shape)

track_df.head(3)


Track-level dataset shape: (300000, 58)


Unnamed: 0,track_id,disc_number,duration,explicit,audio_feature_id,name,track_number,popularity,has_preview,is_long_track,...,artist_popularity_max,artist_followers_mean,artist_followers_max,track_genres,name_len,name_words,log_duration,log_artist_followers_max,log_artist_followers_mean,has_audio_features
0,0jBh6p4phjdP46bN3RUW0X,1,254426,False,0jBh6p4phjdP46bN3RUW0X,I vespri siciliani (Sung in German): Act II: D...,14,0,0,0,...,63,31112.066667,463518,"[classical, classical bass, classical soprano,...",95,16,12.446769,13.046603,10.345383,1
1,0JJDSzvy912NVhxpQMHRKd,1,213000,False,0JJDSzvy912NVhxpQMHRKd,I Love to Dance (But I Hate This Song),1,4,0,0,...,0,187.0,187,[uk pop punk],38,9,12.269052,5.236442,5.236442,1
2,0jEprLfYeA5OewUMfrcVI7,1,168018,False,0jEprLfYeA5OewUMfrcVI7,Sultan V. Murad İçin Şarkı-i Duaiye,11,20,0,0,...,27,2350.0,2350,"[oriental classical, turkish classical]",35,6,12.031832,7.762596,7.762596,1


## Album-Level Dataset (1 Zeile = 1 Album)

**Ziel:** Aufbau einer ML-fertigen Tabelle (`album_df`), in der **jede Zeile ein Album** repräsentiert.
Der Fokus liegt auf **Aggregations-Features** über Tracks und Artists.

**Schritte (kurz):**
1. **Album-Metadaten & Zeitfeatures**
   Parsing von `release_date` → `release_year`, `release_month`, `release_decade`.

2. **Album-Größe**
   Anzahl Tracks pro Album (`n_tracks`) über Album–Track-Relation.

3. **Album-Audio-Profil**
   Aggregation der Track-Audio-Features
   (z. B. Mittelwerte von Energy, Danceability, Loudness, Tempo).

4. **Album-Artist-Profil**
   Aggregation der beteiligten Artists
   (Anzahl Artists, Mittel/Max von Popularity & Followers).

5. **Album-Genres**
   Union aller Artist-Genres pro Album → `album_genres`.

6. **Feature Engineering**
   Log-Transform (`log_n_tracks`) + Textfeatures (`name_len`, `name_words`).

**Output:** `album_df` – Feature-Matrix auf Album-Ebene.


In [5]:
album_df = albums_src.copy()

# Einheitlicher Key-Name: wir nutzen überall "album_id" als Primärschlüssel.
album_df = album_df.rename(columns={"id": "album_id"})

# Release-Date robust parsen:
# - col_or_na: verhindert Fehler, wenn Spalte fehlt
# - errors="coerce": ungültige Werte werden NaT statt Exception
album_df["release_date_parsed"] = pd.to_datetime(
    col_or_na(album_df, "release_date_parsed"),
    errors="coerce"
)

# Zusätzliche Zeitfeatures (Jahr/Monat/Dekade) generieren
album_df = add_release_time_features(album_df, "release_date_parsed")

# ------------------------------------------------------------
# 2) Album-Größe: Anzahl Tracks pro Album
# ------------------------------------------------------------
# Warum?
# - Singles/EPs/Alben unterscheiden sich strukturell stark.
# - Trackanzahl ist ein guter Prädiktor und dient auch als Sanity-Check.
# - Wir nutzen rat (album_id <-> track_id Beziehungstabelle).

album_track_counts = (
    rat2.groupby("album_id")["track_id"]
    .nunique()  # wie viele verschiedene Tracks pro Album?
    .reset_index()
    .rename(columns={"track_id": "n_tracks"})
)

# Merge der Track-Anzahl in album_df (LEFT JOIN: Album bleibt, auch wenn keine Tracks gemappt sind)
album_df = album_df.merge(album_track_counts, on="album_id", how="left")

# ------------------------------------------------------------
# 3) Album "Audio Signature": Mittelwerte der Track-Audiofeatures
# ------------------------------------------------------------
# Warum?
# - Ein Album ist eine Sammlung von Tracks -> wir brauchen ein stabiles Album-Profil.
# - Mittelwert ist ein guter Baseline-Aggregator (später könnte man std/min/max ergänzen).
# - Wir nehmen bewusst typische Audiofeatures (Spotify-like).

POLICY_AUDIO = [
    "acousticness", "danceability", "energy", "instrumentalness", "liveness",
    "speechiness", "valence", "loudness", "tempo"
]

# Robustheit:
# - Nicht jedes Schema enthält alle Audiofeatures.
# - Wir nehmen nur die Spalten, die in track_df wirklich existieren.
audio_cols_present = [c for c in POLICY_AUDIO if c in track_df.columns]

# Beziehung Album->Track mit den Track-Audiofeatures joinen:
# rat liefert track_id + album_id, track_df liefert pro track_id die Audio-Spalten.
rat_track_audio = rat2.merge(
    track_df[["track_id"] + audio_cols_present],
    on="track_id",
    how="left"
)

# Pro Album aggregieren (mean über alle Tracks im Album)
album_audio_agg = (
    rat_track_audio
    .groupby("album_id")[audio_cols_present]
    .mean()
    .reset_index()
)

# Spaltennamen präfixen, damit klar ist:
# "album_mean_energy" = Album-Level Mittelwert (nicht Track-Level).
# Hinweis: add_prefix beeinflusst auch album_id -> deshalb danach zurück-rename.
album_audio_agg = (
    album_audio_agg
    .add_prefix("album_mean_")
    .rename(columns={"album_mean_album_id": "album_id"})
)

# Aggregierte Audiofeatures zurück in album_df mergen
album_df = album_df.merge(album_audio_agg, on="album_id", how="left")

# ------------------------------------------------------------
# 4) Album -> Artists Aggregationen (optional, falls vorhanden)
# ------------------------------------------------------------
# Warum?
# - Alben können mehrere Artists haben.
# - Popularity/Follower dieser Artists beeinflussen oft den Album-Erfolg.
# - Dieser Block läuft nur, wenn die Beziehungstabelle raa existiert und nicht leer ist.
#   (damit bleibt der Code kompatibel mit verschiedenen Exports.)

if not raa.empty and "album_id" in raa.columns and "artist_id" in raa.columns:
    # Artist-Features (artist_feat) werden an die Album-Artist-Beziehungen angehängt.
    raa_art = raa.merge(artist_feat, on="artist_id", how="left")

    # Aggregation pro Album:
    # - n_album_artists: wie viele Artists hat das Album?
    # - mean/max: durchschnittliches Level + "Top-Artist"-Signal
    album_artist_agg = (
        raa_art.groupby("album_id")
        .agg(
            n_album_artists=("artist_id", "nunique"),
            album_artist_popularity_mean=("artist_popularity", "mean"),
            album_artist_popularity_max=("artist_popularity", "max"),
            album_artist_followers_mean=("artist_followers", "mean"),
            album_artist_followers_max=("artist_followers", "max"),
        )
        .reset_index()
    )

    album_df = album_df.merge(album_artist_agg, on="album_id", how="left")

# ------------------------------------------------------------
# 5) Album Genres: Union über alle Album-Artists
# ------------------------------------------------------------
# Warum?
# - Genres hängen in vielen Spotify-Schemas an Artists, nicht direkt an Alben.
# - Wir definieren Album-Genres als Vereinigung aller Genres der Artists des Albums.
# - Ergebnis ist eine Liste von genre_id (stabile Keys).

if not raa.empty:
    # artist_to_genres enthält: artist_id -> [genre_id,...]
    raa_gen = raa.merge(artist_to_genres, on="artist_id", how="left")

    # Pro Album alle Artist-Genre-Listen flatten + union (set) + sortieren (Stabilität)
    album_to_genres = (
        raa_gen.groupby("album_id")["artist_genres"]
        .apply(lambda rows: sorted(set([
            g for lst in rows.dropna()
            for g in (lst if isinstance(lst, list) else [])
        ])))
        .reset_index()
        .rename(columns={"artist_genres": "album_genres"})
    )

    album_df = album_df.merge(album_to_genres, on="album_id", how="left")

else:
    # Wenn keine raa-Daten existieren:
    # - Wir halten trotzdem eine konsistente Spalte "album_genres",
    #   damit Downstream-Code nicht bricht.
    album_df["album_genres"] = [[] for _ in range(len(album_df))]

# Beim Speichern/Laden über CSV werden Listen manchmal zu Strings.
# ensure_list_column macht daraus wieder echte Python-Listen.
album_df["album_genres"] = ensure_list_column(col_or_na(album_df, "album_genres"))

# ------------------------------------------------------------
# 6) Feature Engineering (Log-Transforms + Name-Features)
# ------------------------------------------------------------
# (A) Log-Transform der Trackanzahl:
# - n_tracks kann heavy-tailed sein (Singles=1..2 vs. Compilations=50+)
# - log1p stabilisiert Skalen und reduziert Ausreißer-Effekt.
album_df["log_n_tracks"] = log1p_numeric(col_or_na(album_df, "n_tracks"))

# (B) Textfeatures aus Albumname:
# - einfache, schnelle Features, manchmal hilfreich
album_df["name_len"] = safe_len_series(col_or_na(album_df, "name"))
album_df["name_words"] = safe_word_count_series(col_or_na(album_df, "name"))

# ------------------------------------------------------------
# Debug / Sanity Check
# ------------------------------------------------------------
print("Album-level dataset shape:", album_df.shape)
album_df.head(3)


Album-level dataset shape: (195938, 29)


Unnamed: 0,album_id,name,album_type,release_date,popularity,release_date_parsed,is_release_year_invalid,release_year,release_month,release_decade,...,album_mean_tempo,n_album_artists,album_artist_popularity_mean,album_artist_popularity_max,album_artist_followers_mean,album_artist_followers_max,album_genres,log_n_tracks,name_len,name_words
0,7zzibEGo1mQ1jXP0sy9MpY,Trophy,album,1119916800000,10,2005-06-28,0,2005,6,2000,...,107.091003,1.0,20.0,20,9904.0,9904,"[gaian doom, post-metal]",0.693147,6,1
1,000EzOAjrELtNitY1ENo4S,De Ja Vu (Lips & Akiko Kiyama Remixes),album,1284940800000,0,2010-09-20,0,2010,9,2010,...,125.000999,2.0,0.0,0,0.0,0,"[classic house, deep funk house, deep house, d...",0.693147,38,8
2,7zt6XxPOo65XwZgUVlaQIB,Big History,album,1460505600000,0,2016-04-13,0,2016,4,2010,...,104.986,1.0,0.0,0,0.0,0,[],0.693147,11,2


## Artist-Level Dataset (1 Zeile = 1 Artist)

**Ziel:** ML-fertige Tabelle (`artist_df`) auf Artist-Ebene, primär für **Clustering / Community Detection**, optional auch für supervised Tasks.

**Schritte (kurz):**
1. **Artist-Stammdaten**
   Start mit `artists`, konsistente `artist_id` für Joins.

2. **Artist-Style-Profil (Aggregation über Tracks)**
   Verknüpfung Artist ↔ Tracks und Aggregation:
   - Anzahl Tracks (`n_tracks`)
   - Ø Track-Popularity, Explicit-Rate (falls verfügbar)
   - Mittelwerte der Audio-Features (Artist-„Sound“-Vektor)

3. **Artist-Genres**
   Merge der Genre-Liste pro Artist (`artist_genres`).

4. **Feature Engineering**
   Log-Transforms für Followers und Track-Anzahl.

**Output:** `artist_df` – numerisches Artist-Profil + Genre-Informationen.


In [6]:
# ------------------------------------------------------------
# 1) Start: Artist-Stammdaten laden + PK konsistent benennen
# ------------------------------------------------------------
# Warum?
# - Für Clustering / Similarity brauchen wir pro Artist einen Feature-Vektor.
# - Einheitliche Key-Namen ("artist_id") machen Joins über mehrere Tabellen stabil.
artist_df = artists.rename(columns={"id": "artist_id"}).copy()

# ------------------------------------------------------------
# 2) Artist "Style Profile": Aggregation über alle Tracks des Artists
# ------------------------------------------------------------
# Warum?
# - Beziehung rta ist Many-to-Many: ein Artist hat viele Tracks, ein Track kann mehrere Artists haben.
# - Für Artist-Level-Analysen brauchen wir stabile numerische Features:
#   * n_tracks: Anzahl eindeutiger Tracks
#   * mean Audio-Signature: Durchschnittswerte von energy, danceability, ...
#   * optional: durchschnittliche Track-Popularity (wie populär sind ihre Tracks im Mittel?)
#   * optional: explicit_rate (Anteil explicit Tracks)

# Wir definieren dynamisch, welche Track-Spalten wir joinen:
# - track_id ist Pflicht (für Counting)
# - audio_cols_present: nur Audiofeatures, die wirklich existieren
cols_for_artist_agg = ["track_id"] + audio_cols_present

# Optional: falls Track-Popularity existiert, nehmen wir sie dazu
if "popularity" in track_df.columns:
    cols_for_artist_agg += ["popularity"]

# Optional: falls explicit existiert, nehmen wir es dazu
if "explicit" in track_df.columns:
    cols_for_artist_agg += ["explicit"]

# Jetzt joinen wir Track-Features in die Track-Artist-Beziehungstabelle:
# Ergebnis: pro (artist_id, track_id) stehen die Track-Features zur Verfügung.
rta_track_audio = rta.merge(
    track_df[cols_for_artist_agg],
    on="track_id",
    how="left"
)


# ------------------------------------------------------------
# Helper: Explicit-Rate robust berechnen
# ------------------------------------------------------------
# Warum eine eigene Funktion?
# - explicit kann als bool, 0/1, oder sogar als String vorliegen.
# - Wir wandeln zu numeric um und berechnen den Mittelwert (entspricht Anteil 1er).
# - Wenn alles fehlt -> NaN (damit später klar ist: keine Information verfügbar).
def explicit_rate_fn(x):
    xx = pd.to_numeric(x, errors="coerce")
    if xx.dropna().empty:
        return np.nan
    return float(np.nanmean(xx))


# ------------------------------------------------------------
# Aggregations-Definition bauen (flexibel je nach vorhandenen Spalten)
# ------------------------------------------------------------
# Grundfeature: Anzahl Tracks pro Artist
agg_dict = {
    "n_tracks": ("track_id", "nunique")
}

# Optional: durchschnittliche Track-Popularity
# Hinweis: kann "proxy" sein, aber ggf. auch Leakage je nach Task.
if "popularity" in rta_track_audio.columns:
    agg_dict["track_pop_mean"] = ("popularity", "mean")

# Optional: Anteil explicit Tracks
if "explicit" in rta_track_audio.columns:
    agg_dict["explicit_rate"] = ("explicit", explicit_rate_fn)

# Core: Audio-Profile (Mittelwerte je Feature)
for c in audio_cols_present:
    agg_dict[f"mean_{c}"] = (c, "mean")

# Aggregation ausführen: 1 Zeile pro Artist
artist_audio_agg = (
    rta_track_audio.groupby("artist_id")
    .agg(**agg_dict)
    .reset_index()
)

# Aggregierte Features zurück in artist_df mergen
artist_df = artist_df.merge(artist_audio_agg, on="artist_id", how="left")

# ------------------------------------------------------------
# 3) Genres pro Artist anhängen (als Liste)
# ------------------------------------------------------------
# Warum?
# - Genres sind in Spotify-Schemas typischerweise Artist-Level.
# - Wir behalten sie als Liste für spätere Multi-Hot-Encodings (Top-K).
artist_df = artist_df.merge(artist_to_genres, on="artist_id", how="left")

# Sicherstellen, dass es echte Python-Listen sind (wichtig nach CSV Import)
artist_df["artist_genres"] = ensure_list_column(col_or_na(artist_df, "artist_genres"))

# ------------------------------------------------------------
# 4) Feature Engineering: Log-Transforms für heavy-tailed Counts
# ------------------------------------------------------------
# Warum?
# - followers ist extrem schief verteilt (einige Superstars, viele kleine Artists).
# - n_tracks kann auch stark variieren.
# - log1p macht Skalen stabiler und reduziert Ausreißer-Einfluss.
artist_df["log_followers"] = log1p_numeric(col_or_na(artist_df, "followers"))
artist_df["log_n_tracks"] = log1p_numeric(col_or_na(artist_df, "n_tracks"))

# ------------------------------------------------------------
# Debug / Sanity Check
# ------------------------------------------------------------
print("Artist-level dataset shape:", artist_df.shape)
artist_df.head(3)


Artist-level dataset shape: (187440, 21)


Unnamed: 0,artist_id,name,popularity,followers,is_followers_extreme,followers_log1p,n_tracks,track_pop_mean,explicit_rate,mean_acousticness,...,mean_energy,mean_instrumentalness,mean_liveness,mean_speechiness,mean_valence,mean_loudness,mean_tempo,artist_genres,log_followers,log_n_tracks
0,7zzsdcNemyhcNk2wpNsXZt,Sinéad Lohan,31,3377,0,8.125039,1,3.0,0.0,0.885,...,0.23,2e-06,0.0995,0.0305,0.385,-17.577,118.542999,"[irish singer-songwriter, lilith]",8.125039,0.693147
1,00045gNg7mLEf9UY9yhD0t,Kubus & BangBang,13,820,0,6.710523,3,9.333333,1.0,0.099367,...,0.619333,0.000497,0.252667,0.443,0.269667,-8.482333,151.447665,[dutch hip hop],6.710523,1.386294
2,000Dq0VqTZpxOP6jQMscVL,Thug Brothers,14,4890,0,8.495152,1,0.0,1.0,0.131,...,0.631,0.0,0.644,0.306,0.571,-4.186,155.955994,"[baton rouge rap, deep southern trap]",8.495152,0.693147


## Speichere Modelling Tables

In [7]:
track_out = PATHS.modeling_dir / "track_dataset.parquet"
album_out = PATHS.modeling_dir / "album_dataset.parquet"
artist_out = PATHS.modeling_dir / "artist_dataset.parquet"

track_df.to_parquet(track_out, index=False)
album_df.to_parquet(album_out, index=False)
artist_df.to_parquet(artist_out, index=False)

print(" Saved modeling datasets:")
print(" -", track_out)
print(" -", album_out)
print(" -", artist_out)

 Saved modeling datasets:
 - C:\GitHub\uni-project-metrics-and-data\data\processed\modeling\slice_001\track_dataset.parquet
 - C:\GitHub\uni-project-metrics-and-data\data\processed\modeling\slice_001\album_dataset.parquet
 - C:\GitHub\uni-project-metrics-and-data\data\processed\modeling\slice_001\artist_dataset.parquet
