# Explorative Datenanalyse (EDA): Spotify SQLite Dataset
## Ziel und Kontext
Dieses Notebook dokumentiert eine **explorative Datenanalyse (EDA)** auf einem reproduzierbaren Datensnapshot („Slice“) eines großen Spotify-SQLite-Datensatzes.

Der Fokus liegt auf:
- **Data Understanding** (Schema, Inhalte, zentrale Variablen)
- **Datenqualität** (Missingness, Integrität, Ausreißer/Regelverletzungen)
- **Explorativen Analysen** (Verteilungen, Korrelationen, Segmentierungen, Zeittrends)

## Leitende Forschungsfragen
1. Wie sind zentrale Variablen verteilt (z. B. Popularität, Audio-Features, Followers, Release-Jahre)?
2. Welche Merkmale hängen mit Popularität zusammen (linear/monoton)?
3. Welche Datenqualitätsprobleme existieren (Missingness, Duplikate, FK-Brüche, Regelverletzungen), und wie stark sind sie?
4. Gibt es zeitliche Trends in Audio-Merkmalen, und wie robust sind sie angesichts der Datenabdeckung?

## Warum Sampling?
Der Gesamtdatensatz ist sehr groß. Daher arbeiten wir mit einer **deterministischen Slice** (ROWID-Mod-Buckets), um:
- reproduzierbare Ergebnisse zu gewährleisten,
- Laufzeit/Speicher zu kontrollieren,
- später Stabilität über mehrere Slices vergleichen zu können.

## Outputs
Alle Plots/Tabellen werden nach `reports/schema_reports/<sample_name>/...` geschrieben.


In [1]:
# --- Grundlegendes Setup ---
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.table import table as mpl_table
import sqlite3
import datetime
import json
import importlib
import numpy as np
import re


from pandas.plotting import scatter_matrix
from utils.data.sqlite_sampler import SQLiteSampleExporter
from utils.data.sqlite_schema import build_db_summary
from utils.core.paths import make_paths, ensure_dirs
from utils.data.sqlite_schema import get_table_info, get_rowcount
from utils.data.parsing import parse_release_date_universal
from utils.data.checker import DataIntegrityChecker



import utils.eda.plots as plots
import utils.data.analyzer as analyzers


importlib.reload(plots)
importlib.reload(analyzers)

from utils.data.analyzer import CategoryAnalyzer, InfluenceAnalyzer
from utils.eda.plots import (
    NumericProfileParams,
    run_profile_reports,
    corr_with_popularity,
    small_scatter,
    top_corr_pairs,
)

# Matplotlib: keine Styles/keine festen Farben setzen
plt.rcParams.update({"figure.figsize": (8, 5), "axes.grid": True})

# Verbindung nur für Schema-Abfrage (UTF-8/Bytes egal, da keine Daten gelesen werden)
pd.set_option("display.max_colwidth", None)


## 1) Reproduzierbarkeit & aktive Slice (Quelle der Wahrheit)

In diesem Abschnitt wird die aktive Teildatenmenge („Slice“) festgelegt. Diese Konfiguration gilt als **Single Source of Truth** für alle nachfolgenden Notebooks.

**Sampling-Prinzip**
- Deterministisches Slicing über `(rowid % row_mod) IN {bucket_ids}`
- Pro Slice werden mehrere Buckets kombiniert, um ausreichende Fallzahlen zu erreichen.
- Metadaten werden in `current_sample.json` gespeichert, sodass weitere Notebooks automatisch dieselbe Slice laden.

**Limitation**
ROWID-basiertes Sampling ist reproduzierbar, aber nicht zwingend zufällig im statistischen Sinn. Potenzielle Verzerrungen werden in der Schlusssektion berücksichtigt.


In [2]:
# ================================================================
# GLOBALE KONFIGURATION (Notebook 01)
# ------------------------------------------------
# Ziel:
# - Dieses Notebook ist die "Quelle der Wahrheit" für die aktive Daten-Slice.
# - Es legt fest, welche Slice verwendet wird und schreibt diese Info in eine JSON-Datei
#   (current_sample.json). Dadurch können alle weiteren Notebooks automatisch
#   dieselbe Slice laden, ohne dass du dort manuell etwas ändern musst.
# ================================================================

# ------------------------------------------------
# 1) Slice-Auswahl (NUR HIER ändern)
# ------------------------------------------------
# SAMPLE_NAME: Name der aktuellen Teilstichprobe ("Slice")
# ROW_K_START: Start-Bucket für die ROWID-Slicing-Methode
SAMPLE_NAME = "slice_000"  # slice_000 | slice_001 | slice_002
ROW_K_START = 0            # slice_000 -> 0 | slice_001 -> 7 | slice_002 -> 14

# ------------------------------------------------
# 2) Sampling-Konfiguration (für Reproduzierbarkeit dokumentiert)
# ------------------------------------------------
# Wir nutzen deterministisches Slicing über (rowid % row_mod) und wählen pro Slice mehrere Buckets.
# Vorteil: reproduzierbar, schnell (DB-seitig), ideal um Modellstabilität über mehrere Slices zu testen.
SAMPLING_CONFIG = {
    "mode": "ROWID_MOD",
    "method": "ROWID multi-bucket slicing",
    "row_mod": 200,                 # Anzahl Partitionen/Buckets insgesamt
    "buckets_per_slice": 7,         # wie viele Buckets pro Slice kombiniert werden
    "row_k_start": ROW_K_START,     # Start-Bucket-Index für diese Slice
    "target_tracks": 300_000,       # Zielgröße der Track-Stichprobe
    "require_audio_features": True, # optionaler Filter: nur Tracks mit Audio-Features
}

# ------------------------------------------------
# 3) Pfade erzeugen + Ordner anlegen
# ------------------------------------------------
# make_paths(...) erzeugt alle relevanten Projektpfade für diese Slice (Export, Reports, Models, ...)
# ensure_dirs(...) legt die Ordner an (safe: kann beliebig oft ausgeführt werden)
PATHS = make_paths(SAMPLE_NAME)
ensure_dirs(PATHS)

# Zusätzlicher Ordner für Schema-Reports (z.B. Tabellenübersicht der SQLite-DB)
PATHS.schema_reports_dir.mkdir(parents=True, exist_ok=True)

# ------------------------------------------------
# 4) "Aktive Slice" als Metadaten speichern (current_sample.json)
# ------------------------------------------------
# Diese Datei wird von anderen Notebooks gelesen (load_sample_name()).
# Damit ist garantiert, dass alle Notebooks konsistent mit der gleichen Slice arbeiten.
PATHS.meta_path.write_text(
    json.dumps(
        {
            "SAMPLE_NAME": SAMPLE_NAME,
            "CREATED_AT": datetime.datetime.now(datetime.UTC).isoformat(),  # Zeitpunkt der Erstellung
            "SAMPLING_CONFIG": SAMPLING_CONFIG,                             # Sampling-Parameter
        },
        indent=2,
    ),
    encoding="utf-8",
)

# ------------------------------------------------
# 5) Kurze Ausgabe zur Kontrolle
# ------------------------------------------------
# row_ks sind die konkreten Bucket-IDs, die diese Slice auswählt:
# (rowid % row_mod) IN row_ks
row_ks = [
    (ROW_K_START + i) % SAMPLING_CONFIG["row_mod"]
    for i in range(SAMPLING_CONFIG["buckets_per_slice"])
]

print("Aktive Slice:", SAMPLE_NAME)
print("Buckets (rowid % row_mod):", row_ks)
print("Export-Verzeichnis:", PATHS.export_dir)
print("Raw-DB:", PATHS.raw_spotify_db_path)
print("Schema-Reports:", PATHS.schema_reports_dir)

Aktive Slice: slice_000
Buckets (rowid % row_mod): [0, 1, 2, 3, 4, 5, 6]
Export-Verzeichnis: C:\GitHub\uni-project-metrics-and-data\data\interim\converted_sqlite_samples\slice_000
Raw-DB: C:\GitHub\uni-project-metrics-and-data\data\raw\spotify.sqlite
Schema-Reports: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000


## 2) Datenmodell (konzeptioneller Überblick)

### Kern-Entitäten
- `tracks`: Track-Metadaten (u. a. Popularität, Dauer, Flags)
- `audio_features`: numerische Audio-Features pro Track (z. B. `energy`, `tempo`)
- `artists`: Artist-Metadaten (z. B. `followers`, `popularity`)
- `albums`: Album-Metadaten (z. B. `album_type`, `release_date`)

### Beziehungstabellen
- `r_track_artist`: Track ↔ Artist (n:m)
- `r_albums_tracks`: Track ↔ Album
- `r_artist_genre`: Artist ↔ Genre

**Hinweis (Genres):** Genres werden im weiteren Verlauf als `genre_id` (IDs) analysiert. Eine stabile Zuordnung zu sprechenden Genre-Namen ist in diesem Notebook nicht vorausgesetzt.


## 3) Datenquelle & Schema (SQLite)

Ziel dieses Abschnitts ist es, die Datenstruktur vor der inhaltlichen Analyse zu verstehen:
- Welche Tabellen existieren?
- Wie groß sind die Tabellen (Zeilenzahl)?
- Welche Spalten/Datentypen sind definiert?
- Welche Spalten sind Primärschlüssel?


In [3]:
with sqlite3.connect(str(PATHS.raw_spotify_db_path)) as con:
    summary, details = build_db_summary(con)

display(summary[["table", "rowcount", "n_columns", "columns_preview"]])


Unnamed: 0,table,rowcount,n_columns,columns_preview
7,r_track_artist,11840402,2,"track_id (), artist_id ()"
5,r_albums_tracks,9900173,2,"album_id (), track_id ()"
8,tracks,8741672,10,"id (), disc_number (), duration (), explicit (), audio_feature_id (), name (), preview_url (), track_number () … (+2 more)"
2,audio_features,8740043,15,"id (), acousticness (), analysis_url (), danceability (), duration (), energy (), instrumentalness (), key () … (+7 more)"
0,albums,4820754,6,"id (), name (), album_group (), album_type (), release_date (), popularity ()"
1,artists,1066031,4,"name (), id (), popularity (), followers ()"
4,r_albums_artists,921486,2,"album_id (), artist_id ()"
6,r_artist_genre,487386,2,"genre_id (), artist_id ()"
3,genres,5489,1,id ()


### 3.1 Tabelleninhalte: Stichproben-Preview

**Ziel**
Wir prüfen exemplarische Datensätze und Datentypen, um Formatfragen (z. B. Datumsstrings), ID-Strukturen und offensichtliche Unstimmigkeiten früh zu erkennen.

**Vorgehen**
Für ausgewählte Tabellen werden Schema-Infos sowie die ersten Zeilen geladen.


In [4]:
tables = ["artists", "tracks", "audio_features", "albums", "genres"]


with sqlite3.connect(str(PATHS.raw_spotify_db_path)) as con:
    for t in tables:
        print(f"\n{'='*70}\nTabelle: {t.upper()}\n{'='*70}")

        info = get_table_info(con, t)
        rowcount = get_rowcount(con, t)

        if rowcount is not None:
            print(f"Zeilen (COUNT): {int(rowcount):,}")
        else:
            print("Zeilen (COUNT): n/a")


        print(f"Anzahl Spalten: {len(info)}")
        display(info[["cid", "name", "type", "notnull", "dflt_value", "pk"]])


        try:
            df_preview = pd.read_sql(f"SELECT * FROM {t} LIMIT 5;", con)
            display(df_preview)
        except Exception as e:
            print(f"Fehler beim Lesen von {t}: {e}")



Tabelle: ARTISTS
Zeilen (COUNT): 1,066,031
Anzahl Spalten: 4


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,name,,0,,0
1,1,id,,0,,0
2,2,popularity,,0,,0
3,3,followers,,0,,0


Unnamed: 0,name,id,popularity,followers
0,Xzibit,4tujQJicOnuZRLiBFdp3Ou,69,1193665
1,Erick Sermon,2VX0o9LDIVmKIgpnwdJpOJ,54,142007
2,J. Ro,3iBOsmwGzRKyR0vs2I61xP,45,158
3,Tash,22qf8cJRzBjIWb2Jc4JeOr,48,3421
4,Craig Mack,4akj4uteQQrrGxhX9Rjuyf,55,161966



Tabelle: TRACKS
Zeilen (COUNT): 8,741,672
Anzahl Spalten: 10


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,,0,,0
1,1,disc_number,,0,,0
2,2,duration,,0,,0
3,3,explicit,,0,,0
4,4,audio_feature_id,,0,,0
5,5,name,,0,,0
6,6,preview_url,,0,,0
7,7,track_number,,0,,0
8,8,popularity,,0,,0
9,9,is_playable,,0,,0


Unnamed: 0,id,disc_number,duration,explicit,audio_feature_id,name,preview_url,track_number,popularity,is_playable
0,1dizvxctg9dHEyaYTFufVi,1,275893,1,1dizvxctg9dHEyaYTFufVi,Gz And Hustlas (feat. Nancy Fletcher),,12,0,
1,2g8HN35AnVGIk7B8yMucww,1,252746,1,2g8HN35AnVGIk7B8yMucww,Big Poppa - 2005 Remaster,https://p.scdn.co/mp3-preview/770e023eb0318270ecc5caa018d758e5e0844de9?cid=cde021ca5d3e42a8bd440f1004a562dc,13,77,
2,49pnyECzcMGCKAqxfTB4JZ,3,315080,0,49pnyECzcMGCKAqxfTB4JZ,"You Were Born - Early Version Of ""One Of The Three"" / Outtake",,6,8,1.0
3,4E5IFAXCob6QqZaJMTw5YN,1,240800,1,4E5IFAXCob6QqZaJMTw5YN,Poppin' Them Thangs,https://p.scdn.co/mp3-preview/f3b556ced9657f8987d2c981014205244daf4540?cid=cde021ca5d3e42a8bd440f1004a562dc,2,70,
4,1gSt2UlC7mtRtJIc5zqKWn,2,203666,0,1gSt2UlC7mtRtJIc5zqKWn,"It's Hard To Say ""I Do"", When I Don't",,2,50,



Tabelle: AUDIO_FEATURES
Zeilen (COUNT): 8,740,043
Anzahl Spalten: 15


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,,0,,0
1,1,acousticness,,0,,0
2,2,analysis_url,,0,,0
3,3,danceability,,0,,0
4,4,duration,,0,,0
5,5,energy,,0,,0
6,6,instrumentalness,,0,,0
7,7,key,,0,,0
8,8,liveness,,0,,0
9,9,loudness,,0,,0


Unnamed: 0,id,acousticness,analysis_url,danceability,duration,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,2jKoVlU7VAmExKJ1Jh3w9P,0.18,https://api.spotify.com/v1/audio-analysis/2jKoVlU7VAmExKJ1Jh3w9P,0.893,219160,0.514,0.0,11,0.0596,-5.08,1,0.283,95.848,4,0.787
1,4JYUDRtPZuVNi7FAnbHyux,0.272,https://api.spotify.com/v1/audio-analysis/4JYUDRtPZuVNi7FAnbHyux,0.52,302013,0.847,0.0,9,0.325,-5.3,1,0.427,177.371002,4,0.799
2,6YjKAkDYmlasMqYw73iB0w,0.0783,https://api.spotify.com/v1/audio-analysis/6YjKAkDYmlasMqYw73iB0w,0.918,288200,0.586,0.0,1,0.145,-2.89,1,0.133,95.516998,4,0.779
3,2YlvHjDb4Tyxl4A1IcDhAe,0.584,https://api.spotify.com/v1/audio-analysis/2YlvHjDb4Tyxl4A1IcDhAe,0.877,243013,0.681,0.0,1,0.119,-6.277,0,0.259,94.834999,4,0.839
4,3UOuBNEin5peSRqdzvlnWM,0.17,https://api.spotify.com/v1/audio-analysis/3UOuBNEin5peSRqdzvlnWM,0.814,270667,0.781,0.000518,11,0.052,-3.33,1,0.233,93.445,4,0.536



Tabelle: ALBUMS
Zeilen (COUNT): 4,820,754
Anzahl Spalten: 6


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,,0,,0
1,1,name,,0,,0
2,2,album_group,,0,,0
3,3,album_type,,0,,0
4,4,release_date,,0,,0
5,5,popularity,,0,,0


Unnamed: 0,id,name,album_group,album_type,release_date,popularity
0,2jKoVlU7VAmExKJ1Jh3w9P,"Alkaholik (feat. Erik Sermon, J Ro & Tash)",,album,954633600000,0
1,4JYUDRtPZuVNi7FAnbHyux,"Flava in Ya Ear Remix (feat. Notorious B.I.G., L.L. Cool J, Busta Rhymes, Rampage)",,single,757382400000,0
2,6YjKAkDYmlasMqYw73iB0w,Bitch Please II,,album,959040000000,0
3,2YlvHjDb4Tyxl4A1IcDhAe,Just Dippin',,compilation,1104537600000,0
4,3UOuBNEin5peSRqdzvlnWM,Still D.R.E.,,album,942710400000,0



Tabelle: GENRES
Zeilen (COUNT): 5,489
Anzahl Spalten: 1


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,,0,,0


Unnamed: 0,id
0,detroit hip hop
1,g funk
2,gangster rap
3,hardcore hip hop
4,hip hop


## 4) Sampling-Export: CSV-Snapshot

**Ziel**
Um effizient mit Pandas zu arbeiten, exportieren wir eine konsistente, reproduzierbare Stichprobe als CSV-Dateien.

**Vorgehen**
- Export zentraler Entitäten (`tracks`, `audio_features`, `artists`, `albums`) sowie relevanter Beziehungstabellen.
- Sampling erfolgt SQL-seitig (performant, reproduzierbar).
- Optionaler Filter: nur Tracks mit vorhandenen Audio-Features.

**Erwartetes Ergebnis**
Ein Export-Report sowie CSV-Dateien im sample-spezifischen Verzeichnis.


In [5]:
exporter = SQLiteSampleExporter(
db_path=PATHS.raw_spotify_db_path,
export_dir=PATHS.export_dir,
sample_name=PATHS.sample_name,
require_audio_features=SAMPLING_CONFIG["require_audio_features"],
)


report = exporter.export_rowid_slice(
target_tracks=SAMPLING_CONFIG["target_tracks"],
row_mod=SAMPLING_CONFIG["row_mod"],
row_k_start=SAMPLING_CONFIG["row_k_start"],
buckets_per_slice=SAMPLING_CONFIG["buckets_per_slice"],
)


report

SamplerReport(sample_name='slice_000', mode='ROWID_MOD', target_tracks=300000, selected_tracks=300000, row_buckets=[0, 1, 2, 3, 4, 5, 6], explicit_count=20193, hit_like_count=252, avg_popularity=6.150243333333333, export_files={'tracks': 'tracks.csv', 'r_albums_tracks': 'r_albums_tracks.csv', 'r_track_artist': 'r_track_artist.csv', 'albums': 'albums.csv', 'artists': 'artists.csv', 'audio_features': 'audio_features.csv', 'r_artist_genre': 'r_artist_genre.csv', 'genres': 'genres.csv', 'r_albums_artists': 'r_albums_artists.csv'}, elapsed_sec=56.16260290145874)

## 5) Datensatz laden (CSV) + Baseline-Checks

Wir laden die exportierten CSVs in DataFrames. Ab hier laufen Analysen überwiegend Pandas-basiert.

**Wichtig:** In diesem Notebook werden die Daten aus `PATHS.raw_dir` geladen (exportierter Snapshot). Abschnitte, die `PATHS.interim_samples_dir` nutzen, sind explizit als *interim/weiterverarbeitete Quelle* zu verstehen.


In [6]:
print("Using sample:", SAMPLE_NAME)

# Lade mehrere Tabellen
tracks = pd.read_csv(PATHS.raw_dir / "tracks.csv")
audio = pd.read_csv(PATHS.raw_dir / "audio_features.csv")
artists = pd.read_csv(PATHS.raw_dir / "artists.csv")
albums = pd.read_csv(PATHS.raw_dir / "albums.csv")

# -------- Übersicht: Anzahl numerischer Spalten pro Tabelle --------
for name, df in {"tracks": tracks, "audio_features": audio, "artists": artists, "albums": albums}.items():
    num_cols = df.select_dtypes(include=["number"]).columns.tolist()
    print(
        f"{name:<15} -> numerische Spalten: {len(num_cols)} | {', '.join(num_cols[:8])}{'...' if len(num_cols) > 8 else ''}"
    )


Using sample: slice_000
tracks          -> numerische Spalten: 6 | disc_number, duration, explicit, track_number, popularity, is_playable
audio_features  -> numerische Spalten: 13 | acousticness, danceability, duration, energy, instrumentalness, key, liveness, loudness...
artists         -> numerische Spalten: 2 | popularity, followers
albums          -> numerische Spalten: 3 | album_group, release_date, popularity


### 5.1 Release-Date Parsing & Granularität (Audit)

Viele Spotify-Datasets enthalten `release_date` in unterschiedlichen Granularitäten (`YYYY`, `YYYY-MM`, `YYYY-MM-DD`). Für Zeitanalysen ist ein konsistentes Parsing entscheidend.

**Ziel**
- Parsing der `release_date`-Spalte in eine einheitliche Datetime-Spalte
- Transparenz über Parsing-Erfolg und Granularität


In [7]:
# Parse to datetime (einheitlich)
albums["release_dt"] = parse_release_date_universal(albums["release_date"])

# Granularitäts-Audit: Distribution der Stringlängen (als Proxy für YYYY vs YYYY-MM vs YYYY-MM-DD)
if "release_date" in albums.columns:
    lengths = albums["release_date"].astype(str).str.len().value_counts(dropna=False).sort_index()
    audit = lengths.rename_axis("release_date_str_len").reset_index(name="count")

    out_dir = PATHS.schema_reports_dir / "time_based_analysis"
    out_dir.mkdir(parents=True, exist_ok=True)
    audit_path = out_dir / "release_date_granularity_audit.csv"
    audit.to_csv(audit_path, index=False, encoding="utf-8-sig")
    print("saved:", audit_path)

parse_rate = albums["release_dt"].notna().mean() * 100
print(f"Release-Date Parsing Success: {parse_rate:.2f}%")


saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\time_based_analysis\release_date_granularity_audit.csv
Release-Date Parsing Success: 99.88%


  out.loc[rest] = pd.to_datetime(txt.loc[rest], errors="coerce")


## 6) Univariate EDA: Deskriptive Statistiken & Verteilungen

Ziel: zentrale numerische Variablen hinsichtlich Lage, Streuung, Schiefe und Extremwerten beschreiben.

**Outputs:** `describe`-HTML pro Tabelle, Histogramm-Reports.


In [8]:
# --- Lade aktuelle Slice (CSV) ---
tracks = pd.read_csv(PATHS.raw_dir / "tracks.csv")
audio = pd.read_csv(PATHS.raw_dir / "audio_features.csv")
artists = pd.read_csv(PATHS.raw_dir / "artists.csv")
albums = pd.read_csv(PATHS.raw_dir / "albums.csv")


# -------- Übersicht: Anzahl numerischer Spalten pro Tabelle --------
for name, df in {"tracks": tracks, "audio_features": audio, "artists": artists, "albums": albums}.items():
    num_cols = df.select_dtypes(include=["number"]).columns.tolist()
    print(f"{name:<15} -> numerische Spalten: {len(num_cols)} | {', '.join(num_cols[:8])}{'...' if len(num_cols)>8 else ''}")


tracks          -> numerische Spalten: 6 | disc_number, duration, explicit, track_number, popularity, is_playable
audio_features  -> numerische Spalten: 13 | acousticness, danceability, duration, energy, instrumentalness, key, liveness, loudness...
artists         -> numerische Spalten: 2 | popularity, followers
albums          -> numerische Spalten: 3 | album_group, release_date, popularity



### Fast Overview der Tabellen mit describe()

In [9]:
tables = {"tracks": tracks, "audio_features": audio, "artists": artists, "albums": albums}

out_dir = PATHS.schema_reports_dir / "descriptions"
out_dir.mkdir(parents=True, exist_ok=True)

for name, df in tables.items():
    num_cols = df.select_dtypes(include=["number"]).columns
    if len(num_cols) == 0:
        print(f"skipped {name}: no numeric columns")
        continue

    desc = (
        df[num_cols]
        .describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95])
        .T.round(2)
    )

    out_file = out_dir / f"{name}_describe.html"

    # Safety: if an old mistake created a DIRECTORY where the file should be, fail clearly
    if out_file.exists() and out_file.is_dir():
        raise RuntimeError(f"{out_file} is a directory. Delete it and rerun.")

    desc.to_html(out_file)  # pathlib Path is fine
    print("saved:", out_file)




saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\descriptions\tracks_describe.html
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\descriptions\audio_features_describe.html
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\descriptions\artists_describe.html
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\descriptions\albums_describe.html


### Histogramme für ausgewählte numerische Spalten

In [10]:
HIST_DIR = PATHS.schema_reports_dir / "histograms"

# release_year wird für Album-Plot genutzt
albums["release_dt"] = parse_release_date_universal(albums["release_date"])
albums["release_year"] = albums["release_dt"].dt.year

track_cols = [
    c
    for c in ["popularity", "duration", "track_number", "disc_number", "explicit", "release_year", "release_age_years"]
    if c in tracks.columns
]
audio_cols = [
    c
    for c in [
        "danceability",
        "energy",
        "loudness",
        "valence",
        "tempo",
        "acousticness",
        "instrumentalness",
        "speechiness",
        "time_signature",
    ]
    if c in audio.columns
]
artist_cols = [c for c in ["popularity", "followers"] if c in artists.columns]
album_cols = [c for c in ["popularity", "release_year"] if c in albums.columns]

params = NumericProfileParams(bins=40, chunk_size=6, log_mode="auto")

run_profile_reports(
    out_dir=HIST_DIR,
    sample_name=PATHS.sample_name,
    params=params,
    reports=[
        ("tracks", tracks, track_cols, "Tracks – numerische Verteilungen"),
        ("audio_features", audio, audio_cols, "Audio Features – numerische Verteilungen"),
        ("artists", artists, artist_cols, "Artists – numerische Verteilungen"),
        ("albums", albums, album_cols, "Albums – numerische Verteilungen"),
    ],
)


  out.loc[rest] = pd.to_datetime(txt.loc[rest], errors="coerce")


saved: tracks_profiles_p1.png
saved: audio_features_profiles_p1.png
saved: audio_features_profiles_p2.png
saved: artists_profiles_p1.png
saved: albums_profiles_p1.png


## 7) Datenqualität

Dieser Block bündelt die systematischen Qualitätsprüfungen:
1. Ausreißer/Regelverletzungen (Domain Checks)
2. Missingness/Completeness
3. Eindeutigkeit/Duplikate/FK-Integrität

Hinweis: Die Reihenfolge ist bewusst so gewählt, damit zunächst stark fehlerhafte Werte sichtbar werden, bevor Missingness/Integrität detailliert ausgewertet werden.


### 7.1 Ausreißer & Regelverletzungen (Domain Checks)

In diesem Schritt prüfen wir numerische Spalten der Tabellen (`tracks`, `audio_features`, `artists`, `albums`) auf auffällige bzw. potenziell fehlerhafte Werte.

**Methodik (pro Tabelle und pro numerischer Spalte):**
- Missing- und Zero-Rate
- Regelbasierte Plausibilitätschecks (Domain-Regeln)
- Quantil-basierte Ausreißer
- IQR-Ausreißer (optional)

**Outputs:** CSV-Reports und PNG-Übersichten in `outliers/`.


In [11]:
from utils.eda.outliers import OutlierParams, run_outlier_suite

OUTLIER_DIR = PATHS.schema_reports_dir / "outliers"

tables = {
    "tracks": tracks,
    "audio_features": audio,
    "artists": artists,
    "albums": albums,
}

params = OutlierParams(q_low=0.005, q_high=0.995, use_iqr=True, iqr_k=1.5)

run_outlier_suite(tables=tables, out_dir=OUTLIER_DIR, params=params)

saved: tracks_robust_outlier_report.csv | tracks_robust_outlier_report_top15.png | tracks_invalid_percent_top12.png | tracks_q_outliers_percent_top12.png
saved: audio_features_robust_outlier_report.csv | audio_features_robust_outlier_report_top15.png | audio_features_invalid_percent_top12.png | audio_features_q_outliers_percent_top12.png
saved: artists_robust_outlier_report.csv | artists_robust_outlier_report_top15.png | artists_invalid_percent_top12.png | artists_q_outliers_percent_top12.png
saved: albums_robust_outlier_report.csv | albums_robust_outlier_report_top15.png | albums_invalid_percent_top12.png | albums_q_outliers_percent_top12.png


### 7.2 Missing Values & Completeness

**Ziel**
Quantifizierung der Datenvollständigkeit je Tabelle/Spalte (Missing-Rate + Pattern-Heatmap). Die Ergebnisse werden als Reports gespeichert.


In [12]:
tables = {}
for name, fname in [
    ("tracks", "tracks.csv"),
    ("audio_features", "audio_features.csv"),
    ("artists", "artists.csv"),
    ("albums", "albums.csv"),
]:
    p = PATHS.interim_samples_dir / fname
    if p.exists():
        tables[name] = pd.read_csv(p)


def missing_summary(df: pd.DataFrame) -> pd.DataFrame:
    ms = (df.isna().mean() * 100).sort_values(ascending=False).round(2)
    return ms.to_frame("missing_pct")


outMissing = PATHS.schema_reports_dir / "missing"
outMissing.mkdir(parents=True, exist_ok=True)

for name, df in tables.items():
    print(f"\n=== {name.upper()} ===")
    ms = missing_summary(df)
    display(ms.head(20))
    ms.to_csv(outMissing / f"missing_{name}.csv", encoding="utf-8")
    # Heatmap (nur bei breiten Tabellen sinnvoll)
    plt.figure(figsize=(8, 4))
    sns.heatmap(df.sample(min(len(df), 1000), random_state=42).isna(), cbar=False)
    plt.title(f"Missing-Heatmap (Sample) – {name}")
    plt.tight_layout()
    plt.savefig(outMissing / f"missing_heatmap_{name}.png", dpi=150)
    plt.close()



### 7.3 Eindeutigkeit, Duplikate & Fremdschlüssel-Integrität

**Ziel**
Prüfung technischer Integrität:
- Primärschlüssel eindeutig?
- Duplikate vorhanden?
- Fremdschlüssel verweisen auf existierende IDs?


In [13]:
checker = DataIntegrityChecker(
    data_dir=PATHS.interim_samples_dir,
    schema_reports_dir=PATHS.schema_reports_dir,
)
report, path = checker.execute()


Unnamed: 0,table,check,status,n_bad,n_total,pct_bad
0,albums,duplicates_all_cols,skip,,,
1,albums,unique(id),skip,,,
2,artists,duplicates_all_cols,skip,,,
3,artists,unique(id),skip,,,
4,audio_features,duplicates_all_cols,skip,,,
5,audio_features,unique(id),skip,,,
6,genres,duplicates_all_cols,skip,,,
7,genres,unique(id),skip,,,
8,r_albums_tracks:albums,fk(album_id->id),skip,,,
9,r_albums_tracks:tracks,fk(track_id->track_id),skip,,,


✓ Integrity report saved to: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\integrity\integrity_report.csv


## 8) Vergleichende Verteilungen (Sanity Checks)

Kurzer Vergleich zentraler Ziel-/Proxy-Variablen, um Skalen und grobe Unterschiede sichtbar zu machen.


In [14]:
POPULARITY_DIR = PATHS.schema_reports_dir / "popularity_distributions"
POPULARITY_DIR.mkdir(parents=True, exist_ok=True)
fig, ax = plt.subplots(figsize=(7, 4))
sns.kdeplot(tracks["popularity"], label="Tracks", fill=True)
sns.kdeplot(artists["popularity"], label="Artists", fill=True)
plt.title("Popularität – Vergleich: Tracks vs. Artists")
plt.xlabel("Popularity")
plt.legend()
plt.savefig(POPULARITY_DIR / "popularity_tracks_vs_artists.png", dpi=200, bbox_inches="tight")
plt.close(fig)


## 9) Zeitliche Analysen (Coverage + Trends)

Ziel dieses Abschnitts:
- **Coverage:** Wie viele Releases pro Jahr existieren (Datenabdeckung)?
- **Trends:** Veränderungen ausgewählter Audio-Features über die Zeit.


In [15]:
TRACKS_RELEASE_DIR = PATHS.schema_reports_dir / "time_based_analysis"
TRACKS_RELEASE_DIR.mkdir(parents=True, exist_ok=True)

# Bugfix/Robustness: release_date ist häufig String -> konsequent release_dt verwenden
if "release_dt" in albums.columns:
    albums["year"] = albums["release_dt"].dt.year
    year_counts = albums["year"].value_counts().sort_index()

    fig, ax = plt.subplots(figsize=(8, 4))
    sns.lineplot(x=year_counts.index, y=year_counts.values, ax=ax)
    ax.set_title("Anzahl veröffentlichter Alben pro Jahr")
    ax.set_xlabel("Jahr")
    ax.set_ylabel("Anzahl Alben")

    out = TRACKS_RELEASE_DIR / "albums_per_year.png"
    fig.savefig(out, dpi=200, bbox_inches="tight")
    plt.close(fig)


### 9.1 Zeitliche Trends (Audio-Features)

Wir analysieren Jahresdurchschnitte ausgewählter Audio-Merkmale und speichern Trendplots.


In [16]:
trend_features = ["tempo", "energy", "valence", "loudness"]


# 1) Merge: tracks + audio_features
tracks = pd.read_csv(PATHS.raw_dir / "tracks.csv").rename(columns={"id": "track_id"})
audio = pd.read_csv(PATHS.raw_dir/ "audio_features.csv").rename(columns={"id": "audio_feature_id"})


df = tracks.merge(audio, on="audio_feature_id", how="left")


# 2) Merge: Track -> Album (über Relationstabelle, falls vorhanden)
rat_path = PATHS.raw_dir / "r_albums_tracks.csv"
if rat_path.exists():
    r_at = pd.read_csv(rat_path)
    df = df.merge(r_at[["track_id", "album_id"]], on="track_id", how="left")
else:
    # Fallback: falls tracks bereits album_id enthält
    if "album_id" not in df.columns:
        df["album_id"] = pd.NA


# 3) Merge: Album release_date dazu
albums = pd.read_csv(PATHS.raw_dir / "albums.csv")[["id", "release_date"]].rename(columns={"id": "album_id"})
df = df.merge(albums, on="album_id", how="left")


# 4) Jahr aus release_date ableiten (ms timestamp)
ts = pd.to_numeric(df["release_date"], errors="coerce")
df["release_dt"] = pd.to_datetime(ts, unit="ms", errors="coerce")
df["year"] = df["release_dt"].dt.year


# 5) Trends: jährliche Mittelwerte + Anzahl Tracks pro Jahr
trends = (
df.dropna(subset=["year"])
.groupby("year")[trend_features]
.mean()
.reset_index()
)


counts = (
df.dropna(subset=["year"])
.groupby("year")["track_id"]
.count()
.reset_index(name="count")
)


trends = trends.merge(counts, on="year", how="left").sort_values("year")


# 6) Rolling mean (k=3) für glattere Trendlinien
for f in trend_features:
    trends[f"trend3_{f}"] = trends[f].rolling(3, min_periods=1, center=True).mean()



## 10) Bivariate/Multivariate EDA: Korrelationen & Beziehungen

Ziel: erste Evidenz zu Zusammenhängen zwischen numerischen Variablen und Popularität.

**Hinweis:** Korrelation ist deskriptiv (nicht kausal). Effektgrößen sind wichtiger als reine Signifikanz.


In [17]:
HEAT_DIR = PATHS.schema_reports_dir  / "correlation_heatmaps"
HEAT_DIR.mkdir(parents=True, exist_ok=True)

tables = {
    "Tracks": tracks,
    "Audio Features": audio,
    "Artists": artists,
    "Albums": albums
}

def safe_name(s: str) -> str:
    s = s.lower().strip()
    s = re.sub(r"\s+", "_", s)
    s = re.sub(r"[^a-z0-9_]+", "", s)
    return s

for name, df in tables.items():
    num_cols = df.select_dtypes(include=["number"]).columns
    if len(num_cols) < 2:
        print("skip (not enough numeric cols):", name)
        continue

    corr = df[num_cols].corr(numeric_only=True).round(2)

    fig, ax = plt.subplots(figsize=(6, 4))
    sns.heatmap(
        corr,
        annot=False,
        cmap="coolwarm",
        center=0,
        cbar_kws={"shrink": 0.7},
        square=True,
        ax=ax
    )
    ax.set_title(f"Korrelationsmatrix – {name}", fontsize=12, fontweight="bold")
    fig.tight_layout()

    out_path = HEAT_DIR / f"corr_heatmap_{safe_name(name)}.png"
    fig.savefig(out_path, dpi=200, bbox_inches="tight")
    plt.close(fig)

    print("saved:", out_path)


saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\correlation_heatmaps\corr_heatmap_tracks.png
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\correlation_heatmaps\corr_heatmap_audio_features.png
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\correlation_heatmaps\corr_heatmap_artists.png
skip (not enough numeric cols): Albums


### 10.1 Wichtigste Beziehungen zu Popularität (Ranking)

Wir berechnen und ranken Pearson- und Spearman-Korrelationen mit Popularität, sofern die Spalte vorhanden ist.


In [18]:
CORR_DIR = PATHS.schema_reports_dir / "correlations"
CORR_DIR.mkdir(parents=True, exist_ok=True)

# build full table across all tables
all_corrs = []
for name, df in tables.items():
    c = corr_with_popularity(df, name)
    if not c.empty:
        all_corrs.append(c)

if not all_corrs:
    print("No correlations computed (missing popularity or too few numeric features).")
else:
    pop_corrs = pd.concat(all_corrs, ignore_index=True)

    # save full CSV
    csv_path = CORR_DIR / "corr_with_popularity_all_tables.csv"
    pop_corrs.to_csv(csv_path, index=False, encoding="utf-8-sig")
    print("saved:", csv_path)

    # TOP 15 as PNG table
    top = pop_corrs.head(15).copy()
    for c in ["pearson_r", "spearman_r"]:
        top[c] = top[c].round(3)
    for c in ["pearson_p", "spearman_p"]:
        top[c] = top[c].map(lambda v: f"{v:.2e}")

    fig, ax = plt.subplots(figsize=(12, 0.55 * len(top) + 1.8))
    ax.axis("off")
    tbl = mpl_table(ax, cellText=top.values, colLabels=top.columns, cellLoc="center", loc="center")
    tbl.auto_set_font_size(False)
    tbl.set_fontsize(8.5)
    tbl.scale(1, 1.2)

    png_path = CORR_DIR / "corr_with_popularity_top15_table.png"
    fig.savefig(png_path, dpi=200, bbox_inches="tight")
    plt.close(fig)
    print("saved:", png_path)

    # TOP 12 barplot (Spearman r)
    plot_df = pop_corrs.head(12).copy()
    fig, ax = plt.subplots(figsize=(10, 4))
    ax.bar(plot_df["feature"], plot_df["spearman_r"])
    ax.set_title("Top 12: Spearman correlation with popularity")
    ax.set_ylabel("spearman_r")
    ax.tick_params(axis="x", rotation=45)
    fig.tight_layout()

    bar_path = CORR_DIR / "corr_with_popularity_top12_spearman.png"
    fig.savefig(bar_path, dpi=200, bbox_inches="tight")
    plt.close(fig)
    print("saved:", bar_path)



saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\correlations\corr_with_popularity_all_tables.csv
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\correlations\corr_with_popularity_top15_table.png
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\correlations\corr_with_popularity_top12_spearman.png


### 10.2  Popularität nach Quantilen (nichtlinearer Check)

Korrelationen sind global. Eine Quantil-Segmentierung zeigt häufig robuste, nichtlineare Unterschiede (z. B. Top-10% vs. Rest).

**Output:** Boxplots pro Feature nach Popularitäts-Quantilen.


In [19]:
QUANTILE_DIR = PATHS.schema_reports_dir / "popularity_quantiles"
QUANTILE_DIR.mkdir(parents=True, exist_ok=True)

if "popularity" in tracks.columns:
    pop = pd.to_numeric(tracks["popularity"], errors="coerce").clip(0, 100)

    # nur Zeilen mit valider Popularität
    dfq = tracks.loc[pop.notna()].copy()
    pop_clean = pop.loc[pop.notna()]

    if len(dfq) > 0:
        # qcut kann wegen vieler Duplikate scheitern -> duplicates="drop"
        # Labels dynamisch, falls weniger als 4 Bins entstehen
        q = 4
        bins = pd.qcut(pop_clean, q=q, duplicates="drop")

        n_bins = bins.cat.categories.size
        labels = [f"Q{i+1}" for i in range(n_bins)]
        dfq["popularity_q"] = pd.qcut(pop_clean, q=q, labels=labels, duplicates="drop")

        # Features (nur wirklich numerische)
        feat_candidates = [c for c in ["duration", "track_number", "disc_number"] if c in dfq.columns]

        # optional: explicit als 0/1 wenn du es willst
        if "explicit" in dfq.columns:
            dfq["explicit_num"] = dfq["explicit"].astype(str).str.lower().map({"true": 1, "false": 0})
            if dfq["explicit_num"].notna().sum() > 0:
                feat_candidates.append("explicit_num")

        for feat in feat_candidates:
            tmp = dfq[["popularity_q", feat]].copy()
            tmp[feat] = pd.to_numeric(tmp[feat], errors="coerce")
            tmp = tmp.dropna()

            if tmp.empty or tmp[feat].nunique() < 2:
                continue

            fig, ax = plt.subplots(figsize=(7.5, 4.2))
            sns.boxplot(data=tmp, x="popularity_q", y=feat, ax=ax)
            ax.set_title(f"{feat} nach Popularitäts-Quantilen")
            ax.set_xlabel("Popularity (Quantile)")
            ax.set_ylabel(feat)
            fig.tight_layout()

            out = QUANTILE_DIR / f"box_{feat}_by_popularity_quantile.png"
            fig.savefig(out, dpi=200, bbox_inches="tight")
            plt.close(fig)


### 10.3 Visualisierung wichtiger Beziehungen (Scatterplots)

Wir visualisieren bivariate Beziehungen, um Trends, Cluster und Ausreißer direkt zu erkennen.


In [20]:
SCATTER_DIR = PATHS.schema_reports_dir / "scatter_small"
SCATTER_DIR.mkdir(parents=True, exist_ok=True)

# ----------------------------
# (A) Audio: Top-k Spearman Paare -> Scatterplots
# ----------------------------
audio_pairs = top_corr_pairs(audio, k=12, method="spearman")
for x, y, c in audio_pairs:
    small_scatter(audio, x, y, f"Audio: {x} vs {y} (ρ={c:.2f})", out_dir=SCATTER_DIR)


# ----------------------------
# (B) Tracks: gezielte Beziehungen zur Popularity
# ----------------------------
track_targets = [
("duration", "popularity", "Tracks: Duration vs Popularity"),
("track_number", "popularity", "Tracks: Track number vs Popularity"),
("disc_number", "popularity", "Tracks: Disc number vs Popularity"),
]


for x, y, title in track_targets:
    if x in tracks.columns and y in tracks.columns:
        small_scatter(tracks, x, y, title, out_dir=SCATTER_DIR)


# ----------------------------
# (C) Top-k Korrelationen auch in Tracks (nur numeric ↔ numeric)
# -> falls du nicht nur popularity-Beziehungen willst
# ----------------------------
track_pairs = top_corr_pairs(tracks, k=8, method="spearman")
for x, y, c in track_pairs:
    small_scatter(tracks, x, y, f"Tracks: {x} vs {y} (ρ={c:.2f})", out_dir=SCATTER_DIR)


print(f"saved: scatter_small -> {SCATTER_DIR.name}")

saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\scatter_small\audio:_energy_vs_loudness_(ρ=080).png
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\scatter_small\audio:_acousticness_vs_energy_(ρ=-074).png
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\scatter_small\audio:_acousticness_vs_loudness_(ρ=-060).png
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\scatter_small\audio:_danceability_vs_valence_(ρ=054).png
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\scatter_small\audio:_instrumentalness_vs_loudness_(ρ=-033).png
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\scatter_small\audio:_energy_vs_valence_(ρ=032).png
saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\scatter_small\audio:_loudness_vs_valence_(ρ=031).png
saved: C:\Git

### 10.4 Scatter-Matrix (kompakter Multivariat-Überblick)


In [21]:
SCATTER_DIR = PATHS.schema_reports_dir / "scatter"
SCATTER_DIR.mkdir(parents=True, exist_ok=True)


selected_cols = [c for c in ["danceability", "energy", "valence", "loudness", "tempo"] if c in audio.columns]


if len(selected_cols) >= 3:
    df_plot = audio[selected_cols].apply(pd.to_numeric, errors="coerce").dropna()


    n = min(2000, len(df_plot))
    df_plot = df_plot.sample(n=n, random_state=42)


    axes = scatter_matrix(
    df_plot,
    figsize=(10, 8),
    alpha=0.25,
    diagonal="kde",
    marker=".",
    )


    # bessere Lesbarkeit: Labels rotieren/kleiner machen
    for ax in axes[-1, :]:
        ax.xaxis.label.set_rotation(45)
        ax.xaxis.label.set_ha("right")
    for ax in axes[:, 0]:
        ax.yaxis.label.set_rotation(0)
        ax.yaxis.label.set_ha("right")


    plt.suptitle("Scatter-Matrix ausgewählter Audio-Features", y=1.02, fontsize=12)
    plt.tight_layout()


    out_path = SCATTER_DIR / "audio_features_scatter_matrix.png"
    plt.savefig(out_path, dpi=200, bbox_inches="tight")
    plt.close()


    print(f"saved: {out_path.name}")


saved: audio_features_scatter_matrix.png


### 10.5 Artists: Followers vs. Popularity

Wir prüfen den Zusammenhang zwischen Reichweite (`followers`) und Popularität. Aufgrund starker Schiefe wird `log1p(followers)` verwendet.


In [22]:
ARTISTS_DIR = PATHS.schema_reports_dir / "artists_eda"
ARTISTS_DIR.mkdir(parents=True, exist_ok=True)


df = artists[["followers", "popularity"]].copy()
df["followers"] = pd.to_numeric(df["followers"], errors="coerce")
df["popularity"] = pd.to_numeric(df["popularity"], errors="coerce").clip(0, 100)
df = df.dropna()

x = np.log1p(df["followers"].clip(lower=0))
y = df["popularity"]


fig, ax = plt.subplots(figsize=(7.5, 4.5))
ax.scatter(x, y, alpha=0.25, s=8)
ax.set_title("Artists: log1p(followers) vs popularity")
ax.set_xlabel("log1p(followers)")
ax.set_ylabel("popularity (0–100)")
ax.grid(alpha=0.2)
fig.tight_layout()


out = ARTISTS_DIR / "artists_scatter_followers_log_vs_popularity.png"
fig.savefig(out, dpi=200, bbox_inches="tight")
plt.close(fig)
print(f"saved: {out.name}")


saved: artists_scatter_followers_log_vs_popularity.png


## 11) Segmentierungen & Cross-Table Analysen

Diese Analysen vergleichen Muster zwischen Kategorien (Album-Typ, Genre-IDs) und über Tabellen hinweg.


### 11.1 Kategorie-Analyse: Audio-Features nach Album-Typ und Genre_id

Analyse der Audio-Features (z. B. Danceability, Energy, Valence, Loudness, Tempo) nach:
- **Album-Typ** (album/single/compilation)
- **Genre** als `genre_id` (Top-N)


In [23]:
analyzer = CategoryAnalyzer(PATHS.raw_dir, PATHS.schema_reports_dir)
results = analyzer.execute()


INFO - CategoryAnalyzer - Running CategoryAnalyzer
INFO - CategoryAnalyzer - Album-type analysis with features: tempo, energy, valence, loudness, danceability
INFO - CategoryAnalyzer - Saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\categories\stats_album_type.csv
INFO - CategoryAnalyzer - Genre analysis (top_n=15)
INFO - CategoryAnalyzer - Saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\categories\stats_genre_top15.csv
INFO - CategoryAnalyzer - Outputs in: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\categories


### 11.2 Cross-Table Relationship: Artist Influence

**Ziel**
Wir untersuchen, ob Reichweite (Followers) mit dem durchschnittlichen Song-Erfolg (Ø Track Popularity) zusammenhängt.


In [24]:
analyzer = InfluenceAnalyzer(PATHS.raw_dir, PATHS.schema_reports_dir)
results = analyzer.execute()


INFO - InfluenceAnalyzer - Running InfluenceAnalyzer
INFO - InfluenceAnalyzer - Column mapping: {'tracks': {'track_id': 'track_id', 'popularity': 'track_popularity'}, 'artists': {'id': 'artist_id', 'followers': 'followers', 'popularity': 'artist_popularity'}, 'r_track_artist': {'track_id': 'track_id', 'artist_id': 'artist_id'}}
INFO - InfluenceAnalyzer - Artists in analysis: 187090
INFO - InfluenceAnalyzer - Correlation pearson_log=0.286 spearman=0.270
INFO - InfluenceAnalyzer - Saved: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\influence\artist_influence.csv
INFO - InfluenceAnalyzer - Outputs in: C:\GitHub\uni-project-metrics-and-data\data\reports\schema_overview\slice_000\influence
