# 00_Data_Cleaning.ipynb

**Project context**  
Prepare two CSVs for the exploration & modelling notebooks.

---

## Inputs
| raw file | rows × cols | note |
|----------|-------------|------|
| `echonest.csv` | ~13 k × ~250 | multi-index header, many nulls |

## Outputs
| clean file | rows × cols | note |
|------------|-------------|------|
| `echonest_audio_features.csv` | 13 129 × **8** | neat global features |
| `echonest_audio_temporal.csv` | 13 129 × **224** | beat₀ … beat₂₂₃ timbre PCA |

## Road-map
1. **Load** raw CSV & flatten header  
2. **Clean** – drop sparse columns & rows with any NaNs  
3. **Split** → rename columns → cast temporal to *float32*  
4. **Save** neat CSVs (index=`track_id`)


### 1 · Load raw EchoNest


In [27]:
# 1 · Load raw CSV & flatten header --------------------------------------
import pandas as pd, numpy as np, os

BASE = "/Users/angel/emotion_audio_gan/data/fma/fma_metadata"
RAW  = os.path.join(BASE, "echonest.csv")

df_raw = pd.read_csv(
    RAW,
    skiprows=[0],          # drop EchoNest license row
    header=[0,1],          # two-level header
    index_col=0            # track_id
)

# flatten multi-index → "group_field"
df_raw.columns = ['_'.join(col).strip() for col in df_raw.columns.values]
df_raw.index.name = "track_id"

print("raw:", df_raw.shape)
# Interpretation – ~13 k rows, ~250 cols



raw: (13129, 249)


### 2 · Clean: drop duplicates & heavy-null columns
* Threshold > 3000 nulls (≈ 23 %) removes mostly social-metadata noise.*

In [28]:
# 2 · Clean: drop sparse cols then rows with any NaN ----------------------
THRESH = 3000   # >3000 nulls ≈ 23 % missing → unreliable

sparse_cols = [c for c in df_raw.columns if df_raw[c].isna().sum() > THRESH]
df = df_raw.drop(columns=sparse_cols)

df = df.dropna()              # remove any row that still has NaN
df = df.drop_duplicates()

print("after clean:", df.shape, "| dropped cols:", len(sparse_cols))
# Interpretation – removed noisy metadata & null rows; data now fully dense


after clean: (13129, 239) | dropped cols: 10


### 3 · Split → rename → cast


In [30]:
# 3 · Split into GLOBAL 8 & TEMPORAL 224 ---------------------------------
# 3a  Rename global cols to plain Spotify names
global_map = {
    "audio_features_valence":        "valence",
    "audio_features_energy":         "energy",
    "audio_features_acousticness":   "acousticness",
    "audio_features_danceability":   "danceability",
    "audio_features_instrumentalness":"instrumentalness",
    "audio_features_liveness":       "liveness",
    "audio_features_speechiness":    "speechiness",
    "audio_features_tempo":          "tempo"
}
df = df.rename(columns=global_map)

global_cols   = list(global_map.values())
temporal_cols = sorted([c for c in df.columns if c.startswith("temporal_features")])

# 3b  Rename temporal → beat_0 … beat_223
beat_names = {old: f"beat_{i}" for i,old in enumerate(temporal_cols)}
df = df.rename(columns=beat_names)

df_global   = df[global_cols].copy()
df_temporal = df[list(beat_names.values())].astype("float32")  # halve memory


### 4 · Save neat CSVs (ndex = track_id)


In [None]:
# 4 · Save CSVs ------------------------------------------------------
OUT_DIR = "/Users/angel/emotion_audio_gan/data/fma"
os.makedirs(OUT_DIR, exist_ok=True)

df_global.to_csv   (os.path.join(OUT_DIR, "echonest_audio_features.csv"),
                    index_label="track_id")

df_temporal.to_csv (os.path.join(OUT_DIR, "echonest_audio_temporal.csv"),
                    index_label="track_id")

print("✓ Saved:", "echonest_audio_features.csv", "and",
      "echonest_audio_temporal.csv", "to", OUT_DIR)
# Interpretation – both CSVs share track_id index → easy downstream loading

✓ Saved: echonest_audio_features.csv and echonest_audio_temporal.csv to /Users/angel/emotion_audio_gan/data/fma
