# Preprocessing & Data Exploration  
**Notebook:** HDF daily laoder & dataset structure checks 

**Project Path:** `/Users/laval/spotify`

This notebook provides a first exploration of the **daily HDF data dumps** used 

## Dataset Structure (per day)

Each daily folder contains ≈ **40 HDF files**, which materialize into **4 core dataframes**:

1. **Playlist–Track info**  
   - Playlist ID  
   - Track ID  
   - Track name  
   - Artist ID  
   - Album ID  
   - **Track popularity that day**
   - File name: "{date}_playlist_track_info_{i}.hdf"

2. **Playlist → Track list**  
   Complete track list for every playlist observed that day.
   - File name: "{date}_playlist_ids_with_track_ids_{i}.hdf"

3. **Playlist metadata**  
   - Playlist ID  
   - Owner (Spotify / FILTER/Sony / Dexter/Universal / Toxify/Warner / …)  
   - Description  
   - **Follower count per day**  
   - Track count 
   - File name: "{date}_playlists_with_features_{i}.hdf" 

4. **Track IDs observed that day**  
   ~1M tracks/day, collected across playlists.
   - File name: "{date}_track_ids_{i}.hdf"

## HDF loader 

Adapted from @zeijena's script

- Listing the HDF files for selected dates
- Loading each of the 4 expected dataframes
- Printing shapes & sample rows
- Validating consistent schema across days

In [1]:
import pandas as pd
import os
from datetime import datetime

In [3]:
# --- PATHS ---
main_folder_path = '/Users/laval/spotify/data'  # daily folders

# --- PARAMS ---
N_PARTS = 10  # set to 10 on full daily data 

# List date folders
all_folders = sorted(
    [folder for folder in os.listdir(main_folder_path)
     if os.path.isdir(os.path.join(main_folder_path, folder))]
)

# first day 
all_folders = all_folders[:1]

for idx, folder_name in enumerate(all_folders):
    print("=" * 80)
    print(f"Folder: {folder_name}")
    folder_path = os.path.join(main_folder_path, folder_name)

    if not os.path.exists(folder_path):
        print("Folder does not exist, skip")
        continue

    # parse date with folder_name format 'YYYY-MM-DD'
    try:
        date_observed = datetime.strptime(folder_name, '%Y-%m-%d').date()
        print(f"   Parsed date: {date_observed}")
    except ValueError:
        date_observed = None
        print("Could not parse folder name as date")

    # =========================================
    # 1) playlist_track_info
    # =========================================
    dfs = []
    for i in range(1, N_PARTS + 1):
        print(f"[playlist_track_info] part {i}")
        file_name = f"{folder_name}_playlist_track_info_{i}.hdf"
        file_path = os.path.join(folder_path, file_name)

        if not os.path.exists(file_path):
            print(f"   - file {file_name} not found, skip")
            continue

        try:
            df = pd.read_hdf(file_path, key='/playlist_track_info')
            df['track.popularity'] = pd.to_numeric(df['track.popularity'], errors='coerce')
            dfs.append(df)
        except Exception as e:
            print(f"Error reading {file_name}: {e}")
            break # Skip the entire day if there is an error with any file 

    big_df_playlist_track_info = pd.concat(dfs, ignore_index=True) if dfs else None
    if big_df_playlist_track_info is not None:
        print("1) playlist_track_info loaded:", big_df_playlist_track_info.shape)

    # =========================================
    # 2) playlist_ids_with_track_ids
    # =========================================
    dfs = []
    for i in range(1, N_PARTS + 1):
        print(f"[playlist_ids_with_track_ids] part {i}")
        file_name = f"{folder_name}_playlist_ids_with_track_ids_{i}.hdf"
        file_path = os.path.join(folder_path, file_name)

        if not os.path.exists(file_path):
            print(f"   - file {file_name} not found, skip")
            continue

        try:
            df = pd.read_hdf(file_path, key='/playlist_ids_with_track_ids')
            dfs.append(df)
        except Exception as e:
            print(f"Error reading {file_name}: {e}")
            break

    big_df_playlist_ids_with_track_ids = pd.concat(dfs, ignore_index=True) if dfs else None
    if big_df_playlist_ids_with_track_ids is not None:
        print("2) playlist_ids_with_track_ids loaded:",
              big_df_playlist_ids_with_track_ids.shape)

    # =========================================
    # 3) playlists_with_features
    # =========================================
    dfs = []
    for i in range(1, N_PARTS + 1):
        print(f"[playlists_with_features] part {i}")
        file_name = f"{folder_name}_playlists_with_features_{i}.hdf"
        file_path = os.path.join(folder_path, file_name)

        if not os.path.exists(file_path):
            print(f"   - file {file_name} not found, skip")
            continue

        try:
            df = pd.read_hdf(file_path, key='/playlists')
            dfs.append(df)
        except Exception as e:
            print(f"Error reading {file_name}: {e}")
            break

    big_df_playlists = pd.concat(dfs, ignore_index=True) if dfs else None
    if big_df_playlists is not None:
        print("3) playlists_with_features loaded:", big_df_playlists.shape)

    # =========================================
    # 4) track_ids
    # =========================================
    dfs = []
    for i in range(1, N_PARTS + 1):
        print(f"[track_ids] part {i}")
        file_name = f"{folder_name}_track_ids_{i}.hdf"
        file_path = os.path.join(folder_path, file_name)

        if not os.path.exists(file_path):
            print(f"   - file {file_name} not found, skip")
            continue

        try:
            df = pd.read_hdf(file_path, key='/gathered_track_ids')
            dfs.append(df)
        except Exception as e:
            print(f"Error reading {file_name}: {e}")
            break

    big_df_track_ids = pd.concat(dfs, ignore_index=True) if dfs else None
    if big_df_track_ids is not None:
        print("4) track_ids loaded:", big_df_track_ids.shape)



Folder: 2019-01-09
   Parsed date: 2019-01-09
[playlist_track_info] part 1
Error reading 2019-01-09_playlist_track_info_1.hdf: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.
[playlist_ids_with_track_ids] part 1
Error reading 2019-01-09_playlist_ids_with_track_ids_1.hdf: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.
[playlists_with_features] part 1
Error reading 2019-01-09_playlists_with_features_1.hdf: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.
[track_ids] part 1
Error reading 2019-01-09_track_ids_1.hdf: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.


## Inspecting daily HDF data (structural overview & quick sanity checks)

Series of simple diagnostics to confirm that:
- Each of the 4 expected daily dataframes was correctly loaded
- The schema and column names match expectations
- The number of rows per part/day is reasonable
- Key columns such as `track.id`, `playlist.id`, `track.popularity` are present.

In [4]:
# List of loaded daily dataframes (the loader script must have run first)
loaded_dfs = {
    "playlist_track_info": big_df_playlist_track_info,
    "playlist_ids_with_track_ids": big_df_playlist_ids_with_track_ids,
    "playlists": big_df_playlists,
    "track_ids": big_df_track_ids,
}

print("="*100)
print("DAILY DATA STRUCTURE OVERVIEW")
print("="*100)

for name, df in loaded_dfs.items():
    print("\n" + "-"*100)
    print(f"*** {name.upper()}")
    
    if df is None:
        print("Not loaded missing or failed HDF files")
        continue

    # Basic metadata
    print(f"   Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print("   Memory usage: {:.2f} MB".format(df.memory_usage(deep=True).sum() / 1e6))

    # Columns
    print("   Columns:", list(df.columns)[:15], 
          ("..." if len(df.columns) > 15 else ""))

    # Non-null summary
    print("\n   Non-null counts:")
    display(df.notnull().sum().sort_values(ascending=False).head(10))

    # Head
    print("\n   Preview:")
    display(df.head())

    # Simple identifier check
    id_candidates = [c for c in df.columns if "id" in c.lower()]
    if id_candidates:
        col = id_candidates[0]
        print(f"   Unique values in '{col}': {df[col].nunique():,}")
    else:
        print("   (No obvious ID column found.)")


DAILY DATA STRUCTURE OVERVIEW

----------------------------------------------------------------------------------------------------
*** PLAYLIST_TRACK_INFO
Not loaded missing or failed HDF files

----------------------------------------------------------------------------------------------------
*** PLAYLIST_IDS_WITH_TRACK_IDS
Not loaded missing or failed HDF files

----------------------------------------------------------------------------------------------------
*** PLAYLISTS
Not loaded missing or failed HDF files

----------------------------------------------------------------------------------------------------
*** TRACK_IDS
Not loaded missing or failed HDF files
