Skip to content

Filter Apple Music catalog to jazz-relevant subset at index-build time #174

@dprodger

Description

@dprodger

Background

Issue #162 stood up the Apple Music catalog refresh as a managed service. The full catalog at this point is ~84GB of parquet, ~31GB of indexed DuckDB, with 60M albums and 196M songs. Building the song-album index requires >8GB of RAM and a 16GB Render worker — a real cost just to make song-by-album lookups fast.

The vast majority of that catalog is not jazz. Apple's Feed API doesn't support genre-filtered downloads (you have to grab the whole thing), but we can filter at index-build time, before the data ever materializes into DuckDB tables.

Goal

Shrink the indexed catalog to roughly the jazz subset, so:

  • Index builds fit comfortably in 2-4GB worker memory
  • Refresh runs in minutes instead of hours
  • We don't need to provision a 16GB instance just for catalog rebuild

Disk savings are minimal (we still download the full 84GB parquet each refresh because Apple gives us the whole feed), but memory + time savings are large.

Approach

Two complementary filters in backend/integrations/apple_music/catalog_index.py build_index() CREATE TABLE clauses:

  1. Genre tag filter. Apple's parquet has a genres column. Add WHERE list_has_any(genres, ARRAY[...]) with a curated list of jazz-relevant tags (Jazz, Bebop, Vocal Jazz, Big Band, Hard Bop, Cool Jazz, Fusion, etc.). Jazz is probably 1-3% of the catalog, so 60M albums → ~1M, 196M songs → ~3-5M.

  2. Known-performer artist ID filter. We already have a performers table in Postgres with names and MusicBrainz IDs. Backfill an apple_artist_id column (one-time lookup), then include OR primaryArtists[1].id IN (SELECT apple_artist_id FROM performers). Catches jazz performers whose albums aren't tagged with a jazz genre (soundtracks, jams, mistagged albums).

Both controlled by env vars so the filter is configurable per deploy:

  • APPLE_CATALOG_FILTER_GENRES=Jazz,Bebop,Vocal Jazz,... (comma-separated)
  • APPLE_CATALOG_FILTER_TO_KNOWN_ARTISTS=true|false

Discovery step (chicken-and-egg)

We can't know what genre tags exist without inspecting the catalog. Use the read-only REPL added in commit 3bd9563:

cd /opt/render/project/src/backend
python scripts/onetime_scripts/apple_catalog_repl.py

Then:

-- Inventory jazz-adjacent tags
SELECT g, COUNT(*) FROM (
  SELECT unnest(genres) AS g FROM albums
) WHERE g ILIKE '%jazz%' OR g ILIKE '%bebop%' OR g ILIKE '%swing%' OR g ILIKE '%fusion%'
GROUP BY g ORDER BY 2 DESC LIMIT 50;

-- Cross-check: any well-known jazz albums without a jazz genre tag?
SELECT a.name, a.artist_name, a.genres FROM albums a
WHERE LOWER(a.artist_name) IN ('miles davis','john coltrane','bill evans','thelonious monk','charlie parker')
ORDER BY artist_name LIMIT 50;

That gives us the jazz tag taxonomy + a sense of how often we'd miss something with genre-only filtering.

Implementation sketch

  1. Confirm parquet schema has genres column (verify via REPL)
  2. Compile curated jazz-tag list from REPL exploration
  3. Add WHERE clauses to the CREATE TABLE albums and CREATE TABLE songs paths in build_index()
  4. Add the two env vars + sensible defaults (probably both off by default to preserve current behavior; opt in by setting them on the worker)
  5. Add an apple_artist_id column + backfill to the performers table for filter Searching for artist or song returns errors during type ahead #2
  6. One refresh cycle on the small worker plan to validate

Out of scope

  • Changing the download step (we still grab 84GB each refresh)
  • Filtering at query time only (defeats the index-build savings; we want indexes to be small too)
  • Auto-discovering jazz tags (manual curation is fine, taxonomy is small)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions