Filter Apple Music catalog to jazz-relevant subset at index-build time

## Background

Issue #162 stood up the Apple Music catalog refresh as a managed service. The full catalog at this point is ~84GB of parquet, ~31GB of indexed DuckDB, with 60M albums and 196M songs. Building the song-album index requires >8GB of RAM and a 16GB Render worker — a real cost just to make song-by-album lookups fast.

The vast majority of that catalog is not jazz. Apple's Feed API doesn't support genre-filtered downloads (you have to grab the whole thing), but we *can* filter at index-build time, before the data ever materializes into DuckDB tables.

## Goal

Shrink the indexed catalog to roughly the jazz subset, so:

- Index builds fit comfortably in 2-4GB worker memory
- Refresh runs in minutes instead of hours
- We don't need to provision a 16GB instance just for catalog rebuild

Disk savings are minimal (we still download the full 84GB parquet each refresh because Apple gives us the whole feed), but memory + time savings are large.

## Approach

Two complementary filters in `backend/integrations/apple_music/catalog_index.py` `build_index()` `CREATE TABLE` clauses:

1. **Genre tag filter.** Apple's parquet has a `genres` column. Add `WHERE list_has_any(genres, ARRAY[...])` with a curated list of jazz-relevant tags (Jazz, Bebop, Vocal Jazz, Big Band, Hard Bop, Cool Jazz, Fusion, etc.). Jazz is probably 1-3% of the catalog, so 60M albums → ~1M, 196M songs → ~3-5M.

2. **Known-performer artist ID filter.** We already have a `performers` table in Postgres with names and MusicBrainz IDs. Backfill an `apple_artist_id` column (one-time lookup), then include `OR primaryArtists[1].id IN (SELECT apple_artist_id FROM performers)`. Catches jazz performers whose albums aren't tagged with a jazz genre (soundtracks, jams, mistagged albums).

Both controlled by env vars so the filter is configurable per deploy:

- `APPLE_CATALOG_FILTER_GENRES=Jazz,Bebop,Vocal Jazz,...` (comma-separated)
- `APPLE_CATALOG_FILTER_TO_KNOWN_ARTISTS=true|false`

## Discovery step (chicken-and-egg)

We can't know what genre tags exist without inspecting the catalog. Use the read-only REPL added in commit 3bd9563:

```bash
cd /opt/render/project/src/backend
python scripts/onetime_scripts/apple_catalog_repl.py
```

Then:

```sql
-- Inventory jazz-adjacent tags
SELECT g, COUNT(*) FROM (
  SELECT unnest(genres) AS g FROM albums
) WHERE g ILIKE '%jazz%' OR g ILIKE '%bebop%' OR g ILIKE '%swing%' OR g ILIKE '%fusion%'
GROUP BY g ORDER BY 2 DESC LIMIT 50;

-- Cross-check: any well-known jazz albums without a jazz genre tag?
SELECT a.name, a.artist_name, a.genres FROM albums a
WHERE LOWER(a.artist_name) IN ('miles davis','john coltrane','bill evans','thelonious monk','charlie parker')
ORDER BY artist_name LIMIT 50;
```

That gives us the jazz tag taxonomy + a sense of how often we'd miss something with genre-only filtering.

## Implementation sketch

1. Confirm parquet schema has `genres` column (verify via REPL)
2. Compile curated jazz-tag list from REPL exploration
3. Add `WHERE` clauses to the `CREATE TABLE albums` and `CREATE TABLE songs` paths in `build_index()`
4. Add the two env vars + sensible defaults (probably both off by default to preserve current behavior; opt in by setting them on the worker)
5. Add an `apple_artist_id` column + backfill to the `performers` table for filter #2
6. One refresh cycle on the small worker plan to validate

## Out of scope

- Changing the download step (we still grab 84GB each refresh)
- Filtering at query time only (defeats the index-build savings; we want indexes to be small too)
- Auto-discovering jazz tags (manual curation is fine, taxonomy is small)

## References

- Original work: #162
- Catalog rebuild handler: `backend/research_worker/handlers/apple_catalog.py`
- Index builder: `backend/integrations/apple_music/catalog_index.py`
- Read-only REPL for exploration: `backend/scripts/onetime_scripts/apple_catalog_repl.py` (commit 3bd9563)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter Apple Music catalog to jazz-relevant subset at index-build time #174

Background

Goal

Approach

Discovery step (chicken-and-egg)

Implementation sketch

Out of scope

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Filter Apple Music catalog to jazz-relevant subset at index-build time #174

Description

Background

Goal

Approach

Discovery step (chicken-and-egg)

Implementation sketch

Out of scope

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions