feat(jobs): library-identity-consumer for LML bulk-resolve (#802) by jakebromberg · Pull Request #807 · WXYC/Backend-Service

jakebromberg · 2026-05-11T20:49:45Z

Summary

Implements #802 under the post-#800 cross-cache-identity pivot: Backend is now a thin writer; LML is the sole composer of cross-cache identity. The new jobs/library-identity-consumer/ package consumes LML's POST /api/v1/identity/bulk-resolve-libraries and UPSERTs the verdicts into library_identity + library_identity_source atomically per library_id. The predecessor jobs/library-identity-backfill/ (which composed identity locally from five legacy sources) is deleted in the same change.

Closes #802.

Prerequisites (merged)

Decision record: #800 (architecture pivot 2026-05-09)
API contract: WXYC/wxyc-shared#104 (api.yaml v1.2.0)
LML endpoint: WXYC/library-metadata-lookup#272 / PR #273

Acceptance criteria (BS#802)

SELECTs libraries needing identity refresh: library.canonical_entity_id IS NOT NULL OR library.id IN (SELECT library_id FROM library_identity WHERE last_verified_at < NOW() - interval '7 days').
POSTs to LML /api/v1/identity/bulk-resolve-libraries in batches of up to 500 (LML cap is 1000; headroom).
UPSERTs each BulkResolveResult into library_identity + library_identity_source atomically per library_id inside db.transaction().
Emits Sentry-traced metrics: rows_resolved / rows_unresolved / rows_skipped / lml_total_latency_ms / lml_total_calls (as attributes on a top-level library-identity-consumer.run span; per-batch LML POST wrapped in lml.bulk_resolve_libraries http.client span with LML's cache_stats projected as lml.cache.*).
DRY_RUN locked JSON output. In DRY_RUN the loop still calls LML so resolve / unresolved / error counts are honest predictions; only DB writes are suppressed.
Old jobs/library-identity-backfill/ and Dockerfile.library-identity-backfill deleted in the same PR (no tombstone scripts).
Idempotent rerun via UPSERT. On a batch-level LML error the orchestrator counts every input as rows_skipped { lml_error } and continues; the next run re-picks failed rows via the SELECT predicate (retry is free).
Wired into Manual Build & Deploy. The deploy-base.yml matrix discovers targets dynamically via jobs/$TARGET_APP/package.json's job-type field, so "job-type": "one-shot" in package.json is sufficient — no workflow edit needed.

Column-name correction

BS#802's body wrote last_refreshed_at, but the column on library_identity is last_verified_at. The SELECT predicate, the writer, and the DRY_RUN report all use the actual column name. Surfacing here for the reviewer.

Out-of-scope follow-ups

library_identity artist-ID column gap. LML's ReconciledIdentity carries discogs_artist_id, musicbrainz_artist_id, and bandcamp_id for the artist, but library_identity has no main-row destinations for those columns today. The values flow through library_identity_source.external_id (text) via provenance rows, so no data is dropped — but the main row is a partial denormalised view until a follow-up migration adds artist-id columns. The migration is deliberately out of scope for this PR; should be tracked as a separate ticket.
Compilation handling. Per BS#802's note that compilation track-level identity is BS#801's scope, kind: 'compilation' results are counted as rows_skipped { compilation } and not written. The writer's surface area for compilations (library_track_identity_source) is BS#801's PR.
Integration test. The acceptance criteria mention an integration test covering happy path + retry + unresolved row. The repo's integration tests run sequentially against a real Docker PG and would additionally require an LML fixture server. The orchestrator's contract is fully exercised by unit-level tests with a mocked bulkResolve; lml-fetch is covered by a fetch-stubbed unit test suite. Flagging that an end-to-end test against a fixture LML would be a useful future add.

Test plan

npm run typecheck — passes
npm run lint — 0 errors (warnings only, all pre-existing)
npm run format:check — passes
npm run build — all workspaces including the new @wxyc/library-identity-consumer build cleanly
npm run test:unit — 141 suites, 1778 tests pass (49 new tests in tests/unit/jobs/library-identity-consumer/ across select / writer / orchestrate / lml-fetch)
CI green on this PR

Implements BS#802 under the post-#800 cross-cache-identity pivot: Backend is now a thin writer; LML is the sole composer of cross-cache identity. The new `jobs/library-identity-consumer/` package consumes LML's `POST /api/v1/identity/bulk-resolve-libraries` endpoint (api.yaml v1.2.0, wxyc-shared#104; deployed via LML#272 / PR #273) and UPSERTs the verdicts into `library_identity` + `library_identity_source` atomically per library_id. The SELECT predicate picks libraries needing identity refresh: `library.canonical_entity_id IS NOT NULL OR library.id IN (SELECT library_id FROM library_identity WHERE last_verified_at < NOW() - interval '7 days')`. Note: BS#802's body wrote `last_refreshed_at`, but the actual column on `library_identity` is `last_verified_at`; the code uses the real column name. Batches up to 500 inputs per LML call (LML caps at 1000). For each `BulkResolveResult`: `single_artist` → write per-source rows + main row in `db.transaction()`; `unresolved` → counted, no write; `compilation` → counted as `rows_skipped { compilation }` and deferred to BS#801 (per-track identity writes for V/A rows). A batch-level LML error counts every input as `rows_skipped { lml_error }` and continues — the next run re-picks failed rows via the SELECT predicate, so retry is free. Sentry-traced metrics (`rows_resolved`, `rows_unresolved`, `rows_skipped`, `lml_total_calls`, `lml_total_latency_ms`) land as attributes on a top-level `library-identity-consumer.run` span. The per-batch LML POST is wrapped in `lml.bulk_resolve_libraries` (`http.client`); LML's `cache_stats` projects onto the same span as `lml.cache.*` attributes (LML#229 pattern). DRY_RUN still calls LML so the resolve/unresolved/error counts are honest predictions; only DB writes are suppressed, and a locked-schema JSON object is emitted on stdout. Deletes the predecessor `jobs/library-identity-backfill/` and `Dockerfile.library-identity-backfill` in the same change (BS#802 acceptance). The deploy-base.yml matrix discovers targets dynamically via `jobs/$TARGET_APP/package.json`'s `job-type` field, so registering the new one-shot target requires no workflow edit. CLAUDE.md and docs/env-vars.md are updated to reflect the rename and the new env vars (`LIBRARY_METADATA_URL`, `LML_API_KEY`, `STALE_THRESHOLD_DAYS`). Known scope cuts called out for follow-up: `library_identity` has no main-row destinations for `discogs_artist_id`, `musicbrainz_artist_id`, or `bandcamp_id` from LML's `ReconciledIdentity` payload — those values flow through `library_identity_source.external_id` (text) via provenance rows so no data is dropped, but the main row is a partial denormalised view until a follow-up migration adds artist-id columns (deliberately out of scope here). Compilation handling is BS#801's scope. Test coverage (34 tests across select / writer / orchestrate): SELECT predicate honors the post-#800 disjunction and env var validation; writer is transactional with per-source-then-main ordering and null-confidence skipping; orchestrator dispatches by kind, counts LML and writer errors without aborting the run, paginates by id-cursor, and emits the locked DRY_RUN JSON schema.

Adds 15 unit tests for the lml-fetch module covering URL construction (trailing /api/v1 and trailing-slash stripping, idempotent against the legacy convention), the conditional Authorization: Bearer header, the request body shape (POSTs `{ inputs }` verbatim), Sentry span wrapping (name=lml.bulk_resolve_libraries, op=http.client; batch size attribute; cache_stats projection onto the span as lml.cache.*), the defensive array-narrowing guard on cache_stats, the swallow-on-setAttributes-throw safety net, and the three error paths (non-2xx, AbortError → timed-out message, generic network error rethrown). Each behavior has at least one realistic failure mode (LML_API_KEY set but header not sent, URL double-slashed, Sentry span crashing on malformed cache_stats payload) that would silently fail in production without this coverage.

…structured LML errors Four follow-ups from in-session review of #807. Medium — `lml-types.ts:54` typed `discogs_artist_id` as `string | null`; api.yaml v1.2.0 says `integer`. Corrected to `number | null` and updated the writer-test fixture (`'D-1'` → `12345`) so the type assertion is honest. Harmless today because `projectMainRow` drops the field, but a follow-up migration that adds main-row destinations for the artist-level IDs would otherwise double-coerce silently. Medium — `Totals.rows_skipped.null_confidence_provenance` mixed units: the other `rows_skipped.*` buckets count library_ids, but this counter counted source rows within a successful library_id write. As a result the library_id-level invariant `scanned == resolved + unresolved + sum(rows_skipped.values())` was false. Lifted the source-row counter out into a sibling field `Totals.source_rows_skipped_null_confidence` (and `DryRunReport.source_rows_skipped_null_confidence`) so the library_id-level sum stays clean. A new regression test pins the invariant. Medium — added a defensive cardinality check on the LML response. api.yaml v1.2.0 preserves order but doesn't contractually guarantee 1:1 cardinality. Before this PR an under-cardinality response silently under-reported `scanned`. The orchestrator now logs `lml_cardinality_mismatch`, captures to Sentry, and counts the missing inputs under a new `rows_skipped { lml_cardinality_mismatch }` bucket. New test covers the under-cardinality path. Low — restructured LML failures as a typed `LmlFetchError` carrying `status: number | null` and `retryable: boolean`. 5xx → retryable, 4xx → not, timeout/network → retryable, status=null. The orchestrator's behavior is unchanged today (every failure still goes through `rows_skipped { lml_error }`) but a future retry-with-backoff layer can pivot on the structured fields rather than parsing the message string. Existing error-path tests rewritten to assert the `LmlFetchError` shape; added a non-retryable 4xx case. Pre-flight clean: typecheck + lint + format:check pass; 1781 unit tests pass (4 new in the consumer suite covering counter-unit cleanliness, under-cardinality, non-retryable 4xx).

jakebromberg added 3 commits May 11, 2026 13:41

jakebromberg merged commit c235b88 into main May 11, 2026
5 checks passed

This was referenced May 20, 2026

[Followup] Generalize compilation_track_artist → library_track (cross-cache-identity §3.2 step 2) #801

Open

Extend library-identity-consumer SELECT predicate to cover NULL canonical_entity_id rows (~34K out-of-scope today) #974

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(jobs): library-identity-consumer for LML bulk-resolve (#802)#807

feat(jobs): library-identity-consumer for LML bulk-resolve (#802)#807
jakebromberg merged 3 commits into
mainfrom
issue/802-bulk-resolve-consumer

jakebromberg commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jakebromberg commented May 11, 2026

Summary

Prerequisites (merged)

Acceptance criteria (BS#802)

Column-name correction

Out-of-scope follow-ups

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant