Skip to content

feat(jobs): library-identity-consumer for LML bulk-resolve (#802)#807

Merged
jakebromberg merged 3 commits into
mainfrom
issue/802-bulk-resolve-consumer
May 11, 2026
Merged

feat(jobs): library-identity-consumer for LML bulk-resolve (#802)#807
jakebromberg merged 3 commits into
mainfrom
issue/802-bulk-resolve-consumer

Conversation

@jakebromberg
Copy link
Copy Markdown
Member

Summary

Implements #802 under the post-#800 cross-cache-identity pivot: Backend is now a thin writer; LML is the sole composer of cross-cache identity. The new jobs/library-identity-consumer/ package consumes LML's POST /api/v1/identity/bulk-resolve-libraries and UPSERTs the verdicts into library_identity + library_identity_source atomically per library_id. The predecessor jobs/library-identity-backfill/ (which composed identity locally from five legacy sources) is deleted in the same change.

Closes #802.

Prerequisites (merged)

Acceptance criteria (BS#802)

  • SELECTs libraries needing identity refresh: library.canonical_entity_id IS NOT NULL OR library.id IN (SELECT library_id FROM library_identity WHERE last_verified_at < NOW() - interval '7 days').
  • POSTs to LML /api/v1/identity/bulk-resolve-libraries in batches of up to 500 (LML cap is 1000; headroom).
  • UPSERTs each BulkResolveResult into library_identity + library_identity_source atomically per library_id inside db.transaction().
  • Emits Sentry-traced metrics: rows_resolved / rows_unresolved / rows_skipped / lml_total_latency_ms / lml_total_calls (as attributes on a top-level library-identity-consumer.run span; per-batch LML POST wrapped in lml.bulk_resolve_libraries http.client span with LML's cache_stats projected as lml.cache.*).
  • DRY_RUN locked JSON output. In DRY_RUN the loop still calls LML so resolve / unresolved / error counts are honest predictions; only DB writes are suppressed.
  • Old jobs/library-identity-backfill/ and Dockerfile.library-identity-backfill deleted in the same PR (no tombstone scripts).
  • Idempotent rerun via UPSERT. On a batch-level LML error the orchestrator counts every input as rows_skipped { lml_error } and continues; the next run re-picks failed rows via the SELECT predicate (retry is free).
  • Wired into Manual Build & Deploy. The deploy-base.yml matrix discovers targets dynamically via jobs/$TARGET_APP/package.json's job-type field, so "job-type": "one-shot" in package.json is sufficient — no workflow edit needed.

Column-name correction

BS#802's body wrote last_refreshed_at, but the column on library_identity is last_verified_at. The SELECT predicate, the writer, and the DRY_RUN report all use the actual column name. Surfacing here for the reviewer.

Out-of-scope follow-ups

  1. library_identity artist-ID column gap. LML's ReconciledIdentity carries discogs_artist_id, musicbrainz_artist_id, and bandcamp_id for the artist, but library_identity has no main-row destinations for those columns today. The values flow through library_identity_source.external_id (text) via provenance rows, so no data is dropped — but the main row is a partial denormalised view until a follow-up migration adds artist-id columns. The migration is deliberately out of scope for this PR; should be tracked as a separate ticket.
  2. Compilation handling. Per BS#802's note that compilation track-level identity is BS#801's scope, kind: 'compilation' results are counted as rows_skipped { compilation } and not written. The writer's surface area for compilations (library_track_identity_source) is BS#801's PR.
  3. Integration test. The acceptance criteria mention an integration test covering happy path + retry + unresolved row. The repo's integration tests run sequentially against a real Docker PG and would additionally require an LML fixture server. The orchestrator's contract is fully exercised by unit-level tests with a mocked bulkResolve; lml-fetch is covered by a fetch-stubbed unit test suite. Flagging that an end-to-end test against a fixture LML would be a useful future add.

Test plan

  • npm run typecheck — passes
  • npm run lint — 0 errors (warnings only, all pre-existing)
  • npm run format:check — passes
  • npm run build — all workspaces including the new @wxyc/library-identity-consumer build cleanly
  • npm run test:unit — 141 suites, 1778 tests pass (49 new tests in tests/unit/jobs/library-identity-consumer/ across select / writer / orchestrate / lml-fetch)
  • CI green on this PR

Implements BS#802 under the post-#800 cross-cache-identity pivot: Backend is now a thin writer; LML is the sole composer of cross-cache identity. The new `jobs/library-identity-consumer/` package consumes LML's `POST /api/v1/identity/bulk-resolve-libraries` endpoint (api.yaml v1.2.0, wxyc-shared#104; deployed via LML#272 / PR #273) and UPSERTs the verdicts into `library_identity` + `library_identity_source` atomically per library_id.

The SELECT predicate picks libraries needing identity refresh: `library.canonical_entity_id IS NOT NULL OR library.id IN (SELECT library_id FROM library_identity WHERE last_verified_at < NOW() - interval '7 days')`. Note: BS#802's body wrote `last_refreshed_at`, but the actual column on `library_identity` is `last_verified_at`; the code uses the real column name.

Batches up to 500 inputs per LML call (LML caps at 1000). For each `BulkResolveResult`: `single_artist` → write per-source rows + main row in `db.transaction()`; `unresolved` → counted, no write; `compilation` → counted as `rows_skipped { compilation }` and deferred to BS#801 (per-track identity writes for V/A rows). A batch-level LML error counts every input as `rows_skipped { lml_error }` and continues — the next run re-picks failed rows via the SELECT predicate, so retry is free.

Sentry-traced metrics (`rows_resolved`, `rows_unresolved`, `rows_skipped`, `lml_total_calls`, `lml_total_latency_ms`) land as attributes on a top-level `library-identity-consumer.run` span. The per-batch LML POST is wrapped in `lml.bulk_resolve_libraries` (`http.client`); LML's `cache_stats` projects onto the same span as `lml.cache.*` attributes (LML#229 pattern). DRY_RUN still calls LML so the resolve/unresolved/error counts are honest predictions; only DB writes are suppressed, and a locked-schema JSON object is emitted on stdout.

Deletes the predecessor `jobs/library-identity-backfill/` and `Dockerfile.library-identity-backfill` in the same change (BS#802 acceptance). The deploy-base.yml matrix discovers targets dynamically via `jobs/$TARGET_APP/package.json`'s `job-type` field, so registering the new one-shot target requires no workflow edit. CLAUDE.md and docs/env-vars.md are updated to reflect the rename and the new env vars (`LIBRARY_METADATA_URL`, `LML_API_KEY`, `STALE_THRESHOLD_DAYS`).

Known scope cuts called out for follow-up: `library_identity` has no main-row destinations for `discogs_artist_id`, `musicbrainz_artist_id`, or `bandcamp_id` from LML's `ReconciledIdentity` payload — those values flow through `library_identity_source.external_id` (text) via provenance rows so no data is dropped, but the main row is a partial denormalised view until a follow-up migration adds artist-id columns (deliberately out of scope here). Compilation handling is BS#801's scope.

Test coverage (34 tests across select / writer / orchestrate): SELECT predicate honors the post-#800 disjunction and env var validation; writer is transactional with per-source-then-main ordering and null-confidence skipping; orchestrator dispatches by kind, counts LML and writer errors without aborting the run, paginates by id-cursor, and emits the locked DRY_RUN JSON schema.
Adds 15 unit tests for the lml-fetch module covering URL construction (trailing /api/v1 and trailing-slash stripping, idempotent against the legacy convention), the conditional Authorization: Bearer header, the request body shape (POSTs `{ inputs }` verbatim), Sentry span wrapping (name=lml.bulk_resolve_libraries, op=http.client; batch size attribute; cache_stats projection onto the span as lml.cache.*), the defensive array-narrowing guard on cache_stats, the swallow-on-setAttributes-throw safety net, and the three error paths (non-2xx, AbortError → timed-out message, generic network error rethrown).

Each behavior has at least one realistic failure mode (LML_API_KEY set but header not sent, URL double-slashed, Sentry span crashing on malformed cache_stats payload) that would silently fail in production without this coverage.
…structured LML errors

Four follow-ups from in-session review of #807.

Medium — `lml-types.ts:54` typed `discogs_artist_id` as `string | null`; api.yaml v1.2.0 says `integer`. Corrected to `number | null` and updated the writer-test fixture (`'D-1'` → `12345`) so the type assertion is honest. Harmless today because `projectMainRow` drops the field, but a follow-up migration that adds main-row destinations for the artist-level IDs would otherwise double-coerce silently.

Medium — `Totals.rows_skipped.null_confidence_provenance` mixed units: the other `rows_skipped.*` buckets count library_ids, but this counter counted source rows within a successful library_id write. As a result the library_id-level invariant `scanned == resolved + unresolved + sum(rows_skipped.values())` was false. Lifted the source-row counter out into a sibling field `Totals.source_rows_skipped_null_confidence` (and `DryRunReport.source_rows_skipped_null_confidence`) so the library_id-level sum stays clean. A new regression test pins the invariant.

Medium — added a defensive cardinality check on the LML response. api.yaml v1.2.0 preserves order but doesn't contractually guarantee 1:1 cardinality. Before this PR an under-cardinality response silently under-reported `scanned`. The orchestrator now logs `lml_cardinality_mismatch`, captures to Sentry, and counts the missing inputs under a new `rows_skipped { lml_cardinality_mismatch }` bucket. New test covers the under-cardinality path.

Low — restructured LML failures as a typed `LmlFetchError` carrying `status: number | null` and `retryable: boolean`. 5xx → retryable, 4xx → not, timeout/network → retryable, status=null. The orchestrator's behavior is unchanged today (every failure still goes through `rows_skipped { lml_error }`) but a future retry-with-backoff layer can pivot on the structured fields rather than parsing the message string. Existing error-path tests rewritten to assert the `LmlFetchError` shape; added a non-retryable 4xx case.

Pre-flight clean: typecheck + lint + format:check pass; 1781 unit tests pass (4 new in the consumer suite covering counter-unit cleanliness, under-cardinality, non-retryable 4xx).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[§4 step 2 post-pivot] Consume LML POST /api/v1/identity/bulk-resolve-libraries

1 participant