feat(jobs): library-identity-consumer for LML bulk-resolve (#802)#807
Merged
Conversation
Implements BS#802 under the post-#800 cross-cache-identity pivot: Backend is now a thin writer; LML is the sole composer of cross-cache identity. The new `jobs/library-identity-consumer/` package consumes LML's `POST /api/v1/identity/bulk-resolve-libraries` endpoint (api.yaml v1.2.0, wxyc-shared#104; deployed via LML#272 / PR #273) and UPSERTs the verdicts into `library_identity` + `library_identity_source` atomically per library_id. The SELECT predicate picks libraries needing identity refresh: `library.canonical_entity_id IS NOT NULL OR library.id IN (SELECT library_id FROM library_identity WHERE last_verified_at < NOW() - interval '7 days')`. Note: BS#802's body wrote `last_refreshed_at`, but the actual column on `library_identity` is `last_verified_at`; the code uses the real column name. Batches up to 500 inputs per LML call (LML caps at 1000). For each `BulkResolveResult`: `single_artist` → write per-source rows + main row in `db.transaction()`; `unresolved` → counted, no write; `compilation` → counted as `rows_skipped { compilation }` and deferred to BS#801 (per-track identity writes for V/A rows). A batch-level LML error counts every input as `rows_skipped { lml_error }` and continues — the next run re-picks failed rows via the SELECT predicate, so retry is free. Sentry-traced metrics (`rows_resolved`, `rows_unresolved`, `rows_skipped`, `lml_total_calls`, `lml_total_latency_ms`) land as attributes on a top-level `library-identity-consumer.run` span. The per-batch LML POST is wrapped in `lml.bulk_resolve_libraries` (`http.client`); LML's `cache_stats` projects onto the same span as `lml.cache.*` attributes (LML#229 pattern). DRY_RUN still calls LML so the resolve/unresolved/error counts are honest predictions; only DB writes are suppressed, and a locked-schema JSON object is emitted on stdout. Deletes the predecessor `jobs/library-identity-backfill/` and `Dockerfile.library-identity-backfill` in the same change (BS#802 acceptance). The deploy-base.yml matrix discovers targets dynamically via `jobs/$TARGET_APP/package.json`'s `job-type` field, so registering the new one-shot target requires no workflow edit. CLAUDE.md and docs/env-vars.md are updated to reflect the rename and the new env vars (`LIBRARY_METADATA_URL`, `LML_API_KEY`, `STALE_THRESHOLD_DAYS`). Known scope cuts called out for follow-up: `library_identity` has no main-row destinations for `discogs_artist_id`, `musicbrainz_artist_id`, or `bandcamp_id` from LML's `ReconciledIdentity` payload — those values flow through `library_identity_source.external_id` (text) via provenance rows so no data is dropped, but the main row is a partial denormalised view until a follow-up migration adds artist-id columns (deliberately out of scope here). Compilation handling is BS#801's scope. Test coverage (34 tests across select / writer / orchestrate): SELECT predicate honors the post-#800 disjunction and env var validation; writer is transactional with per-source-then-main ordering and null-confidence skipping; orchestrator dispatches by kind, counts LML and writer errors without aborting the run, paginates by id-cursor, and emits the locked DRY_RUN JSON schema.
Adds 15 unit tests for the lml-fetch module covering URL construction (trailing /api/v1 and trailing-slash stripping, idempotent against the legacy convention), the conditional Authorization: Bearer header, the request body shape (POSTs `{ inputs }` verbatim), Sentry span wrapping (name=lml.bulk_resolve_libraries, op=http.client; batch size attribute; cache_stats projection onto the span as lml.cache.*), the defensive array-narrowing guard on cache_stats, the swallow-on-setAttributes-throw safety net, and the three error paths (non-2xx, AbortError → timed-out message, generic network error rethrown).
Each behavior has at least one realistic failure mode (LML_API_KEY set but header not sent, URL double-slashed, Sentry span crashing on malformed cache_stats payload) that would silently fail in production without this coverage.
…structured LML errors Four follow-ups from in-session review of #807. Medium — `lml-types.ts:54` typed `discogs_artist_id` as `string | null`; api.yaml v1.2.0 says `integer`. Corrected to `number | null` and updated the writer-test fixture (`'D-1'` → `12345`) so the type assertion is honest. Harmless today because `projectMainRow` drops the field, but a follow-up migration that adds main-row destinations for the artist-level IDs would otherwise double-coerce silently. Medium — `Totals.rows_skipped.null_confidence_provenance` mixed units: the other `rows_skipped.*` buckets count library_ids, but this counter counted source rows within a successful library_id write. As a result the library_id-level invariant `scanned == resolved + unresolved + sum(rows_skipped.values())` was false. Lifted the source-row counter out into a sibling field `Totals.source_rows_skipped_null_confidence` (and `DryRunReport.source_rows_skipped_null_confidence`) so the library_id-level sum stays clean. A new regression test pins the invariant. Medium — added a defensive cardinality check on the LML response. api.yaml v1.2.0 preserves order but doesn't contractually guarantee 1:1 cardinality. Before this PR an under-cardinality response silently under-reported `scanned`. The orchestrator now logs `lml_cardinality_mismatch`, captures to Sentry, and counts the missing inputs under a new `rows_skipped { lml_cardinality_mismatch }` bucket. New test covers the under-cardinality path. Low — restructured LML failures as a typed `LmlFetchError` carrying `status: number | null` and `retryable: boolean`. 5xx → retryable, 4xx → not, timeout/network → retryable, status=null. The orchestrator's behavior is unchanged today (every failure still goes through `rows_skipped { lml_error }`) but a future retry-with-backoff layer can pivot on the structured fields rather than parsing the message string. Existing error-path tests rewritten to assert the `LmlFetchError` shape; added a non-retryable 4xx case. Pre-flight clean: typecheck + lint + format:check pass; 1781 unit tests pass (4 new in the consumer suite covering counter-unit cleanliness, under-cardinality, non-retryable 4xx).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #802 under the post-#800 cross-cache-identity pivot: Backend is now a thin writer; LML is the sole composer of cross-cache identity. The new
jobs/library-identity-consumer/package consumes LML'sPOST /api/v1/identity/bulk-resolve-librariesand UPSERTs the verdicts intolibrary_identity+library_identity_sourceatomically per library_id. The predecessorjobs/library-identity-backfill/(which composed identity locally from five legacy sources) is deleted in the same change.Closes #802.
Prerequisites (merged)
Acceptance criteria (BS#802)
library.canonical_entity_id IS NOT NULL OR library.id IN (SELECT library_id FROM library_identity WHERE last_verified_at < NOW() - interval '7 days')./api/v1/identity/bulk-resolve-librariesin batches of up to 500 (LML cap is 1000; headroom).BulkResolveResultintolibrary_identity+library_identity_sourceatomically per library_id insidedb.transaction().rows_resolved/rows_unresolved/rows_skipped/lml_total_latency_ms/lml_total_calls(as attributes on a top-levellibrary-identity-consumer.runspan; per-batch LML POST wrapped inlml.bulk_resolve_librarieshttp.client span with LML'scache_statsprojected aslml.cache.*).jobs/library-identity-backfill/andDockerfile.library-identity-backfilldeleted in the same PR (no tombstone scripts).rows_skipped { lml_error }and continues; the next run re-picks failed rows via the SELECT predicate (retry is free).jobs/$TARGET_APP/package.json'sjob-typefield, so"job-type": "one-shot"in package.json is sufficient — no workflow edit needed.Column-name correction
BS#802's body wrote
last_refreshed_at, but the column onlibrary_identityislast_verified_at. The SELECT predicate, the writer, and the DRY_RUN report all use the actual column name. Surfacing here for the reviewer.Out-of-scope follow-ups
library_identityartist-ID column gap. LML'sReconciledIdentitycarriesdiscogs_artist_id,musicbrainz_artist_id, andbandcamp_idfor the artist, butlibrary_identityhas no main-row destinations for those columns today. The values flow throughlibrary_identity_source.external_id(text) via provenance rows, so no data is dropped — but the main row is a partial denormalised view until a follow-up migration adds artist-id columns. The migration is deliberately out of scope for this PR; should be tracked as a separate ticket.kind: 'compilation'results are counted asrows_skipped { compilation }and not written. The writer's surface area for compilations (library_track_identity_source) is BS#801's PR.bulkResolve; lml-fetch is covered by a fetch-stubbed unit test suite. Flagging that an end-to-end test against a fixture LML would be a useful future add.Test plan
npm run typecheck— passesnpm run lint— 0 errors (warnings only, all pre-existing)npm run format:check— passesnpm run build— all workspaces including the new@wxyc/library-identity-consumerbuild cleanlynpm run test:unit— 141 suites, 1778 tests pass (49 new tests intests/unit/jobs/library-identity-consumer/across select / writer / orchestrate / lml-fetch)