Skip to content

fix(album-level-backfill): pin LML result.index invariant (closes BS#1088)#1303

Merged
jakebromberg merged 1 commit into
mainfrom
bs-1088-result-index-defense
Jun 3, 2026
Merged

fix(album-level-backfill): pin LML result.index invariant (closes BS#1088)#1303
jakebromberg merged 1 commit into
mainfrom
bs-1088-result-index-defense

Conversation

@jakebromberg
Copy link
Copy Markdown
Member

Closes #1088. Slot 3 of close-out tracker BS#1279.

Problem

album-level-backfill's runBatch does resolved[result.index] with no per-row check that the index LML returned actually matches the position BS sent. LML's bulk handler honors the input-order contract today (asyncio.gather over _run_one(index=i)), but a future partial-failure isolation change — a try/except skip mid-batch, or a re-ordering for streaming — would cause BS to silently UPSERT album_metadata against the wrong album_id. Same failure class as BS#1051 in a different code path; the existing if (!album) continue guard handles index-past-end, not index-mismatch.

Change

Iterate by position with an O(1) assert. On result.index !== i:

  • count to a new unexpected_index bucket on BatchResult + BackfillSummary
  • log a structured warn (step: unexpected_result_index, expected_index, got_index, album_id)
  • emit a Sentry breadcrumb (category: album-level-backfill, message: unexpected_result_index, data: { expected, got, album_id })
  • skip the write and continue

Position lookup is O(1); the ticket flags find() (O(N^2) over the batch) as the wrong shape. The existing batch_done log fields are unchanged — unexpected_index is additive so the BS#1078 Phase 3 runbook's jq watchdog stays valid.

Acceptance signal

unexpected_index = 0 on a 24-h prod run confirms LML's current contract; any non-zero value flags a silent LML behavior change before it corrupts album_metadata.

Tests

Parameterized unit test over LML response shapes: in-order, out-of-order swap, missing-index (LML returns N-1 results). Plus a Sentry-breadcrumb shape assertion and a sibling-isolation test (one bad result, one good result -> one UPSERT, unexpected_index=1). 47 tests in the file, 2681 across the unit suite, all green locally.

…1088)

Regression-pin against a future LML refactor that drops the input-order `result.index === i` contract for `/api/v1/lookup/bulk`. Today LML's handler honors that via `asyncio.gather` over `_run_one(index=i)`, but a partial-failure isolation change (mid-batch `try/except` skip) would cause BS to silently UPSERT `album_metadata` against the wrong album_id — same failure class as BS#1051 in a different code path. Iterate by position with an O(1) assert (`response.results[i].index === i`), not `find()`; on mismatch, log + breadcrumb (`category: album-level-backfill`, `message: unexpected_result_index`) + count to a new `unexpected_index` bucket on `BatchResult` and `BackfillSummary` + continue. Per-row defensive only — the BS#1078 Phase 3 runbook's `jq` watchdog stays valid because the existing `batch_done` field set is unchanged; `unexpected_index` is additive. Acceptance signal (24-h prod baseline = 0) lives in the new aggregate field.
@jakebromberg jakebromberg merged commit 12ec537 into main Jun 3, 2026
5 checks passed
jakebromberg added a commit that referenced this pull request Jun 3, 2026
…ge + cap breadcrumbs (closes BS#1316)

Follow-up to PR #1303 / BS#1088. The per-row index assertion + counter were correct, but the signal evaporated without an alert: breadcrumbs only attach to a subsequently-captured event, and the job captured none on healthy-shaped runs. A non-zero unexpected_index would also blow Sentry's 100-entry breadcrumb FIFO under a full LML contract break, evicting legitimate upstream context from the trail attached to the next real error.

Three changes:
- Accumulate first_mismatch_index + first_mismatch_got + mismatch_count inside the loop and emit ONE breadcrumb per batch after, capped regardless of how many rows mismatch.
- Fire Sentry.captureMessage('album-level-backfill.unexpected_index', ...) per non-zero batch with a stable fingerprint so the Sentry Issues view surfaces it and cause-aggregation persists across deploys.
- Extend the runbook's jq aggregator to include .unexpected_index and add a note pointing the operator at the new Sentry issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

album-level-backfill: trust result.index from LML without defensive subset check

1 participant