Skip to content

Migration 0053 wedge: split DDL from backfill, harden migration runner #511

@jakebromberg

Description

@jakebromberg

Plan: stop the migration-0053 wedge and harden against recurrence

Context (incident 2026-04-25/26)

A deploy of flowsheet-etl:v0.1.105 on 2026-04-25 ran migration 0053_flowsheet-dj-name-column.sql. That migration mixes DDL (ALTER TABLE ADD COLUMN dj_name) and bulk DML (UPDATE every track row from a 3-table join) inside one transaction. The DDL acquires AccessExclusiveLock. The UPDATE rewrites every row of a 2.6M-row table — runtime ~hours. The deploy step's SSH wrapper timed out long before the UPDATE finished; the container was killed but the PostgreSQL backend continued running orphaned, holding the AccessExclusiveLock. Every subsequent flowsheet-etl-cron (every 30 min) and the deploy migrate job piled fresh queries behind the lock. By the time we noticed, pg_stat_activity showed 48+ blocked queries, the oldest 13 hours old. The backend container was marked unhealthy because its healthcheck queries against flowsheet were also blocked.

Root cause is three compounding code-level problems: (1) migration mixes DDL and bulk DML in one txn; (2) flowsheet-etl/job.ts:85-95 resolveDjNames() runs the same unbounded UPDATE on every cron tick; (3) no statement_timeout / lock_timeout on the migration runner so orphaned backends can hold locks indefinitely.

Stop-phase already applied: cron entry disabled (later overwritten by an in-flight deploy), original migration backend cancelled, full transaction rolled back (so dj_name column does not currently exist, and migrations 0053/0054/0055 are unapplied per drizzle.__drizzle_migrations), and a fresh db-migrate container that an in-flight Auto Build & Deploy had started was stopped before it could re-create the wedge.

Goal

Ship the planned dj_name denormalization (already coded against in backend:v0.1.173+ per commit fe52818) without wedging production, and prevent the same DDL-plus-heavy-DML failure mode from recurring on any future migration or ETL pass.

Non-goals / out of scope

  • Migration 0054 (drops + re-adds search_doc as a STORED generated column referencing dj_name) has the same AccessExclusiveLock-for-hours risk per its own production note. This plan blocks 0054 from running but does not redesign it; that's a separate follow-up plan.
  • Migration 0055 (add-reconciled-identity-to-artists) is gated behind 0054 in the journal, so it ships when 0054 ships — also out of scope.
  • Auditing every prior migration for similar patterns: documented as P3 below, not blocking.
  • backend container's unhealthy state — should auto-resolve once locks are clear and stays clear; if not, separate diagnosis.

Plan

Phase 0 — restore deploy safety (URGENT, single PR)

The reason this is urgent: any push to main triggers Auto Build & Deploy, which runs the migrate job, which sees 0053 unapplied and tries to apply the original SQL — re-creating the wedge. Until Phase 0 lands, no other code can be merged to main safely.

P0.1 — Split DDL out of migration 0053. Rewrite shared/database/src/migrations/0053_flowsheet-dj-name-column.sql so its only SQL is ALTER TABLE "wxyc_schema"."flowsheet" ADD COLUMN "dj_name" text;. The header comment becomes a pointer to the backfill job's path and the documented run procedure. The hash of the file changes; that's expected — the migration is unapplied in production, so drizzle's hash check will simply attempt the new (fast) statement on next migrate.

P0.2 — Make flowsheet-etl resolveDjNames scoped. In jobs/flowsheet-etl/job.ts, change resolveDjNames to accept the just-imported entry IDs and only update those: WHERE f.id = ANY($1) AND f.dj_name IS NULL. The bulk-load path (runBulkLoad, line 286) already has every imported ID in scope. The incremental path (line 423) iterates entriesToInsert/entriesImported — collect the IDs there. With this change, resolveDjNames becomes O(batch_size), not O(table_size). Legacy rows are not touched by the ETL anymore — they are the backfill job's responsibility (Phase 1).

P0.3 — Block migration 0054 until backfill is verified. Prepend a DO $$ ... END $$ guard to shared/database/src/migrations/0054_flowsheet-search-doc-with-dj-name.sql that raises if any track row has dj_name IS NULL: IF (SELECT count(*) FROM "wxyc_schema"."flowsheet" WHERE entry_type = 'track' AND dj_name IS NULL) > 0 THEN RAISE EXCEPTION 'flowsheet.dj_name backfill incomplete; run jobs/backfills/flowsheet-dj-name-backfill before applying 0054'; END IF;. The hash changes; same reasoning as P0.1 — the migration is unapplied. After backfill is at 100%, a small follow-up PR strips the guard.

P0.4 — Tests. Update tests/unit/database/schema.flowsheet-dj-name.test.ts to assert the new (DDL-only) shape of 0054 and the precondition guard in 0054. Update tests/unit/jobs/flowsheet-etl/job.djName.test.ts to assert the new contract of resolveDjNames(entryIds: number[]) and that legacy rows are not touched.

PR-1 deliverable: P0.1 + P0.2 + P0.3 + P0.4. Size estimate ~400 lines incl. tests. Branch name fix/migration-0053-split-and-scope-etl. Must merge before any other PR to main.

Phase 1 — write the backfill, then run it

P1.1 — New job jobs/backfills/flowsheet-dj-name-backfill/. Mirror the structure of jobs/flowsheet-etl/: job.ts, package.json (no cron-schedule field — this is one-shot), tsconfig.json, tsup.config.ts. The job:

  • Iterates batches of 5000 rows: WITH batch AS (SELECT id FROM wxyc_schema.flowsheet WHERE dj_name IS NULL AND entry_type = 'track' ORDER BY id LIMIT 5000 FOR UPDATE SKIP LOCKED) UPDATE wxyc_schema.flowsheet f SET dj_name = COALESCE(u.dj_name, s.legacy_dj_name, u.name) FROM wxyc_schema.shows s LEFT JOIN auth_user u ON u.id = s.primary_dj_id WHERE f.show_id = s.id AND f.id IN (SELECT id FROM batch);. Each batch is its own transaction.
  • Logs progress: [backfill] batch N: M rows, total processed K, remaining R, ETA ~T.
  • Exits cleanly when a batch returns 0 rows.
  • Idempotent and restartable — re-running picks up where it left off via WHERE dj_name IS NULL.

P1.2 — Dockerfile + deploy plumbing. Add Dockerfile.flowsheet-dj-name-backfill mirroring Dockerfile.flowsheet-etl. The deploy step at deploy-base.yml:417-421 currently does yq -r '.["cron-schedule"]' jobs/$TARGET_APP/package.json and exits non-zero when the field is missing — that is, today every jobs/<x> is assumed to be a cron job. We need to teach the workflow about one-shot jobs. Path of least resistance: introduce a "job-type" field in package.json that is "cron" by default and "one-shot" for backfills; gate the cron-install block on job-type == cron. Backfill package.json declares "job-type": "one-shot" and omits cron-schedule. The build/push/deploy flow still pushes the image to ECR; only the cron-install step is skipped. The image is invoked manually via gh workflow run "Manual Build & Deploy" -f target=flowsheet-dj-name-backfill -f version=latest followed by an SSH docker run.

File checklist for the backfill job (mirrored from jobs/flowsheet-etl/):

  • jobs/backfills/flowsheet-dj-name-backfill/job.ts
  • jobs/backfills/flowsheet-dj-name-backfill/package.json (with "job-type": "one-shot", no cron-schedule)
  • jobs/backfills/flowsheet-dj-name-backfill/tsconfig.json
  • jobs/backfills/flowsheet-dj-name-backfill/tsup.config.ts
  • Dockerfile.flowsheet-dj-name-backfill at repo root
  • Tests at tests/unit/jobs/backfills/flowsheet-dj-name-backfill/job.test.ts
  • Verify against jobs/rotation-etl/ and jobs/library-etl/ that all four config files match those shapes; copy any pattern they share that flowsheet-etl alone doesn't have.

P1.3 — Tests. Unit tests for the batch query shape, idempotency (re-run on already-backfilled rows is a no-op), and that the legacy auth_user/shows fallbacks resolve correctly. Integration test if there's a fast path in the existing harness.

P1.4 — Manual execution procedure (documented in the PR description, not in code):

  • Schedule for a known low-traffic window (target 4-5am ET on a weekday).
  • Build + push: gh workflow run "Manual Build & Deploy" -f target=flowsheet-dj-name-backfill -f version=latest.
  • SSH to EC2, log into ECR, pull the image.
  • Run with stdout captured: docker run --rm --env-file .env <image> 2>&1 | tee /tmp/backfill-$(date +%s).log.
  • Estimated runtime: 2.6M rows ÷ 5000/batch = ~520 batches; if each batch is 2-5s that's 17-45 min. Watch for outliers in the log.
  • Verify: SELECT count(*) FROM wxyc_schema.flowsheet WHERE entry_type = 'track' AND dj_name IS NULL; returns 0.
  • Drop the 0054 guard in a tiny follow-up PR; next deploy unblocks 0054 + 0055.

PR-2 deliverable: P1.1 + P1.2 + P1.3. Size estimate ~300 lines incl. tests. Branch name fix/flowsheet-dj-name-backfill-job. Lands after PR-1 deploy is verified.

Phase 2 — defense in depth

These are independent of Phase 0/1 and could ship in parallel with PR-2. They protect against any future migration or ETL with the same shape.

P2.1 — statement_timeout and lock_timeout on migrations. In dev_env/init-db.mjs, before runMigrations(), run SET statement_timeout = '5min'; SET lock_timeout = '30s'; on the postgres-js client. Migrations that take longer indicate a design problem and should fail fast with a clear error rather than wedge production. (For migrations that are legitimately long, e.g. a future planned 0054, add a SET LOCAL statement_timeout = '2hr'; at the top of the SQL file — opt-in, scoped to that one migration.)

P2.2 — Orphan-migration pre-flight. In dev_env/init-db.mjs, before running migrations, query: SELECT pid, age(now(), query_start) AS age, substring(query, 1, 80) AS q FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE locktype = 'relation' AND mode = 'AccessExclusiveLock' AND relation::regclass::text LIKE 'wxyc_schema.%' AND pid <> pg_backend_pid();. If any rows: log them and exit non-zero with migration aborted: another backend is holding AccessExclusiveLock on wxyc_schema.* (pid=...). Investigate before retrying.. Prevents stacking new migration attempts on top of an orphan.

P2.3 — Bound the migrate step's container lifetime. In .github/workflows/deploy-base.yml line 378, replace docker run --rm --env-file .env -e SKIP_SEED=true $IMAGE_URI with timeout --signal=TERM 8m docker run --rm --env-file .env -e SKIP_SEED=true $IMAGE_URI. If the container exceeds 8 minutes, timeout sends SIGTERM, the postgres-js client closes its connection, the backend rolls back promptly. (8m is comfortably above current normal migrate runtime of <30s and below the orphan-creating timeout window.)

P2.4 — Cron lockout for ETL containers. In .github/workflows/deploy-base.yml line 484, change the cron command template from docker rm -f <name> ... ; docker run --name <name> ... to docker ps -q -f "name=^<name>$" | grep -q . || docker run --rm --name <name> --env-file .env <image>. If a previous instance is still running, the new tick is a no-op. The -rm is added so successful runs clean up; the -f (force-kill) is removed so a hung previous run does not get clobbered before we can investigate. This applies to all jobs (flowsheet-etl, library-etl, rotation-etl).

Two implementation notes for PR-3:

  1. Verify the regex-anchored filter syntax (-f "name=^<name>$") on the deployed Docker version on EC2 before merging — Docker filters with ^/$ anchors are documented but version-dependent. If unsupported, fall back to checking the exit status of docker ps -q --filter name=<name> plus a post-check that the matched container's name is exactly <name> (not a substring like <name>-foo).
  2. Trade-off to document in the PR description: with --rm and no force-kill, a hung container holds its name slot until manual remediation (docker kill <name> && docker rm <name>). This is the desired behavior — visible failure beats silent recovery — but operators need to know the manual remediation command. The runbook entry goes in Backend-Service/CLAUDE.md alongside P3.1.

PR-3 deliverable: P2.1 + P2.2 + P2.3 + P2.4. Size estimate ~200 lines incl. tests. Branch name fix/migration-runner-and-cron-hardening. Lands after PR-1; orderable independently of PR-2.

Phase 3 — documentation

P3.1 — Add to Backend-Service/CLAUDE.md a "Migrations" subsection: migrations are DDL-only; bulk backfills (>10k rows) live in jobs/backfills/<name>/ and are run as one-shot deploys; document the SET LOCAL statement_timeout opt-out for legitimately-long migrations.

P3.2 — Mirror to the wiki/ repo if there's a place for cross-repo conventions; otherwise just Backend-Service/CLAUDE.md.

P3.3 — Audit prior migrations: I already noted 0024_flowsheet_entry_type.sql and 0043_add-has-completed-onboarding.sql have the DDL+DML pattern. Both are already applied; not a current risk. Document them in CLAUDE.md as historical exceptions and the pattern we no longer follow.

PR-4 deliverable: P3.1 + P3.2 + P3.3. Size estimate ~80 lines. Lands any time after PR-1.

Phase 4 — re-enable the cron and unblock 0054 (manual / orchestration only)

After PR-1 + PR-2 are merged, the backfill has been run, and dj_name IS NULL returns 0 for tracks:

  1. Tiny PR-5: drop the precondition guard from migration 0054. Merge → next deploy applies 0054 + 0055.
  2. Verify search service still works: backend:v0.1.173+ is already coded to use flowsheet.dj_name, so once 0054 ships, search reads the new path.
  3. Re-add flowsheet-etl-cron to the crontab via Manual Build & Deploy of flowsheet-etl (the deploy step appends the cron entry per deploy-base.yml:484).
  4. Watch pg_stat_activity and [flowsheet-etl] Resolved dj_name for flowsheet entries. log lines for the first few cron ticks to confirm steady-state behavior.

Sequencing summary

Step What Depends on Production effect
PR-1 DDL-only 0053 + scoped ETL resolveDjNames + 0054 guard + tests nothing Future migrate runs in <1s; 0054 fail-fasts on missing backfill.
PR-3 Migration runner + cron hardening PR-1 (so initial migrate is safe) Future safety only.
PR-2 Backfill job PR-1 deployed None until manually invoked.
Manual Run backfill PR-2 deployed ~2.6M rows over ~17-45 min in low-traffic window.
PR-4 Documentation nothing None.
PR-5 Drop 0054 guard backfill at 100% Unblocks 0054 + 0055 on next deploy.
Manual Re-add flowsheet-etl-cron PR-1 image deployed Cron resumes at *N=*30min.

PR-1 is the critical-path fire stop. Everything else can move at normal pace.

Risk assessment

PR-1: very low. ALTER TABLE ADD COLUMN <nullable text> is metadata-only; instant on any size table. The ETL change has existing test coverage in tests/unit/jobs/flowsheet-etl/job.djName.test.ts to extend. The 0054 precondition guard is the trickiest piece — needs a test asserting it raises when expected and is silent when 0 NULL rows.

PR-2 + manual run: medium. The backfill is long-running but bounded. Each batch commits separately, so worst-case interruption loses one batch's work and is restartable. Watch WAL volume during the run; with 5000-row batches, WAL is bounded per commit. If RDS WAL retention is tight, may want to reduce batch size or pace the loop. Run during low-traffic window (4-5am ET) to minimize concurrent contention.

PR-3: low. Mostly tooling. statement_timeout = 5min is conservative — verify no legitimate migration in the journal currently exceeds 5min by checking the recent applied-migrations log on EC2 first. Cron lockout could mask a runaway container indefinitely; mitigation is a separate watchdog (out of scope), or a manual remediation noted in the docs.

PR-4 / docs: zero risk.

Manual backfill: as above.

Manual re-enable cron: low. The new image's resolveDjNames is bounded; if it misbehaves, kill the container and PR-1's hardening will prevent the previous failure mode.

Rollback strategy

  • PR-1: revert is safe. ALTER TABLE ADD COLUMN is reversible via ALTER TABLE ... DROP COLUMN dj_name in a follow-up migration. ETL change is a code revert. The 0054 guard revert is a one-line edit.
  • Backfill: idempotent under WHERE dj_name IS NULL. To "undo" you'd UPDATE flowsheet SET dj_name = NULL WHERE entry_type = 'track' — destructive and not currently planned for; would need explicit auth.
  • PR-3: tooling reverts only. Re-applying takes one PR.

Open questions for the reviewer

  1. Is option (a) for blocking 0054 (in-file precondition guard) the right gate, or should we revert the 0054 commit on a feature branch and re-land it post-backfill? Trade-off: in-file guard keeps the journal stable but every fresh dev DB seeded with prod-shape data needs the backfill before 0054 can apply; revert keeps fresh deploys clean but bumps migration numbers and touches _journal.json twice (which incident fix: Drizzle migration journal out of sync — migration 0043 silently skipped #400 cited as a footgun).
  2. Should jobs/backfills/ become a permanent convention, or are bulk backfills rare enough that one-off scripts in scripts/ are fine? My read is the former because we want them to be deployable as ECR images with the same env-var contract as ETL jobs.
  3. For statement_timeout = '5min' (P2.1) — is there an existing migration that legitimately needs longer? If so I should grep the journal for runtime evidence before merging. (My quick grep showed 0024 and 0043 as DDL+DML candidates, but both are already applied.)
  4. Cron lockout (P2.4) — should we also add a max-age killswitch, e.g. if previous instance is older than 2 hours, kill it and start fresh? Without one, a wedged container blocks all future runs until manual remediation. Without that we ship the docs path (operator notices via Sentry/log monitoring); with one, we automate it but risk killing legitimate long runs. Lean toward documenting and not auto-killing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions