Migration 0053 wedge: split DDL from backfill, harden migration runner

# Plan: stop the migration-0053 wedge and harden against recurrence

## Context (incident 2026-04-25/26)

A deploy of `flowsheet-etl:v0.1.105` on 2026-04-25 ran migration `0053_flowsheet-dj-name-column.sql`. That migration mixes DDL (`ALTER TABLE ADD COLUMN dj_name`) and bulk DML (`UPDATE every track row from a 3-table join`) inside one transaction. The DDL acquires `AccessExclusiveLock`. The UPDATE rewrites every row of a 2.6M-row table — runtime ~hours. The deploy step's SSH wrapper timed out long before the UPDATE finished; the container was killed but the PostgreSQL backend continued running orphaned, holding the AccessExclusiveLock. Every subsequent `flowsheet-etl-cron` (every 30 min) and the deploy `migrate` job piled fresh queries behind the lock. By the time we noticed, `pg_stat_activity` showed 48+ blocked queries, the oldest 13 hours old. The `backend` container was marked unhealthy because its healthcheck queries against `flowsheet` were also blocked.

Root cause is three compounding code-level problems: (1) migration mixes DDL and bulk DML in one txn; (2) `flowsheet-etl/job.ts:85-95 resolveDjNames()` runs the same unbounded UPDATE on every cron tick; (3) no `statement_timeout` / `lock_timeout` on the migration runner so orphaned backends can hold locks indefinitely.

Stop-phase already applied: cron entry disabled (later overwritten by an in-flight deploy), original migration backend cancelled, full transaction rolled back (so `dj_name` column does not currently exist, and migrations 0053/0054/0055 are unapplied per `drizzle.__drizzle_migrations`), and a fresh `db-migrate` container that an in-flight Auto Build & Deploy had started was stopped before it could re-create the wedge.

## Goal

Ship the planned `dj_name` denormalization (already coded against in `backend:v0.1.173+` per commit `fe52818`) without wedging production, and prevent the same DDL-plus-heavy-DML failure mode from recurring on any future migration or ETL pass.

## Non-goals / out of scope

- Migration 0054 (drops + re-adds `search_doc` as a STORED generated column referencing `dj_name`) has the *same* AccessExclusiveLock-for-hours risk per its own production note. This plan blocks 0054 from running but does not redesign it; that's a separate follow-up plan.
- Migration 0055 (`add-reconciled-identity-to-artists`) is gated behind 0054 in the journal, so it ships when 0054 ships — also out of scope.
- Auditing every prior migration for similar patterns: documented as P3 below, not blocking.
- `backend` container's unhealthy state — should auto-resolve once locks are clear and stays clear; if not, separate diagnosis.

## Plan

### Phase 0 — restore deploy safety (URGENT, single PR)

The reason this is urgent: any push to `main` triggers `Auto Build & Deploy`, which runs the `migrate` job, which sees 0053 unapplied and tries to apply the original SQL — re-creating the wedge. Until Phase 0 lands, no other code can be merged to `main` safely.

**P0.1 — Split DDL out of migration 0053.** Rewrite `shared/database/src/migrations/0053_flowsheet-dj-name-column.sql` so its only SQL is `ALTER TABLE "wxyc_schema"."flowsheet" ADD COLUMN "dj_name" text;`. The header comment becomes a pointer to the backfill job's path and the documented run procedure. The hash of the file changes; that's expected — the migration is unapplied in production, so drizzle's hash check will simply attempt the new (fast) statement on next migrate.

**P0.2 — Make `flowsheet-etl` `resolveDjNames` scoped.** In `jobs/flowsheet-etl/job.ts`, change `resolveDjNames` to accept the just-imported entry IDs and only update those: `WHERE f.id = ANY($1) AND f.dj_name IS NULL`. The bulk-load path (`runBulkLoad`, line 286) already has every imported ID in scope. The incremental path (line 423) iterates `entriesToInsert`/`entriesImported` — collect the IDs there. With this change, `resolveDjNames` becomes O(batch_size), not O(table_size). Legacy rows are *not* touched by the ETL anymore — they are the backfill job's responsibility (Phase 1).

**P0.3 — Block migration 0054 until backfill is verified.** Prepend a `DO $$ ... END $$` guard to `shared/database/src/migrations/0054_flowsheet-search-doc-with-dj-name.sql` that raises if any track row has `dj_name IS NULL`: `IF (SELECT count(*) FROM "wxyc_schema"."flowsheet" WHERE entry_type = 'track' AND dj_name IS NULL) > 0 THEN RAISE EXCEPTION 'flowsheet.dj_name backfill incomplete; run jobs/backfills/flowsheet-dj-name-backfill before applying 0054'; END IF;`. The hash changes; same reasoning as P0.1 — the migration is unapplied. After backfill is at 100%, a small follow-up PR strips the guard.

**P0.4 — Tests.** Update `tests/unit/database/schema.flowsheet-dj-name.test.ts` to assert the new (DDL-only) shape of 0054 and the precondition guard in 0054. Update `tests/unit/jobs/flowsheet-etl/job.djName.test.ts` to assert the new contract of `resolveDjNames(entryIds: number[])` and that legacy rows are not touched.

PR-1 deliverable: P0.1 + P0.2 + P0.3 + P0.4. Size estimate ~400 lines incl. tests. Branch name `fix/migration-0053-split-and-scope-etl`. Must merge before any other PR to `main`.

### Phase 1 — write the backfill, then run it

**P1.1 — New job `jobs/backfills/flowsheet-dj-name-backfill/`.** Mirror the structure of `jobs/flowsheet-etl/`: `job.ts`, `package.json` (no `cron-schedule` field — this is one-shot), `tsconfig.json`, `tsup.config.ts`. The job:
- Iterates batches of 5000 rows: `WITH batch AS (SELECT id FROM wxyc_schema.flowsheet WHERE dj_name IS NULL AND entry_type = 'track' ORDER BY id LIMIT 5000 FOR UPDATE SKIP LOCKED) UPDATE wxyc_schema.flowsheet f SET dj_name = COALESCE(u.dj_name, s.legacy_dj_name, u.name) FROM wxyc_schema.shows s LEFT JOIN auth_user u ON u.id = s.primary_dj_id WHERE f.show_id = s.id AND f.id IN (SELECT id FROM batch);`. Each batch is its own transaction.
- Logs progress: `[backfill] batch N: M rows, total processed K, remaining R, ETA ~T`.
- Exits cleanly when a batch returns 0 rows.
- Idempotent and restartable — re-running picks up where it left off via `WHERE dj_name IS NULL`.

**P1.2 — Dockerfile + deploy plumbing.** Add `Dockerfile.flowsheet-dj-name-backfill` mirroring `Dockerfile.flowsheet-etl`. The deploy step at `deploy-base.yml:417-421` currently does `yq -r '.["cron-schedule"]' jobs/$TARGET_APP/package.json` and exits non-zero when the field is missing — that is, today every `jobs/<x>` is assumed to be a cron job. We need to teach the workflow about one-shot jobs. Path of least resistance: introduce a `"job-type"` field in `package.json` that is `"cron"` by default and `"one-shot"` for backfills; gate the cron-install block on `job-type == cron`. Backfill `package.json` declares `"job-type": "one-shot"` and omits `cron-schedule`. The build/push/deploy flow still pushes the image to ECR; only the cron-install step is skipped. The image is invoked manually via `gh workflow run "Manual Build & Deploy" -f target=flowsheet-dj-name-backfill -f version=latest` followed by an SSH `docker run`.

**File checklist for the backfill job (mirrored from `jobs/flowsheet-etl/`):**
- `jobs/backfills/flowsheet-dj-name-backfill/job.ts`
- `jobs/backfills/flowsheet-dj-name-backfill/package.json` (with `"job-type": "one-shot"`, no `cron-schedule`)
- `jobs/backfills/flowsheet-dj-name-backfill/tsconfig.json`
- `jobs/backfills/flowsheet-dj-name-backfill/tsup.config.ts`
- `Dockerfile.flowsheet-dj-name-backfill` at repo root
- Tests at `tests/unit/jobs/backfills/flowsheet-dj-name-backfill/job.test.ts`
- Verify against `jobs/rotation-etl/` and `jobs/library-etl/` that all four config files match those shapes; copy any pattern they share that flowsheet-etl alone doesn't have.

**P1.3 — Tests.** Unit tests for the batch query shape, idempotency (re-run on already-backfilled rows is a no-op), and that the legacy `auth_user`/`shows` fallbacks resolve correctly. Integration test if there's a fast path in the existing harness.

**P1.4 — Manual execution procedure** (documented in the PR description, not in code):
- Schedule for a known low-traffic window (target 4-5am ET on a weekday).
- Build + push: `gh workflow run "Manual Build & Deploy" -f target=flowsheet-dj-name-backfill -f version=latest`.
- SSH to EC2, log into ECR, pull the image.
- Run with stdout captured: `docker run --rm --env-file .env <image> 2>&1 | tee /tmp/backfill-$(date +%s).log`.
- Estimated runtime: 2.6M rows ÷ 5000/batch = ~520 batches; if each batch is 2-5s that's 17-45 min. Watch for outliers in the log.
- Verify: `SELECT count(*) FROM wxyc_schema.flowsheet WHERE entry_type = 'track' AND dj_name IS NULL;` returns 0.
- Drop the 0054 guard in a tiny follow-up PR; next deploy unblocks 0054 + 0055.

PR-2 deliverable: P1.1 + P1.2 + P1.3. Size estimate ~300 lines incl. tests. Branch name `fix/flowsheet-dj-name-backfill-job`. Lands after PR-1 deploy is verified.

### Phase 2 — defense in depth

These are independent of Phase 0/1 and could ship in parallel with PR-2. They protect against any *future* migration or ETL with the same shape.

**P2.1 — `statement_timeout` and `lock_timeout` on migrations.** In `dev_env/init-db.mjs`, before `runMigrations()`, run `SET statement_timeout = '5min'; SET lock_timeout = '30s';` on the postgres-js client. Migrations that take longer indicate a design problem and should fail fast with a clear error rather than wedge production. (For migrations that are legitimately long, e.g. a future planned 0054, add a `SET LOCAL statement_timeout = '2hr';` at the top of the SQL file — opt-in, scoped to that one migration.)

**P2.2 — Orphan-migration pre-flight.** In `dev_env/init-db.mjs`, before running migrations, query: `SELECT pid, age(now(), query_start) AS age, substring(query, 1, 80) AS q FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE locktype = 'relation' AND mode = 'AccessExclusiveLock' AND relation::regclass::text LIKE 'wxyc_schema.%' AND pid <> pg_backend_pid();`. If any rows: log them and exit non-zero with `migration aborted: another backend is holding AccessExclusiveLock on wxyc_schema.* (pid=...). Investigate before retrying.`. Prevents stacking new migration attempts on top of an orphan.

**P2.3 — Bound the migrate step's container lifetime.** In `.github/workflows/deploy-base.yml` line 378, replace `docker run --rm --env-file .env -e SKIP_SEED=true $IMAGE_URI` with `timeout --signal=TERM 8m docker run --rm --env-file .env -e SKIP_SEED=true $IMAGE_URI`. If the container exceeds 8 minutes, `timeout` sends SIGTERM, the postgres-js client closes its connection, the backend rolls back promptly. (8m is comfortably above current normal migrate runtime of <30s and below the orphan-creating timeout window.)

**P2.4 — Cron lockout for ETL containers.** In `.github/workflows/deploy-base.yml` line 484, change the cron command template from `docker rm -f <name> ... ; docker run --name <name> ...` to `docker ps -q -f "name=^<name>$" | grep -q . || docker run --rm --name <name> --env-file .env <image>`. If a previous instance is still running, the new tick is a no-op. The `-rm` is added so successful runs clean up; the `-f` (force-kill) is removed so a hung previous run does not get clobbered before we can investigate. This applies to all jobs (`flowsheet-etl`, `library-etl`, `rotation-etl`).

Two implementation notes for PR-3:
1. Verify the regex-anchored filter syntax (`-f "name=^<name>$"`) on the deployed Docker version on EC2 before merging — Docker filters with `^`/`$` anchors are documented but version-dependent. If unsupported, fall back to checking the exit status of `docker ps -q --filter name=<name>` plus a post-check that the matched container's name is exactly `<name>` (not a substring like `<name>-foo`).
2. Trade-off to document in the PR description: with `--rm` and no force-kill, a hung container holds its name slot until manual remediation (`docker kill <name> && docker rm <name>`). This is the desired behavior — visible failure beats silent recovery — but operators need to know the manual remediation command. The runbook entry goes in `Backend-Service/CLAUDE.md` alongside P3.1.

PR-3 deliverable: P2.1 + P2.2 + P2.3 + P2.4. Size estimate ~200 lines incl. tests. Branch name `fix/migration-runner-and-cron-hardening`. Lands after PR-1; orderable independently of PR-2.

### Phase 3 — documentation

**P3.1** — Add to `Backend-Service/CLAUDE.md` a "Migrations" subsection: migrations are DDL-only; bulk backfills (>10k rows) live in `jobs/backfills/<name>/` and are run as one-shot deploys; document the `SET LOCAL statement_timeout` opt-out for legitimately-long migrations.

**P3.2** — Mirror to the `wiki/` repo if there's a place for cross-repo conventions; otherwise just `Backend-Service/CLAUDE.md`.

**P3.3** — Audit prior migrations: I already noted `0024_flowsheet_entry_type.sql` and `0043_add-has-completed-onboarding.sql` have the DDL+DML pattern. Both are already applied; not a current risk. Document them in `CLAUDE.md` as historical exceptions and the pattern we no longer follow.

PR-4 deliverable: P3.1 + P3.2 + P3.3. Size estimate ~80 lines. Lands any time after PR-1.

### Phase 4 — re-enable the cron and unblock 0054 (manual / orchestration only)

After PR-1 + PR-2 are merged, the backfill has been run, and `dj_name IS NULL` returns 0 for tracks:

1. Tiny PR-5: drop the precondition guard from migration 0054. Merge → next deploy applies 0054 + 0055.
2. Verify search service still works: `backend:v0.1.173+` is already coded to use `flowsheet.dj_name`, so once 0054 ships, search reads the new path.
3. Re-add `flowsheet-etl-cron` to the crontab via `Manual Build & Deploy` of `flowsheet-etl` (the deploy step appends the cron entry per `deploy-base.yml:484`).
4. Watch `pg_stat_activity` and `[flowsheet-etl] Resolved dj_name for flowsheet entries.` log lines for the first few cron ticks to confirm steady-state behavior.

## Sequencing summary

| Step | What | Depends on | Production effect |
|---|---|---|---|
| PR-1 | DDL-only 0053 + scoped ETL `resolveDjNames` + 0054 guard + tests | nothing | Future migrate runs in <1s; 0054 fail-fasts on missing backfill. |
| PR-3 | Migration runner + cron hardening | PR-1 (so initial migrate is safe) | Future safety only. |
| PR-2 | Backfill job | PR-1 deployed | None until manually invoked. |
| Manual | Run backfill | PR-2 deployed | ~2.6M rows over ~17-45 min in low-traffic window. |
| PR-4 | Documentation | nothing | None. |
| PR-5 | Drop 0054 guard | backfill at 100% | Unblocks 0054 + 0055 on next deploy. |
| Manual | Re-add `flowsheet-etl-cron` | PR-1 image deployed | Cron resumes at *N=*30min. |

PR-1 is the critical-path fire stop. Everything else can move at normal pace.

## Risk assessment

**PR-1**: very low. `ALTER TABLE ADD COLUMN <nullable text>` is metadata-only; instant on any size table. The ETL change has existing test coverage in `tests/unit/jobs/flowsheet-etl/job.djName.test.ts` to extend. The 0054 precondition guard is the trickiest piece — needs a test asserting it raises when expected and is silent when 0 NULL rows.

**PR-2 + manual run**: medium. The backfill is long-running but bounded. Each batch commits separately, so worst-case interruption loses one batch's work and is restartable. Watch WAL volume during the run; with 5000-row batches, WAL is bounded per commit. If RDS WAL retention is tight, may want to reduce batch size or pace the loop. Run during low-traffic window (4-5am ET) to minimize concurrent contention.

**PR-3**: low. Mostly tooling. `statement_timeout = 5min` is conservative — verify no legitimate migration in the journal currently exceeds 5min by checking the recent applied-migrations log on EC2 first. Cron lockout could mask a runaway container indefinitely; mitigation is a separate watchdog (out of scope), or a manual remediation noted in the docs.

**PR-4 / docs**: zero risk.

**Manual backfill**: as above.

**Manual re-enable cron**: low. The new image's `resolveDjNames` is bounded; if it misbehaves, kill the container and PR-1's hardening will prevent the previous failure mode.

## Rollback strategy

- PR-1: revert is safe. ALTER TABLE ADD COLUMN is reversible via `ALTER TABLE ... DROP COLUMN dj_name` in a follow-up migration. ETL change is a code revert. The 0054 guard revert is a one-line edit.
- Backfill: idempotent under `WHERE dj_name IS NULL`. To "undo" you'd `UPDATE flowsheet SET dj_name = NULL WHERE entry_type = 'track'` — destructive and not currently planned for; would need explicit auth.
- PR-3: tooling reverts only. Re-applying takes one PR.

## Open questions for the reviewer

1. Is option (a) for blocking 0054 (in-file precondition guard) the right gate, or should we revert the 0054 commit on a feature branch and re-land it post-backfill? Trade-off: in-file guard keeps the journal stable but every fresh dev DB seeded with prod-shape data needs the backfill before 0054 can apply; revert keeps fresh deploys clean but bumps migration numbers and touches `_journal.json` twice (which incident #400 cited as a footgun).
2. Should `jobs/backfills/` become a permanent convention, or are bulk backfills rare enough that one-off scripts in `scripts/` are fine? My read is the former because we want them to be deployable as ECR images with the same env-var contract as ETL jobs.
3. For `statement_timeout = '5min'` (P2.1) — is there an existing migration that legitimately needs longer? If so I should grep the journal for runtime evidence before merging. (My quick grep showed 0024 and 0043 as DDL+DML candidates, but both are already applied.)
4. Cron lockout (P2.4) — should we also add a max-age killswitch, e.g. `if previous instance is older than 2 hours, kill it and start fresh`? Without one, a wedged container blocks all future runs until manual remediation. Without that we ship the docs path (operator notices via Sentry/log monitoring); with one, we automate it but risk killing legitimate long runs. Lean toward documenting and not auto-killing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migration 0053 wedge: split DDL from backfill, harden migration runner #511

Plan: stop the migration-0053 wedge and harden against recurrence

Context (incident 2026-04-25/26)

Goal

Non-goals / out of scope

Plan

Phase 0 — restore deploy safety (URGENT, single PR)

Phase 1 — write the backfill, then run it

Phase 2 — defense in depth

Phase 3 — documentation

Phase 4 — re-enable the cron and unblock 0054 (manual / orchestration only)

Sequencing summary

Risk assessment

Rollback strategy

Open questions for the reviewer

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Step	What	Depends on	Production effect
PR-1	DDL-only 0053 + scoped ETL `resolveDjNames` + 0054 guard + tests	nothing	Future migrate runs in <1s; 0054 fail-fasts on missing backfill.
PR-3	Migration runner + cron hardening	PR-1 (so initial migrate is safe)	Future safety only.
PR-2	Backfill job	PR-1 deployed	None until manually invoked.
Manual	Run backfill	PR-2 deployed	~2.6M rows over ~17-45 min in low-traffic window.
PR-4	Documentation	nothing	None.
PR-5	Drop 0054 guard	backfill at 100%	Unblocks 0054 + 0055 on next deploy.
Manual	Re-add `flowsheet-etl-cron`	PR-1 image deployed	Cron resumes at N=30min.

Migration 0053 wedge: split DDL from backfill, harden migration runner #511

Description

Plan: stop the migration-0053 wedge and harden against recurrence

Context (incident 2026-04-25/26)

Goal

Non-goals / out of scope

Plan

Phase 0 — restore deploy safety (URGENT, single PR)

Phase 1 — write the backfill, then run it

Phase 2 — defense in depth

Phase 3 — documentation

Phase 4 — re-enable the cron and unblock 0054 (manual / orchestration only)

Sequencing summary

Risk assessment

Rollback strategy

Open questions for the reviewer

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions