You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Plan: stop the migration-0053 wedge and harden against recurrence
Context (incident 2026-04-25/26)
A deploy of flowsheet-etl:v0.1.105 on 2026-04-25 ran migration 0053_flowsheet-dj-name-column.sql. That migration mixes DDL (ALTER TABLE ADD COLUMN dj_name) and bulk DML (UPDATE every track row from a 3-table join) inside one transaction. The DDL acquires AccessExclusiveLock. The UPDATE rewrites every row of a 2.6M-row table — runtime ~hours. The deploy step's SSH wrapper timed out long before the UPDATE finished; the container was killed but the PostgreSQL backend continued running orphaned, holding the AccessExclusiveLock. Every subsequent flowsheet-etl-cron (every 30 min) and the deploy migrate job piled fresh queries behind the lock. By the time we noticed, pg_stat_activity showed 48+ blocked queries, the oldest 13 hours old. The backend container was marked unhealthy because its healthcheck queries against flowsheet were also blocked.
Root cause is three compounding code-level problems: (1) migration mixes DDL and bulk DML in one txn; (2) flowsheet-etl/job.ts:85-95 resolveDjNames() runs the same unbounded UPDATE on every cron tick; (3) no statement_timeout / lock_timeout on the migration runner so orphaned backends can hold locks indefinitely.
Stop-phase already applied: cron entry disabled (later overwritten by an in-flight deploy), original migration backend cancelled, full transaction rolled back (so dj_name column does not currently exist, and migrations 0053/0054/0055 are unapplied per drizzle.__drizzle_migrations), and a fresh db-migrate container that an in-flight Auto Build & Deploy had started was stopped before it could re-create the wedge.
Goal
Ship the planned dj_name denormalization (already coded against in backend:v0.1.173+ per commit fe52818) without wedging production, and prevent the same DDL-plus-heavy-DML failure mode from recurring on any future migration or ETL pass.
Non-goals / out of scope
Migration 0054 (drops + re-adds search_doc as a STORED generated column referencing dj_name) has the same AccessExclusiveLock-for-hours risk per its own production note. This plan blocks 0054 from running but does not redesign it; that's a separate follow-up plan.
Migration 0055 (add-reconciled-identity-to-artists) is gated behind 0054 in the journal, so it ships when 0054 ships — also out of scope.
Auditing every prior migration for similar patterns: documented as P3 below, not blocking.
backend container's unhealthy state — should auto-resolve once locks are clear and stays clear; if not, separate diagnosis.
Plan
Phase 0 — restore deploy safety (URGENT, single PR)
The reason this is urgent: any push to main triggers Auto Build & Deploy, which runs the migrate job, which sees 0053 unapplied and tries to apply the original SQL — re-creating the wedge. Until Phase 0 lands, no other code can be merged to main safely.
P0.1 — Split DDL out of migration 0053. Rewrite shared/database/src/migrations/0053_flowsheet-dj-name-column.sql so its only SQL is ALTER TABLE "wxyc_schema"."flowsheet" ADD COLUMN "dj_name" text;. The header comment becomes a pointer to the backfill job's path and the documented run procedure. The hash of the file changes; that's expected — the migration is unapplied in production, so drizzle's hash check will simply attempt the new (fast) statement on next migrate.
P0.2 — Make flowsheet-etlresolveDjNames scoped. In jobs/flowsheet-etl/job.ts, change resolveDjNames to accept the just-imported entry IDs and only update those: WHERE f.id = ANY($1) AND f.dj_name IS NULL. The bulk-load path (runBulkLoad, line 286) already has every imported ID in scope. The incremental path (line 423) iterates entriesToInsert/entriesImported — collect the IDs there. With this change, resolveDjNames becomes O(batch_size), not O(table_size). Legacy rows are not touched by the ETL anymore — they are the backfill job's responsibility (Phase 1).
P0.3 — Block migration 0054 until backfill is verified. Prepend a DO $$ ... END $$ guard to shared/database/src/migrations/0054_flowsheet-search-doc-with-dj-name.sql that raises if any track row has dj_name IS NULL: IF (SELECT count(*) FROM "wxyc_schema"."flowsheet" WHERE entry_type = 'track' AND dj_name IS NULL) > 0 THEN RAISE EXCEPTION 'flowsheet.dj_name backfill incomplete; run jobs/backfills/flowsheet-dj-name-backfill before applying 0054'; END IF;. The hash changes; same reasoning as P0.1 — the migration is unapplied. After backfill is at 100%, a small follow-up PR strips the guard.
P0.4 — Tests. Update tests/unit/database/schema.flowsheet-dj-name.test.ts to assert the new (DDL-only) shape of 0054 and the precondition guard in 0054. Update tests/unit/jobs/flowsheet-etl/job.djName.test.ts to assert the new contract of resolveDjNames(entryIds: number[]) and that legacy rows are not touched.
PR-1 deliverable: P0.1 + P0.2 + P0.3 + P0.4. Size estimate ~400 lines incl. tests. Branch name fix/migration-0053-split-and-scope-etl. Must merge before any other PR to main.
Phase 1 — write the backfill, then run it
P1.1 — New job jobs/backfills/flowsheet-dj-name-backfill/. Mirror the structure of jobs/flowsheet-etl/: job.ts, package.json (no cron-schedule field — this is one-shot), tsconfig.json, tsup.config.ts. The job:
Iterates batches of 5000 rows: WITH batch AS (SELECT id FROM wxyc_schema.flowsheet WHERE dj_name IS NULL AND entry_type = 'track' ORDER BY id LIMIT 5000 FOR UPDATE SKIP LOCKED) UPDATE wxyc_schema.flowsheet f SET dj_name = COALESCE(u.dj_name, s.legacy_dj_name, u.name) FROM wxyc_schema.shows s LEFT JOIN auth_user u ON u.id = s.primary_dj_id WHERE f.show_id = s.id AND f.id IN (SELECT id FROM batch);. Each batch is its own transaction.
Logs progress: [backfill] batch N: M rows, total processed K, remaining R, ETA ~T.
Exits cleanly when a batch returns 0 rows.
Idempotent and restartable — re-running picks up where it left off via WHERE dj_name IS NULL.
P1.2 — Dockerfile + deploy plumbing. Add Dockerfile.flowsheet-dj-name-backfill mirroring Dockerfile.flowsheet-etl. The deploy step at deploy-base.yml:417-421 currently does yq -r '.["cron-schedule"]' jobs/$TARGET_APP/package.json and exits non-zero when the field is missing — that is, today every jobs/<x> is assumed to be a cron job. We need to teach the workflow about one-shot jobs. Path of least resistance: introduce a "job-type" field in package.json that is "cron" by default and "one-shot" for backfills; gate the cron-install block on job-type == cron. Backfill package.json declares "job-type": "one-shot" and omits cron-schedule. The build/push/deploy flow still pushes the image to ECR; only the cron-install step is skipped. The image is invoked manually via gh workflow run "Manual Build & Deploy" -f target=flowsheet-dj-name-backfill -f version=latest followed by an SSH docker run.
File checklist for the backfill job (mirrored from jobs/flowsheet-etl/):
jobs/backfills/flowsheet-dj-name-backfill/job.ts
jobs/backfills/flowsheet-dj-name-backfill/package.json (with "job-type": "one-shot", no cron-schedule)
Dockerfile.flowsheet-dj-name-backfill at repo root
Tests at tests/unit/jobs/backfills/flowsheet-dj-name-backfill/job.test.ts
Verify against jobs/rotation-etl/ and jobs/library-etl/ that all four config files match those shapes; copy any pattern they share that flowsheet-etl alone doesn't have.
P1.3 — Tests. Unit tests for the batch query shape, idempotency (re-run on already-backfilled rows is a no-op), and that the legacy auth_user/shows fallbacks resolve correctly. Integration test if there's a fast path in the existing harness.
P1.4 — Manual execution procedure (documented in the PR description, not in code):
Schedule for a known low-traffic window (target 4-5am ET on a weekday).
Run with stdout captured: docker run --rm --env-file .env <image> 2>&1 | tee /tmp/backfill-$(date +%s).log.
Estimated runtime: 2.6M rows ÷ 5000/batch = ~520 batches; if each batch is 2-5s that's 17-45 min. Watch for outliers in the log.
Verify: SELECT count(*) FROM wxyc_schema.flowsheet WHERE entry_type = 'track' AND dj_name IS NULL; returns 0.
Drop the 0054 guard in a tiny follow-up PR; next deploy unblocks 0054 + 0055.
PR-2 deliverable: P1.1 + P1.2 + P1.3. Size estimate ~300 lines incl. tests. Branch name fix/flowsheet-dj-name-backfill-job. Lands after PR-1 deploy is verified.
Phase 2 — defense in depth
These are independent of Phase 0/1 and could ship in parallel with PR-2. They protect against any future migration or ETL with the same shape.
P2.1 — statement_timeout and lock_timeout on migrations. In dev_env/init-db.mjs, before runMigrations(), run SET statement_timeout = '5min'; SET lock_timeout = '30s'; on the postgres-js client. Migrations that take longer indicate a design problem and should fail fast with a clear error rather than wedge production. (For migrations that are legitimately long, e.g. a future planned 0054, add a SET LOCAL statement_timeout = '2hr'; at the top of the SQL file — opt-in, scoped to that one migration.)
P2.2 — Orphan-migration pre-flight. In dev_env/init-db.mjs, before running migrations, query: SELECT pid, age(now(), query_start) AS age, substring(query, 1, 80) AS q FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE locktype = 'relation' AND mode = 'AccessExclusiveLock' AND relation::regclass::text LIKE 'wxyc_schema.%' AND pid <> pg_backend_pid();. If any rows: log them and exit non-zero with migration aborted: another backend is holding AccessExclusiveLock on wxyc_schema.* (pid=...). Investigate before retrying.. Prevents stacking new migration attempts on top of an orphan.
P2.3 — Bound the migrate step's container lifetime. In .github/workflows/deploy-base.yml line 378, replace docker run --rm --env-file .env -e SKIP_SEED=true $IMAGE_URI with timeout --signal=TERM 8m docker run --rm --env-file .env -e SKIP_SEED=true $IMAGE_URI. If the container exceeds 8 minutes, timeout sends SIGTERM, the postgres-js client closes its connection, the backend rolls back promptly. (8m is comfortably above current normal migrate runtime of <30s and below the orphan-creating timeout window.)
P2.4 — Cron lockout for ETL containers. In .github/workflows/deploy-base.yml line 484, change the cron command template from docker rm -f <name> ... ; docker run --name <name> ... to docker ps -q -f "name=^<name>$" | grep -q . || docker run --rm --name <name> --env-file .env <image>. If a previous instance is still running, the new tick is a no-op. The -rm is added so successful runs clean up; the -f (force-kill) is removed so a hung previous run does not get clobbered before we can investigate. This applies to all jobs (flowsheet-etl, library-etl, rotation-etl).
Two implementation notes for PR-3:
Verify the regex-anchored filter syntax (-f "name=^<name>$") on the deployed Docker version on EC2 before merging — Docker filters with ^/$ anchors are documented but version-dependent. If unsupported, fall back to checking the exit status of docker ps -q --filter name=<name> plus a post-check that the matched container's name is exactly <name> (not a substring like <name>-foo).
Trade-off to document in the PR description: with --rm and no force-kill, a hung container holds its name slot until manual remediation (docker kill <name> && docker rm <name>). This is the desired behavior — visible failure beats silent recovery — but operators need to know the manual remediation command. The runbook entry goes in Backend-Service/CLAUDE.md alongside P3.1.
PR-3 deliverable: P2.1 + P2.2 + P2.3 + P2.4. Size estimate ~200 lines incl. tests. Branch name fix/migration-runner-and-cron-hardening. Lands after PR-1; orderable independently of PR-2.
Phase 3 — documentation
P3.1 — Add to Backend-Service/CLAUDE.md a "Migrations" subsection: migrations are DDL-only; bulk backfills (>10k rows) live in jobs/backfills/<name>/ and are run as one-shot deploys; document the SET LOCAL statement_timeout opt-out for legitimately-long migrations.
P3.2 — Mirror to the wiki/ repo if there's a place for cross-repo conventions; otherwise just Backend-Service/CLAUDE.md.
P3.3 — Audit prior migrations: I already noted 0024_flowsheet_entry_type.sql and 0043_add-has-completed-onboarding.sql have the DDL+DML pattern. Both are already applied; not a current risk. Document them in CLAUDE.md as historical exceptions and the pattern we no longer follow.
PR-4 deliverable: P3.1 + P3.2 + P3.3. Size estimate ~80 lines. Lands any time after PR-1.
Phase 4 — re-enable the cron and unblock 0054 (manual / orchestration only)
After PR-1 + PR-2 are merged, the backfill has been run, and dj_name IS NULL returns 0 for tracks:
Tiny PR-5: drop the precondition guard from migration 0054. Merge → next deploy applies 0054 + 0055.
Verify search service still works: backend:v0.1.173+ is already coded to use flowsheet.dj_name, so once 0054 ships, search reads the new path.
Re-add flowsheet-etl-cron to the crontab via Manual Build & Deploy of flowsheet-etl (the deploy step appends the cron entry per deploy-base.yml:484).
Watch pg_stat_activity and [flowsheet-etl] Resolved dj_name for flowsheet entries. log lines for the first few cron ticks to confirm steady-state behavior.
Future migrate runs in <1s; 0054 fail-fasts on missing backfill.
PR-3
Migration runner + cron hardening
PR-1 (so initial migrate is safe)
Future safety only.
PR-2
Backfill job
PR-1 deployed
None until manually invoked.
Manual
Run backfill
PR-2 deployed
~2.6M rows over ~17-45 min in low-traffic window.
PR-4
Documentation
nothing
None.
PR-5
Drop 0054 guard
backfill at 100%
Unblocks 0054 + 0055 on next deploy.
Manual
Re-add flowsheet-etl-cron
PR-1 image deployed
Cron resumes at *N=*30min.
PR-1 is the critical-path fire stop. Everything else can move at normal pace.
Risk assessment
PR-1: very low. ALTER TABLE ADD COLUMN <nullable text> is metadata-only; instant on any size table. The ETL change has existing test coverage in tests/unit/jobs/flowsheet-etl/job.djName.test.ts to extend. The 0054 precondition guard is the trickiest piece — needs a test asserting it raises when expected and is silent when 0 NULL rows.
PR-2 + manual run: medium. The backfill is long-running but bounded. Each batch commits separately, so worst-case interruption loses one batch's work and is restartable. Watch WAL volume during the run; with 5000-row batches, WAL is bounded per commit. If RDS WAL retention is tight, may want to reduce batch size or pace the loop. Run during low-traffic window (4-5am ET) to minimize concurrent contention.
PR-3: low. Mostly tooling. statement_timeout = 5min is conservative — verify no legitimate migration in the journal currently exceeds 5min by checking the recent applied-migrations log on EC2 first. Cron lockout could mask a runaway container indefinitely; mitigation is a separate watchdog (out of scope), or a manual remediation noted in the docs.
PR-4 / docs: zero risk.
Manual backfill: as above.
Manual re-enable cron: low. The new image's resolveDjNames is bounded; if it misbehaves, kill the container and PR-1's hardening will prevent the previous failure mode.
Rollback strategy
PR-1: revert is safe. ALTER TABLE ADD COLUMN is reversible via ALTER TABLE ... DROP COLUMN dj_name in a follow-up migration. ETL change is a code revert. The 0054 guard revert is a one-line edit.
Backfill: idempotent under WHERE dj_name IS NULL. To "undo" you'd UPDATE flowsheet SET dj_name = NULL WHERE entry_type = 'track' — destructive and not currently planned for; would need explicit auth.
PR-3: tooling reverts only. Re-applying takes one PR.
Open questions for the reviewer
Is option (a) for blocking 0054 (in-file precondition guard) the right gate, or should we revert the 0054 commit on a feature branch and re-land it post-backfill? Trade-off: in-file guard keeps the journal stable but every fresh dev DB seeded with prod-shape data needs the backfill before 0054 can apply; revert keeps fresh deploys clean but bumps migration numbers and touches _journal.json twice (which incident fix: Drizzle migration journal out of sync — migration 0043 silently skipped #400 cited as a footgun).
Should jobs/backfills/ become a permanent convention, or are bulk backfills rare enough that one-off scripts in scripts/ are fine? My read is the former because we want them to be deployable as ECR images with the same env-var contract as ETL jobs.
For statement_timeout = '5min' (P2.1) — is there an existing migration that legitimately needs longer? If so I should grep the journal for runtime evidence before merging. (My quick grep showed 0024 and 0043 as DDL+DML candidates, but both are already applied.)
Cron lockout (P2.4) — should we also add a max-age killswitch, e.g. if previous instance is older than 2 hours, kill it and start fresh? Without one, a wedged container blocks all future runs until manual remediation. Without that we ship the docs path (operator notices via Sentry/log monitoring); with one, we automate it but risk killing legitimate long runs. Lean toward documenting and not auto-killing.
Plan: stop the migration-0053 wedge and harden against recurrence
Context (incident 2026-04-25/26)
A deploy of
flowsheet-etl:v0.1.105on 2026-04-25 ran migration0053_flowsheet-dj-name-column.sql. That migration mixes DDL (ALTER TABLE ADD COLUMN dj_name) and bulk DML (UPDATE every track row from a 3-table join) inside one transaction. The DDL acquiresAccessExclusiveLock. The UPDATE rewrites every row of a 2.6M-row table — runtime ~hours. The deploy step's SSH wrapper timed out long before the UPDATE finished; the container was killed but the PostgreSQL backend continued running orphaned, holding the AccessExclusiveLock. Every subsequentflowsheet-etl-cron(every 30 min) and the deploymigratejob piled fresh queries behind the lock. By the time we noticed,pg_stat_activityshowed 48+ blocked queries, the oldest 13 hours old. Thebackendcontainer was marked unhealthy because its healthcheck queries againstflowsheetwere also blocked.Root cause is three compounding code-level problems: (1) migration mixes DDL and bulk DML in one txn; (2)
flowsheet-etl/job.ts:85-95 resolveDjNames()runs the same unbounded UPDATE on every cron tick; (3) nostatement_timeout/lock_timeouton the migration runner so orphaned backends can hold locks indefinitely.Stop-phase already applied: cron entry disabled (later overwritten by an in-flight deploy), original migration backend cancelled, full transaction rolled back (so
dj_namecolumn does not currently exist, and migrations 0053/0054/0055 are unapplied perdrizzle.__drizzle_migrations), and a freshdb-migratecontainer that an in-flight Auto Build & Deploy had started was stopped before it could re-create the wedge.Goal
Ship the planned
dj_namedenormalization (already coded against inbackend:v0.1.173+per commitfe52818) without wedging production, and prevent the same DDL-plus-heavy-DML failure mode from recurring on any future migration or ETL pass.Non-goals / out of scope
search_docas a STORED generated column referencingdj_name) has the same AccessExclusiveLock-for-hours risk per its own production note. This plan blocks 0054 from running but does not redesign it; that's a separate follow-up plan.add-reconciled-identity-to-artists) is gated behind 0054 in the journal, so it ships when 0054 ships — also out of scope.backendcontainer's unhealthy state — should auto-resolve once locks are clear and stays clear; if not, separate diagnosis.Plan
Phase 0 — restore deploy safety (URGENT, single PR)
The reason this is urgent: any push to
maintriggersAuto Build & Deploy, which runs themigratejob, which sees 0053 unapplied and tries to apply the original SQL — re-creating the wedge. Until Phase 0 lands, no other code can be merged tomainsafely.P0.1 — Split DDL out of migration 0053. Rewrite
shared/database/src/migrations/0053_flowsheet-dj-name-column.sqlso its only SQL isALTER TABLE "wxyc_schema"."flowsheet" ADD COLUMN "dj_name" text;. The header comment becomes a pointer to the backfill job's path and the documented run procedure. The hash of the file changes; that's expected — the migration is unapplied in production, so drizzle's hash check will simply attempt the new (fast) statement on next migrate.P0.2 — Make
flowsheet-etlresolveDjNamesscoped. Injobs/flowsheet-etl/job.ts, changeresolveDjNamesto accept the just-imported entry IDs and only update those:WHERE f.id = ANY($1) AND f.dj_name IS NULL. The bulk-load path (runBulkLoad, line 286) already has every imported ID in scope. The incremental path (line 423) iteratesentriesToInsert/entriesImported— collect the IDs there. With this change,resolveDjNamesbecomes O(batch_size), not O(table_size). Legacy rows are not touched by the ETL anymore — they are the backfill job's responsibility (Phase 1).P0.3 — Block migration 0054 until backfill is verified. Prepend a
DO $$ ... END $$guard toshared/database/src/migrations/0054_flowsheet-search-doc-with-dj-name.sqlthat raises if any track row hasdj_name IS NULL:IF (SELECT count(*) FROM "wxyc_schema"."flowsheet" WHERE entry_type = 'track' AND dj_name IS NULL) > 0 THEN RAISE EXCEPTION 'flowsheet.dj_name backfill incomplete; run jobs/backfills/flowsheet-dj-name-backfill before applying 0054'; END IF;. The hash changes; same reasoning as P0.1 — the migration is unapplied. After backfill is at 100%, a small follow-up PR strips the guard.P0.4 — Tests. Update
tests/unit/database/schema.flowsheet-dj-name.test.tsto assert the new (DDL-only) shape of 0054 and the precondition guard in 0054. Updatetests/unit/jobs/flowsheet-etl/job.djName.test.tsto assert the new contract ofresolveDjNames(entryIds: number[])and that legacy rows are not touched.PR-1 deliverable: P0.1 + P0.2 + P0.3 + P0.4. Size estimate ~400 lines incl. tests. Branch name
fix/migration-0053-split-and-scope-etl. Must merge before any other PR tomain.Phase 1 — write the backfill, then run it
P1.1 — New job
jobs/backfills/flowsheet-dj-name-backfill/. Mirror the structure ofjobs/flowsheet-etl/:job.ts,package.json(nocron-schedulefield — this is one-shot),tsconfig.json,tsup.config.ts. The job:WITH batch AS (SELECT id FROM wxyc_schema.flowsheet WHERE dj_name IS NULL AND entry_type = 'track' ORDER BY id LIMIT 5000 FOR UPDATE SKIP LOCKED) UPDATE wxyc_schema.flowsheet f SET dj_name = COALESCE(u.dj_name, s.legacy_dj_name, u.name) FROM wxyc_schema.shows s LEFT JOIN auth_user u ON u.id = s.primary_dj_id WHERE f.show_id = s.id AND f.id IN (SELECT id FROM batch);. Each batch is its own transaction.[backfill] batch N: M rows, total processed K, remaining R, ETA ~T.WHERE dj_name IS NULL.P1.2 — Dockerfile + deploy plumbing. Add
Dockerfile.flowsheet-dj-name-backfillmirroringDockerfile.flowsheet-etl. The deploy step atdeploy-base.yml:417-421currently doesyq -r '.["cron-schedule"]' jobs/$TARGET_APP/package.jsonand exits non-zero when the field is missing — that is, today everyjobs/<x>is assumed to be a cron job. We need to teach the workflow about one-shot jobs. Path of least resistance: introduce a"job-type"field inpackage.jsonthat is"cron"by default and"one-shot"for backfills; gate the cron-install block onjob-type == cron. Backfillpackage.jsondeclares"job-type": "one-shot"and omitscron-schedule. The build/push/deploy flow still pushes the image to ECR; only the cron-install step is skipped. The image is invoked manually viagh workflow run "Manual Build & Deploy" -f target=flowsheet-dj-name-backfill -f version=latestfollowed by an SSHdocker run.File checklist for the backfill job (mirrored from
jobs/flowsheet-etl/):jobs/backfills/flowsheet-dj-name-backfill/job.tsjobs/backfills/flowsheet-dj-name-backfill/package.json(with"job-type": "one-shot", nocron-schedule)jobs/backfills/flowsheet-dj-name-backfill/tsconfig.jsonjobs/backfills/flowsheet-dj-name-backfill/tsup.config.tsDockerfile.flowsheet-dj-name-backfillat repo roottests/unit/jobs/backfills/flowsheet-dj-name-backfill/job.test.tsjobs/rotation-etl/andjobs/library-etl/that all four config files match those shapes; copy any pattern they share that flowsheet-etl alone doesn't have.P1.3 — Tests. Unit tests for the batch query shape, idempotency (re-run on already-backfilled rows is a no-op), and that the legacy
auth_user/showsfallbacks resolve correctly. Integration test if there's a fast path in the existing harness.P1.4 — Manual execution procedure (documented in the PR description, not in code):
gh workflow run "Manual Build & Deploy" -f target=flowsheet-dj-name-backfill -f version=latest.docker run --rm --env-file .env <image> 2>&1 | tee /tmp/backfill-$(date +%s).log.SELECT count(*) FROM wxyc_schema.flowsheet WHERE entry_type = 'track' AND dj_name IS NULL;returns 0.PR-2 deliverable: P1.1 + P1.2 + P1.3. Size estimate ~300 lines incl. tests. Branch name
fix/flowsheet-dj-name-backfill-job. Lands after PR-1 deploy is verified.Phase 2 — defense in depth
These are independent of Phase 0/1 and could ship in parallel with PR-2. They protect against any future migration or ETL with the same shape.
P2.1 —
statement_timeoutandlock_timeouton migrations. Indev_env/init-db.mjs, beforerunMigrations(), runSET statement_timeout = '5min'; SET lock_timeout = '30s';on the postgres-js client. Migrations that take longer indicate a design problem and should fail fast with a clear error rather than wedge production. (For migrations that are legitimately long, e.g. a future planned 0054, add aSET LOCAL statement_timeout = '2hr';at the top of the SQL file — opt-in, scoped to that one migration.)P2.2 — Orphan-migration pre-flight. In
dev_env/init-db.mjs, before running migrations, query:SELECT pid, age(now(), query_start) AS age, substring(query, 1, 80) AS q FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE locktype = 'relation' AND mode = 'AccessExclusiveLock' AND relation::regclass::text LIKE 'wxyc_schema.%' AND pid <> pg_backend_pid();. If any rows: log them and exit non-zero withmigration aborted: another backend is holding AccessExclusiveLock on wxyc_schema.* (pid=...). Investigate before retrying.. Prevents stacking new migration attempts on top of an orphan.P2.3 — Bound the migrate step's container lifetime. In
.github/workflows/deploy-base.ymlline 378, replacedocker run --rm --env-file .env -e SKIP_SEED=true $IMAGE_URIwithtimeout --signal=TERM 8m docker run --rm --env-file .env -e SKIP_SEED=true $IMAGE_URI. If the container exceeds 8 minutes,timeoutsends SIGTERM, the postgres-js client closes its connection, the backend rolls back promptly. (8m is comfortably above current normal migrate runtime of <30s and below the orphan-creating timeout window.)P2.4 — Cron lockout for ETL containers. In
.github/workflows/deploy-base.ymlline 484, change the cron command template fromdocker rm -f <name> ... ; docker run --name <name> ...todocker ps -q -f "name=^<name>$" | grep -q . || docker run --rm --name <name> --env-file .env <image>. If a previous instance is still running, the new tick is a no-op. The-rmis added so successful runs clean up; the-f(force-kill) is removed so a hung previous run does not get clobbered before we can investigate. This applies to all jobs (flowsheet-etl,library-etl,rotation-etl).Two implementation notes for PR-3:
-f "name=^<name>$") on the deployed Docker version on EC2 before merging — Docker filters with^/$anchors are documented but version-dependent. If unsupported, fall back to checking the exit status ofdocker ps -q --filter name=<name>plus a post-check that the matched container's name is exactly<name>(not a substring like<name>-foo).--rmand no force-kill, a hung container holds its name slot until manual remediation (docker kill <name> && docker rm <name>). This is the desired behavior — visible failure beats silent recovery — but operators need to know the manual remediation command. The runbook entry goes inBackend-Service/CLAUDE.mdalongside P3.1.PR-3 deliverable: P2.1 + P2.2 + P2.3 + P2.4. Size estimate ~200 lines incl. tests. Branch name
fix/migration-runner-and-cron-hardening. Lands after PR-1; orderable independently of PR-2.Phase 3 — documentation
P3.1 — Add to
Backend-Service/CLAUDE.mda "Migrations" subsection: migrations are DDL-only; bulk backfills (>10k rows) live injobs/backfills/<name>/and are run as one-shot deploys; document theSET LOCAL statement_timeoutopt-out for legitimately-long migrations.P3.2 — Mirror to the
wiki/repo if there's a place for cross-repo conventions; otherwise justBackend-Service/CLAUDE.md.P3.3 — Audit prior migrations: I already noted
0024_flowsheet_entry_type.sqland0043_add-has-completed-onboarding.sqlhave the DDL+DML pattern. Both are already applied; not a current risk. Document them inCLAUDE.mdas historical exceptions and the pattern we no longer follow.PR-4 deliverable: P3.1 + P3.2 + P3.3. Size estimate ~80 lines. Lands any time after PR-1.
Phase 4 — re-enable the cron and unblock 0054 (manual / orchestration only)
After PR-1 + PR-2 are merged, the backfill has been run, and
dj_name IS NULLreturns 0 for tracks:backend:v0.1.173+is already coded to useflowsheet.dj_name, so once 0054 ships, search reads the new path.flowsheet-etl-cronto the crontab viaManual Build & Deployofflowsheet-etl(the deploy step appends the cron entry perdeploy-base.yml:484).pg_stat_activityand[flowsheet-etl] Resolved dj_name for flowsheet entries.log lines for the first few cron ticks to confirm steady-state behavior.Sequencing summary
resolveDjNames+ 0054 guard + testsflowsheet-etl-cronPR-1 is the critical-path fire stop. Everything else can move at normal pace.
Risk assessment
PR-1: very low.
ALTER TABLE ADD COLUMN <nullable text>is metadata-only; instant on any size table. The ETL change has existing test coverage intests/unit/jobs/flowsheet-etl/job.djName.test.tsto extend. The 0054 precondition guard is the trickiest piece — needs a test asserting it raises when expected and is silent when 0 NULL rows.PR-2 + manual run: medium. The backfill is long-running but bounded. Each batch commits separately, so worst-case interruption loses one batch's work and is restartable. Watch WAL volume during the run; with 5000-row batches, WAL is bounded per commit. If RDS WAL retention is tight, may want to reduce batch size or pace the loop. Run during low-traffic window (4-5am ET) to minimize concurrent contention.
PR-3: low. Mostly tooling.
statement_timeout = 5minis conservative — verify no legitimate migration in the journal currently exceeds 5min by checking the recent applied-migrations log on EC2 first. Cron lockout could mask a runaway container indefinitely; mitigation is a separate watchdog (out of scope), or a manual remediation noted in the docs.PR-4 / docs: zero risk.
Manual backfill: as above.
Manual re-enable cron: low. The new image's
resolveDjNamesis bounded; if it misbehaves, kill the container and PR-1's hardening will prevent the previous failure mode.Rollback strategy
ALTER TABLE ... DROP COLUMN dj_namein a follow-up migration. ETL change is a code revert. The 0054 guard revert is a one-line edit.WHERE dj_name IS NULL. To "undo" you'dUPDATE flowsheet SET dj_name = NULL WHERE entry_type = 'track'— destructive and not currently planned for; would need explicit auth.Open questions for the reviewer
_journal.jsontwice (which incident fix: Drizzle migration journal out of sync — migration 0043 silently skipped #400 cited as a footgun).jobs/backfills/become a permanent convention, or are bulk backfills rare enough that one-off scripts inscripts/are fine? My read is the former because we want them to be deployable as ECR images with the same env-var contract as ETL jobs.statement_timeout = '5min'(P2.1) — is there an existing migration that legitimately needs longer? If so I should grep the journal for runtime evidence before merging. (My quick grep showed 0024 and 0043 as DDL+DML candidates, but both are already applied.)if previous instance is older than 2 hours, kill it and start fresh? Without one, a wedged container blocks all future runs until manual remediation. Without that we ship the docs path (operator notices via Sentry/log monitoring); with one, we automate it but risk killing legitimate long runs. Lean toward documenting and not auto-killing.