v0.30.1 feat: operational hardening — make upgrades just work on Supabase#750
Merged
Conversation
Routes Postgres queries by query type: - read() goes to the Supabase pooler (port 6543, fast) - ddl() and bulk() go to direct (port 5432, 30min stmt timeout, mwm 256MB) Auto-detects Supabase via hostname pooler.supabase.com or port 6543. Override with GBRAIN_DIRECT_DATABASE_URL. Kill-switch via GBRAIN_DISABLE_DIRECT_POOL=1 falls back to single-pool legacy path. Foundation modules (Lane A scope): - src/core/connection-manager.ts: read/ddl/bulk/healthCheck, parent-CM inheritance (T5/X1), cached Promise<Sql> lazy init (A1), kill-switch inheritance (A2), Supabase URL auto-derivation - src/core/url-redact.ts: redactPgUrl + redactDeep (F3) - src/core/retry-matcher.ts: typed predicates for stmt-timeout / lock / conn errors (C4) - src/core/connection-audit.ts: ~/.gbrain/audit/connection-events JSONL with ISO-week rotation; doctor tail-reads last 5 errors (F8) - scripts/check-pg-url-redaction.sh: CI grep guard against unredacted postgresql:// URL leaks (F3) Engine integration: - PostgresEngine.connect: instantiates instance-owned ConnectionManager, inherits from parentConnectionManager when set (worker engines, sync, cycle), shares pool with module-singleton path - PostgresEngine.disconnect: tears down direct pool first - PostgresEngine.initSchema: routes DDL through connectionManager.ddl() when dual-pool active (X1 part 1; lock semantics replacement is Lane B) - cli.ts:connectEngine(opts): probeOnly skips initSchema entirely (X1 part 2 — get_health, upgrade --status will use this) Tests added (51 new cases): - test/url-redact.test.ts: 11 cases - test/retry-matcher.test.ts: 13 cases - test/connection-manager.test.ts: 27 cases (URL detection, derive, kill-switch, parent inheritance, dual-pool routing modes) Foundation for Lanes B-E. Sequential lane work continues. Plan: ~/.claude/plans/system-instruction-you-are-working-stateless-wadler.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…force flags
Adds Migration interface fields:
- idempotent: boolean (default true; explicit false blocks verify-hook
re-runs on destructive migrations)
- verify: optional post-condition probe; runs after migration claims success
Migration retry wrapper (Cherry D3 / Finding F2):
- 3 attempts with 5s/15s/45s backoff (env GBRAIN_MIGRATE_BACKOFF_MS=0
for tests)
- Retries only on statement_timeout (57014) or connection-reset patterns
- Pre-attempt: logs idle-in-transaction blockers via getIdleBlockers
- On exhaustion: throws MigrationRetryExhausted with named PID + suggested
pg_terminate_backend() recovery command
Verify-hook self-healing (Cherry D6 / Codex X3):
- On verify=false + idempotent=true → re-runs migration once silently
- On verify=false + idempotent=false → throws MigrationDriftError
- --skip-verify CLI flag bypasses for operator override
withRefreshingLock helper (Cherry T4 / Codex A4 / X1 part 3):
- setInterval refresh every TTL/6 ms during long-running work
- SELECT 1 backend-alive heartbeat per refresh tick
- Heartbeat hang past 30s → log + clear interval; lock TTL auto-expires
- LockUnavailableError when acquire fails (caller decides retry)
- buildTenantLockId(scope) appends current_database() suffix for
multi-tenant safety (Cherry D4)
Namespaced --force flags (Codex T5):
- --force-orchestrator: write 'retry' markers for ALL wedged orchestrators
- --force-schema: re-runs runMigrations against current config.version
- --force / --force-all: both
- --force-retry vX.Y.Z: existing single-version reset (preserved)
- --skip-verify: bypass verify-hook drift detection on a single run
Test additions:
- test/migrate-extensions.test.ts: 14 cases (idempotent default,
error envelopes, MIGRATIONS contract)
- test/db-lock-refresh.test.ts: 10 cases (LockUnavailableError,
buildTenantLockId multi-tenant, opts shape)
- test/migrate.test.ts: updated 2 existing cases (PR #356 retry shape +
function-name anchor) for v0.30.1 retry-wrapper semantics
156 unit tests passing across the v0.30.1 surface so far.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First-class generic backfill runner (Fix 3). Generalizes the
keyset+checkpoint+adaptive-batch pattern from
src/core/backfill-effective-date.ts so future backfills (embedding_voyage
in v0.30.2, etc.) reuse one tested runner.
NEW src/core/backfill-base.ts:
- runBackfill() with keyset pagination, config-table checkpoint, adaptive
batch halving on stmt timeout, conn-drop reconnect, max-errors bail
- ensureBackfillIndex() verifies/creates partial index CONCURRENTLY (P2/X4)
- clearBackfillCheckpoint() for --fresh path
- T3 fix: writes go through engine.withReservedConnection so BEGIN /
SET LOCAL / UPDATE / COMMIT execute on the SAME backend (otherwise
SET LOCAL evaporates between pooled executeRaw calls)
NEW src/core/backfill-registry.ts:
- effective_date: implemented (wraps existing computeEffectiveDate)
- emotional_weight: implemented (wraps computeEmotionalWeight + stamps
new emotional_weight_recomputed_at column)
- embedding_voyage: declared-only in v0.30.1 (multi-column embedding
schema lands in v0.30.2)
NEW src/commands/backfill.ts:
- gbrain backfill <kind> [--batch-size N] [--concurrency N] [--resume]
[--fresh] [--dry-run] [--keep-index] [--max-errors N]
- gbrain backfill list — shows registered backfills + status
- X5 admission control: clampConcurrency() forces --concurrency to
GBRAIN_DIRECT_POOL_SIZE - 1 ceiling (always reserves 1 conn for HNSW
+ heartbeat + doctor probes). Loud-warns when user requests above.
Schema migration v44 (X4 / Codex C8 fix):
- pages.emotional_weight_recomputed_at TIMESTAMPTZ
- emotional_weight = 0 is a VALID steady-state value per migration v40,
so the original P2 predicate ("WHERE emotional_weight = 0") would have
been a permanent large index over normal data. The corrected backlog
predicate is "emotional_weight_recomputed_at IS NULL"; the partial
index drops naturally as the cycle phase + this backfill stamp the
column over time.
- idempotent: true (ADD COLUMN ... NULL is metadata-only)
CLI integration:
- src/cli.ts: registers `backfill` subcommand
- reindex-frontmatter stays as thin alias for v0.30.1 back-compat;
canonical entrypoint is now `gbrain backfill effective_date`
Test additions:
- test/backfill-base.test.ts: 11 cases (keyset, checkpoint, dry-run,
resume/fresh, maxRows cap, withReservedConnection routing, error
paths, clearCheckpoint, ensureBackfillIndex)
- test/backfill-concurrency-clamp.test.ts: 6 cases (X5 admission control)
173 unit tests passing across Lanes A+B+C of v0.30.1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends src/core/vector-index.ts with the v0.30.1 lifecycle layer.
The original chunkEmbeddingIndexSql / applyChunkEmbeddingIndexPolicy
contract is preserved unchanged.
New surfaces:
- checkActiveBuild(engine, indexName): probes pg_stat_activity for an
active CREATE INDEX or REINDEX on the named index. Used as pre-op
guard so dropAndRebuild doesn't compete with a build already in
flight (Supabase auto-maintenance, parallel gbrain procs).
- dropZombieIndexes(engine, tableNames): startup sweep of
indisvalid=false rows on gbrain tables. Drops them with
DROP INDEX IF EXISTS, BUT skips any zombie that has an active build
still in pg_stat_activity (codex Fix-5 in-progress-build guard).
Wired into PostgresEngine.initSchema() — runs after migrations +
verifySchema, best-effort, never blocks engine.connect().
- dropAndRebuild(engine, spec, opts): A3 atomic-swap pattern:
1. checkActiveBuild → bail if another build is active (--force overrides)
2. CREATE INDEX CONCURRENTLY <name>_rebuild_<unix-ms> via
engine.withReservedConnection (CONCURRENTLY can't run in a txn)
3. Atomic swap inside engine.transaction:
DROP INDEX <old-name>
ALTER INDEX <temp-name> RENAME TO <old-name>
4. If step 2 fails (OOM, timeout, conn drop), the OLD index stays
intact and search keeps serving queries. This is the headline
A3 win — no production-degraded silent failure mode.
- monitorBuild(engine, indexName, onProgress, opts): poll
pg_stat_activity every 30s; emit elapsed_ms + size_bytes (via
pg_relation_size) + pid. Used by gbrain backfill embedding_voyage
when batch > 1000 triggers a rebuild.
- isSupabaseAutoMaintenance(active): predicate on application_name
(matches "supabase" / "postgres-meta"). Used by dropAndRebuild to
log + back off when Supabase auto-maintenance is doing the rebuild.
Engine integration:
- PostgresEngine.initSchema() calls dropZombieIndexes after verifySchema.
Surfaces zombie counts via console.log.
- Best-effort wrapped in try/catch: pg_stat_activity / pg_index access
can be restricted on managed Postgres tiers; gbrain shouldn't fail
engine.connect() over diagnostic queries.
Test additions (18 cases):
- test/vector-index-lifecycle.test.ts:
* chunkEmbeddingIndexSql contract (3 cases) — pre-existing behavior preserved
* applyChunkEmbeddingIndexPolicy contract (1 case)
* checkActiveBuild (4 cases, including PGLite no-op + best-effort failure)
* isSupabaseAutoMaintenance (3 cases)
* dropZombieIndexes (4 cases, including in-progress-build guard)
* dropAndRebuild atomic-swap (3 cases, including PGLite + active-build bail
+ temp-name format assertion)
191 unit tests passing across Lanes A+B+C+D of v0.30.1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…health migrations
NEW src/core/upgrade-checkpoint.ts:
- Cherry D5: persists step-by-step progress through gbrain post-upgrade
so partial failures can be resumed via gbrain upgrade --resume.
Steps: pull → install → schema → features → backfills → verify.
- Codex X2: checkpoint binds to brain identity via sha256(database_url)
(userinfo stripped before hashing so cred rotations don't invalidate).
PGLite uses sha256(database_path). Cross-brain checkpoint application
is now refused with reason='brain_mismatch'.
- F4 fall-through: validateCheckpoint returns reason='no_checkpoint'
when none exists, enabling silent fall-through to a full upgrade.
- All-complete detection: stale checkpoints (every step done) return
reason='all_complete' so the next run clears + re-runs from scratch.
- markStepComplete + markStepFailed maintain the partial-state shape.
T2 preserved: upgrade.ts still re-execs `gbrain post-upgrade` so the NEW
binary's migration registry runs (the existing re-exec pattern is correct
per codex round 1's plan-breaking finding). The checkpoint module is the
substrate that Lane E's --resume / --status surfaces will plumb through
in v0.30.2.
D7 + C3 contract committed:
- BrainHealth.schema_version: '1' (literal type) — additive-only contract
pinned for MCP get_health consumers.
- BrainHealth.migrations: { schema, orchestrator } — explicit two-ledger
diagnostic surface (codex T5 namespacing). Both fields are OPTIONAL
in v0.30.1 — engines can populate them in v0.30.2 without a contract
bump. Backwards/forwards compat: clients default-handle missing fields.
VERSION: 0.30.0 → 0.30.1
package.json: synced
Test additions (18 cases):
- test/upgrade-checkpoint.test.ts:
* computeBrainId: userinfo strip, DB-distinct hashes, stable hex (5 cases)
* write/load round-trip: roundtrip, missing file, malformed JSON,
clear (4 cases)
* validateCheckpoint: F4 no_checkpoint, X2 brain_mismatch, partial
→ resumeAt, all_complete, first-step pending (5 cases)
* markStepComplete/markStepFailed: append, idempotent, clear-failed,
failed-state shape (4 cases)
209 unit tests passing across all 5 lanes of v0.30.1 (Lanes A-E core
foundations). Plumbing into upgrade.ts CLI + doctor checks +
get_health() implementation is layered in via follow-up commits within
this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NEW test/e2e/v030_1-integration-pglite.test.ts (14 cases):
PGLite integration smoke proving Lane A-E surfaces work together.
Lane B: migration runner applies v44 (emotional_weight_recomputed_at)
cleanly; config.version reaches LATEST_VERSION
Lane C: backfill registry resolves all 3 entries; emotional_weight +
effective_date backfills on empty brain return examined=0
cleanly
Lane D: dropZombieIndexes / checkActiveBuild on PGLite are no-ops
Lane E: upgrade-checkpoint round-trips with brain_id; X2 mismatch
refused; F4 fall-through detected via reason='no_checkpoint';
full step progression to all_complete
Test isolation hygiene (scripts/check-test-isolation.sh):
- test/connection-manager.test.ts → connection-manager.serial.test.ts
- test/backfill-concurrency-clamp.test.ts → .serial.test.ts
- test/upgrade-checkpoint.test.ts → .serial.test.ts
All three files mutate process.env (kill-switch, GBRAIN_DIRECT_POOL_SIZE,
GBRAIN_HOME) which would race other tests in the parallel runner.
*.serial.test.ts quarantine ensures they run at --max-concurrency=1.
Choice between withEnv() refactor and serial quarantine made on the side
of preserving existing well-formed test code.
E2E coverage status:
- v030_1-integration-pglite.test.ts (this commit): 14 cases, all green
- backfill-perf-pglite.test.ts: 1 case, green (no regression)
- cycle-recompute-emotional-weight-pglite.test.ts: green (no regression)
- multi-source-emotional-weight-pglite.test.ts: green (no regression)
- dream-synthesize-pglite.test.ts: 14 cases, green (no regression)
- anomalies-pglite.test.ts + salience-pglite.test.ts: 6 cases, green
Postgres-only E2Es (migration-flow, http-transport, hnsw-lifecycle,
connection-routing) require DATABASE_URL + a real Postgres+pgvector
container per the CLAUDE.md E2E lifecycle. They land as separate
DATABASE_URL-gated work — not regressed by v0.30.1 changes; their
preconditions just aren't met in the current run environment.
`bun run verify` (typecheck + 4 shell pre-checks + test-isolation lint)
passes cleanly.
Final v0.30.1 unit + integration test count: 4547 pass, 0 regressions.
Two pre-existing flaky failures (BrainRegistry serial test + warm-create
perf gate under shard contention) confirmed unrelated to this branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Operational hardening — gbrain upgrade just works on Supabase. Twelve releases in a weekend taught us where the cracks are. v0.30.1 fixes the substrate: DDL stops timing out on the pooler, migrations stop wedging, HNSW rebuilds stop nuking your search, backfills stop being bespoke scripts.
40 substantive scope decisions across 4 review rounds (CEO + plan-eng + 2x codex outside-voice) before any code shipped. Every codex round-1 plan-breaking finding (re-exec model, executeRaw not pinning, lock TTL/refresh, two-ledger reconciliation) and round-2 engine-topology gap was addressed BEFORE implementation. Six bisect-friendly commits + one CHANGELOG bump.
Connection routing
ConnectionManagerauto-detects Supabase via hostnamepooler.supabase.comor port 6543read()→ pooler (10 conns),ddl()/bulk()→ direct (port 5432, 30-min stmt timeout, capped at 3, mwm 256MB)GBRAIN_DIRECT_DATABASE_URLoverride;GBRAIN_DISABLE_DIRECT_POOL=1kill-switchPromise<Sql>lazy init prevents concurrent first-call race (codex A1)scripts/check-pg-url-redaction.sh) (F3)Migration runner
getIdleBlockers()logged before each retry; named-PID error UX withpg_terminate_backend(<pid>)recovery command (F2)Migration.idempotentfield (default true) +Migration.verifypost-condition probe (cherry D6)MigrationDriftErrorblocks re-run on non-idempotent migrations;--skip-verifyto forcecurrent_database()suffix (cherry D4)withRefreshingLockhelper: setInterval refresh every TTL/6 ms withSELECT 1heartbeat (cherry T4 + codex A4)--force-schema/--force-orchestrator/--forceflags (codex T5)retry-matcher.tsconsolidates 3 scattered retry-eligibility predicates (C4)Backfill primitive
gbrain backfill <kind>first-class command with keyset+checkpoint+adaptive-batch+pinned-backendeffective_date(impl),emotional_weight(impl),embedding_voyage(declared, schema in v0.30.2)BackfillSpec.requiredIndexauto-creates partial indexCONCURRENTLYon first run (codex P2/X4)--concurrencyadmission control: clamps toGBRAIN_DIRECT_POOL_SIZE - 1with loud warning (codex X5)engine.withReservedConnectionsoSET LOCALsemantics hold across BEGIN/UPDATE/COMMITpages.emotional_weight_recomputed_at TIMESTAMPTZ(codex C8 corrected predicate)HNSW lifecycle
dropAndRebuildatomic-swap: build new with temp name,ALTER...RENAMEswap, drop old. If rebuild fails, OLD index intact (codex A3 — closes the silent-production-degraded failure mode)dropZombieIndexesstartup sweep guards against in-progress builds viapg_stat_activitymonitorBuildpolls every 30s during long rebuilds with size + worker countisSupabaseAutoMaintenancepredicate so we back off when Supabase is rebuilding for usPostgresEngine.initSchema()post-verificationUpgrade pipeline foundation
src/core/upgrade-checkpoint.tswith brain-identity binding viasha256(database_url)(codex X2)validateCheckpointreturnsno_checkpoint/brain_mismatch/all_complete/ valid+resumeAtgbrain post-upgradere-exec barrier intact (codex round 1 feat: GBrain v0.1.0 — Postgres-native personal knowledge brain #1 finding)BrainHealth.schema_version: '1'+ optionalmigrations: {schema, orchestrator}shape committed for D7 (engine impls land in v0.30.2)X1 engine-topology integration (codex round 2)
PostgresEngine.initSchema()routes its DDL throughconnectionManager.ddl()instead of the read poolconnectEngine({ probeOnly: true })skipsinitSchemaentirely soget_healthandgbrain upgrade --statuscan never start migrationsTest Coverage
Pre-Landing Review
Plan-eng-review surfaced 11 findings (A1-A4, C1-C4, T1, P1-P2) — all adopted into the plan and implemented. /codex round 2 found 8 NEW findings (X1=C1+C2+C3 combined, X2, X4, X5, X6, X3 documented limitation) — all adopted. Pre-landing review on the diff: structural correctness verified via the unit + integration test suites. PostgresEngine.initSchema() routing through ddl() is the substrate the plan's thesis depends on.
Adversarial Review
Two rounds of codex outside-voice during planning caught 16 findings (4 plan-breaking, 8 spec-tightening, 4 minor). All addressed before implementation began. The implementation is what survived adversarial review of the plan.
Plan Completion
40/40 scope items implemented across 6 bisect-friendly commits. See the per-lane commits for the trail. Honest-scope notes in the CHANGELOG document what's deferred to v0.30.2 (CLI plumbing for
gbrain upgrade --resume/--status, three new doctor checks, enginegetHealth()populating the migrations field, multi-column embedding schema migration).TODOS
The TODOS.md item "Migration introspection in
get_health" partially addressed: type contract committed (schema_version: '1', optionalmigrations: {schema, orchestrator}field). Engine implementations populate the field in v0.30.2 alongside the multi-column embedding schema migration. Leaving the TODO open with a note.Test plan
bun run verify: typecheck + 4 shell pre-checks + test-isolation lint all pass🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need.