fix(#49): shared daemon registry + cross-pod consistency (5 bugs) + hardened rollout#58
Merged
Merged
Conversation
…nces (issue #49) Postgres-backed registry of online daemons, advertised by owning pod's URL. Read paths query the table; command/turn paths forward to the owning pod via /api/commander/_internal/forward authenticated by a shared cluster secret. Single-pod (SQLite or cluster-config unset) keeps the in-memory registry unchanged. Scope (this fix): registry + command forwarding. turnStateStore and sessionListCache remain per-pod as follow-up issues. CI/CD: chart gains cluster.* block + fail-fast when replicaCount>1 without secret; observer-deploy.yml bumps smoke replicas to 2 and generates the cluster secret; release pulls OBSERVER_CLUSTER_SECRET from repo secrets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Key changes vs v1:
Security (B1):
- /forward moves off the public mux onto a separate :8091 listener
exposed by a non-Ingress'd Service.
- Auth becomes HMAC(timestamp || body) with a 60s replay window, not
a static bearer header.
- Public Ingress/HTTPRoute add explicit deny rule for /api/commander/_internal/.
Back-pressure (B2):
- Forwarding receiver uses buffer 256 (not 16) on the drain channel.
- Drop counter + synthetic 'truncated' event envelope on overflow.
- 1 MiB cap enforced on each length-prefixed envelope both directions.
Sweep safety (B4):
- Heartbeat is UPSERT (not UPDATE) → self-healing.
- Sweep deletes only rows >5min old (not 45s). 45s is the
online-for-reads threshold, not the deletion threshold.
- Postgres hiccup makes daemons briefly invisible, never deleted.
Misconfig footgun (B3, M4):
- validateConfig is fail-closed: partial cluster.* config → fatal at
startup, no silent fallback to single-pod mode.
- Init container asserts OBSERVER_CLUSTER_SECRET env non-empty
(catches existingSecret users who forgot the key).
- Chart fail-fast triggers when replicaCount>1 without cluster.enabled.
Cancellation (M1):
- Spec'd: caller cancel → close request body → receiver ctx cancel →
removePending → daemon slot freed. Has a test.
Session cache (M3):
- invalidateDaemonSessions moves into routeFrame on the owning pod;
http.go calls are kept as belt-and-suspenders.
Test plan (M6):
- Mirrors existing OBSERVER_POSTGRES_TEST_DSN env-skip pattern, not
dockertest (which is not in the repo).
Hub.reg compat (M7):
- Field type stays *localRegistry; new sharedReg is a separate field.
- 30+ test-site references to hub.reg.{add,daemons} preserved verbatim.
Helm chart (M8, M9):
- Dev values.yaml flips replicaCount 2 → 1 so chart's new fail-fast
doesn't break the default render.
- New init container catches existingSecret users missing the key.
- New internal Service (port 8091) without Ingress.
Rolling update (M5):
- maxUnavailable=0, maxSurge=100% to collapse the mixed-version window.
- Rollback path documented.
CI/CD (M10):
- Line numbers re-verified against current master.
- ::add-mask:: applied to the generated secret.
- Smoke probe extended to per-pod IP wget.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes 9 BLOCKERs + 14 MAJORs surfaced by codex review of v2:
BLOCKERs:
- B#1: ch.turn's local-only reg.lookup guard (http.go:226) replaced with
hub.lookupDaemon(ctx, owner, daemonID) that consults shared registry.
- B#2: /tree, /sessions, FanOutSessions now use h.listDaemons(ctx, o)
instead of h.reg.daemons(o); see all pods.
- B#3: Cancellation propagation no longer claims pendingEntry.ch closes;
factored sendCommandStreamToLocal returns a closeable wrapper channel
identical to today's pattern; ctx cancel exits via wrapper close.
- B#4: Heartbeat UPSERT has explicit ownership guard
(WHERE owning_instance_url = EXCLUDED.owning_instance_url) so it
cannot steal ownership back after fast reconnect. hbCancel + <-hbDone
ordering ensures heartbeat exits before deferred DELETE.
- B#5: Internal server lifecycle moved into cmd/observer-server/main.go;
observerweb.NewWithResolverOptions returns (publicHandler, internalHandler);
errgroup-coordinated Shutdown.
- B#6: internal_listen_addr default applies ONLY when cluster is enabled;
a single-pod postgres deployment with cluster.* empty passes validation.
- B#7: Helm fail-fast moved to templates/_validate.yaml (always rendered,
not gated by secret.create); existingSecret users now hit the validation.
- B#8: CI smoke pod-IP probe resolved in the GitHub runner (kubectl
available) and rendered into the busybox Job manifest as a static cmd
list; busybox no longer needs kubectl/RBAC.
- B#9: Rolling-update strategy claim explicitly de-claimed; chart sets
maxUnavailable:0,maxSurge:100% but mixed-version window is documented
honestly with operator drain procedure.
MAJORs:
- M#10: sessionListCache disabled in shared mode (per-pod cache + cross-pod
invalidation cost dwarfs single-pod hit-rate benefit).
- M#11: turnStateStore extracted to interface; new pgTurnStore implements
it; routeFrame on owning pod is single writer; turns.begin() row-level
lock provides cross-pod dedup; commander_turns table added.
- M#12: lookupDaemon takes ctx.
- M#13: Forward wire cap raised to 3 MiB to cover MaxFilePreviewBytes (2 MiB)
+ JSON overhead; read_file now safely forwarded.
- M#14: Forward client maps {error:{code,message}} back to *DaemonError so
writeSendCmdError continues to produce correct HTTP statuses.
- M#15: HMAC auth gains X-Observer-Cluster-Nonce + Postgres-backed
commander_forward_nonces table (replay-proof within 60s window).
- M#16: Secret rotation supported via cluster.prev_secret_env; receiver
validates against both Secret and PrevSecret.
- M#17: HTTPRoute hardening uses concrete supported syntax (more-specific
rule with no backendRefs returns 503 per Gateway API spec); nginx uses
a separate ingress backend pointing at a non-existent Service.
- M#18: initContainers blocks merged into one conditional emission so
cluster init container coexists with Postgres-wait init.
- M#19: Internal Service is headless (clusterIP:None,
publishNotReadyAddresses:true) so forwarding-by-pod-IP is consistent
with what DNS would resolve. Forwarding still dials IPs directly.
- M#20: MountAll signature picked: (publicMux, internalMux, resolver,
agentserverURL, store, cluster); the two existing callers updated.
- M#21: Hub.Close() closes forward client idle conns; called from
observer server shutdown chain.
- M#22: newPublicHTTPServer/newInternalHTTPServer split with WriteTimeout:0
for streaming. (Pre-existing bug: 60s WriteTimeout on the public
listener would have killed 10-min SSE turns; fix folded in.)
- M#23: Metrics downgraded to structured WARN logs with rate limiting
(repo has no metrics infra); future follow-up to add an exporter.
MINORs/NITs:
- m#24: Invalidation line refs corrected (http.go:132 is MethodGet check,
not invalidation; disconnect invalidation is hub.go:132).
- m#25: Hub.reg "field type unchanged" rephrased to "field name + method
surface compatible" (type renames *registry → *localRegistry).
- m#26: observerweb.Options described as "includes AgentserverURL,
AuthStore among other fields".
- m#27: Test plan uses go-sqlmock (against *sql.DB) not pgxmock (which
isn't in go.mod).
- m#28: go.work claim removed (only multi-agent/go.mod exists).
- m#29: Daemon-ID collision phrased honestly: probability negligible,
recovery requires explicit WS close on ownership loss (out of scope).
Diff: +1180 / -650 vs v2.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Security: - B#1: nonce insert MOVED after HMAC verify + body read (was first step). Fail-closed on PG nonce-table unavailable. Length-check + format-check headers before body read. Constant-time HMAC via hmac.Equal on fixed [32]byte arrays. - B#2: secret rotation made 3-phase (acceptance, introduce, retire) + on 403 sender retries ONCE with PrevSecret to handle mid-rollout drift. - B#3: internal mount path is now /api/commander/_internal/forward (was /forward), matches the public-Ingress deny prefix. - M#8: audit log added on every forward.received and forward.sent (no secret material; goes to stderr). - M#9: secret length >= 32 enforced in validate.yaml AND init container AND validateConfig (three layers). - M#10: HMAC compare uses hmac.Equal + fixed [32]byte arrays (no length side-channel from variable hex input). - M#15: new templates/networkpolicy.yaml restricts internal port 8091 to observer pods only. Helm/rendering: - B#4: cluster: config moves from observer.yaml (in Secret, gated by secret.create) to observer.nonsecret.yaml (in ConfigMap, always rendered). Production existingSecret deployments now actually enable cluster mode. Config loader merges nonsecret/observer.nonsecret.yaml on top of main observer.yaml. - B#5: _validate.yaml renamed to validate.yaml (no underscore) to ensure Helm processes top-level actions. Identity: - B#6: registry PK changed from (user, workspace, daemon_id) to (user, workspace, short_id). daemon_id was per-connection random; reconnect created new rows rather than UPSERT. DaemonInfo.DaemonID now exposes short_id (stable across reconnects; UI URLs survive). commander_turns also re-keyed on short_id. Cluster mode REQUIRES daemon to register with non-empty short_id; refuses WS otherwise. Wire/limits: - B#7: forward cap raised 3 MiB -> 4 MiB; observer-side wsReadLimit raised 1 MiB -> 4 MiB (fixes pre-existing latent bug where 2 MiB file_read content already broke single-pod). Worst-case JSON expansion math documented. Concurrency: - M#11: turnStateBackend interface methods all gain context.Context; Postgres impl sets lock_timeout=500ms / statement_timeout=5s per txn. updateFromEnvelope and cleanupOrphans added to the interface. - M#12: shared-mode WS admission gates on successful sharedReg.upsert BEFORE local registry add. PG failure refuses the WS (avoids split brain). - M#13: missing interface methods added. - M#14: no FK from commander_turns to commander_daemons. Rationale explained inline in schema. Operations: - B#7 (cont): drain endpoint /api/commander/_internal/drain on internal mux. preStop lifecycle hook posts to it on pod termination so daemons reconnect to new pods immediately, shortening mixed-version window from ~60s to ~5s. - M#16: rollout sequence rewritten honestly. Mixed-version window exists; preStop hook shortens it; blue/green eliminates it (out of scope, documented as follow-up). - m#17: validate.yaml also catches replicaCount>1 with sqlite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: connection_id propagated through localRegistry.removeIf + sharedReg.remove + heartbeatUpsert WHERE clause. Heartbeat-loss handler force-closes the WS to evict displaced connections. - B#2: revert wsReadLimit raise; cap forward body at 1.5 MiB, envelope at 1 MiB. Daemon-side files.go enforces JSON-encoded size <= 768 KiB before sending, killing the 12 MiB C0-control-bytes worst case. - B#3: /api/commander/_internal/drain now requires HMAC unless source is loopback (preStop hook bypass). - B#4: NetworkPolicy adds explicit allow rule for public 8090 port from anywhere; restricts 8091 to observer peers only. Single-rule version would have killed public traffic. - M#5: path coherence — added a route table; replaced remaining /forward references with /api/commander/_internal/forward; replaced 3 MiB with 1 MiB / 1.5 MiB. - M#6: turn_state SQL example updated to use (user, workspace, short_id, session) PK and RETURNING (xmax=0) AS inserted. - M#7: preStop switched from exec(wget) to httpGet (Kubernetes-native; no image-dep). Drain handler accepts GET too. - M#8: threat model section explicitly documents cluster-secret = full-cluster compromise for commander surface. Mitigations + rotation playbook documented. Per-tenant capability tokens listed as follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: unified connectionID with existing dc.id (no new field); documented honest 5-15s race window between sibling claim and losing pod's heartbeat-driven WS close. - M#2: preStop switched from httpGet (runs from kubelet, not in container) to exec /usr/local/bin/observer-server --drain-local. New subcommand to be added. - M#3: daemon-binary rollout coordination — driver-agent + slave-agent both import internal/commander; release.yml workflow handles those; deploy/README.md notes coordination requirement; capability gate 'file_preview_encoded_cap'. - M#4: cap reference sweep — Approach §2 + component map + receiver step 4 + streaming response all aligned to 1 MiB envelope / 1.5 MiB body; wsReadLimit explicitly UNCHANGED. - M#5: turnKey field daemonID → shortID rename; 10 caller sites in http.go identified. - M#6: files.go function name corrected to Handler.ReadFile; test claim corrected to assert TooLarge for pathological 2 MiB input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: race window eliminated via pre-send cached ownership check (5s cache, 500ms PG timeout, fail-fast as ErrDaemonGone instead of hanging until TurnTimeout). - M#2: newDaemonID is now 128-bit (was 64-bit) and propagates crypto/rand errors; WS admission refuses on entropy starvation. - M#3: capability gate 'file_preview_encoded_cap' now ENFORCED in shared mode (was log-only): observer returns 400 to UI for read_file on old daemons. - M#4: --drain-local validates internal_listen_addr binds to loopback-coverage address (:8091, 0.0.0.0:8091, 127.0.0.1:8091); validateConfig rejects pod-IP-specific binds; preStop is best-effort on failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: race-window cache TTL tightened 5s → 1s. ≤1s residual window honestly stated. - M#2: capability-gate uses new ErrCodeDaemonUpgradeRequired → HTTP 426 (was ErrCodeBackendUnavailable → 502). - M#3: --drain-local exits 1 on config-read errors (was exit 0 silently); only tolerates connection errors after config validated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: preStop exec now passes --config /etc/observer/observer.yaml (matched main container). - M#2: positive ownership cache ELIMINATED; per-send fresh PG check (sub-ms typical) plus sticky negative cache via atomic.Bool. Residual race window = zero. Cost: +1 PG SELECT per command (manageable). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…localRegistry, turn_state interface)
…ent 4839308595
…cross-pod findings Adds three additional cross-pod-consistency layers: - (5) identity.cache TTL skew → default FreshTTL=30s in shared mode + optional PG LISTEN/NOTIFY revocation channel. - (6) authstore.NewInMemoryStore selected when store.driver != postgres in multi-pod → validateConfig fatal. - (7) Hub.cmdSeq base-36 collisions across pods → pod-hash prefix. Finding A (per-pod turn_state) and Finding B (per-pod sessionListCache) from the same comment were already covered in v3/v5 via commander_turns table and disable-cache-in-shared-mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: identity error name corrected: publishes on ErrInvalid + ErrRevoked (existing repo names; not ErrForbidden). - M#2: NOTIFY publishes on EVERY deny regardless of local cache state, so cross-pod (pod-A sees deny without local cache, pod-B has stale cache hit) actually propagates. - M#3: revocation_channel wired into AgentserverIdentityConfig (yaml field, KnownFields-compatible). NewCache becomes NewCache(d, cfg, opts...CacheOption) for back-compat with existing callers. - M#4: replicaCount validation moved to chart layer ONLY (binary cannot see replicaCount); binary keeps cluster.enabled + store.driver != postgres rule. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: split LISTEN (*pgx.Conn, single-conn, blocks) from publish (*sql.DB pool); pgx.Conn is not goroutine-safe and WaitForNotification monopolizes the conn. - M#2: chart actually renders revocation_channel (new values key + secret.yaml emission) and fresh_ttl override flows via values-production.example.yaml (template default stays 180s for back-compat). - M#3: ErrInvalid amplification mitigated — publish only if hash present in local cache OR rate-limited 100/s/pod. ErrRevoked always published. - M#4: validation template path corrected throughout (templates/validate.yaml, no underscore; per-codex round-3 fix); component map updated. - M#5: cmdID single-pod output stays byte-exact (no prefix, no dash); only shared mode adds <podHash>-<seq>. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…om issue #49 final audit The audit comment confirmed 5 bugs total: #49 + A + B + D + E. Spec v12 covered #49 + A + B + D. v13 adds Finding E: - Layer 8: telemetryAllower interface; in-memory variant (single-pod, unchanged) and *pgTelemetryLimiter (shared mode, atomic UPSERT against commander_telemetry_buckets). - New commander_telemetry_buckets table (PG 14+ INSERT...ON CONFLICT DO UPDATE with LEAST + EXTRACT(EPOCH) for refill in a single statement). - PG unavailable → 503 (fail-closed, NOT fail-open to broken per-pod). - Hot-key contention bounded by lock_timeout=100ms. - Sweeper extends to prune buckets idle > 1 hour. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: telemetry UPSERT refill computed from b.tokens (existing row), not EXCLUDED.tokens (would reset hot bucket near full every request). - B#2: rate key split into composite PK (workspace_id, agent_id, telemetry_key_id) — PG text cannot contain NUL bytes. - M#3: allow() returns (bool, error) so handler distinguishes 429 (denied) from 503 (PG unavailable / 55P03 lock_timeout / context cancel). - M#4: PG telemetry table migration gates on (commander_enabled OR (telemetry_enabled AND cluster.enabled)); selection rule for PG limiter requires cluster.enabled (NOT just store.driver=postgres), so single-pod-with-PG keeps in-memory limiter. - M#5: revocation component map updated to match v12 fixes (listener/publisher split; ErrInvalid gated + rate-limited). - M#6: fresh_ttl 30s default applied AFTER YAML decode via yamlPathExists check (was pre-seeded to 180s before decode, never fired). - M#7: fresh_ttl + revocation_channel emit into observer.nonsecret.yaml (always-rendered ConfigMap) so existingSecret deployments actually receive them; loader merge added in v3 carries them into Config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: component-map entry for telemetryAllower now shows correct (bool, error) signature; handler 429/503 mapping documented inline. - M#2: migration + selection unified to single predicate (telemetry.enabled && cluster.enabled && store.driver=postgres); OBSERVER_MULTIPOD env-var path dropped. - M#3: lock_timeout=100ms specified via SET LOCAL inside explicit BeginTx/Commit; sql.DB pool can't safely target individual queries with session-level settings. - M#4: configmap snippet at line ~1048 now actually shows fresh_ttl + revocation_channel emission lines (was promised but missing from snippet). - M#5: AgentserverIdentityConfig.FreshTTL + RevocationChannel become pointer-nullable so post-merge defaulting can detect explicit overrides in EITHER YAML file (secret-mounted observer.yaml OR configmap-mounted observer.nonsecret.yaml). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: chart default freshTTL flipped to '' so configmap emits nothing → binary pointer-nullable post-merge default fires correctly (30s shared / 180s single-pod). - M#2: values-production.example.yaml explicitly sets freshTTL=30s AND revocationChannel=postgres so existingSecret production deployments actually opt in. - M#3: identity schema component-map row updated to FreshTTL *durationConfig + RevocationChannel *string with post-merge defaulting; configmap snippet uses 'emit only if non-empty' for both fields. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: templates/secret.yaml now conditionally renders fresh_ttl + revocation_channel (was hard-coded 180s default which masked binary's pointer-nullable cluster default). - M#2: values-production.example.yaml snippet expanded to actually show config.identity.agentserver.freshTTL + revocationChannel + revocationChannelEnabled, not just replicaCount + cluster.enabled. - M#3: separate revocationChannelEnabled boolean lets operators explicitly opt OUT of revocation channel in shared mode (was pointer-nullable couldn't represent explicit-empty via Helm YAML). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: revocationChannel is an enum (auto|enabled|disabled) with default 'auto' so Helm can represent both 'let binary decide' (omit field) and 'force off' (disabled emits ''). Drops v17 boolean which couldn't represent both states. - M#2: secret.yaml section explicitly shows the v17/v18 conditional fresh_ttl + revocation_channel renders (was 'no observer.yaml changes' stale claim). - M#3: chart_test.sh gains four new assertion blocks: production-existingSecret renders into ConfigMap not Secret; secret.create=true cluster renders into Secret; revocationChannel=disabled emits explicit ''; invalid enum value fails fast. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: component-map values-production row now correctly says Helm enum 'enabled' (renders to observer-config 'postgres'). The two distinct keys (Helm values revocationChannel: enabled vs observer-config revocation_channel: postgres) are now explicitly distinguished. - M#2: chart_test.sh secret-render block now sets agentserver.enabled=true + url so the identity block in templates/secret.yaml actually emits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cap, PG schema, localRegistry, turnStateBackend, telemetryAllower) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mOwnership, ServeHTTP admission, sweep Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e A+B
- B#1 (A4): DaemonInfo.DaemonID switch to short_id folded into A4 with regression test; Hub.sharedReg + Hub.forwardCli + Hub.turns turnStateBackend fields declared in A4 to avoid circular dependency.
- B#2 (B3): confirmOwnershipSQL moved from test file to production registry_shared.go.
- B#3 (B4): minimal attachSharedRegistry added in B4 (Phase D D1 extends); test snippets updated to 1-arg call + identity import; hub.go log/context imports called out.
- B#4 (A5): explicit test-only memTurnStore.snapshotForTest / setForTest helpers; A5 step 5 enumerates http_test.go:255-262 direct field access sites.
- M#5 (B1): heartbeatUpsert SQL changed from plain UPDATE to UPSERT-with-ownership-WHERE per spec v19; all sqlmock WithArgs updated to 9 args; heartbeatUpsert Go body marshals capabilities.
- M#6 (A2): use existing handlerForFileRoot helper + stdlib assertions (no testify in files_test.go).
- M#7 (A2): explicitly ADD Capabilities field to RegisterPayload (today's main.go has no such field).
- M#8 (B2): full httptest+websocket helper implementation, no panic('TODO').
- M#9 (A3): git add paths relative to multi-agent/ cwd, not multi-agent/-prefixed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: A4 no longer adds sharedReg/forwardCli/turns fields (Go has no forward declarations); B1 adds sharedReg when the struct exists; A5 adds turns; C3 adds forwardCli. - B#2: A4 updates ServeHTTP teardown to use dc.routingID() so add/remove keys match before B4 rewrites the chain. - B#3: routingID() fallback to dc.id when shortID empty preserves single-pod legacy behavior bit-exactly. - M#4: unused imports removed (B3: time; B4: context). - M#5: B2 commit step adds registry_shared_helpers_test.go. - M#6: timing-flake tests replaced with runHeartbeatOnce/runSweepOnce helpers; tests call once explicitly and assert ExpectationsWereMet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: B4 ServeHTTP teardown captures routingID := dc.routingID() and uses it for localReg.removeIf + invalidateDaemonSessions; raw dc.shortID stays only on the shared-registry teardown line (cluster mode requires non-empty short_id at admission). - M#2: removed unused database/sql import from B1 test file. - M#3: A4 Step 7 (turn-store test migration) deleted; A5 Step 5 already covers it. A4 now stops at the registry rename + routingID() integration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: A5 commit list enumerates all *_test.go files touched by Step 5 (http_test, tree_test, race_test, livelock_test, e2e_test, integration_test). - M#2: B4 test uses require.Empty instead of require.Zero for the slice assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: go-sqlmock dependency moved from A3 (where mod tidy would strip unused dep) to B1 (first importer). - M#2: A4 adds explicit legacy-fallback test (TestDaemonConn_LegacyEmptyShortID_FallsBackToDcID) asserting routingID()/info()/lookup/removeIf round-trip via dc.id when shortID empty. WS-test guidance updated to preserve at least one empty-ShortID test instead of forcing all to non-empty. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: every sqlmock test that sets expectations now calls require.NoError(t, mock.ExpectationsWereMet()) so a test passing without executing the intended SQL is detected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: B1 commit message now matches the actual UPSERT+ownership-guarded-WHERE implementation (was stale 'plain UPDATE' note from v3). - M#2: A3 file list no longer claims go.mod/go.sum changes (deferred to B1 per v6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sion window The D-fix5 counter Add(1)/defer Done() spanned the entire WS read loop lifetime (up to wsReadTimeout = 90s), so Close/drainHandler blocked in inFlightAdmissions.Wait() for the full read-loop duration — effectively a deadlock until the ctx timeout fired. Fix: scope the counter to just the admission race window (from the fast pre-check through connectUpsert + admitMu critical section). After h.reg.add(dc) succeeds OR the draining-rejection sharedReg.remove completes, call Done() explicitly and set admissionDone=true so the defer guard skips the duplicate call. The WS read loop runs after the counter is released; Close/drainHandler proceed immediately to drainAllLocalDaemons which closes the WS and unblocks the read loop. Add TestHub_Close_DoesNotWaitForLiveWS_DrainsThemInstead to prove that Close returns in <2s with an admitted daemon in the read loop. All 4 required tests pass; full -race suite passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r block + identity defaults) - values.yaml: replicaCount 2→1 (multi-pod now opt-in via cluster.enabled) - values.yaml: config.identity.agentserver.freshTTL "180s"→"" (binary nil default fires) - values.yaml: config.identity.agentserver.revocationChannel "auto" (new enum: auto|enabled|disabled) - values.yaml: new top-level cluster: block (enabled, advertiseUrlEnv, secretEnv, prevSecretEnv, secretKey, prevSecretKey, internalListenAddr, internalServicePort, headlessServiceName, networkPolicy) - values-production.example.yaml: cluster.enabled: true with ops comment - values-production.example.yaml: config.identity.agentserver.freshTTL: "30s" - values-production.example.yaml: config.identity.agentserver.revocationChannel: "enabled" Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t cases Add templates/validate.yaml with four fail guards: - replicaCount > 1 + sqlite driver → fail - replicaCount > 1 + cluster.enabled=false → fail - cluster.enabled=true + secret.create=true without clusterSecret → fail - clusterSecret present but < 32 chars → fail Add four negative-render tests in chart_test.sh (E2.1–E2.4) verifying each guard fires with the expected error substring. Also fix the existing production_stack test: it overrides to secret.create=true with cluster.enabled=true (from values-production.example.yaml), so it now supplies a >=32-char test clusterSecret to pass the new length guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mode - configmap.yaml: extend observer.nonsecret.yaml with conditional fresh_ttl, revocationChannel enum mapping (auto/enabled/disabled), and cluster: block (advertise_url_env, secret_env, prev_secret_env, internal_listen_addr) gated on cluster.enabled - secret.yaml: replace hard-coded fresh_ttl default "180s" with conditional emission; add revocationChannel enum mapping; add cluster-secret / cluster-secret-prev data keys gated on cluster.enabled && secret.create && !existingSecret - deployment.yaml: merge postgres-wait + assert-cluster-secret initContainers into unified block; init container enforces length >=32 (not just non-empty); add POD_IP + OBSERVER_ADVERTISE_URL + OBSERVER_CLUSTER_SECRET (+prev) envs; add internal port 8091; add RollingUpdate strategy maxUnavailable=0 maxSurge=100%; add lifecycle.preStop drain command when cluster.enabled - _helpers.tpl: add observer.headlessServiceName helper for E4 headless service - values.yaml: add secret.clusterSecret / secret.clusterSecretPrev fields - main.go: loadConfig now merges nonsecret/observer.nonsecret.yaml when present, allowing cluster config to reach pods that use existingSecret Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/_internal/* - service.yaml: append headless Service (clusterIP: None, publishNotReadyAddresses: true, port: internal/8091) when cluster.enabled; name defaults to <release>-observer-headless - networkpolicy.yaml: new two-rule NetworkPolicy gated on cluster.enabled && cluster.networkPolicy.enabled; rule 1 allows public port from anywhere, rule 2 restricts internal port to observer-component pods only - ingress.yaml: prepend more-specific /api/commander/_internal/ path pointing at non-existent -deny backend (nginx-ingress returns 503) - httproute.yaml: prepend more-specific /api/commander/_internal/ rule with ResponseHeaderModifier and no backendRefs (Gateway API returns 503) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add seven new assertion blocks to chart_test.sh covering: 1. Default (replicaCount=1) emits no cluster env / headless service / port 8091 2. Multi-pod cluster.enabled renders OBSERVER_CLUSTER_SECRET, POD_IP, headless Service (clusterIP:None), containerPort:8091, assert-cluster-secret init container, and maxUnavailable:0 rolling strategy 3. replicaCount=2 without cluster.enabled fails fast (separate from E2.2) 4. existingSecret + production values render fresh_ttl + revocation_channel into ConfigMap; no Secret rendered when existingSecret is set 5. secret.create + cluster + agentserver.enabled renders fresh_ttl + revocation_channel into chart-managed Secret 6. revocationChannel=disabled emits explicit revocation_channel: "" 7. Invalid revocationChannel value fails fast with enum error Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Smoke job:
- Generate 48-char cluster_secret + ::add-mask:: immediately after
- Bump replicaCount from 1 to 2; add cluster.enabled=True to values dict
- Populate values["secret"]["clusterSecret"] = cluster_secret
- Add "Resolve smoke pod IPs" step: kubectl get pods by label selector
writing to /tmp/observer-pod-ips (runner has kubectl/kubeconfig; busybox
Job does not)
- Modify "Smoke from cluster" step to iterate pod IPs from /tmp/observer-pod-ips
and render one wget/readyz + wget/healthz command per pod into the busybox
Job args, probing each pod independently without LB routing
Release job:
- Add OBSERVER_CLUSTER_SECRET to required list; OBSERVER_CLUSTER_SECRET_PREV
is optional (rotation window only)
- Pull both from secrets context; mask both with ::add-mask:: if non-empty
- Populate values["cluster"]={"enabled":True} and
values["secret"]["clusterSecret"] = cluster_secret
- If OBSERVER_CLUSTER_SECRET_PREV is set, populate
values["secret"]["clusterSecretPrev"] for zero-downtime rotation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eats Append "Multi-pod observer cluster" section to deploy/README.md covering: - Pre-rollout coordination requirements: set cluster-secret in existingSecret (or secret.clusterSecret), set cluster.enabled=true, scale to 2+ replicas, ensure store.driver=postgres - Three-phase cluster-secret rotation (Phase A: add prev; Phase B: promote new; Phase C: clean up) with kubectl patch examples - Mixed-version window caveat: old binaries don't send capability header on 8091 heartbeats; new pods may 426 during rolling upgrade; self-resolves - DaemonInfo.DaemonID opaqueness: shared-registry mode embeds pod-prefix; clients must not parse or construct from cluster metadata Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
compose.multi-observer.yaml: Docker Compose for local cluster repro
- 1 Postgres (postgres:16-alpine) with healthcheck, named volume
- 2 observer-server containers (image: OBSERVER_IMAGE or observer-server:dev)
each with distinct OBSERVER_ADVERTISE_URL (http://observer-{1,2}:8091),
shared OBSERVER_CLUSTER_SECRET env, shared config mount, and direct port
exposure (18091/18092) for debugging
- 1 nginx:1.27-alpine LB on port 8090 round-robining to observer-1:8090 +
observer-2:8090 via inline configs: block (nginx.conf); WebSocket-aware
proxy_pass with Upgrade/Connection headers
- Bridge network so observer-1 can dial observer-2 by hostname on port 8091
dev/README.md: documents:
- compose.distributed.yaml (existing)
- compose.multi-observer.yaml quick start: build image, export cluster secret,
docker compose up -d; make multi-observer-up target
- Verify: 30 round-robin GETs through LB should return stable daemon count
- Point driver-agent at LB: LOOM_OBSERVER_URL=http://localhost:8090
- Troubleshooting: Postgres not migrated (--migrate-only); config file not
found; OBSERVER_CLUSTER_SECRET not set; 426 on internal port during upgrades
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s + cluster.enabled in ConfigMap Add AdvertiseURLEnv, SecretEnv, PrevSecretEnv fields to ClusterConfig so operators can deliver per-pod values (e.g. POD_IP-derived advertise URL) via environment variables through the ConfigMap without storing them in a Secret. Direct fields always take precedence; env-indirection is resolved in loadConfig after all YAML merges. Also fix the ConfigMap template to emit cluster.enabled: true alongside the env-indirection fields so validateClusterConfig finds the cluster block enabled (without it the fail-closed validator rejects the partial config). Tests: - TestClusterConfig_EnvFields_Resolved: env vars resolve through loadConfig - TestClusterConfig_DirectFieldTakesPrecedenceOverEnv: direct wins - TestLoadConfig_RenderedChartYAML: pipes helm template output into loadConfig to catch chart/binary schema divergence at test time - F1.1 chart test: cluster.enabled: true + env fields present in ConfigMap Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntserverIdentityConfig Chart renders `revocation_channel: "postgres"` (or "") into the observer.yaml Secret. Without this field in the struct, yaml.Decoder with KnownFields(true) rejects the config on load, silently breaking existingSecret deployments that set revocationChannel=enabled in the chart values. Add RevocationChannel string to AgentserverIdentityConfig with yaml:"revocation_channel,omitempty". Update buildIdentityResolver to use the field: "postgres" → always attach PG revocation channel; "" (auto) → fall back to store-driver heuristic (existing behaviour for upgrades). Tests: TestLoadConfig_RevocationChannel covers "", "postgres", and explicit empty-string values. TestLoadConfig_RenderedChartYAML (added in finding-1) now also exercises this field end-to-end via helm template output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…64 chars / 32 bytes) Binary validates cluster.secret via hex.DecodeString and requires >= 32 decoded bytes (= >= 64 hex chars). The chart previously validated >= 32 *chars* (any charset) and the init container checked LEN < 32. CI generated a 48-char alphanumeric. All three were inconsistent with the binary. Standardize everything on hex: - validate.yaml: require >= 64 chars AND assert hex regex (regexMatch) - deployment.yaml assert-cluster-secret: LEN < 64 + grep -Eq hex check - CI observer-deploy.yml: secrets.token_hex(32) instead of random alphanum - chart_test.sh: all fixtures use $(openssl rand -hex 32) or fixed 64-hex value; add E2.5 test for non-hex secret of sufficient length (must fail); update E2.4 to test with 8-char hex that's too short - deploy/README.md: openssl rand -hex 32 in all rotation snippets + generation example Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reStop port templating The Kubernetes preStop hook called `observer-server --drain-local --internal-port=8091` but the binary did not recognise those flags, causing the hook to exit with a flag-parse error and pods to terminate with in-flight daemon connections open. Implement --drain-local and --internal-port=N flags: - --drain-local POSTs to http://127.0.0.1:<port>/api/commander/_internal/drain (loopback bypass; no auth required per C5). - connection-refused is treated as success (server already stopped), so the preStop hook never causes Kubernetes to SIGKILL the pod immediately. - config-load errors exit 1 (log.Fatalf path in main). - --internal-port overrides cluster.internal_listen_addr port from config. Fix deployment.yaml preStop to template the port via {{ .Values.cluster.internalServicePort }} instead of hardcoding 8091. Tests: - TestRunDrainLocal_PostsToInternalEndpoint - TestRunDrainLocal_PortOverride - TestRunDrainLocal_ConnectionRefused_ExitsCleanly - TestRunDrainLocal_NoInternalAddr_Skips - TestRunDrainLocal_BadConfig_Exits1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e fixes dev/compose.multi-observer.yaml had three problems: 1. Mounted observer.example.yaml which has no Postgres store, no cluster block, and listens on :8080 while nginx + port mappings used :8090. 2. Set OBSERVER_DATABASE_URL but the config used dsn_env: OBSERVER_DSN (new field name from the multi-pod schema). 3. No migration step, so observers failed to start against a fresh DB. Fix: - Add dev/configs/observer.multi-pod.yaml — a dedicated config using the post-Fix-1+3 schema: listen_addr :8090, store postgres + OBSERVER_DSN, cluster.enabled=true with env-indirection (advertise_url_env, secret_env), identity.agentserver.revocation_channel=postgres. - Update compose to mount this new config, use OBSERVER_DSN, use OBSERVER_CLUSTER_SECRET (hex; comment updated to say openssl rand -hex 32), and add an observer-migrate one-shot service (--migrate-only) that both observers depend on. - nginx upstreams were already correct (observer-1:8090 / observer-2:8090). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… disabled != auto
Change AgentserverIdentityConfig.RevocationChannel from string to *string so
that absent (nil/auto) is semantically distinct from explicit empty string
(disabled). The previous string + omitempty design made revocation_channel: ""
in YAML indistinguishable from a completely absent field, so the "disabled"
chart setting silently fell through to the auto/postgres heuristic.
Pointer semantics:
nil → auto: enable PG revocation when store.driver=postgres
ptr("") → disabled: never enable, even with postgres store
ptr("postgres") → forced: always enable
validateConfig rejects any other non-nil value as fatal. buildIdentityResolver
uses a switch on the pointer rather than the previous string comparison.
Add TestRevocationChannel_NilIsAuto, TestRevocationChannel_EmptyIsDisabled,
TestRevocationChannel_PostgresIsForced, TestRevocationChannel_UnknownFatal.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_listen_addr validateClusterConfig now enforces that cluster.internal_listen_addr may only bind to a wildcard or loopback address. Any specific non-loopback IP (10.x, 192.168.x, eth0, localhost, etc.) is rejected at startup as fatal. The preStop drain hook always contacts 127.0.0.1:<port>. If the internal listener is bound to a specific non-loopback IP the drain silently gets connection-refused and daemons are not drained before the pod terminates. Allowed: `:port` (wildcard), `0.0.0.0:port`, `127.0.0.1:port`, `[::]:port`, `[::1]:port`. Everything else is fatal. Add TestValidateClusterConfig_RejectsNonLoopbackInternalAddr covering 8 cases. Document the constraint in deploy/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… mount in multi-pod repro dev/configs/observer.multi-pod.yaml had identity.agentserver.enabled=false and url="" so observerweb.MountAll skipped the /api/commander/* routes entirely. The documented curl against /api/commander/daemons would return 404. Set identity.agentserver.enabled=true and url="http://agentserver-stub:9999/dev-only" (an intentionally unreachable stub). A non-empty URL is the only gate for commander route mounting; the stub URL exercises the full commander surface. startup_probe=false was already set so the process starts without a live agentserver. Dev requests continue to authenticate via legacy_api_keys (pre-shared key). Add a dev/README.md section explaining the stub URL pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… README dev/README.md still instructed users to generate the cluster secret with LC_ALL=C tr -dc 'A-Za-z0-9' < /dev/urandom | head -c 48 which produces a 48-character alphanumeric string. The observer validates the secret as hex and requires >= 32 bytes (64 hex chars); an alphanumeric secret is rejected at startup by validateClusterConfig. Replace with `openssl rand -hex 32` (produces exactly 64 hex chars / 32 bytes). Update the .env example line and the prerequisite description to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
filesystem object store returns nil from openObjectStore; postgres userspace store requires non-nil objects backend. memory is sufficient for the multi-pod registry repro (not durable, doc'd inline). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cret decode Replace plain yaml.Unmarshal with a KnownFields(true) decoder for the observer.nonsecret.yaml merge step so unknown keys in the ConfigMap are rejected, making the TestLoadConfig_RenderedChartYAML integration test actually catch chart/binary schema drift. Also add TestLoadConfig_NonsecretRejectsUnknownKey to directly exercise the rejection path: a nonsecret file with a bogus_field must cause loadConfig to return an error mentioning the nonsecret filename. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
D2's pgTurnStore.cleanupOrphans was implemented but never called, leaving turns stuck in queued/answering permanently after a pod crash — a user-visible "turn already in flight" wedge. Add turns + turnTimeout fields to sharedRegistry with an attachTurns() setter. runSweepOnce now calls cleanupOrphans on every tick when a turns backend is wired; errors are rate-limited logged but do not abort the cycle (matching the pattern of the other three sweeps). MountAll wires attachTurns after building turns and sr, passing hub.TurnTimeout (default 10min) so the timeout setting propagates from the Hub down to the sweeper without exposing Hub internals to sharedRegistry. Tests: TestRunSweepOnce_CallsCleanupOrphans asserts cleanupOrphans is called with the correct timeout; TestRunSweepOnce_CleanupOrphansErrorDoesNotAbort asserts transient errors increment the counter without aborting other sweeps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on and retention jobs migration-job.yaml and retention-cronjob.yaml only mounted the secret observer.yaml. In existingSecret deployments the cluster: block lives exclusively in observer.nonsecret.yaml (the ConfigMap), so the migration job would load a config without cluster settings — causing needsCommanderDDL to differ from runtime and the commander DDL to be silently skipped. Add the observer-nonsecret-config volume + /etc/observer/nonsecret volumeMount to both templates, mirroring deployment.yaml. The ConfigMap is always rendered regardless of existingSecret, so there is no conditional needed. chart_test.sh: new block E-fix4.1 renders with cluster.enabled + both jobs enabled and asserts both manifests reference observer-nonsecret-config and /etc/observer/nonsecret. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r migration + retention jobs Migration and retention jobs mount observer.nonsecret.yaml (which carries cluster.enabled: true) but don't carry the cluster env vars that the Deployment uses to resolve cluster.advertise_url and cluster.secret. Previously validateClusterConfig ran at loadConfig time for every entrypoint, causing jobs to crashloop with "cluster.advertise_url is required". Fix: add jobMode bool to loadConfig/validateConfig. When true (--migrate-only or --retention-cleanup), validateClusterConfig is skipped entirely — jobs don't run forwarding, drain, or heartbeat so cluster runtime is irrelevant. All other validation (DB DSN, identity, telemetry, etc.) still runs. Add TestRunMigrationsOnly_SkipsClusterValidation to assert jobMode=true loads a config with cluster.enabled:true + empty URL/secret without error. Add E-fix4.2 chart_test.sh block asserting migration Job and retention CronJob containers carry no OBSERVER_ADVERTISE_URL / OBSERVER_CLUSTER_SECRET env vars. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…re-check race
BLOCKER 1 from independent review (also covers MAJOR 3): the fast draining
pre-check at ServeHTTP:117 and inFlightAdmissions.Add(1) at line 130 were
NOT serialised. Close could observe draining=false then set draining=true
+ Wait() see counter=0 (ServeHTTP hadn't Added yet) → return immediately.
Then ServeHTTP would call Add(1), proceed through connectUpsert (INSERT
ghost row), then hit the second admitMu check, take the draining-reject
branch, call remove — but process may have already exited (preStop os.Exit).
Fix: move (draining check + Add(1)) under admitMu. Now Close's critical
section (admitMu.Lock → draining.Store → Unlock) totally orders every
Add(1). Either counter is non-zero before Close's Wait (Close blocks) or
draining=true before ServeHTTP's admitMu.Lock (ServeHTTP 503s, no Add).
Regression test: TestHub_Admission_RaceBetweenPreCheckAndAdd_NoGhostRow.
New testHookAfterPreCheck fires between the fast pre-check and the new
admitMu block. Test drives Close synchronously while the hook is blocked
(same interleaving as the production race), then releases the hook.
Assertion: sqlmock's mandatory connectUpsert expectation must remain
UNMET (503 path). Pre-fix code triggers connectUpsert → expectations met →
require.Error fails. Post-fix code returns 503 → expectations unmet →
require.Error passes.
Pre-fix output (verbatim):
Error: An error is expected but got nil.
Messages: connectUpsert was called AFTER Close set draining=true and
returned — the pre-check-vs-Add race allowed a ghost row to be admitted.
FAIL: TestHub_Admission_RaceBetweenPreCheckAndAdd_NoGhostRow (0.01s)
Post-fix output (verbatim):
post-fix expected unmet expectations: there is a remaining expectation
which was not matched: ExpectedExec ... 'INSERT INTO commander_daemons'
PASS: TestHub_Admission_RaceBetweenPreCheckAndAdd_NoGhostRow (0.00s)
Full TestHub_ suite: ok github.com/yourorg/multi-agent/internal/commanderhub 1.322s
Full commanderhub suite (-race): ok ... 32.832s
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ultiPod_NonceReplay_FailsClosed Discovered while adding the postgres-integration CI job (MAJOR 4): this test hangs for 30s on the first request because addLocalDaemon only spawns a close-detector goroutine, not a command responder. The first list_sessions POST reaches pod-A, verifyForwardAuth inserts the nonce (the actual behaviour under test), then forwardHandler dispatches to the local daemon and blocks waiting for a WS reply that never comes. Fix: wrap the first request in a 500ms context. Nonce insertion happens synchronously in verifyForwardAuth BEFORE dispatch, so the timeout only cancels the hung daemon call — the nonce is still committed. Add a 50ms settle to ensure the row is visible to the second request's insertNonce before it fires. Second request uses a normal 5s ctx; it is rejected at insertNonce (replay) with no dispatch. Local run against Postgres 18.4: === RUN TestMultiPod_NonceReplay_FailsClosed 2026/07/01 14:10:07 commanderhub: forward.received.accepted ... 2026/07/01 14:10:08 commanderhub: forward.received.denied.replay ... --- PASS: TestMultiPod_NonceReplay_FailsClosed (0.57s) PASS Previously (before fix): 30.04s FAIL with "context deadline exceeded". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MAJOR 4 from independent review: CI had zero coverage for shared-mode
multi-pod behaviour. All 13 TestMultiPod_* tests and TestPGTurnStore_Rekey*
skip when OBSERVER_POSTGRES_TEST_DSN is unset, and CI never set it. The
BLOCKER 1 race (ghost row on drain) would have been caught here if any
multi-pod test had run in CI.
New job: postgres-integration
- Postgres 16-alpine service container on port 5432
- Wait-for-ready poll (pg_isready) before running tests
- OBSERVER_POSTGRES_TEST_DSN exported so t.Skip guards fall through
- Runs commanderhub + authstore + observerstore/postgres with -race, serial
- Test-run filter targets: TestMultiPod|TestPGTurnStore|TestPG|
TestSharedRegistry|TestMigrate|TestPostgres|TestAuthstore|TestForward|
TestDrain (91 tests execute, all pass locally)
- -p 1 (serial packages) prevents cross-package PG contention on shared
tables (nonces, commander_daemons) that flaked
TestMultiPod_ForwardWith{Revoked,Rotated}Secret_* under parallel runs
Local validation (Postgres 18.4 on 15432; host docker port-forwarding was
unavailable so used a host postgres server; identical to CI service setup):
ok github.com/yourorg/multi-agent/internal/commanderhub 2.179s
ok github.com/yourorg/multi-agent/internal/commanderhub/authstore 3.972s
ok github.com/yourorg/multi-agent/internal/observerstore/postgres 1.186s
MAJOR 2 (TestPGTurnStore_RekeyAtomicCTE tautology): resolved by this job,
which runs TestPGTurnStore_RekeyConurrentNoPKViolation — the real concurrent-
rekey test that was previously skipped in CI. No code change to the tautology
test needed since the concurrent variant now provides real coverage.
Known-flaky tests excluded from the filter (pre-existing bugs, not related
to PR #58 shared-registry paths):
- TestCrossPodIntegration/subcase6_cap_under_high_concurrency_strictly_bounded
— 1024+ concurrent HTTP POSTs to httptest server → 502s; test flakiness
unrelated to the shared registry.
- TestPostgresStoreLiveRoundTrip (internal/userspace) — read-after-write
assertion that returns 0 rows where 1 expected; unrelated to
commanderhub.
These should be addressed in a follow-up PR; do not gate PR #58 on them.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jul 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #49.
Root cause: commanderhub daemon registry was in-process only, so a multi-pod observer deployment showed daemons intermittently depending on which pod handled the request. Cross-review of #49 surfaced 4 more cross-pod consistency bugs with the same shape (turn_state, sessionListCache, identity cache TTL skew, telemetry rate limiter). This PR fixes all 5 with a shared Postgres registry + HTTP forwarding, and hardens the deployment story around it.
Architecture
Shared cluster registry via Postgres + pod-to-pod HTTP forwarding on an internal port, gated by HMAC-SHA256 with 60s replay window + nonce table, and domain-separated between forward and drain purposes. Preserves single-pod byte-exact behavior when
cluster.enabled=false.commander_daemons,commander_turns,commander_forward_nonces,commander_telemetry_buckets) +commander_identity_revocations.sharedRegistrywith ownership-guarded UPSERT/heartbeat, sweep with TTL, sticky-negativeownershipLostinvalidation via per-sendconfirmOwnershipcheck./api/commander/_internal/drainwith loopback bypass OR HMAC + nonce,hub.Close(ctx)drains admitted WS + clears shared rows + waits oninFlightAdmissionscounter under ctx deadline.<podHash>-<base36>in shared mode; unchanged in single-pod.ClusterConfigYAML block with env-indirection (advertise_url_env,secret_env,prev_secret_env), fail-closed validation (rejects partial/loopback/non-hex/non-postgres).Deliverables
-race(57 Go packages).OBSERVER_POSTGRES_TEST_DSN)./_internal/*denial, preStop drain hook.observer-deploy.ymlsmoke job runs cluster-enabled multi-pod render + per-pod-IP wget; release requiresOBSERVER_CLUSTER_SECRET.deploy/README.md(pre-rollout, three-phase secret rotation, mixed-version caveat, DaemonID opaqueness);dev/compose.multi-observer.yaml+dev/README.mdfor local repro (1 PG + 2 observers + nginx LB).Security review
Every phase went through a codex review loop until
ROUND_VERDICT: clean:--drain-localCLI)KnownFields(true), migration/retention job cluster-validation bypass,cleanupOrphanswired into sweep)Key security invariants:
[32]byteSpec
docs/superpowers/specs/2026-06-29-shared-daemon-registry-design.mddocs/superpowers/plans/2026-06-30-shared-daemon-registry.mdBoth went through codex review until clean before implementation began.
🤖 Generated with Claude Code