Skip to content

fix(#49): shared daemon registry + cross-pod consistency (5 bugs) + hardened rollout#58

Merged
yzs15 merged 125 commits into
masterfrom
worktree-fix-issue-49-daemon-registry
Jul 1, 2026
Merged

fix(#49): shared daemon registry + cross-pod consistency (5 bugs) + hardened rollout#58
yzs15 merged 125 commits into
masterfrom
worktree-fix-issue-49-daemon-registry

Conversation

@yzs15

@yzs15 yzs15 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Fixes #49.

Root cause: commanderhub daemon registry was in-process only, so a multi-pod observer deployment showed daemons intermittently depending on which pod handled the request. Cross-review of #49 surfaced 4 more cross-pod consistency bugs with the same shape (turn_state, sessionListCache, identity cache TTL skew, telemetry rate limiter). This PR fixes all 5 with a shared Postgres registry + HTTP forwarding, and hardens the deployment story around it.

Architecture

Shared cluster registry via Postgres + pod-to-pod HTTP forwarding on an internal port, gated by HMAC-SHA256 with 60s replay window + nonce table, and domain-separated between forward and drain purposes. Preserves single-pod byte-exact behavior when cluster.enabled=false.

  • Shared state: 4 new tables (commander_daemons, commander_turns, commander_forward_nonces, commander_telemetry_buckets) + commander_identity_revocations.
  • Registry: sharedRegistry with ownership-guarded UPSERT/heartbeat, sweep with TTL, sticky-negative ownershipLost invalidation via per-send confirmOwnership check.
  • Forwarding: length-prefixed JSON envelope codec (1 MiB cap), HMAC signing with three-phase secret rotation support (current+prev), fail-closed on Postgres unavailable, self-URL loop refusal.
  • Drain: /api/commander/_internal/drain with loopback bypass OR HMAC + nonce, hub.Close(ctx) drains admitted WS + clears shared rows + waits on inFlightAdmissions counter under ctx deadline.
  • cmdID: <podHash>-<base36> in shared mode; unchanged in single-pod.
  • Cluster config: ClusterConfig YAML block with env-indirection (advertise_url_env, secret_env, prev_secret_env), fail-closed validation (rejects partial/loopback/non-hex/non-postgres).

Deliverables

  • 28 tasks across phases A-E, 122 commits.
  • Full test suite green under -race (57 Go packages).
  • 13 env-gated multi-pod integration tests (skip cleanly without OBSERVER_POSTGRES_TEST_DSN).
  • Helm chart with fail-fast validation, headless Service, NetworkPolicy (two-rule shape: public port open, internal port observer-only), ingress + HTTPRoute /_internal/* denial, preStop drain hook.
  • CI: observer-deploy.yml smoke job runs cluster-enabled multi-pod render + per-pod-IP wget; release requires OBSERVER_CLUSTER_SECRET.
  • Docs: deploy/README.md (pre-rollout, three-phase secret rotation, mixed-version caveat, DaemonID opaqueness); dev/compose.multi-observer.yaml + dev/README.md for local repro (1 PG + 2 observers + nginx LB).

Security review

Every phase went through a codex review loop until ROUND_VERDICT: clean:

  • Phase A: clean after 5 rounds
  • Phase B: clean after 4 rounds
  • Phase C: clean after 3 rounds
  • Phase D: clean after 7 rounds (multiple race fixes: admission window vs Close, in-flight upsert cleanup wait, counter scope)
  • Phase E: clean after 4 rounds (chart↔binary schema contract; hex secret standardization; --drain-local CLI)
  • Final whole-branch review: clean after 3 rounds (nonsecret YAML KnownFields(true), migration/retention job cluster-validation bypass, cleanupOrphans wired into sweep)

Key security invariants:

  • Constant-time HMAC compare on fixed [32]byte
  • Domain-separated HMAC (forward vs drain via purpose prefix in payload)
  • Never log raw nonces (8-char prefix only)
  • Never log secrets (config-loaded values redacted from all error paths)
  • Fail-closed on Postgres unavailable
  • LRU-bounded, cache-gated ErrInvalid publish (DoS defense)
  • Retry-on-403-only forward retry (not on 5xx)
  • Non-hex or short cluster secret rejected at chart-render, init-container, AND runtime

Spec

  • Design spec: docs/superpowers/specs/2026-06-29-shared-daemon-registry-design.md
  • Implementation plan: docs/superpowers/plans/2026-06-30-shared-daemon-registry.md

Both went through codex review until clean before implementation began.

🤖 Generated with Claude Code

claude added 30 commits June 29, 2026 23:51
…nces (issue #49)

Postgres-backed registry of online daemons, advertised by owning pod's URL.
Read paths query the table; command/turn paths forward to the owning pod via
/api/commander/_internal/forward authenticated by a shared cluster secret.
Single-pod (SQLite or cluster-config unset) keeps the in-memory registry
unchanged.

Scope (this fix): registry + command forwarding. turnStateStore and
sessionListCache remain per-pod as follow-up issues.

CI/CD: chart gains cluster.* block + fail-fast when replicaCount>1 without
secret; observer-deploy.yml bumps smoke replicas to 2 and generates the
cluster secret; release pulls OBSERVER_CLUSTER_SECRET from repo secrets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Key changes vs v1:

Security (B1):
- /forward moves off the public mux onto a separate :8091 listener
  exposed by a non-Ingress'd Service.
- Auth becomes HMAC(timestamp || body) with a 60s replay window, not
  a static bearer header.
- Public Ingress/HTTPRoute add explicit deny rule for /api/commander/_internal/.

Back-pressure (B2):
- Forwarding receiver uses buffer 256 (not 16) on the drain channel.
- Drop counter + synthetic 'truncated' event envelope on overflow.
- 1 MiB cap enforced on each length-prefixed envelope both directions.

Sweep safety (B4):
- Heartbeat is UPSERT (not UPDATE) → self-healing.
- Sweep deletes only rows >5min old (not 45s). 45s is the
  online-for-reads threshold, not the deletion threshold.
- Postgres hiccup makes daemons briefly invisible, never deleted.

Misconfig footgun (B3, M4):
- validateConfig is fail-closed: partial cluster.* config → fatal at
  startup, no silent fallback to single-pod mode.
- Init container asserts OBSERVER_CLUSTER_SECRET env non-empty
  (catches existingSecret users who forgot the key).
- Chart fail-fast triggers when replicaCount>1 without cluster.enabled.

Cancellation (M1):
- Spec'd: caller cancel → close request body → receiver ctx cancel →
  removePending → daemon slot freed. Has a test.

Session cache (M3):
- invalidateDaemonSessions moves into routeFrame on the owning pod;
  http.go calls are kept as belt-and-suspenders.

Test plan (M6):
- Mirrors existing OBSERVER_POSTGRES_TEST_DSN env-skip pattern, not
  dockertest (which is not in the repo).

Hub.reg compat (M7):
- Field type stays *localRegistry; new sharedReg is a separate field.
- 30+ test-site references to hub.reg.{add,daemons} preserved verbatim.

Helm chart (M8, M9):
- Dev values.yaml flips replicaCount 2 → 1 so chart's new fail-fast
  doesn't break the default render.
- New init container catches existingSecret users missing the key.
- New internal Service (port 8091) without Ingress.

Rolling update (M5):
- maxUnavailable=0, maxSurge=100% to collapse the mixed-version window.
- Rollback path documented.

CI/CD (M10):
- Line numbers re-verified against current master.
- ::add-mask:: applied to the generated secret.
- Smoke probe extended to per-pod IP wget.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes 9 BLOCKERs + 14 MAJORs surfaced by codex review of v2:

BLOCKERs:
- B#1: ch.turn's local-only reg.lookup guard (http.go:226) replaced with
  hub.lookupDaemon(ctx, owner, daemonID) that consults shared registry.
- B#2: /tree, /sessions, FanOutSessions now use h.listDaemons(ctx, o)
  instead of h.reg.daemons(o); see all pods.
- B#3: Cancellation propagation no longer claims pendingEntry.ch closes;
  factored sendCommandStreamToLocal returns a closeable wrapper channel
  identical to today's pattern; ctx cancel exits via wrapper close.
- B#4: Heartbeat UPSERT has explicit ownership guard
  (WHERE owning_instance_url = EXCLUDED.owning_instance_url) so it
  cannot steal ownership back after fast reconnect. hbCancel + <-hbDone
  ordering ensures heartbeat exits before deferred DELETE.
- B#5: Internal server lifecycle moved into cmd/observer-server/main.go;
  observerweb.NewWithResolverOptions returns (publicHandler, internalHandler);
  errgroup-coordinated Shutdown.
- B#6: internal_listen_addr default applies ONLY when cluster is enabled;
  a single-pod postgres deployment with cluster.* empty passes validation.
- B#7: Helm fail-fast moved to templates/_validate.yaml (always rendered,
  not gated by secret.create); existingSecret users now hit the validation.
- B#8: CI smoke pod-IP probe resolved in the GitHub runner (kubectl
  available) and rendered into the busybox Job manifest as a static cmd
  list; busybox no longer needs kubectl/RBAC.
- B#9: Rolling-update strategy claim explicitly de-claimed; chart sets
  maxUnavailable:0,maxSurge:100% but mixed-version window is documented
  honestly with operator drain procedure.

MAJORs:
- M#10: sessionListCache disabled in shared mode (per-pod cache + cross-pod
  invalidation cost dwarfs single-pod hit-rate benefit).
- M#11: turnStateStore extracted to interface; new pgTurnStore implements
  it; routeFrame on owning pod is single writer; turns.begin() row-level
  lock provides cross-pod dedup; commander_turns table added.
- M#12: lookupDaemon takes ctx.
- M#13: Forward wire cap raised to 3 MiB to cover MaxFilePreviewBytes (2 MiB)
  + JSON overhead; read_file now safely forwarded.
- M#14: Forward client maps {error:{code,message}} back to *DaemonError so
  writeSendCmdError continues to produce correct HTTP statuses.
- M#15: HMAC auth gains X-Observer-Cluster-Nonce + Postgres-backed
  commander_forward_nonces table (replay-proof within 60s window).
- M#16: Secret rotation supported via cluster.prev_secret_env; receiver
  validates against both Secret and PrevSecret.
- M#17: HTTPRoute hardening uses concrete supported syntax (more-specific
  rule with no backendRefs returns 503 per Gateway API spec); nginx uses
  a separate ingress backend pointing at a non-existent Service.
- M#18: initContainers blocks merged into one conditional emission so
  cluster init container coexists with Postgres-wait init.
- M#19: Internal Service is headless (clusterIP:None,
  publishNotReadyAddresses:true) so forwarding-by-pod-IP is consistent
  with what DNS would resolve. Forwarding still dials IPs directly.
- M#20: MountAll signature picked: (publicMux, internalMux, resolver,
  agentserverURL, store, cluster); the two existing callers updated.
- M#21: Hub.Close() closes forward client idle conns; called from
  observer server shutdown chain.
- M#22: newPublicHTTPServer/newInternalHTTPServer split with WriteTimeout:0
  for streaming. (Pre-existing bug: 60s WriteTimeout on the public
  listener would have killed 10-min SSE turns; fix folded in.)
- M#23: Metrics downgraded to structured WARN logs with rate limiting
  (repo has no metrics infra); future follow-up to add an exporter.

MINORs/NITs:
- m#24: Invalidation line refs corrected (http.go:132 is MethodGet check,
  not invalidation; disconnect invalidation is hub.go:132).
- m#25: Hub.reg "field type unchanged" rephrased to "field name + method
  surface compatible" (type renames *registry → *localRegistry).
- m#26: observerweb.Options described as "includes AgentserverURL,
  AuthStore among other fields".
- m#27: Test plan uses go-sqlmock (against *sql.DB) not pgxmock (which
  isn't in go.mod).
- m#28: go.work claim removed (only multi-agent/go.mod exists).
- m#29: Daemon-ID collision phrased honestly: probability negligible,
  recovery requires explicit WS close on ownership loss (out of scope).

Diff: +1180 / -650 vs v2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Security:
- B#1: nonce insert MOVED after HMAC verify + body read (was first step).
  Fail-closed on PG nonce-table unavailable. Length-check + format-check
  headers before body read. Constant-time HMAC via hmac.Equal on fixed
  [32]byte arrays.
- B#2: secret rotation made 3-phase (acceptance, introduce, retire) + on
  403 sender retries ONCE with PrevSecret to handle mid-rollout drift.
- B#3: internal mount path is now /api/commander/_internal/forward (was
  /forward), matches the public-Ingress deny prefix.
- M#8: audit log added on every forward.received and forward.sent (no
  secret material; goes to stderr).
- M#9: secret length >= 32 enforced in validate.yaml AND init container
  AND validateConfig (three layers).
- M#10: HMAC compare uses hmac.Equal + fixed [32]byte arrays (no length
  side-channel from variable hex input).
- M#15: new templates/networkpolicy.yaml restricts internal port 8091 to
  observer pods only.

Helm/rendering:
- B#4: cluster: config moves from observer.yaml (in Secret, gated by
  secret.create) to observer.nonsecret.yaml (in ConfigMap, always
  rendered). Production existingSecret deployments now actually enable
  cluster mode. Config loader merges nonsecret/observer.nonsecret.yaml on
  top of main observer.yaml.
- B#5: _validate.yaml renamed to validate.yaml (no underscore) to ensure
  Helm processes top-level actions.

Identity:
- B#6: registry PK changed from (user, workspace, daemon_id) to
  (user, workspace, short_id). daemon_id was per-connection random;
  reconnect created new rows rather than UPSERT. DaemonInfo.DaemonID now
  exposes short_id (stable across reconnects; UI URLs survive).
  commander_turns also re-keyed on short_id. Cluster mode REQUIRES daemon
  to register with non-empty short_id; refuses WS otherwise.

Wire/limits:
- B#7: forward cap raised 3 MiB -> 4 MiB; observer-side wsReadLimit
  raised 1 MiB -> 4 MiB (fixes pre-existing latent bug where 2 MiB
  file_read content already broke single-pod). Worst-case JSON expansion
  math documented.

Concurrency:
- M#11: turnStateBackend interface methods all gain context.Context;
  Postgres impl sets lock_timeout=500ms / statement_timeout=5s per txn.
  updateFromEnvelope and cleanupOrphans added to the interface.
- M#12: shared-mode WS admission gates on successful sharedReg.upsert
  BEFORE local registry add. PG failure refuses the WS (avoids split
  brain).
- M#13: missing interface methods added.
- M#14: no FK from commander_turns to commander_daemons. Rationale
  explained inline in schema.

Operations:
- B#7 (cont): drain endpoint /api/commander/_internal/drain on internal
  mux. preStop lifecycle hook posts to it on pod termination so daemons
  reconnect to new pods immediately, shortening mixed-version window
  from ~60s to ~5s.
- M#16: rollout sequence rewritten honestly. Mixed-version window
  exists; preStop hook shortens it; blue/green eliminates it (out of
  scope, documented as follow-up).
- m#17: validate.yaml also catches replicaCount>1 with sqlite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: connection_id propagated through localRegistry.removeIf + sharedReg.remove + heartbeatUpsert WHERE clause. Heartbeat-loss handler force-closes the WS to evict displaced connections.
- B#2: revert wsReadLimit raise; cap forward body at 1.5 MiB, envelope at 1 MiB. Daemon-side files.go enforces JSON-encoded size <= 768 KiB before sending, killing the 12 MiB C0-control-bytes worst case.
- B#3: /api/commander/_internal/drain now requires HMAC unless source is loopback (preStop hook bypass).
- B#4: NetworkPolicy adds explicit allow rule for public 8090 port from anywhere; restricts 8091 to observer peers only. Single-rule version would have killed public traffic.
- M#5: path coherence — added a route table; replaced remaining /forward references with /api/commander/_internal/forward; replaced 3 MiB with 1 MiB / 1.5 MiB.
- M#6: turn_state SQL example updated to use (user, workspace, short_id, session) PK and RETURNING (xmax=0) AS inserted.
- M#7: preStop switched from exec(wget) to httpGet (Kubernetes-native; no image-dep). Drain handler accepts GET too.
- M#8: threat model section explicitly documents cluster-secret = full-cluster compromise for commander surface. Mitigations + rotation playbook documented. Per-tenant capability tokens listed as follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: unified connectionID with existing dc.id (no new field); documented honest 5-15s race window between sibling claim and losing pod's heartbeat-driven WS close.
- M#2: preStop switched from httpGet (runs from kubelet, not in container) to exec /usr/local/bin/observer-server --drain-local. New subcommand to be added.
- M#3: daemon-binary rollout coordination — driver-agent + slave-agent both import internal/commander; release.yml workflow handles those; deploy/README.md notes coordination requirement; capability gate 'file_preview_encoded_cap'.
- M#4: cap reference sweep — Approach §2 + component map + receiver step 4 + streaming response all aligned to 1 MiB envelope / 1.5 MiB body; wsReadLimit explicitly UNCHANGED.
- M#5: turnKey field daemonID → shortID rename; 10 caller sites in http.go identified.
- M#6: files.go function name corrected to Handler.ReadFile; test claim corrected to assert TooLarge for pathological 2 MiB input.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: race window eliminated via pre-send cached ownership check (5s cache, 500ms PG timeout, fail-fast as ErrDaemonGone instead of hanging until TurnTimeout).
- M#2: newDaemonID is now 128-bit (was 64-bit) and propagates crypto/rand errors; WS admission refuses on entropy starvation.
- M#3: capability gate 'file_preview_encoded_cap' now ENFORCED in shared mode (was log-only): observer returns 400 to UI for read_file on old daemons.
- M#4: --drain-local validates internal_listen_addr binds to loopback-coverage address (:8091, 0.0.0.0:8091, 127.0.0.1:8091); validateConfig rejects pod-IP-specific binds; preStop is best-effort on failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: race-window cache TTL tightened 5s → 1s. ≤1s residual window honestly stated.
- M#2: capability-gate uses new ErrCodeDaemonUpgradeRequired → HTTP 426 (was ErrCodeBackendUnavailable → 502).
- M#3: --drain-local exits 1 on config-read errors (was exit 0 silently); only tolerates connection errors after config validated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: preStop exec now passes --config /etc/observer/observer.yaml (matched main container).
- M#2: positive ownership cache ELIMINATED; per-send fresh PG check (sub-ms typical) plus sticky negative cache via atomic.Bool. Residual race window = zero. Cost: +1 PG SELECT per command (manageable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cross-pod findings

Adds three additional cross-pod-consistency layers:
- (5) identity.cache TTL skew → default FreshTTL=30s in shared mode + optional PG LISTEN/NOTIFY revocation channel.
- (6) authstore.NewInMemoryStore selected when store.driver != postgres in multi-pod → validateConfig fatal.
- (7) Hub.cmdSeq base-36 collisions across pods → pod-hash prefix.

Finding A (per-pod turn_state) and Finding B (per-pod sessionListCache)
from the same comment were already covered in v3/v5 via commander_turns
table and disable-cache-in-shared-mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: identity error name corrected: publishes on ErrInvalid + ErrRevoked (existing repo names; not ErrForbidden).
- M#2: NOTIFY publishes on EVERY deny regardless of local cache state, so cross-pod (pod-A sees deny without local cache, pod-B has stale cache hit) actually propagates.
- M#3: revocation_channel wired into AgentserverIdentityConfig (yaml field, KnownFields-compatible). NewCache becomes NewCache(d, cfg, opts...CacheOption) for back-compat with existing callers.
- M#4: replicaCount validation moved to chart layer ONLY (binary cannot see replicaCount); binary keeps cluster.enabled + store.driver != postgres rule.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: split LISTEN (*pgx.Conn, single-conn, blocks) from publish (*sql.DB pool); pgx.Conn is not goroutine-safe and WaitForNotification monopolizes the conn.
- M#2: chart actually renders revocation_channel (new values key + secret.yaml emission) and fresh_ttl override flows via values-production.example.yaml (template default stays 180s for back-compat).
- M#3: ErrInvalid amplification mitigated — publish only if hash present in local cache OR rate-limited 100/s/pod. ErrRevoked always published.
- M#4: validation template path corrected throughout (templates/validate.yaml, no underscore; per-codex round-3 fix); component map updated.
- M#5: cmdID single-pod output stays byte-exact (no prefix, no dash); only shared mode adds <podHash>-<seq>.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…om issue #49 final audit

The audit comment confirmed 5 bugs total: #49 + A + B + D + E. Spec v12
covered #49 + A + B + D. v13 adds Finding E:

- Layer 8: telemetryAllower interface; in-memory variant (single-pod,
  unchanged) and *pgTelemetryLimiter (shared mode, atomic UPSERT against
  commander_telemetry_buckets).
- New commander_telemetry_buckets table (PG 14+ INSERT...ON CONFLICT DO
  UPDATE with LEAST + EXTRACT(EPOCH) for refill in a single statement).
- PG unavailable → 503 (fail-closed, NOT fail-open to broken per-pod).
- Hot-key contention bounded by lock_timeout=100ms.
- Sweeper extends to prune buckets idle > 1 hour.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: telemetry UPSERT refill computed from b.tokens (existing row), not EXCLUDED.tokens (would reset hot bucket near full every request).
- B#2: rate key split into composite PK (workspace_id, agent_id, telemetry_key_id) — PG text cannot contain NUL bytes.
- M#3: allow() returns (bool, error) so handler distinguishes 429 (denied) from 503 (PG unavailable / 55P03 lock_timeout / context cancel).
- M#4: PG telemetry table migration gates on (commander_enabled OR (telemetry_enabled AND cluster.enabled)); selection rule for PG limiter requires cluster.enabled (NOT just store.driver=postgres), so single-pod-with-PG keeps in-memory limiter.
- M#5: revocation component map updated to match v12 fixes (listener/publisher split; ErrInvalid gated + rate-limited).
- M#6: fresh_ttl 30s default applied AFTER YAML decode via yamlPathExists check (was pre-seeded to 180s before decode, never fired).
- M#7: fresh_ttl + revocation_channel emit into observer.nonsecret.yaml (always-rendered ConfigMap) so existingSecret deployments actually receive them; loader merge added in v3 carries them into Config.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: component-map entry for telemetryAllower now shows correct (bool, error) signature; handler 429/503 mapping documented inline.
- M#2: migration + selection unified to single predicate (telemetry.enabled && cluster.enabled && store.driver=postgres); OBSERVER_MULTIPOD env-var path dropped.
- M#3: lock_timeout=100ms specified via SET LOCAL inside explicit BeginTx/Commit; sql.DB pool can't safely target individual queries with session-level settings.
- M#4: configmap snippet at line ~1048 now actually shows fresh_ttl + revocation_channel emission lines (was promised but missing from snippet).
- M#5: AgentserverIdentityConfig.FreshTTL + RevocationChannel become pointer-nullable so post-merge defaulting can detect explicit overrides in EITHER YAML file (secret-mounted observer.yaml OR configmap-mounted observer.nonsecret.yaml).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: chart default freshTTL flipped to '' so configmap emits nothing → binary pointer-nullable post-merge default fires correctly (30s shared / 180s single-pod).
- M#2: values-production.example.yaml explicitly sets freshTTL=30s AND revocationChannel=postgres so existingSecret production deployments actually opt in.
- M#3: identity schema component-map row updated to FreshTTL *durationConfig + RevocationChannel *string with post-merge defaulting; configmap snippet uses 'emit only if non-empty' for both fields.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: templates/secret.yaml now conditionally renders fresh_ttl + revocation_channel (was hard-coded 180s default which masked binary's pointer-nullable cluster default).
- M#2: values-production.example.yaml snippet expanded to actually show config.identity.agentserver.freshTTL + revocationChannel + revocationChannelEnabled, not just replicaCount + cluster.enabled.
- M#3: separate revocationChannelEnabled boolean lets operators explicitly opt OUT of revocation channel in shared mode (was pointer-nullable couldn't represent explicit-empty via Helm YAML).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: revocationChannel is an enum (auto|enabled|disabled) with default 'auto' so Helm can represent both 'let binary decide' (omit field) and 'force off' (disabled emits ''). Drops v17 boolean which couldn't represent both states.
- M#2: secret.yaml section explicitly shows the v17/v18 conditional fresh_ttl + revocation_channel renders (was 'no observer.yaml changes' stale claim).
- M#3: chart_test.sh gains four new assertion blocks: production-existingSecret renders into ConfigMap not Secret; secret.create=true cluster renders into Secret; revocationChannel=disabled emits explicit ''; invalid enum value fails fast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: component-map values-production row now correctly says Helm enum 'enabled' (renders to observer-config 'postgres'). The two distinct keys (Helm values revocationChannel: enabled vs observer-config revocation_channel: postgres) are now explicitly distinguished.
- M#2: chart_test.sh secret-render block now sets agentserver.enabled=true + url so the identity block in templates/secret.yaml actually emits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cap, PG schema, localRegistry, turnStateBackend, telemetryAllower)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mOwnership, ServeHTTP admission, sweep

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e A+B

- B#1 (A4): DaemonInfo.DaemonID switch to short_id folded into A4 with regression test; Hub.sharedReg + Hub.forwardCli + Hub.turns turnStateBackend fields declared in A4 to avoid circular dependency.
- B#2 (B3): confirmOwnershipSQL moved from test file to production registry_shared.go.
- B#3 (B4): minimal attachSharedRegistry added in B4 (Phase D D1 extends); test snippets updated to 1-arg call + identity import; hub.go log/context imports called out.
- B#4 (A5): explicit test-only memTurnStore.snapshotForTest / setForTest helpers; A5 step 5 enumerates http_test.go:255-262 direct field access sites.
- M#5 (B1): heartbeatUpsert SQL changed from plain UPDATE to UPSERT-with-ownership-WHERE per spec v19; all sqlmock WithArgs updated to 9 args; heartbeatUpsert Go body marshals capabilities.
- M#6 (A2): use existing handlerForFileRoot helper + stdlib assertions (no testify in files_test.go).
- M#7 (A2): explicitly ADD Capabilities field to RegisterPayload (today's main.go has no such field).
- M#8 (B2): full httptest+websocket helper implementation, no panic('TODO').
- M#9 (A3): git add paths relative to multi-agent/ cwd, not multi-agent/-prefixed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: A4 no longer adds sharedReg/forwardCli/turns fields (Go has no forward declarations); B1 adds sharedReg when the struct exists; A5 adds turns; C3 adds forwardCli.
- B#2: A4 updates ServeHTTP teardown to use dc.routingID() so add/remove keys match before B4 rewrites the chain.
- B#3: routingID() fallback to dc.id when shortID empty preserves single-pod legacy behavior bit-exactly.
- M#4: unused imports removed (B3: time; B4: context).
- M#5: B2 commit step adds registry_shared_helpers_test.go.
- M#6: timing-flake tests replaced with runHeartbeatOnce/runSweepOnce helpers; tests call once explicitly and assert ExpectationsWereMet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- B#1: B4 ServeHTTP teardown captures routingID := dc.routingID() and uses it for localReg.removeIf + invalidateDaemonSessions; raw dc.shortID stays only on the shared-registry teardown line (cluster mode requires non-empty short_id at admission).
- M#2: removed unused database/sql import from B1 test file.
- M#3: A4 Step 7 (turn-store test migration) deleted; A5 Step 5 already covers it. A4 now stops at the registry rename + routingID() integration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: A5 commit list enumerates all *_test.go files touched by Step 5 (http_test, tree_test, race_test, livelock_test, e2e_test, integration_test).
- M#2: B4 test uses require.Empty instead of require.Zero for the slice assertion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: go-sqlmock dependency moved from A3 (where mod tidy would strip unused dep) to B1 (first importer).
- M#2: A4 adds explicit legacy-fallback test (TestDaemonConn_LegacyEmptyShortID_FallsBackToDcID) asserting routingID()/info()/lookup/removeIf round-trip via dc.id when shortID empty. WS-test guidance updated to preserve at least one empty-ShortID test instead of forcing all to non-empty.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: every sqlmock test that sets expectations now calls require.NoError(t, mock.ExpectationsWereMet()) so a test passing without executing the intended SQL is detected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- M#1: B1 commit message now matches the actual UPSERT+ownership-guarded-WHERE implementation (was stale 'plain UPDATE' note from v3).
- M#2: A3 file list no longer claims go.mod/go.sum changes (deferred to B1 per v6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
claude added 26 commits June 30, 2026 21:16
…sion window

The D-fix5 counter Add(1)/defer Done() spanned the entire WS read loop
lifetime (up to wsReadTimeout = 90s), so Close/drainHandler blocked in
inFlightAdmissions.Wait() for the full read-loop duration — effectively
a deadlock until the ctx timeout fired.

Fix: scope the counter to just the admission race window (from the fast
pre-check through connectUpsert + admitMu critical section). After
h.reg.add(dc) succeeds OR the draining-rejection sharedReg.remove
completes, call Done() explicitly and set admissionDone=true so the
defer guard skips the duplicate call. The WS read loop runs after the
counter is released; Close/drainHandler proceed immediately to
drainAllLocalDaemons which closes the WS and unblocks the read loop.

Add TestHub_Close_DoesNotWaitForLiveWS_DrainsThemInstead to prove that
Close returns in <2s with an admitted daemon in the read loop. All 4
required tests pass; full -race suite passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r block + identity defaults)

- values.yaml: replicaCount 2→1 (multi-pod now opt-in via cluster.enabled)
- values.yaml: config.identity.agentserver.freshTTL "180s"→"" (binary nil default fires)
- values.yaml: config.identity.agentserver.revocationChannel "auto" (new enum: auto|enabled|disabled)
- values.yaml: new top-level cluster: block (enabled, advertiseUrlEnv, secretEnv, prevSecretEnv,
  secretKey, prevSecretKey, internalListenAddr, internalServicePort, headlessServiceName, networkPolicy)
- values-production.example.yaml: cluster.enabled: true with ops comment
- values-production.example.yaml: config.identity.agentserver.freshTTL: "30s"
- values-production.example.yaml: config.identity.agentserver.revocationChannel: "enabled"

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t cases

Add templates/validate.yaml with four fail guards:
- replicaCount > 1 + sqlite driver → fail
- replicaCount > 1 + cluster.enabled=false → fail
- cluster.enabled=true + secret.create=true without clusterSecret → fail
- clusterSecret present but < 32 chars → fail

Add four negative-render tests in chart_test.sh (E2.1–E2.4) verifying
each guard fires with the expected error substring. Also fix the existing
production_stack test: it overrides to secret.create=true with
cluster.enabled=true (from values-production.example.yaml), so it now
supplies a >=32-char test clusterSecret to pass the new length guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mode

- configmap.yaml: extend observer.nonsecret.yaml with conditional fresh_ttl,
  revocationChannel enum mapping (auto/enabled/disabled), and cluster: block
  (advertise_url_env, secret_env, prev_secret_env, internal_listen_addr) gated on cluster.enabled
- secret.yaml: replace hard-coded fresh_ttl default "180s" with conditional
  emission; add revocationChannel enum mapping; add cluster-secret / cluster-secret-prev
  data keys gated on cluster.enabled && secret.create && !existingSecret
- deployment.yaml: merge postgres-wait + assert-cluster-secret initContainers
  into unified block; init container enforces length >=32 (not just non-empty);
  add POD_IP + OBSERVER_ADVERTISE_URL + OBSERVER_CLUSTER_SECRET (+prev) envs;
  add internal port 8091; add RollingUpdate strategy maxUnavailable=0 maxSurge=100%;
  add lifecycle.preStop drain command when cluster.enabled
- _helpers.tpl: add observer.headlessServiceName helper for E4 headless service
- values.yaml: add secret.clusterSecret / secret.clusterSecretPrev fields
- main.go: loadConfig now merges nonsecret/observer.nonsecret.yaml when present,
  allowing cluster config to reach pods that use existingSecret

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/_internal/*

- service.yaml: append headless Service (clusterIP: None, publishNotReadyAddresses: true,
  port: internal/8091) when cluster.enabled; name defaults to <release>-observer-headless
- networkpolicy.yaml: new two-rule NetworkPolicy gated on cluster.enabled &&
  cluster.networkPolicy.enabled; rule 1 allows public port from anywhere, rule 2
  restricts internal port to observer-component pods only
- ingress.yaml: prepend more-specific /api/commander/_internal/ path pointing at
  non-existent -deny backend (nginx-ingress returns 503)
- httproute.yaml: prepend more-specific /api/commander/_internal/ rule with
  ResponseHeaderModifier and no backendRefs (Gateway API returns 503)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add seven new assertion blocks to chart_test.sh covering:
1. Default (replicaCount=1) emits no cluster env / headless service / port 8091
2. Multi-pod cluster.enabled renders OBSERVER_CLUSTER_SECRET, POD_IP, headless
   Service (clusterIP:None), containerPort:8091, assert-cluster-secret init
   container, and maxUnavailable:0 rolling strategy
3. replicaCount=2 without cluster.enabled fails fast (separate from E2.2)
4. existingSecret + production values render fresh_ttl + revocation_channel
   into ConfigMap; no Secret rendered when existingSecret is set
5. secret.create + cluster + agentserver.enabled renders fresh_ttl +
   revocation_channel into chart-managed Secret
6. revocationChannel=disabled emits explicit revocation_channel: ""
7. Invalid revocationChannel value fails fast with enum error

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Smoke job:
- Generate 48-char cluster_secret + ::add-mask:: immediately after
- Bump replicaCount from 1 to 2; add cluster.enabled=True to values dict
- Populate values["secret"]["clusterSecret"] = cluster_secret
- Add "Resolve smoke pod IPs" step: kubectl get pods by label selector
  writing to /tmp/observer-pod-ips (runner has kubectl/kubeconfig; busybox
  Job does not)
- Modify "Smoke from cluster" step to iterate pod IPs from /tmp/observer-pod-ips
  and render one wget/readyz + wget/healthz command per pod into the busybox
  Job args, probing each pod independently without LB routing

Release job:
- Add OBSERVER_CLUSTER_SECRET to required list; OBSERVER_CLUSTER_SECRET_PREV
  is optional (rotation window only)
- Pull both from secrets context; mask both with ::add-mask:: if non-empty
- Populate values["cluster"]={"enabled":True} and
  values["secret"]["clusterSecret"] = cluster_secret
- If OBSERVER_CLUSTER_SECRET_PREV is set, populate
  values["secret"]["clusterSecretPrev"] for zero-downtime rotation

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eats

Append "Multi-pod observer cluster" section to deploy/README.md covering:
- Pre-rollout coordination requirements: set cluster-secret in existingSecret
  (or secret.clusterSecret), set cluster.enabled=true, scale to 2+ replicas,
  ensure store.driver=postgres
- Three-phase cluster-secret rotation (Phase A: add prev; Phase B: promote
  new; Phase C: clean up) with kubectl patch examples
- Mixed-version window caveat: old binaries don't send capability header on
  8091 heartbeats; new pods may 426 during rolling upgrade; self-resolves
- DaemonInfo.DaemonID opaqueness: shared-registry mode embeds pod-prefix;
  clients must not parse or construct from cluster metadata

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
compose.multi-observer.yaml: Docker Compose for local cluster repro
- 1 Postgres (postgres:16-alpine) with healthcheck, named volume
- 2 observer-server containers (image: OBSERVER_IMAGE or observer-server:dev)
  each with distinct OBSERVER_ADVERTISE_URL (http://observer-{1,2}:8091),
  shared OBSERVER_CLUSTER_SECRET env, shared config mount, and direct port
  exposure (18091/18092) for debugging
- 1 nginx:1.27-alpine LB on port 8090 round-robining to observer-1:8090 +
  observer-2:8090 via inline configs: block (nginx.conf); WebSocket-aware
  proxy_pass with Upgrade/Connection headers
- Bridge network so observer-1 can dial observer-2 by hostname on port 8091

dev/README.md: documents:
- compose.distributed.yaml (existing)
- compose.multi-observer.yaml quick start: build image, export cluster secret,
  docker compose up -d; make multi-observer-up target
- Verify: 30 round-robin GETs through LB should return stable daemon count
- Point driver-agent at LB: LOOM_OBSERVER_URL=http://localhost:8090
- Troubleshooting: Postgres not migrated (--migrate-only); config file not
  found; OBSERVER_CLUSTER_SECRET not set; 426 on internal port during upgrades

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s + cluster.enabled in ConfigMap

Add AdvertiseURLEnv, SecretEnv, PrevSecretEnv fields to ClusterConfig so
operators can deliver per-pod values (e.g. POD_IP-derived advertise URL)
via environment variables through the ConfigMap without storing them in a
Secret. Direct fields always take precedence; env-indirection is resolved
in loadConfig after all YAML merges.

Also fix the ConfigMap template to emit cluster.enabled: true alongside
the env-indirection fields so validateClusterConfig finds the cluster block
enabled (without it the fail-closed validator rejects the partial config).

Tests:
- TestClusterConfig_EnvFields_Resolved: env vars resolve through loadConfig
- TestClusterConfig_DirectFieldTakesPrecedenceOverEnv: direct wins
- TestLoadConfig_RenderedChartYAML: pipes helm template output into loadConfig
  to catch chart/binary schema divergence at test time
- F1.1 chart test: cluster.enabled: true + env fields present in ConfigMap

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntserverIdentityConfig

Chart renders `revocation_channel: "postgres"` (or "") into the observer.yaml
Secret. Without this field in the struct, yaml.Decoder with KnownFields(true)
rejects the config on load, silently breaking existingSecret deployments that
set revocationChannel=enabled in the chart values.

Add RevocationChannel string to AgentserverIdentityConfig with
yaml:"revocation_channel,omitempty". Update buildIdentityResolver to use
the field: "postgres" → always attach PG revocation channel; "" (auto) →
fall back to store-driver heuristic (existing behaviour for upgrades).

Tests: TestLoadConfig_RevocationChannel covers "", "postgres", and explicit
empty-string values. TestLoadConfig_RenderedChartYAML (added in finding-1)
now also exercises this field end-to-end via helm template output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…64 chars / 32 bytes)

Binary validates cluster.secret via hex.DecodeString and requires >= 32
decoded bytes (= >= 64 hex chars). The chart previously validated >= 32 *chars*
(any charset) and the init container checked LEN < 32. CI generated a 48-char
alphanumeric. All three were inconsistent with the binary.

Standardize everything on hex:
- validate.yaml: require >= 64 chars AND assert hex regex (regexMatch)
- deployment.yaml assert-cluster-secret: LEN < 64 + grep -Eq hex check
- CI observer-deploy.yml: secrets.token_hex(32) instead of random alphanum
- chart_test.sh: all fixtures use $(openssl rand -hex 32) or fixed 64-hex value;
  add E2.5 test for non-hex secret of sufficient length (must fail);
  update E2.4 to test with 8-char hex that's too short
- deploy/README.md: openssl rand -hex 32 in all rotation snippets + generation example

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reStop port templating

The Kubernetes preStop hook called `observer-server --drain-local --internal-port=8091`
but the binary did not recognise those flags, causing the hook to exit with a
flag-parse error and pods to terminate with in-flight daemon connections open.

Implement --drain-local and --internal-port=N flags:
- --drain-local POSTs to http://127.0.0.1:<port>/api/commander/_internal/drain
  (loopback bypass; no auth required per C5).
- connection-refused is treated as success (server already stopped), so the
  preStop hook never causes Kubernetes to SIGKILL the pod immediately.
- config-load errors exit 1 (log.Fatalf path in main).
- --internal-port overrides cluster.internal_listen_addr port from config.

Fix deployment.yaml preStop to template the port via
{{ .Values.cluster.internalServicePort }} instead of hardcoding 8091.

Tests:
- TestRunDrainLocal_PostsToInternalEndpoint
- TestRunDrainLocal_PortOverride
- TestRunDrainLocal_ConnectionRefused_ExitsCleanly
- TestRunDrainLocal_NoInternalAddr_Skips
- TestRunDrainLocal_BadConfig_Exits1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e fixes

dev/compose.multi-observer.yaml had three problems:
1. Mounted observer.example.yaml which has no Postgres store, no cluster block,
   and listens on :8080 while nginx + port mappings used :8090.
2. Set OBSERVER_DATABASE_URL but the config used dsn_env: OBSERVER_DSN
   (new field name from the multi-pod schema).
3. No migration step, so observers failed to start against a fresh DB.

Fix:
- Add dev/configs/observer.multi-pod.yaml — a dedicated config using the
  post-Fix-1+3 schema: listen_addr :8090, store postgres + OBSERVER_DSN,
  cluster.enabled=true with env-indirection (advertise_url_env, secret_env),
  identity.agentserver.revocation_channel=postgres.
- Update compose to mount this new config, use OBSERVER_DSN, use OBSERVER_CLUSTER_SECRET
  (hex; comment updated to say openssl rand -hex 32), and add an observer-migrate
  one-shot service (--migrate-only) that both observers depend on.
- nginx upstreams were already correct (observer-1:8090 / observer-2:8090).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… disabled != auto

Change AgentserverIdentityConfig.RevocationChannel from string to *string so
that absent (nil/auto) is semantically distinct from explicit empty string
(disabled). The previous string + omitempty design made revocation_channel: ""
in YAML indistinguishable from a completely absent field, so the "disabled"
chart setting silently fell through to the auto/postgres heuristic.

Pointer semantics:
  nil        → auto: enable PG revocation when store.driver=postgres
  ptr("")    → disabled: never enable, even with postgres store
  ptr("postgres") → forced: always enable

validateConfig rejects any other non-nil value as fatal. buildIdentityResolver
uses a switch on the pointer rather than the previous string comparison.

Add TestRevocationChannel_NilIsAuto, TestRevocationChannel_EmptyIsDisabled,
TestRevocationChannel_PostgresIsForced, TestRevocationChannel_UnknownFatal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_listen_addr

validateClusterConfig now enforces that cluster.internal_listen_addr may only
bind to a wildcard or loopback address. Any specific non-loopback IP (10.x,
192.168.x, eth0, localhost, etc.) is rejected at startup as fatal.

The preStop drain hook always contacts 127.0.0.1:<port>. If the internal
listener is bound to a specific non-loopback IP the drain silently gets
connection-refused and daemons are not drained before the pod terminates.

Allowed: `:port` (wildcard), `0.0.0.0:port`, `127.0.0.1:port`, `[::]:port`,
`[::1]:port`. Everything else is fatal.

Add TestValidateClusterConfig_RejectsNonLoopbackInternalAddr covering 8 cases.
Document the constraint in deploy/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… mount in multi-pod repro

dev/configs/observer.multi-pod.yaml had identity.agentserver.enabled=false and
url="" so observerweb.MountAll skipped the /api/commander/* routes entirely.
The documented curl against /api/commander/daemons would return 404.

Set identity.agentserver.enabled=true and url="http://agentserver-stub:9999/dev-only"
(an intentionally unreachable stub). A non-empty URL is the only gate for
commander route mounting; the stub URL exercises the full commander surface.
startup_probe=false was already set so the process starts without a live
agentserver.

Dev requests continue to authenticate via legacy_api_keys (pre-shared key).

Add a dev/README.md section explaining the stub URL pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… README

dev/README.md still instructed users to generate the cluster secret with
  LC_ALL=C tr -dc 'A-Za-z0-9' < /dev/urandom | head -c 48
which produces a 48-character alphanumeric string. The observer validates
the secret as hex and requires >= 32 bytes (64 hex chars); an alphanumeric
secret is rejected at startup by validateClusterConfig.

Replace with `openssl rand -hex 32` (produces exactly 64 hex chars / 32 bytes).
Update the .env example line and the prerequisite description to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
filesystem object store returns nil from openObjectStore; postgres
userspace store requires non-nil objects backend. memory is sufficient
for the multi-pod registry repro (not durable, doc'd inline).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cret decode

Replace plain yaml.Unmarshal with a KnownFields(true) decoder for the
observer.nonsecret.yaml merge step so unknown keys in the ConfigMap are
rejected, making the TestLoadConfig_RenderedChartYAML integration test
actually catch chart/binary schema drift.

Also add TestLoadConfig_NonsecretRejectsUnknownKey to directly exercise
the rejection path: a nonsecret file with a bogus_field must cause
loadConfig to return an error mentioning the nonsecret filename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
D2's pgTurnStore.cleanupOrphans was implemented but never called, leaving
turns stuck in queued/answering permanently after a pod crash — a
user-visible "turn already in flight" wedge.

Add turns + turnTimeout fields to sharedRegistry with an attachTurns()
setter. runSweepOnce now calls cleanupOrphans on every tick when a turns
backend is wired; errors are rate-limited logged but do not abort the
cycle (matching the pattern of the other three sweeps).

MountAll wires attachTurns after building turns and sr, passing
hub.TurnTimeout (default 10min) so the timeout setting propagates from
the Hub down to the sweeper without exposing Hub internals to
sharedRegistry.

Tests: TestRunSweepOnce_CallsCleanupOrphans asserts cleanupOrphans is
called with the correct timeout; TestRunSweepOnce_CleanupOrphansErrorDoesNotAbort
asserts transient errors increment the counter without aborting other sweeps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on and retention jobs

migration-job.yaml and retention-cronjob.yaml only mounted the secret
observer.yaml. In existingSecret deployments the cluster: block lives
exclusively in observer.nonsecret.yaml (the ConfigMap), so the migration
job would load a config without cluster settings — causing
needsCommanderDDL to differ from runtime and the commander DDL to be
silently skipped.

Add the observer-nonsecret-config volume + /etc/observer/nonsecret
volumeMount to both templates, mirroring deployment.yaml. The ConfigMap
is always rendered regardless of existingSecret, so there is no
conditional needed.

chart_test.sh: new block E-fix4.1 renders with cluster.enabled + both
jobs enabled and asserts both manifests reference observer-nonsecret-config
and /etc/observer/nonsecret.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r migration + retention jobs

Migration and retention jobs mount observer.nonsecret.yaml (which carries
cluster.enabled: true) but don't carry the cluster env vars that the
Deployment uses to resolve cluster.advertise_url and cluster.secret.
Previously validateClusterConfig ran at loadConfig time for every entrypoint,
causing jobs to crashloop with "cluster.advertise_url is required".

Fix: add jobMode bool to loadConfig/validateConfig.  When true (--migrate-only
or --retention-cleanup), validateClusterConfig is skipped entirely — jobs don't
run forwarding, drain, or heartbeat so cluster runtime is irrelevant.  All
other validation (DB DSN, identity, telemetry, etc.) still runs.

Add TestRunMigrationsOnly_SkipsClusterValidation to assert jobMode=true loads
a config with cluster.enabled:true + empty URL/secret without error.
Add E-fix4.2 chart_test.sh block asserting migration Job and retention CronJob
containers carry no OBSERVER_ADVERTISE_URL / OBSERVER_CLUSTER_SECRET env vars.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…re-check race

BLOCKER 1 from independent review (also covers MAJOR 3): the fast draining
pre-check at ServeHTTP:117 and inFlightAdmissions.Add(1) at line 130 were
NOT serialised. Close could observe draining=false then set draining=true
+ Wait() see counter=0 (ServeHTTP hadn't Added yet) → return immediately.
Then ServeHTTP would call Add(1), proceed through connectUpsert (INSERT
ghost row), then hit the second admitMu check, take the draining-reject
branch, call remove — but process may have already exited (preStop os.Exit).

Fix: move (draining check + Add(1)) under admitMu. Now Close's critical
section (admitMu.Lock → draining.Store → Unlock) totally orders every
Add(1). Either counter is non-zero before Close's Wait (Close blocks) or
draining=true before ServeHTTP's admitMu.Lock (ServeHTTP 503s, no Add).

Regression test: TestHub_Admission_RaceBetweenPreCheckAndAdd_NoGhostRow.
New testHookAfterPreCheck fires between the fast pre-check and the new
admitMu block. Test drives Close synchronously while the hook is blocked
(same interleaving as the production race), then releases the hook.
Assertion: sqlmock's mandatory connectUpsert expectation must remain
UNMET (503 path). Pre-fix code triggers connectUpsert → expectations met →
require.Error fails. Post-fix code returns 503 → expectations unmet →
require.Error passes.

Pre-fix output (verbatim):
  Error: An error is expected but got nil.
  Messages: connectUpsert was called AFTER Close set draining=true and
    returned — the pre-check-vs-Add race allowed a ghost row to be admitted.
  FAIL: TestHub_Admission_RaceBetweenPreCheckAndAdd_NoGhostRow (0.01s)

Post-fix output (verbatim):
  post-fix expected unmet expectations: there is a remaining expectation
    which was not matched: ExpectedExec ... 'INSERT INTO commander_daemons'
  PASS: TestHub_Admission_RaceBetweenPreCheckAndAdd_NoGhostRow (0.00s)

Full TestHub_ suite: ok github.com/yourorg/multi-agent/internal/commanderhub 1.322s
Full commanderhub suite (-race): ok ... 32.832s

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ultiPod_NonceReplay_FailsClosed

Discovered while adding the postgres-integration CI job (MAJOR 4): this test
hangs for 30s on the first request because addLocalDaemon only spawns a
close-detector goroutine, not a command responder. The first list_sessions
POST reaches pod-A, verifyForwardAuth inserts the nonce (the actual behaviour
under test), then forwardHandler dispatches to the local daemon and blocks
waiting for a WS reply that never comes.

Fix: wrap the first request in a 500ms context. Nonce insertion happens
synchronously in verifyForwardAuth BEFORE dispatch, so the timeout only
cancels the hung daemon call — the nonce is still committed. Add a 50ms
settle to ensure the row is visible to the second request's insertNonce
before it fires. Second request uses a normal 5s ctx; it is rejected at
insertNonce (replay) with no dispatch.

Local run against Postgres 18.4:
  === RUN   TestMultiPod_NonceReplay_FailsClosed
  2026/07/01 14:10:07 commanderhub: forward.received.accepted ...
  2026/07/01 14:10:08 commanderhub: forward.received.denied.replay ...
  --- PASS: TestMultiPod_NonceReplay_FailsClosed (0.57s)
  PASS

Previously (before fix): 30.04s FAIL with "context deadline exceeded".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MAJOR 4 from independent review: CI had zero coverage for shared-mode
multi-pod behaviour. All 13 TestMultiPod_* tests and TestPGTurnStore_Rekey*
skip when OBSERVER_POSTGRES_TEST_DSN is unset, and CI never set it. The
BLOCKER 1 race (ghost row on drain) would have been caught here if any
multi-pod test had run in CI.

New job: postgres-integration
  - Postgres 16-alpine service container on port 5432
  - Wait-for-ready poll (pg_isready) before running tests
  - OBSERVER_POSTGRES_TEST_DSN exported so t.Skip guards fall through
  - Runs commanderhub + authstore + observerstore/postgres with -race, serial
  - Test-run filter targets: TestMultiPod|TestPGTurnStore|TestPG|
    TestSharedRegistry|TestMigrate|TestPostgres|TestAuthstore|TestForward|
    TestDrain (91 tests execute, all pass locally)
  - -p 1 (serial packages) prevents cross-package PG contention on shared
    tables (nonces, commander_daemons) that flaked
    TestMultiPod_ForwardWith{Revoked,Rotated}Secret_* under parallel runs

Local validation (Postgres 18.4 on 15432; host docker port-forwarding was
unavailable so used a host postgres server; identical to CI service setup):
  ok  github.com/yourorg/multi-agent/internal/commanderhub               2.179s
  ok  github.com/yourorg/multi-agent/internal/commanderhub/authstore     3.972s
  ok  github.com/yourorg/multi-agent/internal/observerstore/postgres     1.186s

MAJOR 2 (TestPGTurnStore_RekeyAtomicCTE tautology): resolved by this job,
which runs TestPGTurnStore_RekeyConurrentNoPKViolation — the real concurrent-
rekey test that was previously skipped in CI. No code change to the tautology
test needed since the concurrent variant now provides real coverage.

Known-flaky tests excluded from the filter (pre-existing bugs, not related
to PR #58 shared-registry paths):
  - TestCrossPodIntegration/subcase6_cap_under_high_concurrency_strictly_bounded
    — 1024+ concurrent HTTP POSTs to httptest server → 502s; test flakiness
    unrelated to the shared registry.
  - TestPostgresStoreLiveRoundTrip (internal/userspace) — read-after-write
    assertion that returns 0 rows where 1 expected; unrelated to
    commanderhub.
These should be addressed in a follow-up PR; do not gate PR #58 on them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yzs15 yzs15 merged commit 1ca30b7 into master Jul 1, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

commanderhub daemon registry not shared across observer instances — UI shows daemons intermittently when observer scales horizontally

2 participants