Context
docs/design/SESSION_HUB_DO.md proposes a Cloudflare Durable Objects "session hub" — one DO instance per session, multi-client WebSocket fanout, SQLite-backed state, hibernation-friendly. The design is good. The runtime choice (Cloudflare) has open questions for our buyer:
- Data residency: session payloads live in Cloudflare's footprint. Tenant-data-residency questions during enterprise procurement become "what's in CF and what's in your project."
- Operational coupling: DO ties session state to Cloudflare's compute lifecycle, separately from our existing GKE/Postgres/NATS/Temporal stack.
- Integration friction: the rest of the platform (Temporal workflows, NATS bus, Postgres ledger, audit pipeline) lives in our GKE project. Session state on CF means cross-cloud egress for every audit/governance handoff.
This issue scopes the GCP-native equivalent so we have an explicit alternative to compare DO against, rather than picking on path dependence. We should be able to answer "DO vs GCP-native" with numbers, not vibes.
DO concept → GCP primitive mapping
| DO concept |
GCP-native equivalent |
Notes |
| One DO instance per session, sticky by ID |
GKE StatefulSet + Envoy Gateway HTTPRoute with consistent_hash policy hashing x-maestro-session-id header to a pod |
Each pod handles a shard of sessions. Pod identity comes from statefulset.kubernetes.io/pod-name; Envoy Ringhash supports per-endpoint hash_key so we can pin shards to ordinals if needed |
| Single-threaded ordering per actor |
In-pod actor map: Map<sessionId, ActorWorker> where each ActorWorker is a single async loop draining a per-session channel |
TS or Rust (control-plane-rs already has the bones). Same single-writer guarantee, no cross-actor contention |
| Built-in DO storage (SQLite-backed) |
NATS JetStream KV bucket for hot per-session state (immediately consistent, history-aware) + Cloud SQL Postgres for durable event log |
NATS KV gives us DO-like per-key consistency without adding Spanner. Postgres holds the long-term sessions/participants/events tables we already designed |
acceptWebSocket + serializeAttachment (hibernation) |
Idle eviction: flush in-memory attachment metadata to NATS KV with TTL; when client reconnects, hash routes them back to (potentially different) pod which rehydrates from KV |
No free hibernation on GCP. Mitigations: multiplex N sessions per pod (default 1 pod ≈ 500–2000 sessions), HPA scales pod count by active session count not CPU, idle-evict timer flushes state and frees memory |
| Worker that fronts the DO |
Cloud Run (auth/validation router) or an extra Envoy filter |
Cloud Run can do auth + forwarding; downstream WS upgrade still hits GKE pods. WS must not terminate on Cloud Run (60-min hard cap, best-effort affinity). |
| Cross-DO fanout (web + Slack + IDE all attached) |
NATS subject per session: session.{sessionId}.events. Each pod publishes; clients connected to any pod consume via JetStream durable consumer |
Already deployed (`pkg/natsbus`). Solves the case where Maestro replicas split clients of one session across pods. |
| Runner → hub event ingestion |
HTTP `POST /sessions/:id/events` to the routed pod (still works) or runners publish directly to the NATS subject |
The proto/contract from `SESSION_HUB_DO.md` survives untouched on the wire. |
| Long-running durable runner work |
Temporal workflow, workflow ID = session ID |
Already deployed (k8s/temporal/). Use `SignalWorkflow` from the hub for cancel/queue/steer. |
Why not Cloud Run for the hub itself
- WebSocket timeout caps at 60 minutes (hard).
- Session affinity is best-effort cookie-based — new requests can hit different instances.
- Per-instance state on Cloud Run can't be relied on; the multi-client fanout requires real cross-instance sync (which puts us back at NATS anyway, but with worse routing).
Conclusion: hub runs on GKE. Cloud Run is fine for the auth/validation front layer if we want serverless there.
Why not just deploy Rivet on GKE
- Real option. Rivet is the same primitive Sourcegraph picked, and they self-host. docs.rivet.dev/connect/kubernetes covers GKE deploy.
- Trade-off: a new operational dependency. We already have GKE + Envoy + NATS + Postgres + Temporal. Rivet replaces some of those for one workload type. The integration cost is paying for two ways to do durable execution (Rivet for sessions, Temporal for runs) instead of one.
- Recommendation: keep Rivet on the table as a v2 simplification once we know the workload, but do v1 with primitives we already operate.
Trade-off matrix vs DO
| Dimension |
Durable Objects |
GCP-native (this design) |
| Idle session cost |
Effectively zero (hibernation) |
Pod-time (mitigated by N-sessions-per-pod and HPA-on-active-count) |
| Steady-state cost (high session count) |
Linear with session-active-time |
Pod count + Memorystore + NATS — can be cheaper at scale |
| Time to first byte (cold start) |
~50ms (CF edge) |
Pod is warm; rehydrate from KV ~5–20ms |
| Data residency |
CF footprint |
Our GCP project, tenant region |
| Audit pipeline integration |
Cross-cloud egress to platform |
Same VPC/project |
| Operational surface |
Zero ops |
Envoy Gateway config, StatefulSet ops, NATS KV bucket ops |
| Vendor lock-in |
High |
Low (k8s + Envoy + NATS + PG are portable) |
| Hibernation primitive |
Native (acceptWebSocket/serializeAttachment) |
Build it: idle-evict + KV flush + on-reattach rehydrate |
| Multi-client fanout |
Native (DO is the rendezvous) |
NATS subject per session |
| WebSocket draining on config push |
Stable |
Open Envoy Gateway bug envoyproxy/gateway#8889 — equivalent xDS updates can drain active WS connections. Needs mitigation (config rate-limiting + reconnect handling on clients) |
Concrete deploy/ deltas (what would land in evalops/deploy)
- `k8s/production/maestro/maestro-statefulset.yaml` — convert maestro-deployment to StatefulSet, ordinal-named pods, headless service for stable DNS
- `k8s/production/maestro/envoy-gateway-route.yaml` — Gateway API HTTPRoute with `BackendTrafficPolicy` ringhash on `x-maestro-session-id` header
- `k8s/gcp-runtime/memorystore-maestro.yaml` (optional) — hot state tier if NATS KV proves too slow
- NATS KV bucket provisioned via existing `ensemble-nats.yaml` config (no new system)
- `docs/runbooks/maestro.md` — add session-hub failure modes (pod loss, Envoy config push, NATS KV unavailable)
- `tests/preflight/test_maestro_consistent_hash.py` — preflight that 1000 mock session IDs distribute evenly across pods and survive a rolling restart with <2% reconnects landing on a different pod
Acceptance criteria
- 1000 concurrent sessions distribute roughly evenly across N pods (verified by metrics)
- A rolling restart of the StatefulSet preserves >98% of WS connections via reconnect; lost connections rehydrate state from NATS KV in <500ms
- Web + IDE + TUI all attached to one session see ordered events with the same per-session sequence numbers
- Hub failure modes are documented in the runbook, with one preflight per failure
- Cost model documented: $/active-session/hr at 100 / 1000 / 10k concurrent sessions, comparable to a DO cost estimate at the same scale
Decision needed
Before any code:
- Are we committing to GKE-native for v1, or do we want to spike Rivet-on-GKE as a parallel option and benchmark?
- Where does session state authoritatively live — NATS JetStream KV (operationally lighter, requires JetStream up-time SLO) or Cloud SQL Postgres (slower hot-path, simpler single-source-of-truth)?
- Do we keep `SESSION_HUB_DO.md` as the design and add a sibling `SESSION_HUB_GKE.md`, or do we replace?
References
Context
docs/design/SESSION_HUB_DO.mdproposes a Cloudflare Durable Objects "session hub" — one DO instance per session, multi-client WebSocket fanout, SQLite-backed state, hibernation-friendly. The design is good. The runtime choice (Cloudflare) has open questions for our buyer:This issue scopes the GCP-native equivalent so we have an explicit alternative to compare DO against, rather than picking on path dependence. We should be able to answer "DO vs GCP-native" with numbers, not vibes.
DO concept → GCP primitive mapping
consistent_hashpolicy hashingx-maestro-session-idheader to a podstatefulset.kubernetes.io/pod-name; Envoy Ringhash supports per-endpointhash_keyso we can pin shards to ordinals if neededMap<sessionId, ActorWorker>where each ActorWorker is a single async loop draining a per-session channelsessions/participants/eventstables we already designedacceptWebSocket+serializeAttachment(hibernation)session.{sessionId}.events. Each pod publishes; clients connected to any pod consume via JetStream durable consumerWhy not Cloud Run for the hub itself
Conclusion: hub runs on GKE. Cloud Run is fine for the auth/validation front layer if we want serverless there.
Why not just deploy Rivet on GKE
Trade-off matrix vs DO
acceptWebSocket/serializeAttachment)Concrete deploy/ deltas (what would land in evalops/deploy)
Acceptance criteria
Decision needed
Before any code:
References