Skip to content

GCP-native equivalent of SESSION_HUB_DO (Envoy ring-hash + StatefulSet + NATS JetStream) #299

@haasonsaas

Description

@haasonsaas

Context

docs/design/SESSION_HUB_DO.md proposes a Cloudflare Durable Objects "session hub" — one DO instance per session, multi-client WebSocket fanout, SQLite-backed state, hibernation-friendly. The design is good. The runtime choice (Cloudflare) has open questions for our buyer:

  • Data residency: session payloads live in Cloudflare's footprint. Tenant-data-residency questions during enterprise procurement become "what's in CF and what's in your project."
  • Operational coupling: DO ties session state to Cloudflare's compute lifecycle, separately from our existing GKE/Postgres/NATS/Temporal stack.
  • Integration friction: the rest of the platform (Temporal workflows, NATS bus, Postgres ledger, audit pipeline) lives in our GKE project. Session state on CF means cross-cloud egress for every audit/governance handoff.

This issue scopes the GCP-native equivalent so we have an explicit alternative to compare DO against, rather than picking on path dependence. We should be able to answer "DO vs GCP-native" with numbers, not vibes.

DO concept → GCP primitive mapping

DO concept GCP-native equivalent Notes
One DO instance per session, sticky by ID GKE StatefulSet + Envoy Gateway HTTPRoute with consistent_hash policy hashing x-maestro-session-id header to a pod Each pod handles a shard of sessions. Pod identity comes from statefulset.kubernetes.io/pod-name; Envoy Ringhash supports per-endpoint hash_key so we can pin shards to ordinals if needed
Single-threaded ordering per actor In-pod actor map: Map<sessionId, ActorWorker> where each ActorWorker is a single async loop draining a per-session channel TS or Rust (control-plane-rs already has the bones). Same single-writer guarantee, no cross-actor contention
Built-in DO storage (SQLite-backed) NATS JetStream KV bucket for hot per-session state (immediately consistent, history-aware) + Cloud SQL Postgres for durable event log NATS KV gives us DO-like per-key consistency without adding Spanner. Postgres holds the long-term sessions/participants/events tables we already designed
acceptWebSocket + serializeAttachment (hibernation) Idle eviction: flush in-memory attachment metadata to NATS KV with TTL; when client reconnects, hash routes them back to (potentially different) pod which rehydrates from KV No free hibernation on GCP. Mitigations: multiplex N sessions per pod (default 1 pod ≈ 500–2000 sessions), HPA scales pod count by active session count not CPU, idle-evict timer flushes state and frees memory
Worker that fronts the DO Cloud Run (auth/validation router) or an extra Envoy filter Cloud Run can do auth + forwarding; downstream WS upgrade still hits GKE pods. WS must not terminate on Cloud Run (60-min hard cap, best-effort affinity).
Cross-DO fanout (web + Slack + IDE all attached) NATS subject per session: session.{sessionId}.events. Each pod publishes; clients connected to any pod consume via JetStream durable consumer Already deployed (`pkg/natsbus`). Solves the case where Maestro replicas split clients of one session across pods.
Runner → hub event ingestion HTTP `POST /sessions/:id/events` to the routed pod (still works) or runners publish directly to the NATS subject The proto/contract from `SESSION_HUB_DO.md` survives untouched on the wire.
Long-running durable runner work Temporal workflow, workflow ID = session ID Already deployed (k8s/temporal/). Use `SignalWorkflow` from the hub for cancel/queue/steer.

Why not Cloud Run for the hub itself

  • WebSocket timeout caps at 60 minutes (hard).
  • Session affinity is best-effort cookie-based — new requests can hit different instances.
  • Per-instance state on Cloud Run can't be relied on; the multi-client fanout requires real cross-instance sync (which puts us back at NATS anyway, but with worse routing).

Conclusion: hub runs on GKE. Cloud Run is fine for the auth/validation front layer if we want serverless there.

Why not just deploy Rivet on GKE

  • Real option. Rivet is the same primitive Sourcegraph picked, and they self-host. docs.rivet.dev/connect/kubernetes covers GKE deploy.
  • Trade-off: a new operational dependency. We already have GKE + Envoy + NATS + Postgres + Temporal. Rivet replaces some of those for one workload type. The integration cost is paying for two ways to do durable execution (Rivet for sessions, Temporal for runs) instead of one.
  • Recommendation: keep Rivet on the table as a v2 simplification once we know the workload, but do v1 with primitives we already operate.

Trade-off matrix vs DO

Dimension Durable Objects GCP-native (this design)
Idle session cost Effectively zero (hibernation) Pod-time (mitigated by N-sessions-per-pod and HPA-on-active-count)
Steady-state cost (high session count) Linear with session-active-time Pod count + Memorystore + NATS — can be cheaper at scale
Time to first byte (cold start) ~50ms (CF edge) Pod is warm; rehydrate from KV ~5–20ms
Data residency CF footprint Our GCP project, tenant region
Audit pipeline integration Cross-cloud egress to platform Same VPC/project
Operational surface Zero ops Envoy Gateway config, StatefulSet ops, NATS KV bucket ops
Vendor lock-in High Low (k8s + Envoy + NATS + PG are portable)
Hibernation primitive Native (acceptWebSocket/serializeAttachment) Build it: idle-evict + KV flush + on-reattach rehydrate
Multi-client fanout Native (DO is the rendezvous) NATS subject per session
WebSocket draining on config push Stable Open Envoy Gateway bug envoyproxy/gateway#8889 — equivalent xDS updates can drain active WS connections. Needs mitigation (config rate-limiting + reconnect handling on clients)

Concrete deploy/ deltas (what would land in evalops/deploy)

  1. `k8s/production/maestro/maestro-statefulset.yaml` — convert maestro-deployment to StatefulSet, ordinal-named pods, headless service for stable DNS
  2. `k8s/production/maestro/envoy-gateway-route.yaml` — Gateway API HTTPRoute with `BackendTrafficPolicy` ringhash on `x-maestro-session-id` header
  3. `k8s/gcp-runtime/memorystore-maestro.yaml` (optional) — hot state tier if NATS KV proves too slow
  4. NATS KV bucket provisioned via existing `ensemble-nats.yaml` config (no new system)
  5. `docs/runbooks/maestro.md` — add session-hub failure modes (pod loss, Envoy config push, NATS KV unavailable)
  6. `tests/preflight/test_maestro_consistent_hash.py` — preflight that 1000 mock session IDs distribute evenly across pods and survive a rolling restart with <2% reconnects landing on a different pod

Acceptance criteria

  • 1000 concurrent sessions distribute roughly evenly across N pods (verified by metrics)
  • A rolling restart of the StatefulSet preserves >98% of WS connections via reconnect; lost connections rehydrate state from NATS KV in <500ms
  • Web + IDE + TUI all attached to one session see ordered events with the same per-session sequence numbers
  • Hub failure modes are documented in the runbook, with one preflight per failure
  • Cost model documented: $/active-session/hr at 100 / 1000 / 10k concurrent sessions, comparable to a DO cost estimate at the same scale

Decision needed

Before any code:

  1. Are we committing to GKE-native for v1, or do we want to spike Rivet-on-GKE as a parallel option and benchmark?
  2. Where does session state authoritatively live — NATS JetStream KV (operationally lighter, requires JetStream up-time SLO) or Cloud SQL Postgres (slower hot-path, simpler single-source-of-truth)?
  3. Do we keep `SESSION_HUB_DO.md` as the design and add a sibling `SESSION_HUB_GKE.md`, or do we replace?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    architecture-reviewCross-service architecture review requestedenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions