-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Implement Raft-based leader election to coordinate distributed load tests. No etcd dependency — Raft runs embedded in each node via the openraft crate.
Cluster mode is opt-in
Cluster mode is disabled by default. Without CLUSTER_ENABLED=true the binary runs exactly as it does today — single node, config from env vars / YAML, no Raft, no gRPC.
CLUSTER_ENABLED=false # default — standalone mode, no cluster behaviour
CLUSTER_ENABLED=true # opt-in to Raft cluster formation
Node Discovery
Local / Dev — HashiCorp Consul DNS
Nodes discover each other via Consul DNS — no hardcoded IP list needed locally.
Each node registers itself as a Consul service (loadtest-cluster) on startup and exposes an HTTP health endpoint that Consul polls. Consul DNS loadtest-cluster.service.consul then resolves to all nodes with a passing health check.
Health check states
The health endpoint (GET /health/cluster) returns the node's current Raft state. Consul tracks this and updates the node's service tags accordingly:
| Tag | Meaning | DNS visible |
|---|---|---|
forming |
Node started, waiting to reach quorum | yes — forming.loadtest-cluster.service.consul |
follower |
In cluster, running as a Raft follower | yes — follower.loadtest-cluster.service.consul |
leader |
Elected Raft leader / coordinator | yes — leader.loadtest-cluster.service.consul |
The untagged loadtest-cluster.service.consul resolves to all healthy nodes regardless of state, so a new node can query it to find peers to join.
Health check response (JSON):
{
state: leader,
node_id: node-dev-1,
leader_id: node-dev-1,
term: 3,
peers: 3,
cluster_ready: true
}Consul service registration on each node startup:
{
name: loadtest-cluster,
port: 7000,
tags: [forming],
checks: [{
http: http://localhost:8080/health/cluster,
interval: 5s,
timeout: 2s
}]
}Tags are updated dynamically as Raft state changes: forming → follower → leader.
A node that loses leader status updates its tag back to follower automatically.
Env vars:
DISCOVERY_MODE=consul
CONSUL_ADDR=http://127.0.0.1:8500
CONSUL_SERVICE_NAME=loadtest-cluster # default
GCP / Production — Static peer list auto-join
CLUSTER_NODES=10.1.0.5:7000,10.2.0.5:7000,10.3.0.5:7000
Each node reads the peer list, attempts gRPC handshake, and joins the Raft cluster. First node to achieve quorum becomes the initial leader.
Key behaviours
- Single coordinator (leader) per test run — no split-brain
- Automatic failover: if the leader dies, remaining nodes elect a new one (quorum = majority)
- After election, leader retrieves test config from GCS or Consul KV (Issue [Phase 4] Cluster config retrieval from GCS bucket (GCP) and Consul KV (local) #76) and distributes it to all followers via gRPC (Issue [Phase 4] gRPC worker communication protocol #46)
- Followers stream metrics back to leader for aggregation (Issue [Phase 4] Result aggregation pipeline for distributed tests #49)
- Node IDs are derived from hostname + region tag so they are stable across restarts
Implementation notes
openraftcrate for Raft state machine- gRPC (tonic) as the Raft transport layer (reuses Issue [Phase 4] gRPC worker communication protocol #46 infrastructure)
- Health endpoint served on
CLUSTER_HEALTH_ADDR(default0.0.0.0:8080, path/health/cluster) - Consul tags updated via Consul agent API on every Raft state transition
CLUSTER_NODE_IDenv var (or auto-derived from hostname) for stable node identityCLUSTER_BIND_ADDRfor the Raft/gRPC listen address
Full env var reference
CLUSTER_ENABLED=false # opt-in (default: false)
CLUSTER_BIND_ADDR=0.0.0.0:7000 # Raft + gRPC listen address
CLUSTER_HEALTH_ADDR=0.0.0.0:8080 # HTTP health check endpoint
CLUSTER_NODE_ID=node-us-central1 # stable node identity (or auto from hostname)
DISCOVERY_MODE=static # static | consul
CLUSTER_NODES=ip1:7000,ip2:7000 # peer list (static discovery)
CONSUL_ADDR=http://127.0.0.1:8500 # Consul address (consul discovery)
CONSUL_SERVICE_NAME=loadtest-cluster # Consul service name (consul discovery)