-
Notifications
You must be signed in to change notification settings - Fork 618
Description
Search before asking
- I had searched in the issues and found no similar issues.
Motivation
Problem Statement
The current kvrocks and controller have a split-brain risk under network partition scenarios. Due to the lack of a lease mechanism, the old kvrocks master may continue accepting writes indefinitely.
Design Principles
The following principles are derived from the analysis, balancing implementation cost against benefit.
- kvrocks master gates write operations on a lease; no lease means writes are rejected (configurable). The controller probe must carry: current master ID, cluster version, and lease duration. When kvrocks receives a probe, and the master ID and version match, it extends the lease.
- Shorten the controller session TTL to detect network partition events more quickly.
- Allow manual designation of the active partition; the controller provides a one-command switch to move the kvrocks master to the specified partition, for use when a partition failure is already known.
- kvrocks master validates its master status via an API call on each write (configurable); the controller exposes an API that returns who the current master is.
Rationale
The Core Problem: Old Master Keeps Writing During Network Partition
Consider the following sequence of events:
- T=0: Network partition. Controller loses visibility of M. M is unaware and continues accepting writes.
- T=3~12s: Probe fails 4 consecutive times. Only the failure counter increments; no action is taken.
- T=15s: 5th failure. Controller elects S as the new master, writes to etcd, pushes SETNODES to S. S becomes master.
- M never receives SETNODES. It still believes itself to be master and continues accepting writes. Two masters coexist; data diverges.
- Network recovers. Controller detects M has a stale version and pushes the new topology. M is demoted to slave and syncs from S. All writes made during the partition are lost.
The faster the master detects the anomaly and stops writing, the sooner data divergence is prevented. The cost tradeoffs are:
- Per-write: master checks with the controller on every write. Most effective, but adds network round-trip cost, increases etcd concurrency, and raises write latency.
- Per-write: master checks a local lease cache; rejects writes if the cache is expired. The controller probe refreshes the cache. Better cost/benefit balance, but a brief split-brain window exists if a partition occurs during a valid cache period and a new master is assigned.
- Options 1 and 2 are complementary. Use option 1 for high-consistency requirements, option 2 otherwise. This can be controlled by configuration.
Impact of Single-Node Controller Probe
Because the controller master is the sole probe source for the kvrocks master, two failure modes exist under network partition:
- The controller master can probe the kvrocks master successfully, but both are in the failed partition and cannot serve clients.
- The controller is in the failed partition and cannot reach the kvrocks master, but the kvrocks master is in the healthy partition — triggering an unnecessary failover.
Some industry approaches use probes from clients in multiple partitions simultaneously, combining results to decide whether to trigger a failover. This requires an additional quorum mechanism.
Given current constraints, using etcd's own mechanism to keep the controller master in the reachable partition is a pragmatic tradeoff:
- etcd
minLeaseTTL= 2s; the minimum achievable lease is 2 seconds. controller sessionTTL= 6, hardcoded as a constant. This can be made configurable; halving it to 3s is a reasonable starting point.- Controller-to-kvrocks failover window =
ping_interval × max_ping_count= 3s × 5 = 15s. Consider reducing this, e.g. to 10s.
Therefore:
- For high-consistency requirements: kvrocks must actively validate its master status via network on every write.
- Otherwise: the maximum split-brain window can be reduced to 12s (currently 21s). A split-brain can still occur within that 12s window — this is an explicit tradeoff.
Manual Intervention During Partition Failure
Consider a concrete network partition example: in a multi-region deployment, an entire region goes offline. Providing a manual intervention mechanism could make the recovery process more controlled — for example, a command to switch to a specified region. Once triggered, the system prioritizes instances in the designated region during the next master election.
If this approach is considered viable, I can implement it incrementally — starting with the lease mechanism for kvrocks master write operations, made configurable.
Solution
- kvrocks master gates write operations on a lease; no lease means writes are rejected (configurable). The controller probe must carry: current master ID, cluster version, and lease duration. When kvrocks receives a probe, and the master ID and version match, it extends the lease.
- Shorten the controller session TTL to detect network partition events more quickly.
- Allow manual designation of the active partition; the controller provides a one-command switch to move the kvrocks master to the specified partition, for use when a partition failure is already known.
- kvrocks master validates its master status via an API call on each write (configurable); the controller exposes an API that returns who the current master is.
Are you willing to submit a PR?
- I'm willing to submit a PR!