Document that locks aren't really locks #11457
As a part of our Jepsen testing, we've demonstrated that etcd's locks aren't actually locks: like all "distributed locks" they cannot guarantee mutual exclusion when processes are allowed to run slow or fast, or crash, or messages are delayed, or their clocks are unstable, etc. For example, this workload uses etcd's locks to protect read-modify-write updates to perfect
This is directly adapted from the etcd 3.2 announcement, which demonstrates
$ echo 0 >f $ for `seq 1 100`; do ETCDCTL_API=3 etcdctl lock mylock -- bash -c 'expr 1 + $(cat f) > f' & pids="$pids $!" done $ wait $pids $ cat f
Instead of updating a file on disk, our workload adds unique integers to a set
We use the same lock acquisition and release strategy as described before: we
When we partition away leader nodes every 10 seconds or so, this workload
This problem was exacerbated by #11456, but fundamentally cannot be fixed. Users cannot use etcd as a naive locking system: they must carefully couple a fencing token (e.g. the etcd lock key revision number) to any systems they interact with in order to preserve exclusion boundaries.
etcd could remove locks altogether, but I don't think that's strictly necessary: it's still useful to have something which is mostly a lock. For example, users could use locks to ensure that most of the time, one node, rather than dozens, is performing a specific computation. Instead, I'd like to suggest changing the documentation to make these risks, and the correct use of locks, explicit. In particular, I think these pages could be revised:
The text was updated successfully, but these errors were encountered:
The initial design for MSDHA expected that etcd locks would remain held for the lifetime of the process run under the lock. It also expected the process to be stopped and the lease to be dropped all at the same time if the etcd lock was somehow lost. See [here](etcd-io/etcd#11457 (comment)) for more information. The new design instead relies on detecting that there is no current master rather than using a lock to hold the master role. MSDHA will (in order): * wait MSDHA_TTL*2 seconds * Acquire the etcd lock * Determine if a master already exists by now otherwise stop * Promote the node * Set the state to master in etcd * Run a background job to wait MSDHA_MASTER_TTL seconds before terminating the container The logic for restarting the etcd change detection was also re-written to prevent losing events between restarts.