You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Jepsen testing, we found that updating shared state protected by a lock could result in lost updates or other anomalies. This was due in part to the fact that distributed locks are intrinsically unsafe, but was exacerbated by an issue where a server failed to check whether a lease was valid after blocking for a lock to be released, and simply told the client they held the lock instead.
It seems the root cause is the failure of maintaining the lease via keepalive (not sure why this happened though...) of process 0. We can know this since the explicit release of the lease failed after the acquire operation.
And there is a "bug" inside etcd server side. While waiting for the previous lock holder to release the lock, the client does not check the current lease status. So when the lock operation returns, the lease might be already expired. Then the etcd client will falsely think it still holds the lock.
We can mitigate the issue by checking the lease status before the return of lock operation. However, it wont completely solve the problem. If the response RPC takes long enough, the lease might still expire in between...
Basically, as we discussed, the lock holder have to checked the sequence number (in etcd the revision of the lock key) and the remaining lease timeout before doing anything...
More concretely, we should check if the owner key is still exist here before return
In Jepsen testing, we found that updating shared state protected by a lock could result in lost updates or other anomalies. This was due in part to the fact that distributed locks are intrinsically unsafe, but was exacerbated by an issue where a server failed to check whether a lease was valid after blocking for a lock to be released, and simply told the client they held the lock instead.
As @xiang90 wrote:
This was addressed by #11408; I'm creating this issue as a reference point for later readers.
The text was updated successfully, but these errors were encountered: