Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

garbage collect stale distributed locks #9982

Merged
merged 1 commit into from Jan 30, 2020
Merged

Conversation

aanm
Copy link
Member

@aanm aanm commented Jan 28, 2020

Under special conditions [1], cilium-agent could leak distributed locks
into the KVStore making all cilium-agents to deadlock. To avoid this
cilium-operator will garbage collect stale distributed locks. If a lock
is being held for more than 2x default value of cilium's lock lease TTL,
cilium-operator will delete that lock from the KVStore. On a first sight
this might not be reasonable but cilium-agents only use the distributed
locks to coordinated across each other which one should pick the
security identity for a new set of labels. Even if 2 cilium-agents pick
2 different security identities for the same set of labels, only the
first one will be able to publish that identity into the KVStore. The
2nd agent will fail to publish the security identity that it picked for
the set of labels:

[1]

  1. connect to a etcd member (https://node-A:2379) to create a lock with
    a 25 second lease (the default for cilium)
  2. drop all peer packets, not client packets, to and from that etcd
    member (https://node-A:2380).
  3. try releasing that lock immediately. The unlock will fail with the
    error level=error msg="Unable to unlock kvstore lock"
    error="etcdserver: request timed out".
  4. As soon the unlock fails, resume connectivity dropped in 2), this
    re-connectivity needs to happen before the lease expires, which means
    the Unlock on step 3) needs to fail in less than 25 seconds, this is
    what happen for my case.
  5. since the client resumed connectivity with the server in less than 25
    seconds, the lease didn't expire which means the server still thinks
    the client is still holding the lock but the client assumes the lock was
    unlocked.

Signed-off-by: André Martins andre@cilium.io

Fix #9855
Fix #9046


This change is Reviewable

@aanm aanm added kind/bug This is a bug in the Cilium logic. pending-review priority/insane 🚨 http://cultofthepartyparrot.com/parrots/shipitparrot.gif labels Jan 28, 2020
@aanm aanm requested a review from a team as a code owner January 28, 2020 15:11
@aanm aanm requested a review from a team January 28, 2020 15:11
@aanm aanm requested a review from a team as a code owner January 28, 2020 15:11
@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

4 similar comments
@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.7.0 Jan 28, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.6 Jan 28, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.5.12 Jan 28, 2020
@aanm aanm added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jan 28, 2020
@aanm
Copy link
Member Author

aanm commented Jan 28, 2020

test-me-please

@aanm aanm force-pushed the pr/issue-9855-kvstore-lock branch from 03fce22 to 0305d87 Compare January 28, 2020 15:14
@aanm
Copy link
Member Author

aanm commented Jan 28, 2020

test-me-please

@aanm aanm added wip and removed pending-review labels Jan 28, 2020
@aanm aanm force-pushed the pr/issue-9855-kvstore-lock branch 2 times, most recently from a26a239 to 37935f8 Compare January 28, 2020 15:24
@aanm
Copy link
Member Author

aanm commented Jan 28, 2020

test-me-please

@aanm aanm removed the wip label Jan 28, 2020
Copy link
Member Author

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @joestringer

pkg/kvstore/allocator/allocator.go Outdated Show resolved Hide resolved
if err := kvstore.Delete(ctx, key); err != nil {
scopedLog.WithError(err).
Warning("Unable to unlocked distributed lock due client staleness." +
" Please check the connectivity between the KVStore and the client with that lease ID.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it's tricky... One needs to access each cilium status on each cilium instance to know which lease ID the cilium-agent is using. I thought about this while developing the PR and the easiest would be to add it into the cilium node structure that we have. What do you think? Also, be aware this needs to be backported to 1.5 and 1.6

pkg/kvstore/allocator/allocator_test.go Show resolved Hide resolved
pkg/kvstore/allocator/allocator_test.go Show resolved Hide resolved
pkg/kvstore/allocator/allocator_test.go Show resolved Hide resolved
pkg/kvstore/allocator/allocator_test.go Outdated Show resolved Hide resolved
pkg/kvstore/allocator/allocator_test.go Show resolved Hide resolved
@aanm
Copy link
Member Author

aanm commented Jan 29, 2020

test-me-please

@aanm aanm added this to the 1.7 milestone Jan 29, 2020
operator/kvstore_watchdog.go Outdated Show resolved Hide resolved
pkg/allocator/allocator.go Show resolved Hide resolved
pkg/kvstore/allocator/allocator.go Outdated Show resolved Hide resolved
operator/kvstore_watchdog.go Outdated Show resolved Hide resolved
pkg/kvstore/allocator/allocator.go Show resolved Hide resolved
Under special conditions [1], cilium-agent could leak distributed locks
into the KVStore making all cilium-agents to deadlock. To avoid this
cilium-operator will garbage collect stale distributed locks. If a lock
is being held for more than 2x default value of cilium's lock lease TTL,
cilium-operator will delete that lock from the KVStore. On a first sight
this might not be reasonable but cilium-agents only use the distributed
locks to coordinated across each other which one should pick the
security identity for a new set of labels. Even if 2 cilium-agents pick
2 different security identities for the same set of labels, only the
first one will be able to publish that identity into the KVStore. The
2nd agent will fail to publish the security identity that it picked for
the set of labels:

[1]
1) connect to a etcd member (https://node-A:2379) to create a lock with
   a 25 second lease (the default for cilium)
2) drop all peer packets, not client packets, to and from that etcd
   member (https://node-A:2380).
3) try releasing that lock immediately. The unlock will fail with the
   error level=error msg="Unable to unlock kvstore lock"
   error="etcdserver: request timed out".
4) As soon the unlock fails, resume connectivity dropped in 2), this
   re-connectivity needs to happen before the lease expires, which means
   the Unlock on step 3) needs to fail in less than 25 seconds, this is
   what happen for my case.
5) since the client resumed connectivity with the server in less than 25
   seconds, the lease didn't expire which means the server still thinks
   the client is still holding the lock but the client assumes the lock was
   unlocked.

Signed-off-by: André Martins <andre@cilium.io>
@aanm
Copy link
Member Author

aanm commented Jan 29, 2020

test-me-please

@aanm
Copy link
Member Author

aanm commented Jan 30, 2020

test-me-please

@aanm aanm merged commit 440539b into master Jan 30, 2020
1.7.0 automation moved this from In progress to Merged Jan 30, 2020
@aanm aanm deleted the pr/issue-9855-kvstore-lock branch January 30, 2020 12:36
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.5 in 1.5.12 Jan 30, 2020
@aanm aanm mentioned this pull request Jan 31, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.6 Jan 31, 2020
@aanm aanm mentioned this pull request Jan 31, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.5 to Backport done to v1.5 in 1.5.12 Feb 3, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.6 Feb 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. priority/insane 🚨 http://cultofthepartyparrot.com/parrots/shipitparrot.gif release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.5.12
Backport done to v1.5
1.6.6
Backport done to v1.6
1.7.0
  
Merged
Development

Successfully merging this pull request may close these issues.

Stale etcd kvstore lock deadlock KVStore lock for identity gets stuck
5 participants