Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/endpoint: wait for security identity on restore #12307

Merged
merged 2 commits into from Jun 27, 2020

Conversation

aanm
Copy link
Member

@aanm aanm commented Jun 26, 2020

If the KVStore connectivity is not reliable during the endpoint restore
process Cilium can end up with an endpoint in a 'restoring' state in
case the ep's security identity resolution fails.
Adding a controller will make sure Cilium will retry to get an identity
for that endpoint until the endpoint is removed or the connectivity
with the allocator is successful.

Signed-off-by: André Martins andre@cilium.io

avoid having endpoints in 'restoring' state in case the connectivity with the KVStore is not reliable

@aanm aanm added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.6 labels Jun 26, 2020
@aanm aanm requested a review from a team as a code owner June 26, 2020 17:30
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.6 Jun 26, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.1 Jun 26, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.10 Jun 26, 2020
@aanm
Copy link
Member Author

aanm commented Jun 26, 2020

test-me-please

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine to me, I inspected the callpath up the stack from here and it seems like this is run in a goroutine and completion is notified back through a channel to the main goroutine via restoreComplete but this does not block overall Cilium operation, only some subsequent cleanup steps.

For sanity it probably makes sense to add some documentation higher up in those paths to declare how long we expect that channel could block for so we don't end up introducing dependencies there in future that could cause the whole agent to block up completely when kvstore connectivity is completely blocked (or at least to make sure it integrates properly with the initial kvstore connect timeout of 15m).

My main feedback request here is I don't understand in the case of failure (specifically endpoint deletion) how these controllers are cleaned up.

pkg/endpoint/restore.go Outdated Show resolved Hide resolved
pkg/endpoint/restore.go Outdated Show resolved Hide resolved
pkg/endpoint/restore.go Show resolved Hide resolved
aanm added 2 commits June 26, 2020 19:56
If the KVStore connectivity is not reliable during the endpoint restore
process Cilium can end up with an endpoint in a 'restoring' state in
case the ep's security identity resolution fails.
Adding a controller will make sure Cilium will retry to get an identity
for that endpoint until the endpoint is removed or the connectivity
with the allocator is successful.

Signed-off-by: André Martins <andre@cilium.io>
If the KVStore connectivity is not reliable during the endpoint restore
process Cilium can end up with an endpoint in a 'restoring' state in
case the global security identities sync would fail or time out.
Adding a controller will make sure Cilium will wait until the global
security identities are synced or until the endpoint is removed before
restoring the endpoint.

Signed-off-by: André Martins <andre@cilium.io>
@aanm aanm force-pushed the pr/wait-for-identity-on-restore branch from 5334ec2 to 8a50dcc Compare June 26, 2020 17:57
@aanm aanm requested a review from joestringer June 26, 2020 17:57
@aanm
Copy link
Member Author

aanm commented Jun 26, 2020

test-me-please

@vadorovsky vadorovsky merged commit 54d2175 into cilium:master Jun 27, 2020
@aanm aanm deleted the pr/wait-for-identity-on-restore branch June 27, 2020 18:22
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.1 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.1 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.6 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.6 Jun 30, 2020
@joestringer joestringer moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.1 Jun 30, 2020
@joestringer joestringer added the priority/high This is considered vital to an upcoming release. label Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.6.10 Jun 30, 2020
@aanm
Copy link
Member Author

aanm commented Jun 30, 2020

Dropping the needs-backport/1.6 label due the lack of the 'context' field in the endpoint structure (introduced in Cilium >=1.7)

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.6 Jun 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.10 Jul 3, 2020
@aanm
Copy link
Member Author

aanm commented Jul 6, 2020

@tgraf #12307 (comment)

@aanm aanm added this to Needs backport from master in 1.6.11 Jul 6, 2020
@tgraf
Copy link
Member

tgraf commented Jul 8, 2020

@aanm We still need to find a way to fix this in 1.6

@joestringer joestringer removed this from Needs backport from master in 1.6.11 Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/high This is considered vital to an upcoming release. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.6.10
Needs backport from master
1.7.6
Backport done to v1.7
1.8.1
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

5 participants