New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/endpoint: wait for security identity on restore #12307
pkg/endpoint: wait for security identity on restore #12307
Conversation
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly fine to me, I inspected the callpath up the stack from here and it seems like this is run in a goroutine and completion is notified back through a channel to the main goroutine via restoreComplete
but this does not block overall Cilium operation, only some subsequent cleanup steps.
For sanity it probably makes sense to add some documentation higher up in those paths to declare how long we expect that channel could block for so we don't end up introducing dependencies there in future that could cause the whole agent to block up completely when kvstore connectivity is completely blocked (or at least to make sure it integrates properly with the initial kvstore connect timeout of 15m).
My main feedback request here is I don't understand in the case of failure (specifically endpoint deletion) how these controllers are cleaned up.
If the KVStore connectivity is not reliable during the endpoint restore process Cilium can end up with an endpoint in a 'restoring' state in case the ep's security identity resolution fails. Adding a controller will make sure Cilium will retry to get an identity for that endpoint until the endpoint is removed or the connectivity with the allocator is successful. Signed-off-by: André Martins <andre@cilium.io>
If the KVStore connectivity is not reliable during the endpoint restore process Cilium can end up with an endpoint in a 'restoring' state in case the global security identities sync would fail or time out. Adding a controller will make sure Cilium will wait until the global security identities are synced or until the endpoint is removed before restoring the endpoint. Signed-off-by: André Martins <andre@cilium.io>
5334ec2
to
8a50dcc
Compare
test-me-please |
Dropping the |
@aanm We still need to find a way to fix this in 1.6 |
If the KVStore connectivity is not reliable during the endpoint restore
process Cilium can end up with an endpoint in a 'restoring' state in
case the ep's security identity resolution fails.
Adding a controller will make sure Cilium will retry to get an identity
for that endpoint until the endpoint is removed or the connectivity
with the allocator is successful.
Signed-off-by: André Martins andre@cilium.io