-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1.9] daemon: Avoid blocking datapath on node discovery #14629
Merged
aanm
merged 1 commit into
cilium:v1.9
from
aanm:pr/forwardport/pr/christarazi/v1.8-fix-kvstore-datapath-blocking
Jan 19, 2021
Merged
[v1.9] daemon: Avoid blocking datapath on node discovery #14629
aanm
merged 1 commit into
cilium:v1.9
from
aanm:pr/forwardport/pr/christarazi/v1.8-fix-kvstore-datapath-blocking
Jan 19, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Previously, the datapath relied on node discovery completing. With a kvstore configured, this meant that node registration would also need to complete first. If the kvstore is deployed as pods intended to be managed by Cilium, then this creates a chicken and egg problem. The kvstore pod cannot come online because the datapath is blocked. The datapath is blocked because the agent cannot register the node into the kvstore. And finally, the node cannot be registered into kvstore because kvstore is not online. Here's how the issue manifested itself: ``` $ kubectl -n cilium describe pods etcd-operator-59cf4cfb7c-288qx Name: etcd-operator-59cf4cfb7c-288qx Namespace: cilium Priority: 0 Node: gke-chris-form3-cluster-default-pool-3f30c3b5-zd0f/10.138.0.10 Start Time: Fri, 15 Jan 2021 14:56:24 -0800 Labels: io.cilium/app=etcd-operator pod-template-hash=59cf4cfb7c ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 83s default-scheduler Successfully assigned cilium/etcd-operator-59cf4 Warning FailedCreatePodSandBox 22s kubelet Failed create pod sandbox: rpc error: code = Unk 5bf622367" network for pod "etcd-operator-59cf4cfb7c-288qx": networkPlugin cni failed to set up pod "etcd-op ent client after 30.000000 seconds timeout: Get "http:///var/run/cilium/cilium.sock/v1/config": dial unix /v Is the agent running? Normal SandboxChanged 21s kubelet Pod sandbox changed, it will be killed and re-created. ``` To break the chicken and egg problem with the datapath, we need to split the "node discovery" (`(*NodeDiscovery).StartDiscovery()`) into two separate phases: (1) local node state is initialized and (2) node registration which includes syncing to the kvstore if that's configured. The `(*NodeDiscovery).Registered` channel will close when (2) completes. The new channel `(*NodeDiscovery).LocalStateInitialized` will close when (1) completes. This is because the datapath only relies on (1) being complete and doesn't depend on (2). This will unblock the datapath from the implicit dependency on the kvstore and allow the kvstore pods to come online (Cilium generates endpoints for them), while node registration into the kvstore continues on in the background. Fixes: 43997f5 ("loader: Wait for node configuration to generate datapath") Related: 7045103 ("daemon: Move KVStore initialization earlier") Co-authored-by: André Martins <andre@cilium.io> Co-authored-by: Joe Stringer <joe@cilium.io> Co-authored-by: Paul Chaignon <paul@cilium.io> Co-authored-by: Kornilios Kourtis <kornilios@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>
test-backport-1.9 |
test-backport-1.9 |
test-missed-k8s |
f481727
to
aab60fb
Compare
I've dropped the last commit from this PR which was incrementing the timeout for the k8s jobs. The CI had passed so there's no need to re-run it again. |
christarazi
approved these changes
Jan 18, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pchaigno
approved these changes
Jan 18, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/daemon
Impacts operation of the Cilium daemon.
kind/backports
This PR provides functionality previously merged into master.
kind/bug
This is a bug in the Cilium logic.
release-note/misc
This PR makes changes that have no direct user impact.
sig/kvstore
Impacts the KVStore package interactions.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, the datapath relied on node discovery completing. With a
kvstore configured, this meant that node registration would also need to
complete first.
If the kvstore is deployed as pods intended to be managed by Cilium,
then this creates a chicken and egg problem. The kvstore pod cannot come
online because the datapath is blocked. The datapath is blocked because
the agent cannot register the node into the kvstore. And finally, the
node cannot be registered into kvstore because kvstore is not online.
Here's how the issue manifested itself:
To break the chicken and egg problem with the datapath, we need to split
the "node discovery" (
(*NodeDiscovery).StartDiscovery()
) into twoseparate phases: (1) local node state is initialized and (2) node
registration which includes syncing to the kvstore if that's configured.
The
(*NodeDiscovery).Registered
channel will close when (2) completes.The new channel
(*NodeDiscovery).LocalStateInitialized
will close when(1) completes. This is because the datapath only relies on (1) being
complete and doesn't depend on (2).
This will unblock the datapath from the implicit dependency on the
kvstore and allow the kvstore pods to come online (Cilium generates
endpoints for them), while node registration into the kvstore continues
on in the background.
Fixes: 43997f5 ("loader: Wait for node configuration to generate datapath")
Related: 7045103 ("daemon: Move KVStore initialization earlier")
Co-authored-by: André Martins andre@cilium.io
Co-authored-by: Joe Stringer joe@cilium.io
Co-authored-by: Paul Chaignon paul@cilium.io
Co-authored-by: Kornilios Kourtis kornilios@isovalent.com
Signed-off-by: Chris Tarazi chris@isovalent.com