Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.9] daemon: Avoid blocking datapath on node discovery #14629

Conversation

aanm
Copy link
Member

@aanm aanm commented Jan 18, 2021

Previously, the datapath relied on node discovery completing. With a
kvstore configured, this meant that node registration would also need to
complete first.

If the kvstore is deployed as pods intended to be managed by Cilium,
then this creates a chicken and egg problem. The kvstore pod cannot come
online because the datapath is blocked. The datapath is blocked because
the agent cannot register the node into the kvstore. And finally, the
node cannot be registered into kvstore because kvstore is not online.

Here's how the issue manifested itself:

$ kubectl -n cilium describe pods etcd-operator-59cf4cfb7c-288qx
Name:           etcd-operator-59cf4cfb7c-288qx
Namespace:      cilium
Priority:       0
Node:           gke-chris-form3-cluster-default-pool-3f30c3b5-zd0f/10.138.0.10
Start Time:     Fri, 15 Jan 2021 14:56:24 -0800
Labels:         io.cilium/app=etcd-operator
                pod-template-hash=59cf4cfb7c
...
Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               83s   default-scheduler  Successfully assigned cilium/etcd-operator-59cf4
  Warning  FailedCreatePodSandBox  22s   kubelet            Failed create pod sandbox: rpc error: code = Unk
5bf622367" network for pod "etcd-operator-59cf4cfb7c-288qx": networkPlugin cni failed to set up pod "etcd-op
ent client after 30.000000 seconds timeout: Get "http:///var/run/cilium/cilium.sock/v1/config": dial unix /v
Is the agent running?
  Normal  SandboxChanged  21s  kubelet  Pod sandbox changed, it will be killed and re-created.

To break the chicken and egg problem with the datapath, we need to split
the "node discovery" ((*NodeDiscovery).StartDiscovery()) into two
separate phases: (1) local node state is initialized and (2) node
registration which includes syncing to the kvstore if that's configured.
The (*NodeDiscovery).Registered channel will close when (2) completes.
The new channel (*NodeDiscovery).LocalStateInitialized will close when
(1) completes. This is because the datapath only relies on (1) being
complete and doesn't depend on (2).

This will unblock the datapath from the implicit dependency on the
kvstore and allow the kvstore pods to come online (Cilium generates
endpoints for them), while node registration into the kvstore continues
on in the background.

Fixes: 43997f5 ("loader: Wait for node configuration to generate datapath")
Related: 7045103 ("daemon: Move KVStore initialization earlier")

Co-authored-by: André Martins andre@cilium.io
Co-authored-by: Joe Stringer joe@cilium.io
Co-authored-by: Paul Chaignon paul@cilium.io
Co-authored-by: Kornilios Kourtis kornilios@isovalent.com
Signed-off-by: Chris Tarazi chris@isovalent.com

Fix bug Cilium hangs with kvstore configured

Previously, the datapath relied on node discovery completing. With a
kvstore configured, this meant that node registration would also need to
complete first.

If the kvstore is deployed as pods intended to be managed by Cilium,
then this creates a chicken and egg problem. The kvstore pod cannot come
online because the datapath is blocked. The datapath is blocked because
the agent cannot register the node into the kvstore. And finally, the
node cannot be registered into kvstore because kvstore is not online.

Here's how the issue manifested itself:

```
$ kubectl -n cilium describe pods etcd-operator-59cf4cfb7c-288qx
Name:           etcd-operator-59cf4cfb7c-288qx
Namespace:      cilium
Priority:       0
Node:           gke-chris-form3-cluster-default-pool-3f30c3b5-zd0f/10.138.0.10
Start Time:     Fri, 15 Jan 2021 14:56:24 -0800
Labels:         io.cilium/app=etcd-operator
                pod-template-hash=59cf4cfb7c
...
Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               83s   default-scheduler  Successfully assigned cilium/etcd-operator-59cf4
  Warning  FailedCreatePodSandBox  22s   kubelet            Failed create pod sandbox: rpc error: code = Unk
5bf622367" network for pod "etcd-operator-59cf4cfb7c-288qx": networkPlugin cni failed to set up pod "etcd-op
ent client after 30.000000 seconds timeout: Get "http:///var/run/cilium/cilium.sock/v1/config": dial unix /v
Is the agent running?
  Normal  SandboxChanged  21s  kubelet  Pod sandbox changed, it will be killed and re-created.
```

To break the chicken and egg problem with the datapath, we need to split
the "node discovery" (`(*NodeDiscovery).StartDiscovery()`) into two
separate phases: (1) local node state is initialized and (2) node
registration which includes syncing to the kvstore if that's configured.
The `(*NodeDiscovery).Registered` channel will close when (2) completes.
The new channel `(*NodeDiscovery).LocalStateInitialized` will close when
(1) completes. This is because the datapath only relies on (1) being
complete and doesn't depend on (2).

This will unblock the datapath from the implicit dependency on the
kvstore and allow the kvstore pods to come online (Cilium generates
endpoints for them), while node registration into the kvstore continues
on in the background.

Fixes: 43997f5 ("loader: Wait for node configuration to generate datapath")
Related: 7045103 ("daemon: Move KVStore initialization earlier")

Co-authored-by: André Martins <andre@cilium.io>
Co-authored-by: Joe Stringer <joe@cilium.io>
Co-authored-by: Paul Chaignon <paul@cilium.io>
Co-authored-by: Kornilios Kourtis <kornilios@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
@aanm aanm added kind/bug This is a bug in the Cilium logic. area/daemon Impacts operation of the Cilium daemon. kind/backports This PR provides functionality previously merged into master. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. release-note/misc This PR makes changes that have no direct user impact. backport/1.9 sig/kvstore Impacts the KVStore package interactions. labels Jan 18, 2021
@aanm aanm requested a review from a team as a code owner January 18, 2021 09:36
@aanm aanm changed the title daemon: Avoid blocking datapath on node discovery [v1.9] daemon: Avoid blocking datapath on node discovery Jan 18, 2021
@aanm
Copy link
Member Author

aanm commented Jan 18, 2021

test-backport-1.9

@aanm
Copy link
Member Author

aanm commented Jan 18, 2021

test-backport-1.9

@nebril
Copy link
Member

nebril commented Jan 18, 2021

test-missed-k8s

@aanm aanm force-pushed the pr/forwardport/pr/christarazi/v1.8-fix-kvstore-datapath-blocking branch from f481727 to aab60fb Compare January 18, 2021 16:57
@aanm
Copy link
Member Author

aanm commented Jan 18, 2021

I've dropped the last commit from this PR which was incrementing the timeout for the k8s jobs. The CI had passed so there's no need to re-run it again.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@joestringer joestringer added priority/release-blocker and removed kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. labels Jan 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. kind/backports This PR provides functionality previously merged into master. kind/bug This is a bug in the Cilium logic. release-note/misc This PR makes changes that have no direct user impact. sig/kvstore Impacts the KVStore package interactions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants