Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI: keep cni config on shutdown; taint instead, queue deletions #23486

Merged
merged 5 commits into from Mar 22, 2023

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Jan 31, 2023

This change is made up of several commits, which build on each other. It fixes a wart with how CNI, Containerd, and Dockershim interact, and prevents some possible outage scenarios.

Ultimately, I'd like to make much of this less necessary with an improved CNI STATUS verb, but that will take some time to roll out across the ecosystem.

Asynchronous CNI delete

The CNI plugin now queues DEL for later submission, rather than failing. It tries to connect to the agent socket, and if that fails, it will write a file in /run/cilium with the pod's details. When the agent starts up, it will read that queue and process all pending deletes.

This fixes #22067, where an end-user found themselves with an unrecoverable cluster. Cilium got descheduled but their node was at its pod limit. They couldn't start any pods, and they couldn't delete any pods without Cilium running. Not good.

Taint the node when Cilium is shut down

On nodes where Cilium is scheduled, if it is not running, taint the node with NoSchedule. This will prevent new pods from being scheduled there. This has two goals:

  1. Minimize ignorable errors from pods failing to start because Cilium can't handle a CNI ADD
  2. Minimize pods being started with a non-Cilium network provider

Preserve CNI configuration

Lastly, we no longer delete the CNI configuration file when stopping the agent. This has several important effects:

  • the node no longer goes NotReady, so it won't be removed from cloud LB backends just because cilium is being upgraded. This prevents unneeded churn.
  • CNI DEL will still succeed, so pods can always be cleaned up
  • No chance of pods being started with a different CNI provider
The Cilium operator now taints nodes where Cilium is scheduled to run but is not running. 
This prevents pods from being scheduled on nodes without Cilium.
The CNI configuration file is no longer removed on agent shutdown. 
This means that pod deletion will always succeed; previously it would fail if Cilium was down for an upgrade.
This should help prevent nodes accidentally entering an unmanageable state.
It also means that nodes are not removed from cloud LoadBalancer backends during Cilium upgrades.

Fixes: #22067.

@squeed squeed requested review from a team as code owners January 31, 2023 12:48
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 31, 2023
@squeed squeed requested review from a team as code owners January 31, 2023 12:48
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 31, 2023
@squeed squeed added dont-merge/needs-release-note release-note/major This PR introduces major new functionality to Cilium. and removed kind/community-contribution This was a contribution made by a community member. labels Jan 31, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 31, 2023
@squeed
Copy link
Contributor Author

squeed commented Jan 31, 2023

/test

Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-5.4 so I can create one.

@squeed squeed added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. sig/agent Cilium agent related. and removed dont-merge/needs-release-note labels Jan 31, 2023
@squeed
Copy link
Contributor Author

squeed commented Feb 1, 2023

Test failures are either known flakes, or unlikely to be caused by this:

  • l4lb XDP doesn't use CNI at all
  • TestGetIdentity/Duplicated_identity is a known flake
  • MonitorAggregation didn't get all ICMP events. Again, not something this could do.

So, rebasing and retrying to see if that helps.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me, a few minor comments to address.

If I may share an opinion: I appreciate the PR description because it made reviewing the code much easier, however I'd hate to lose all that context because it isn't somewhere in the commit msg. What do you think of putting it somewhere in the commit(s)? The reason is because when bisecting, it's much easier to read all the context related to a change from the commit msg, rather than think "hmm, I wonder if there's more to this" and have to locate the PR behind it.

What I've seen people do (and what I try to do as well) is paste the "important" commit msgs in the PR description as well, so people sort of have a cover letter to read before diving in, as you did in your PR, while also preserving the history in the commit(s) themselves.

On another note, for the release note: I'd suggest stripping the extra newlines as it's not going to format quite as nicely in the end. Something like this would be easier to read:

This is sentence 1 of the release note.
This is sentence 2 of the same release note.
And so on...

pkg/lock/lockfile/lockfile_linux.go Show resolved Hide resolved
plugins/cilium-cni/main.go Show resolved Hide resolved
operator/watchers/node_taint_test.go Outdated Show resolved Hide resolved
Documentation/installation/taints.rst Outdated Show resolved Hide resolved
pkg/defaults/defaults.go Outdated Show resolved Hide resolved
pkg/lock/lockfile/lockfile_linux.go Outdated Show resolved Hide resolved
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not only interesting but alsoo fun to read, love the comment in code 👍.

LGTM ✔️

@squeed
Copy link
Contributor Author

squeed commented Mar 17, 2023

/test

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for extending this PR, but upon a second read, I noticed that the error logs are quite terse. I think we'd benefit from providing more context when there are warnings / errors, because we don't want the user to then think, "OK well I've got this error, but what should I do / what is Cilium going to do about it?".

Minor comments below.

daemon/cmd/deletion_queue.go Outdated Show resolved Hide resolved
daemon/cmd/deletion_queue.go Outdated Show resolved Hide resolved
daemon/cmd/deletion_queue.go Outdated Show resolved Hide resolved
daemon/cmd/deletion_queue.go Outdated Show resolved Hide resolved
daemon/cmd/deletion_queue.go Outdated Show resolved Hide resolved
daemon/cmd/deletion_queue.go Outdated Show resolved Hide resolved
@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. labels Mar 18, 2023
This adds a simple wrapper around Flock(2). It support read / write
locks, as well as non-blocking operation.

It is currently Linux only, since only Linux has support for
per-file-descriptor locking.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
It is considered bad practice for CNI deletions to fail. This can get
nodes in a fairly broken state, especially if somehow Cilium gets
descheduled -- then we might not be able to get Cilium back on the node!

The CNI plugin now queues DEL for later submission, rather than failing.
It tries to connect to the agent socket, and if that fails, it will write
a file in /run/cilium with the pod's details. When the agent starts up, it will
read that queue and process all pending deletes.

This fixes cilium#22067, where an end-user found themselves with an unrecoverable
cluster. Cilium got descheduled but their node was at its pod limit. They
couldn't start any pods, and they couldn't delete any pods without Cilium
running. Not good.

The behavior is simple enough, but requires some subtlety around file
locking to make sure we don't miss a deletion.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This means that pods will not be scheduled where cilium is down. This
allows us to stop removing the CNI configuration file whenever we
restart the agent.

On nodes where Cilium is scheduled, if it is not running, taint the node with
NoSchedule. This will prevent new pods from being scheduled there. This has
two goals:

- Minimize ignorable errors from pods failing to start because Cilium can't
  handle a CNI ADD
- Minimize pods being started with a non-Cilium network provider

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
This has cosmetic and usability benefits, since it mostly prevents pods
from being scheduled on nodes where they cannot actually be started. It
is not fool-proof, but it will prevent most ignorable errors about pods
failing to start. This commit enables the auto-taint feature.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Don't delete the CNI configuration file when stopping the agent. This has
several important effects:
- the node no longer goes NotReady, so it won't be removed from cloud LB
  backends just because cilium is being upgraded
- CNI DEL will still succeed, so pods can always be cleaned up
- No chance of pods being started with a different CNI provider

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed
Copy link
Contributor Author

squeed commented Mar 21, 2023

@christarazi updated the logging based on your suggestions, thanks for the review.

@christarazi
Copy link
Member

christarazi commented Mar 21, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

@squeed
Copy link
Contributor Author

squeed commented Mar 22, 2023

jenkins job failure is because of a missed tail call!? This PR has nothing to do with BPF...

time="2023-03-21T21:17:36Z" level=debug msg="running local command: kubectl exec -n kube-system cilium-dj8v9 -- cilium metrics list -o json | jq '.[] | select( .name == \"cilium_drop_count_total\" and .labels.reason == \"Missed tail call\" ).value'"
cmd: "kubectl exec -n kube-system cilium-dj8v9 -- cilium metrics list -o json | jq '.[] | select( .name == \"cilium_drop_count_total\" and .labels.reason == \"Missed tail call\" ).value'" exitCode: 0 duration: 168.910676ms stdout:
1

@squeed
Copy link
Contributor Author

squeed commented Mar 22, 2023

/mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next

👍 created #24514

@squeed
Copy link
Contributor Author

squeed commented Mar 22, 2023

/test-1.26-net-next

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 22, 2023
@borkmann borkmann merged commit 98b2967 into cilium:master Mar 22, 2023
42 checks passed
@squeed squeed changed the title CNI: no longer remove cni config on shutdown; taint instead. CNI: keep cni config on shutdown; taint instead, queue deletions Apr 28, 2023
@aojea
Copy link
Contributor

aojea commented Aug 3, 2023

it is sad we don't have a better defined processs for the Pod lifecycle in k8s so we don't have to implement these things :(

drew-viles added a commit to eschercloudai/unikorn that referenced this pull request Sep 5, 2023
* New unikorn release with cilium taints removed from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486
* Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane
* Some minor doc corrections
drew-viles added a commit to eschercloudai/unikorn that referenced this pull request Sep 5, 2023
* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486
  We can't do this yet as it would break any apps pre cp-app-bundle 1.2.0
* Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane
* Some minor doc corrections
drew-viles added a commit to eschercloudai/unikorn that referenced this pull request Sep 5, 2023
* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486
  We can't do this yet as it would break any apps pre kubernetes-cluster-1.4.0
* Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane
* Some minor doc corrections
drew-viles added a commit to eschercloudai/unikorn that referenced this pull request Sep 5, 2023
* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486
  We can't do this yet as it would break any apps pre kubernetes-cluster-1.4.0
* Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane
* Some minor doc corrections
drew-viles added a commit to eschercloudai/unikorn that referenced this pull request Sep 6, 2023
* New unikorn release with a note on deprecating and removing cilium taints from `clusteropenstack` helm application as this is now provided via the cilium operator as of 1.14.0 - cilium/cilium#23486
  We can't do this yet as it would break any apps pre kubernetes-cluster-1.4.0
* Updates a bunch of other packages with a nuew preview application bundle for both clusters and control plane
* Some minor doc corrections
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. area/daemon Impacts operation of the Cilium daemon. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/major This PR introduces major new functionality to Cilium. sig/agent Cilium agent related. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CNI plugin should tolerate a down agent.
7 participants