Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: don't go ready until CNI configuration has been written #32168

Merged
merged 1 commit into from
Apr 25, 2024

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Apr 24, 2024

If the daemon is configured to write a CNI configuration file, we should not go ready until that CNI configuration file has been written. This prevents a race condition where the controller removes the taint from a node too early, meaning pods may be created with a different CNI provider.

In #29405, Cilium was configured in chaining mode, but the "primary" CNI provider hadn't written its configuration yet. This caused the not-ready taint to be removed from the node too early, and pods were created in a bad state.

By hooking in the CNI cell's status in the daemon's Status type, we prevent the daemon's healthz endpoint from returning a successful response until the CNI cell has been successful.

Fixes: #29405

Fixes a bug where Cilium in chained mode removed the `agent-not-ready` taint too early if the primary network is slow in deploying.

@squeed squeed added kind/bug This is a bug in the Cilium logic. area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Apr 24, 2024
@squeed squeed requested review from a team as code owners April 24, 2024 15:05
If the daemon is configured to write a CNI configuration file, we should
not go ready until that CNI configuration file has been written. This
prevents a race condition where the controller removes the taint from a
node too early, meaning pods may be created with a different CNI
provider.

In cilium#29405, Cilium was configured in chaining mode, but the "primary" CNI
provider hadn't written its configuration yet. This caused the not-ready
taint to be removed from the node too early, and pods were created in a
bad state.

By hooking in the CNI cell's status in the daemon's Status type, we
prevent the daemon's healthz endpoint from returning a successful
response until the CNI cell has been successful.

Fixes: cilium#29405

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed squeed added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Apr 24, 2024
@squeed squeed removed the request for review from a team April 24, 2024 15:09
@squeed
Copy link
Contributor Author

squeed commented Apr 24, 2024

(removed some api changes I realized I didnt' want)

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@squeed
Copy link
Contributor Author

squeed commented Apr 24, 2024

/test

@learnitall learnitall removed their request for review April 24, 2024 19:37
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 25, 2024
@ti-mo ti-mo added this pull request to the merge queue Apr 25, 2024
Merged via the queue into cilium:main with commit 77b1e6c Apr 25, 2024
63 of 64 checks passed
@stensonb
Copy link

Awesome! Will this be backported to 1.14 too?

@squeed
Copy link
Contributor Author

squeed commented Apr 25, 2024

1.15 for sure, I’m not sure about 1.14 yet

@squeed squeed added the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Apr 26, 2024
@squeed
Copy link
Contributor Author

squeed commented Apr 26, 2024

Looks like this applies mostly cleanly to v1.14 too; marked as such.

@gandro gandro mentioned this pull request Apr 29, 2024
18 tasks
@gandro gandro added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 29, 2024
@gandro gandro mentioned this pull request Apr 30, 2024
13 tasks
@gandro gandro added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Apr 30, 2024
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cilium removes agent-not-ready taint too early
5 participants