Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

policy/k8s: Fix bug where policy synchronization event was lost #32028

Merged
merged 2 commits into from
Apr 24, 2024

Conversation

gandro
Copy link
Member

@gandro gandro commented Apr 17, 2024

This PR fixes an issue with the K8s policy ingestor where InitK8sSubsystem unblocked the daemon startup before all initial policies were processed by the ingestor.

This meant that we created unnecessary endpoint regenerations at startup, due to the initial endpoint regeneration being disrupted by the discovery of additional policies (and thus re-triggering endpoint regeneration).

The cause for the race was that we started the K8s policy watcher in the Hive lifecycle start hook, which is too late: The InitK8sSubsystem function called from newDaemon constructor is supposed to block on policy resources being synced. But because we registered those resources (via BlockWaitGroupToSyncResources) only after the newDaemon constructor already ran (we depended on a fully constructed Daemon object), InitK8sSubsystem would continue without the policy resources being synced and thus causing unnecessary endpoint regenerations.

This PR fixes the issue by starting the policy resource watcher earlier, namely directly in newDaemon. This unfortunately means that we have to side-step the Hive infrastructure, as we currently require a partially initialized Daemon object to implement the PolicyManager interface for the time being. Hopefully, in the future we can move the PolicyManager logic out of the Daemon struct and thereby breaking this cyclic dependency.

I have manually tested this, both by checking the logs and by disabling the CNP sync manually (which then blocks the agent before it restores endpoints). Unfortunately, I didn't find a good way to assert this via unit tests.

Fixes: #31865

@gandro gandro added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. release-note/misc This PR makes changes that have no direct user impact. labels Apr 17, 2024
@gandro gandro requested review from a team as code owners April 17, 2024 12:26
@gandro gandro requested review from squeed and derailed April 17, 2024 12:26
@gandro gandro force-pushed the pr/gandro/k8s-policy-fix-sync-race branch from 0aecad3 to f86fa91 Compare April 17, 2024 13:10
@gandro
Copy link
Member Author

gandro commented Apr 17, 2024

/test

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandro Nice work!

pkg/policy/k8s/cell.go Show resolved Hide resolved
The policy watcher does not need to be exported. This commit also
prepares the code for a minor restructuring in a subsequent commit.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit fixes an issue with the K8s policy ingestor where
`InitK8sSubsystem` unblocked the daemon startup before all initial
policies were processed by the ingestor.

This meant that we created unnecessary endpoint regenerations at
startup, due to the initial endpoint regeneration being disrupted by the
discovery of additional policies (and thus re-triggering endpoint
regeneration).

The cause for the race was that we started the K8s policy watcher in the
Hive lifecycle start hook, which is too late: The `InitK8sSubsystem`
function called from newDaemon constructor is supposed to block on
policy resources being synced. But because we registered those resources
(via `BlockWaitGroupToSyncResources`) only after the `newDaemon`
constructor already ran (we depended on a fully constructed `Daemon`
object), `InitK8sSubsystem` would continue without the policy resources
being synced and thus causing unnecessary endpoint regenerations.

This commit fixes the issue by starting the policy resource watcher
earlier, namely directly in `newDaemon`. This unfortunately means that
we have to side-step the Hive infrastructure, as we currently require a
partially initialized `Daemon` object to implement the `PolicyManager`
interface for the time being. Hopefully, in the future we can move the
`PolicyManager` logic out of the `Daemon` struct and thereby breaking
this cyclic dependency.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/k8s-policy-fix-sync-race branch from f86fa91 to d1aa7ae Compare April 22, 2024 08:57
@gandro
Copy link
Member Author

gandro commented Apr 22, 2024

/test

@squeed squeed requested review from a team and dylandreimerink and removed request for a team April 22, 2024 09:17
@squeed
Copy link
Contributor

squeed commented Apr 22, 2024

Tagged sig/foundations here.

Edit: Sorry! somehow missed that @derailed was also foundations. Sorry :-)

Copy link
Member

@dylandreimerink dylandreimerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like discussed out of band, this isn't a great permanent solution, but will approve for now and work on proper long term fix later

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 24, 2024
@gandro gandro added this pull request to the merge queue Apr 24, 2024
Merged via the queue into cilium:main with commit 3f9c288 Apr 24, 2024
62 checks passed
@gandro gandro deleted the pr/gandro/k8s-policy-fix-sync-race branch April 24, 2024 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: endpoint regeneration proceeds before policy synchronization has completed
4 participants