-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
policy/k8s: Fix bug where policy synchronization event was lost #32028
policy/k8s: Fix bug where policy synchronization event was lost #32028
Conversation
0aecad3
to
f86fa91
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gandro Nice work!
The policy watcher does not need to be exported. This commit also prepares the code for a minor restructuring in a subsequent commit. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit fixes an issue with the K8s policy ingestor where `InitK8sSubsystem` unblocked the daemon startup before all initial policies were processed by the ingestor. This meant that we created unnecessary endpoint regenerations at startup, due to the initial endpoint regeneration being disrupted by the discovery of additional policies (and thus re-triggering endpoint regeneration). The cause for the race was that we started the K8s policy watcher in the Hive lifecycle start hook, which is too late: The `InitK8sSubsystem` function called from newDaemon constructor is supposed to block on policy resources being synced. But because we registered those resources (via `BlockWaitGroupToSyncResources`) only after the `newDaemon` constructor already ran (we depended on a fully constructed `Daemon` object), `InitK8sSubsystem` would continue without the policy resources being synced and thus causing unnecessary endpoint regenerations. This commit fixes the issue by starting the policy resource watcher earlier, namely directly in `newDaemon`. This unfortunately means that we have to side-step the Hive infrastructure, as we currently require a partially initialized `Daemon` object to implement the `PolicyManager` interface for the time being. Hopefully, in the future we can move the `PolicyManager` logic out of the `Daemon` struct and thereby breaking this cyclic dependency. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
f86fa91
to
d1aa7ae
Compare
/test |
Tagged sig/foundations here. Edit: Sorry! somehow missed that @derailed was also foundations. Sorry :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like discussed out of band, this isn't a great permanent solution, but will approve for now and work on proper long term fix later
This PR fixes an issue with the K8s policy ingestor where
InitK8sSubsystem
unblocked the daemon startup before all initial policies were processed by the ingestor.This meant that we created unnecessary endpoint regenerations at startup, due to the initial endpoint regeneration being disrupted by the discovery of additional policies (and thus re-triggering endpoint regeneration).
The cause for the race was that we started the K8s policy watcher in the Hive lifecycle start hook, which is too late: The
InitK8sSubsystem
function called from newDaemon constructor is supposed to block on policy resources being synced. But because we registered those resources (viaBlockWaitGroupToSyncResources
) only after thenewDaemon
constructor already ran (we depended on a fully constructedDaemon
object),InitK8sSubsystem
would continue without the policy resources being synced and thus causing unnecessary endpoint regenerations.This PR fixes the issue by starting the policy resource watcher earlier, namely directly in
newDaemon
. This unfortunately means that we have to side-step the Hive infrastructure, as we currently require a partially initializedDaemon
object to implement thePolicyManager
interface for the time being. Hopefully, in the future we can move thePolicyManager
logic out of theDaemon
struct and thereby breaking this cyclic dependency.I have manually tested this, both by checking the logs and by disabling the CNP sync manually (which then blocks the agent before it restores endpoints). Unfortunately, I didn't find a good way to assert this via unit tests.
Fixes: #31865