Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: controller watches global configMap in the namespace where it is running and not in managedNamespace - fixes #11463 #11799

Closed
wants to merge 1 commit into from

Conversation

toblich
Copy link

@toblich toblich commented Sep 11, 2023

Fixes #11463

Motivation

There was a bug by which the workflow controller did not pick up changes made to its configMap until its pod was deleted. This only manifestated when the controller is running in a particular namespace (--namespaced) but is managing workflows in a different namespace (--managed-namespace=<some_other_ns>)

Modifications

Fixed the namespace that the controller watches for changes made to its own configMap, so that it watches the same namespace that it initially reads. See the analysis in #11463 for more details.

Verification

I did manual verification by:

  1. Deploying into a cluster using both flags (--namespaced --managed-namespace=<other_ns>)
  2. Making changes to the config map (e.g. to the workflowDefaults)
  3. Reading the controller logs
  4. Starting a workflow to validate whether the changed made to the configmap had been picked up or not

I also ran all tests, but given that this bug is related to configmaps and namespaces in a k8s cluster, and I don't see a clear way of simulating that in workflow/controller/controller_test.go, I haven't added any new tests, but I'm obviously open to do it if there's any example of how to mock the namespaces and configmaps to actually test the watcher (which, as far as I can tell, relies on k8s control plane actually being there and is thus complicated to test without a real k8s cluster).

… running and not in managedNamespace

Signed-off-by: toblich <toblich@users.noreply.github.com>
Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix (and all the details here and in the issue too)! As commented in the issue, this may also potentially fix some related issues as well, if this were their root cause.

This is a one-line fix, so I'm ok approving as is with manual verification.

I did verify the codepaths myself that this is logically correct:

  1. config controller creation uses namespace and not managedNamespace
  2. UpdateConfig calls Get on the config controller, which uses namespace.
  3. Both of those explain why it still works on initialization, but fails to watch correctly

So WatchFunc here is the odd one out. And ofc, as stated in PR and issue, the workflow-controller-configmap should be in the same namespace as the Workflow Controller.

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I do see a potential problem that I suspected in the issue...

On line 422 below in this same function, wfc.notifySemaphoreConfigUpdate is called for every ConfigMap in the namespace. Semaphore configs are indeed in the managedNamespace.

This may require two separate watchers if there isn't a separate watcher on semaphores already

@agilgur5 agilgur5 self-requested a review September 11, 2023 23:10
@agilgur5
Copy link
Member

So there is a separate watcher. The configMapInformer is created on the managedNamespace. But that seems to be used almost exclusively for parameters from ConfigMaps (and GC cache??), if I'm understanding the code correctly.

That informer should probably be extended to semaphores then if I'm understanding correctly.
(Vaguely related, but the retryWatcher should probably be converted to an informer too per #11657 (comment)).

Would definitely be good to get someone more familiar with the ConfigMap codepaths. @sarabala1979 wrote some of this code originally in #4421, which is why I requested his review.

@juliev0
Copy link
Contributor

juliev0 commented Sep 29, 2023

thanks @toblich for the fix! I just reviewed the PR referenced above , which looks like it could get merged soon and takes care of the issue you found as well, so in that case this whole method would go away.

@agilgur5
Copy link
Member

@juliev0 not quite, that PR doesn't actually resolve this issue at all. seems like the author there did not interpret the bug here correctly. See my comment there: #11855 (comment)

@juliev0
Copy link
Contributor

juliev0 commented Sep 30, 2023

@juliev0 not quite, that PR doesn't actually resolve this issue at all. seems like the author there did not interpret the bug here correctly. See my comment there: #11855 (comment)

yes, you're totally right. Not sure how I missed that :|

Copy link
Contributor

@juliev0 juliev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, holding off on approval until maybe we can get some feedback on Anton's questions

@agilgur5
Copy link
Member

agilgur5 commented Oct 2, 2023

After reviewing #11855 in-depth, I am pretty sure that wfc.notifySemaphoreConfigUpdate is indeed problematic here and needs moving (it also needs moving in #11855's current state due to the FilterFunc).

As far as I understand, that function is responsible for requeueing if a semaphore's ConfigMap value increases in size (say you had a limit of 5 and now allow 10, for example).
To retain that functionality, that function call must continue to be called on the managed namespace, otherwise this would fix one bug and create a new regression.

@juliev0
Copy link
Contributor

juliev0 commented Oct 2, 2023

@agilgur5

for parameters from ConfigMaps

Yeah, I see what you're saying. runConfigMapWatcher() has been handling both the WorkflowControllerConfigmap and these ConfigMaps. It would make sense that the latter live on the managed namespace, same as the Workflows that they're used by.

@juliev0
Copy link
Contributor

juliev0 commented Oct 2, 2023

created on the managedNamespace

Are you thinking that instead of using a RetryWatcher for the managed namespace, we should take advantage of the informer's UpdateFunc since otherwise we're creating an unnecessary Watch?

@agilgur5
Copy link
Member

agilgur5 commented Oct 2, 2023

we should take advantage of the informer's UpdateFunc since otherwise we're creating an unnecessary Watch?

ostensibly, it would be a good idea to re-use an existing Informer such as the configMapInformer. that specific line/event handler wouldn't fit though, as it's filtered on executor plugins (and only runs if there are executor plugins defined). A new event handler outside of that conditional can be added to work around that though.

I was also concerned in #11855 (comment) that the configMapInformer's label selector may not apply for semaphore configmaps. If that is accurate, we would either need a third ConfigMap Informer (the two in #11855 plus another one) or to modify that one to handle both scenarios (e.g. move the label selector into the FilterFunc -- but that could change the indices, so that usage would still need to be further checked).

I can tackle that if our two contributors here aren't able to pick it up.

@juliev0
Copy link
Contributor

juliev0 commented Oct 2, 2023

we should take advantage of the informer's UpdateFunc since otherwise we're creating an unnecessary Watch?

ostensibly, it would be a good idea to re-use an existing Informer such as the configMapInformer. that specific line/event handler wouldn't fit though, as it's filtered on executor plugins (and only runs if there are executor plugins defined). A new event handler outside of that conditional can be added to work around that though.

I was also concerned in #11855 (comment) that the configMapInformer's label selector may not apply for semaphore configmaps. If that is accurate, we would either need a third ConfigMap Informer (the two in #11855 plus another one) or to modify that one to handle both scenarios (e.g. move the label selector into the FilterFunc -- but that could change the indices, so that usage would still need to be further checked).

I can tackle that if our two contributors here aren't able to pick it up.

Nice deep diving into all of this. Yes, you seem to have become an expert on this.

@toblich
Copy link
Author

toblich commented Nov 13, 2023

Closing this PR in favor of #11855, which is doing a much more comprehensive fix and also taking care of #11463

@toblich toblich closed this Nov 13, 2023
@toblich toblich deleted the pr branch November 13, 2023 14:09
@agilgur5 agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Feb 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate)
Projects
None yet
3 participants