Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(appset): Fix infinite loop caused by not ignoring statuses updates #19676

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

joliveirinha
Copy link

@joliveirinha joliveirinha commented Aug 24, 2024

This commit fixes a bug where the appset controller updating the status of the resource itself, was causing it be notified for a new reconcile loop.
This happened, because, altough the individual resource (app) statuses were not changing, the order of the list was, as such, resulting in an updated appset resource.

The fix, adds a GenerationPredicate to the Watch on the appSet to guarantee that if no actual change to the AppSet change was done, then it can be ignored.

Fixes [#19675]

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

@joliveirinha joliveirinha requested a review from a team as a code owner August 24, 2024 16:01
Copy link

bunnyshell bot commented Aug 24, 2024

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

  • 🔵 /bns:start to start the environment
  • 🚀 /bns:deploy to redeploy the environment
  • /bns:delete to remove the environment

Copy link

bunnyshell bot commented Aug 24, 2024

✅ Preview Environment created on Bunnyshell but will not be auto-deployed

See: Environment Details

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

This commit fixes a bug where the appset controller updating the status
of the resource itself, was causing it be notified for a new reconcile
loop.
This happened, because, altough the individual resource (app) statuses
were not changing, the order of the list was, as such, resulting in an
updated appset resource.

The fix, adds a GenerationPredicate to the Watch on the appSet to
guarantee that if no actual change to the AppSet change was done, then
it can be ignored.

Signed-off-by: João Oliveirinha <joliveirinha@gmail.com>
@joliveirinha joliveirinha force-pushed the joliveirinha/fix-reconcile-loop-appset branch from a1c0196 to 70c5e4b Compare August 24, 2024 16:07
Copy link

codecov bot commented Aug 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@e612199). Learn more about missing BASE report.
Report is 54 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #19676   +/-   ##
=========================================
  Coverage          ?   55.91%           
=========================================
  Files             ?      316           
  Lines             ?    43794           
  Branches          ?        0           
=========================================
  Hits              ?    24487           
  Misses            ?    16762           
  Partials          ?     2545           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@reggie-k
Copy link
Member

reggie-k commented Aug 25, 2024

Thank you for the detailed description and the PR, @joliveirinha.
Could you add tests for this case?

Copy link
Member

@gdsoumya gdsoumya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks good, a test would ensure we don't regress

Copy link
Contributor

@todaywasawesome todaywasawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests and it looks 🔥

@joliveirinha
Copy link
Author

Sorry for the late response. Only had time to look at this now.

It doesn't seem straight forward to implement a regression test for this.
AFAIK, the tests that exist use fake client and there doesnt seem possible to test the reconcile loop with it.

It seems the only possibility is to use envtest instead.

I maybe missing something since this is my first time looking at this.

@jetersen
Copy link
Contributor

jetersen commented Sep 3, 2024

Another option is to ensure order of the status.resources which I think is the real cause of this issue. See #19757

@joliveirinha
Copy link
Author

joliveirinha commented Sep 3, 2024

Another option is to ensure order of the status.resources which I think is the real cause of this issue. See #19757

Hmm I have mixed feelings about that.

  1. altought what you are saying is indeed true in the exact scenario I reported, it feels that "fixing" it that way is not the ideal fix, mainly because:
  2. we shouldn't care at all for the reason and contents that changed on the resource status from the perspective of being notified by the controller-runtime due to those changes. Even if the status changed, it should be in the implementation to decide what is best requeue time for the next iteration. Most probably, a couple of seconds.

For 2) the current fix here proposed, gives all the flexibility for the implementation to decide what is the best requeue time. If we go via the direction you are suggesting, then a status update that changes the actual status content, will always trigger right away the notification from the runtime, which means you have less flexibility.

You may argue, that it will only happen once because in the next iteration the status is probably the same, still....

To summarize, what I suggest, since the fix is small and relatively obvious, is for us to merge this PR and eventually when the project has integrations tests with the envtest, then it can cover this predicate and the rest that already exist to validate the runtime scenarios.

Also, because this PR most probably also fix other related issues related to the progressive sync.

@jetersen
Copy link
Contributor

jetersen commented Sep 4, 2024

But if you are using SCM Provider and someone creates a new repo, then it should reconcile if a new application is added?

@joliveirinha
Copy link
Author

But if you are using SCM Provider and someone creates a new repo, then it should reconcile if a new application is added?

Yes. The fix doesn't change that. normal requeueing based on the configured requeue time will still happen.

Copy link
Member

@agaudreault agaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many cases where the behavior of the appset controller may depend on status updates adding back the object to the queue. One that comes to my mind is the progressive delivery. There will be less update events, so less calls to shouldRequeueApplicationSet. What are the implication of that? hard to know.

These kinds of scenario are usually coverer by the E2E test suite.

Copy link
Collaborator

@alexmt alexmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @agaudreault . Argo CD application controller logic relies on reconciliation after changing the status. It is not ideal, but I'm afraid the AppicationSet controller is the same. We might open a can of worms by disabling reconciliation on status change.

I would implement this suggestion instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants