Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarize: Consider obj status condition in result #703

Merged
merged 1 commit into from
May 3, 2022

Conversation

darkowlzz
Copy link
Contributor

@darkowlzz darkowlzz commented Apr 29, 2022

SummarizeAndPatch() should also consider the object's status conditions
when computing and returning the runtime results to avoid any
inconsistency in the runtime result and status condition of the object.
When an object's Ready conditions are False, the reconciler should retry
unless it's in stalled state.

Detailed description

Summarize and patch takes the abstracted result and error, and computes the runtime result from it. Separately, it takes the object with status conditions and calculates the status summary and patches the object status.
This can result in a situation where the (sub)reconcilers return a successful result without any error, resulting in a success runtime result. But a negative status condition on the object set by one of the sub-reconcilers would result in the Ready status condition to be False. As a result of this, an object would have Ready=False status and the reconciler wouldn't be attempting any retry to reconcile it. We expect this behavior to only happen when the status condition Stalled=True. Unless Ready=True, the reconciler should always retry.

To make the model consistent, the summarize and patch process can take into consideration the status condition summary for Ready condition.

Runtime result behavior with respect to Ready condition:

  • If Ready=True, but the abstracted result is ResultRequeue and nil error, the runtime result should have Requeue: true.

  • If Ready=True, but the abstracted result is ResultSuccess and nil error, the runtime result should have RequeueAfter: <interval>.

  • If Ready=True, but the reconciler error is not nil, a runtime error is returned, resulting in runtime requeuing the object for reconciliation. Since an error in itself can't be used to set the Ready=False with reason and message, we can't change the status condition to reflect this failure. The retry will happen in the background without anything visible in the object status.

  • [New behavior] If Ready=False, but the abstracted result is Success and reconciler error is nil, the reconciler error can be changed to reflect the reason for Ready=False by setting the error to be the Ready condition message. This will result in the object to have Ready=False and the reconciler performing a retry.

@darkowlzz darkowlzz added the enhancement New feature or request label Apr 29, 2022
@darkowlzz darkowlzz force-pushed the target-condition-in-result branch 2 times, most recently from 80a0261 to f7dd60a Compare April 29, 2022 16:59
afterFunc: func(t *WithT, obj conditions.Setter, patchOpts *patch.HelperOptions) {
t.Expect(patchOpts.IncludeStatusObservedGeneration).To(BeFalse())
},
},
Copy link
Contributor Author

@darkowlzz darkowlzz Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate test case.

SummarizeAndPatch() should also consider the object's status conditions
when computing and returning the runtime results to avoid any
inconsistency in the runtime result and status condition of the object.
When an object's Ready condition is False, the reconciler should retry
unless it's in stalled condition.

Signed-off-by: Sunny <darkowlzz@protonmail.com>
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @darkowlzz 🏅

Copy link
Member

@pjbgf pjbgf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good stuff @darkowlzz

@stefanprodan stefanprodan merged commit 4e30a4f into main May 3, 2022
@stefanprodan stefanprodan deleted the target-condition-in-result branch May 3, 2022 10:26
@pjbgf pjbgf added this to the GA milestone May 4, 2022
starlingx-github pushed a commit to starlingx/ansible-playbooks that referenced this pull request Dec 7, 2022
There are several FluxCD issues which were attempted to be fixed by
implementing recovery logic into app framework. A new issue was
observed that may be hit more often because of the recovery logic.

Flux won't attempt to finish updating a HelmRelease resource to
Ready=True but only display a message containing this signature:
'''the object has been modified...'''. The resource must be correctly
updated by Flux.
Aside from this the recovery logic was put in place for HelmChart
resources that are in Ready=False state, but this should have been
handled by Flux.

I believe the 2 issues described above map to [1] and [2].
To pull [1] and [2] updated FluxCD release to latest available [3].
Release is composed of manifests and container image used.
Update containers used: helm-controller from v0.15.0 to v0.27.0,
source-controller from v0.20.1 to v0.32.1.
Update flux crds, rbac, deployment.
Changelog claims we get k8s 1.25 support part of the upversion.

Disclaimer for tests:
1) Recovery logic concerned with flipping spec.suspend was removed.
Introduced by ongoing [4]
2) optimization by flipping spec.suspend was removed
Introduced by ongoing[4]
3) cert-manager, nginx-ingress-controller, platform-integ-apps had the
reconciliation interval decreased to 1m to allow Flux to manage the
resources by itself in a reasonable time interval.
There will be future commits per app updating reconciliation interval.

Tests on AIO-SX:
PASS: bootstrap
PASS: unlocked enabled available
PASS: apps applied
PASS: tested on k8s 1.24.4 and 1.21.8, inherit that versions between
      should be OK
PASS: delete pods and see they are recreated and no errors
PASS: inspect flux pod logs for errors
PASS: re-test known trigger for 1996747 and 1995748

[1]: fluxcd/source-controller#703
[2]: fluxcd/source-controller#202
[3]: https://github.com/fluxcd/flux2/releases/tag/v0.37.0
[4]: https://review.opendev.org/c/starlingx/config/+/866862
Related-Bug: 1995748
Related-Bug: 1996747
Closes-Bug: 1999032
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: Id295599a2946f48081e7d27e2ab8e06063c3c88d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants