Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove stale Ready=False conditions value to show more accurate status #884

Merged
merged 1 commit into from
Feb 5, 2024

Conversation

darkowlzz
Copy link
Contributor

@darkowlzz darkowlzz commented Feb 2, 2024

Related to the issue described in fluxcd/flux2#4524 and observations from #855. And another instance of the issue discussed on slack that revealed that stale Ready=False value results in such scenarios (described in detail below).

When the reconciliation begins, while fulfilling the prerequisites, Ready=False condition for various reasons are added on the object. On failure, this reason is persisted on the object. On a subsequent reconciliation, when the failure is recovered, the Ready=False condition is not updates until the atomic release reconciliation reaches a conclusion. During this period if the atomic reconciliation enters a retry loop due to constant drift detection and correction, the stale Ready=False condition with incorrect reason persists on the object. The Ready=False message is also copied to Reconciling=True condition, refer

if remediation := req.Object.GetActiveRemediation(); remediation == nil || !remediation.RetriesExhausted(req.Object) {
conditions.MarkReconciling(req.Object, meta.ProgressingWithRetryReason, conditions.GetMessage(req.Object, meta.ReadyCondition))
return ErrMustRequeue
}
, resulting in an incorrect depiction of what's actually happening.
For example, if previously the HelmRelease failed with dependency not ready error, on a subsequent reconciliation, even after going past the dependency check and returning from atomic release reconciliation due to drift detection and correction loop scenario, the Ready=False condition continues to show the stale dependency not ready error.
An example of the status conditions during this situation as shared by a user is:

Status:
  Conditions:
    Last Transition Time:  2024-02-02T13:35:45Z
    Message:               dependency 'platform-flux-system/cilium' is not ready
    Observed Generation:   8
    Reason:                ProgressingWithRetry
    Status:                True
    Type:                  Reconciling
    Last Transition Time:  2024-02-01T15:31:45Z
    Message:               dependency 'platform-flux-system/cilium' is not ready
    Observed Generation:   6
    Reason:                DependencyNotReady
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-01-24T22:22:25Z
    Message:               Helm upgrade succeeded for release platform-external-secrets/external-secrets.v2 with chart external-secrets@0.9.11
    Observed Generation:   3
    Reason:                UpgradeSucceeded
    Status:                True
    Type:                  Released

In order to show more accurate status, the Ready=False conditions added while fulfilling prerequisites can be removed once those checks have succeeded, updating Ready=False to Ready=Unknown with "reconciliation in progress" message. If the atomic reconciliation gets stuck in the drift detection and correction loop with this, the Ready and Reconciling conditons would show "reconciliation in progress". This should be a better indicator of what's going on. The events and logs can be checked to determine accurately what's causing the reconciliation to be progressing for ever.

An example of the new conditions showing progress after recovering from failure and entering drift detection-correction loop:

status:
  conditions:
  - lastTransitionTime: "2024-02-04T18:19:10Z"
    message: reconciliation in progress
    observedGeneration: 3
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2024-02-04T18:19:04Z"
    message: reconciliation in progress
    observedGeneration: 3
    reason: Progressing
    status: Unknown
    type: Ready
  - lastTransitionTime: "2024-02-04T18:15:53Z"
    message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
    observedGeneration: 1
    reason: InstallSucceeded
    status: "True"
    type: Released

(NOT IMPLEMENTED IN THIS PR - See #885 ) Another way to improve the scenario would be to update the Ready condition when drift correction runs. But this would be very specific to the drift scenario and may not work for other situations that could also result in a similar scenario. Updating the Ready condition anywhere after entering the atomic release reconciliation should change the Ready condition to report accurate state.

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and gets HC closer to how KC behaves. I have added a new helper function to runtime that can be used here, see fluxcd/pkg#730

@stefanprodan stefanprodan added the area/ux In pursuit of a delightful user experience label Feb 3, 2024
@darkowlzz darkowlzz force-pushed the update-stale-ready-condition branch 2 times, most recently from 94a724d to 2958334 Compare February 4, 2024 21:17
@darkowlzz
Copy link
Contributor Author

darkowlzz commented Feb 4, 2024

Updated to use the helper from fluxcd/pkg#730. Need to update go.mod with a new pkg/runtime release.
Also added a test to make sure that the prerequisite failure conditions are updated using a release in-sync scenario that doesn't mutate the Ready condition, allowing to check what was the value of Ready condition before atomic release reconciliation.
UPDATE: Thinking about the in-sync scenario used for the test, it reveals another potential conditions related issue where atomic release reconciliation would not add a Ready=True condition. If a release was installed successfully, in an update or in another reconciliation one of the prerequisites causes an intermittent failure, resulting in Ready=False condition. When the failure gets resolved in a retry, the reconciliation would enter atomic release with Ready=False condition and seeing everything in-sync, the condition would remain as it was. With this PR, Ready=False will be updated to Ready=Unknown, which still doesn't solve the issue. Status will show that the release is not ready but in reality, it is ready. This is theoretical based on my current understanding, I'll have to manually test and verify. But that can be addressed in a separate PR later.

This change is in itself an improvement to avoid certain confusing status. #885 can be added separately later for more enhancement in the drift detection-correction loop case specifically.

@darkowlzz darkowlzz marked this pull request as ready for review February 4, 2024 21:33
When the reconciliation begins, while fulfilling the prerequisites,
Ready=False condition for various reasons are added on the object. On
failure, this reason is persisted on the object. On a subsequent
reconciliation, when the failure is recovered, the Ready=False condition
is not updates until the atomic reconciliation reaches a conclusion.
During this period if the atomic reconciliation enters a retry loop due
to constant drift detection and correction, the stale Ready=False
condition with incorrect reason persists on the object. The Ready=False
message is also copied to Reconciling=True condition, resulting in an
incorrect depiction of what's actually happening.
For example, if previously the HelmRelease failed with dependency not
ready error, on a subsequent reconciliation, even after going past the
dependency check and returning from atomic reconciliation due to drift
detection and correction loop scenario, the Ready=False condition
continues to show the stale dependency not ready error.

In order to show more accurate status, the Ready=False conditions added
while fulfilling prerequisites can be removed once those checks have
succeeded, updating Ready=False to Ready=Unknown with "reconciliation in
progress" message. If the atomic reconciliation gets stuck in the drift
detection and correction loop with this, the Ready and Reconciling
conditons would show "reconciliation in progress". This should be a
better indicator of what's going on. The events and logs can be checked
to determine accurately what's causing the reconciliation to be
progressing for ever.

Signed-off-by: Sunny <github@darkowlzz.space>
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @darkowlzz for this major improvement to UX 🥇

@darkowlzz darkowlzz merged commit 0bd797a into main Feb 5, 2024
6 checks passed
@darkowlzz darkowlzz deleted the update-stale-ready-condition branch February 5, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ux In pursuit of a delightful user experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants