Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poll more frequently when waiting for composed resources to become ready #5427

Merged
merged 3 commits into from
Feb 29, 2024

Conversation

negz
Copy link
Member

@negz negz commented Feb 24, 2024

Description of your changes

Fixes #5424

Also, jitter the poll interval +/- 10%.

Realtime compositions will eliminate the need for this, but it should help XRs become ready sooner (at the expense of more function runs, API server load, etc) until realtime compositions are GA.

Without this change the Requeue: true return is exponentially backed off from 1 to 60 seconds. Meaning that in the likely case composed resources aren't ready within the first ~60 seconds (1 + 2 + 4 + 8 + 16 + 32s) we'll poll for readiness every 60 seconds until they are.

With this change the Requeue: true will instead be backed off from 1 to 30 seconds. Meaning that resources that aren't ready in the first ~30 seconds (1 + 2 + 4 + 8 + 16s) will be polled every 30 seconds until they are. This will also apply to XRs stuck in a persistent error state, e.g. because they can't update their status.

This PR also adds 10% jitter to the XR poll interval.

I have:

Need help with this checklist? See the cheat sheet.

Also, jitter the poll interval +/- 10%.

Signed-off-by: Nic Cope <nicc@rk0n.org>
This will allow us to detect ready XRs slightly faster.

Previously backoff was from 1 to 60 seconds. This means an XR that
wasn't ready in the first 63 seconds would be polled every 60 seconds
until it became ready.

Now backoff is from 1 to 30 seconds. This means an XR that isn't ready
in the first 31 seconds will be polled every 30 seconds until it becomes
ready.

Note that this change affects XRs that are persistently returning
errors, not just unready XRs. The XR reconciler only returns errors
when it can't get the XR or can't update the status of the XR.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz requested a review from a team as a code owner February 24, 2024 03:11
@negz negz requested a review from phisco February 24, 2024 03:11
@@ -421,7 +442,7 @@ func NewReconciler(mgr manager.Manager, of resource.CompositeKind, opts ...Recon
log: logging.NewNopLogger(),
record: event.NewNopRecorder(),

pollInterval: defaultPollInterval,
pollInterval: func(_ context.Context, _ *composite.Unstructured) time.Duration { return defaultPollInterval },
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this default won't actually be used. In practice we plumb down --poll-interval from the CLI using WithPollInterval.

@turkenh
Copy link
Member

turkenh commented Feb 26, 2024

Meaning that in the likely case composed resources aren't ready within the first ~60 seconds (1 + 2 + 4 + 8 + 16 + 32s) we'll poll for readiness every 60 seconds until they are.

Thanks for the PR @negz!

My only concern is, we are now increasing TTR for composites that are becoming ready within the first 30 secs or so. For example, if all composed resources were ready within 5secs, the composite would be ready in 7 secs before but 30 secs now. I am wondering if it would be possible to still start with 1 + 2 + 4 ... but capped at 30 secs and also if it does worth doing that?

@negz
Copy link
Member Author

negz commented Feb 26, 2024

I am wondering if it would be possible to still start with 1 + 2 + 4 ... but capped at 30 secs and also if it does worth doing that?

I think that would require either:

  • Updating the controller-runtime rate limiter so that everything backed off to no more than 30 seconds. (Pretty easy)
  • Adding custom rate limiting to cap backoff differently for unready XRs. (Pretty complicated, I think)

Maybe just capping backoff at 30 seconds (rather than 60) for everything in core would be okay? That would affect:

  • Any reconciler that returns an error
  • Any reconciler that returns Requeue: true

Everything would still be subject to the global token bucket rate limiter also.

@turkenh
Copy link
Member

turkenh commented Feb 27, 2024

Maybe just capping backoff at 30 seconds (rather than 60) for everything in core would be okay? That would affect:

This sounds reasonable to me.
AFAIK, this is not configurable, may be introducing a flag (defaulting to 30s) could be a good idea to control this behavior if needed (and be able to revert back to previous value just in case).

@negz
Copy link
Member Author

negz commented Feb 28, 2024

@turkenh It turns out that I didn't need to update every controller, so I've just capped the XR reconciler at 30s backoff. See latest commit for details.

I'm leaning toward not adding a flag to configure this, but I could be convinced if you feel strongly.

@negz negz force-pushed the poll-harder branch 2 times, most recently from c84a378 to be615b9 Compare February 28, 2024 01:59
Copy link
Member

@turkenh turkenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning toward not adding a flag to configure this, but I could be convinced if you feel strongly.

I don't have a strong opinion either.

Looking good. Three non-blocking questions:

  • Do you think if it makes sense to bump the default value for --max-reconciler-rate in this PR as you mentioned here, given that now we will reconcile "more", that could be considered as relevant.
  • Should we also bump the default resource req/limits to accommodate the change here, with a similar reasoning as above? I believe we can at least relax the cpu limit.
  • Should we backport this or would it be a stretch ?

@negz
Copy link
Member Author

negz commented Feb 29, 2024

@turkenh I can open another PR for the first two.

I think no, too much of a behavior change to back port.

@negz negz merged commit 4cfa757 into crossplane:master Feb 29, 2024
18 checks passed
@negz negz deleted the poll-harder branch February 29, 2024 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poll with shorter intervals until the composites are ready to reduce TTR
2 participants