-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poll more frequently when waiting for composed resources to become ready #5427
Conversation
Also, jitter the poll interval +/- 10%. Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
This will allow us to detect ready XRs slightly faster. Previously backoff was from 1 to 60 seconds. This means an XR that wasn't ready in the first 63 seconds would be polled every 60 seconds until it became ready. Now backoff is from 1 to 30 seconds. This means an XR that isn't ready in the first 31 seconds will be polled every 30 seconds until it becomes ready. Note that this change affects XRs that are persistently returning errors, not just unready XRs. The XR reconciler only returns errors when it can't get the XR or can't update the status of the XR. Signed-off-by: Nic Cope <nicc@rk0n.org>
@@ -421,7 +442,7 @@ func NewReconciler(mgr manager.Manager, of resource.CompositeKind, opts ...Recon | |||
log: logging.NewNopLogger(), | |||
record: event.NewNopRecorder(), | |||
|
|||
pollInterval: defaultPollInterval, | |||
pollInterval: func(_ context.Context, _ *composite.Unstructured) time.Duration { return defaultPollInterval }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this default won't actually be used. In practice we plumb down --poll-interval
from the CLI using WithPollInterval
.
Thanks for the PR @negz! My only concern is, we are now increasing TTR for composites that are becoming ready within the first 30 secs or so. For example, if all composed resources were ready within 5secs, the composite would be ready in 7 secs before but 30 secs now. I am wondering if it would be possible to still start with 1 + 2 + 4 ... but capped at 30 secs and also if it does worth doing that? |
I think that would require either:
Maybe just capping backoff at 30 seconds (rather than 60) for everything in core would be okay? That would affect:
Everything would still be subject to the global token bucket rate limiter also. |
This sounds reasonable to me. |
@turkenh It turns out that I didn't need to update every controller, so I've just capped the XR reconciler at 30s backoff. See latest commit for details. I'm leaning toward not adding a flag to configure this, but I could be convinced if you feel strongly. |
c84a378
to
be615b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning toward not adding a flag to configure this, but I could be convinced if you feel strongly.
I don't have a strong opinion either.
Looking good. Three non-blocking questions:
- Do you think if it makes sense to bump the default value for
--max-reconciler-rate
in this PR as you mentioned here, given that now we will reconcile "more", that could be considered as relevant. - Should we also bump the default resource req/limits to accommodate the change here, with a similar reasoning as above? I believe we can at least relax the cpu limit.
- Should we backport this or would it be a stretch ?
@turkenh I can open another PR for the first two. I think no, too much of a behavior change to back port. |
Description of your changes
Fixes #5424
Also, jitter the poll interval +/- 10%.
Realtime compositions will eliminate the need for this, but it should help XRs become ready sooner (at the expense of more function runs, API server load, etc) until realtime compositions are GA.
Without this change the
Requeue: true
return is exponentially backed off from 1 to 60 seconds. Meaning that in the likely case composed resources aren't ready within the first ~60 seconds (1 + 2 + 4 + 8 + 16 + 32s) we'll poll for readiness every 60 seconds until they are.With this change the
Requeue: true
will instead be backed off from 1 to 30 seconds. Meaning that resources that aren't ready in the first ~30 seconds (1 + 2 + 4 + 8 + 16s) will be polled every 30 seconds until they are. This will also apply to XRs stuck in a persistent error state, e.g. because they can't update their status.This PR also adds 10% jitter to the XR poll interval.
I have:
make reviewable
to ensure this PR is ready for review.Added or updated e2e tests.Linked a PR or a docs tracking issue to document this change.Addedbackport release-x.y
labels to auto-backport this PR.Need help with this checklist? See the cheat sheet.