Make use of rate limiting in managed reconciler #243

hasheddan · 2021-02-11T19:49:47Z

Description of your changes

Updates managed reconciler to return Requeue: true instead of
RequeueAfter when encountering non-status update errors. This allows the
reconciler to make use of the controller rate limiter rather use a
constant value for requeuing after errors.

This also renames longWait to pollInterval to more accurately reflect
its behavior, and adds default rate limiters to be used at both the provider and
controller level. The provider rate limiter is a configurable token
bucket and the controller limiter is a max of limiter that uses the
provider limiter and a per-item exponential backoff.

Signed-off-by: hasheddan georgedanielmangum@gmail.com

xref #40

I have:

Read and followed Crossplane's contribution process.
Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

Unit tests.

hasheddan · 2021-02-11T19:50:36Z

pkg/reconciler/managed/reconciler.go

+func (r *Reconciler) maybeBackoff(ctx context.Context, m resource.Managed, wait time.Duration, err error) (reconcile.Result, error) {
+	updateErr := errors.Wrap(r.client.Status().Update(ctx, m), errUpdateManagedStatus)
+	if r.useBackoff && err != nil {
+		return reconcile.Result{}, err


We could concatenate the update error in the case where we both pass an error, backoff is enabled, and the status update fails.

negz · 2021-02-12T01:14:50Z

pkg/reconciler/managed/reconciler.go

+	}
+	return reconcile.Result{RequeueAfter: wait}, updateErr
+}
+


I'm a little averse to wrappers that produce return values like this, but I can't think of a better approach if we want to allow folks to pick between backoff or fixed requeuafter values.

Do you think we could just straight-up switch to using backoff? i.e. If/when folks bump to this version of runtime they'd be using backoff? Providers could cut over at their own leisure.

I would feel pretty good about just switching over if we include a default implementation (and potentially some variations) that are recommended because we are essentially just using an extremely naive rate limiting function right now as it is. Would just need to explicitly call out in release notes though or else consumers would be defaulting to backing off to ~16 minutes with the controller-runtime implementation. The long wait would remain 1 minute thought so likely not too big of a problem.

If/when folks bump to this version of runtime they'd be using backoff? Providers could cut over at their own leisure.

That sounds good to me.

I would feel pretty good about just switching over if we include a default implementation (and potentially some variations) that are recommended because we are essentially just using an extremely naive rate limiting function right now as it is.

I agree that a simple default rate limiter would go a long way here. I believe the big three providers would just opt for using the default, maybe except provider-azure since its requests take longer than others.

negz · 2021-02-13T04:33:36Z

pkg/reconciler/managed/reconciler.go

@@ -42,8 +42,7 @@ const (
 	reconcileGracePeriod = 30 * time.Second
 	reconcileTimeout     = 1 * time.Minute

-	defaultManagedShortWait = 30 * time.Second
-	defaultManagedLongWait  = 1 * time.Minute
+	defaultManagedLongWait = 1 * time.Minute


Nit: Maybe long doesn't make sense in context anymore, since we don't have a short wait? I tend to think of this as the "speculative" or "happy path" poll interval these days.

@negz I was considering that as well, but didn't know if we want to the WithLongWait() on update. I think most providers don't actually use that option (i.e. they default to the 1 minute), and it might also be good for the break to be exposed to give a stronger indication that something has changed behind the scenes.

We'd be making a fairly equivalent breaking change in removing WithShortWait(), right? We could also have WithLongWait become an alias for WithPollInterval or whatever we called this now.

negz · 2021-02-13T04:42:48Z

pkg/reconciler/managed/reconciler.go

@@ -545,10 +545,10 @@ func (r *Reconciler) Reconcile(_ context.Context, req reconcile.Request) (reconc
 		// If this is the first time we encounter this issue we'll be requeued
 		// implicitly when we update our status with the new error condition.
 		// If not, we want to try again after a short wait.
-		log.Debug("Cannot initialize managed resource", "error", err, "requeue-after", time.Now().Add(r.shortWait))
+		log.Debug("Cannot initialize managed resource", "error", err)


We don't need to bring this into scope for this PR, but it would be great if we could eventually make our backoff state inspectable such that we could keep something like the requeue-after context. I'd feel more comfortable with longer backoffs (e.g. multiple minutes) if we could provide users with context as to how long our backoff was and why.

Yeah unfortunately we don't have access to the rate limiting interface within the reconciler (it is set at the controller level) and the controller-runtime default actually doesn't allow any sort of "peek" method that doesn't also increment the failure count used for backoff. This was part of my motivation for having a "layer 2" rate limiter because we could:

Set a default without folks having to explicitly pass it in to the controller.

Decouple the external API backoff from the k8s API server one.

Do things like log what the backoff value is currently at.

I'd prefer to tackle this (someday) at the controller-runtime level if possible. It would be nice if, for example, the controller.Manager had a method we could call to ask what the soonest time we'd be requeued at was.

negz · 2021-02-13T04:46:01Z

pkg/reconciler/managed/reconciler.go

 		record.Event(managed, event.Warning(reasonCannotInitialize, err))
 		managed.SetConditions(xpv1.ReconcileError(err))
-		return reconcile.Result{RequeueAfter: r.shortWait}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
+		return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)


Would we be able to just return reconcile.Result, err in these situations if we deprecated the Synced status per #198? Just curious - we don't need to bring that into scope with this PR.

I believe we still need to call the status update in some scenarios (such as #242). One thing we could do it concatenate the err from the method that was called with the status update. I played around with this a bit initially though and it felt pretty sloppy.

Ah, that's true. We also need to assume that at the Observe method (and possibly also Create and Update) could be mutating the managed resource's status.

That said, perhaps it's acceptable for us to only persist these status updates in the happy path? i.e. In the return statements that follow any place we'd currently log "Successfully Xed the external resource". When we hit an error case we'd return the error without persisting the managed resource's status. When we didn't hit an error we'd call r.client.Status().Update() as we do today, and return that error if there was one.

Without having thought through it too much I feel like this is probably acceptable? Managed resource viewers should be able to tell that we're trying to create or delete a managed resource because we'll be emitting warning events like CannotCreateExternalResource. Any status fields that would be set by the ExternalClient methods would presumably just get set once the error cleared and we were able to proceed.

perhaps it's acceptable for us to only persist these status updates in the happy path? i.e. In the return statements that follow any place we'd currently log "Successfully Xed the external resource". When we hit an error case we'd return the error without persisting the managed resource's status.

Would that result in errors not appearing in conditions, hence shown only as event?

@muvaf yep, I don't think we can do this currently

hasheddan · 2021-02-15T21:11:05Z

Example usage: crossplane-contrib/provider-aws#544

muvaf

LGTM! Thanks for tackling this long-running issue @hasheddan ! Solid solution.

muvaf · 2021-02-16T09:23:47Z

pkg/reconciler/managed/reconciler.go

+// after a specified duration when it is not actively waiting for an external
+// operation, but wishes to check whether an existing external resource needs to
+// be synced to its Crossplane Managed resource.
+func WithpollInterval(after time.Duration) ReconcilerOption {


Suggested change

func WithpollInterval(after time.Duration) ReconcilerOption {

func WithPollInterval(after time.Duration) ReconcilerOption {

muvaf · 2021-02-16T09:31:56Z

pkg/reconciler/managed/reconciler.go

 		record.Event(managed, event.Warning(reasonCannotInitialize, err))
 		managed.SetConditions(xpv1.ReconcileError(err))
-		return reconcile.Result{RequeueAfter: r.shortWait}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
+		return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)


perhaps it's acceptable for us to only persist these status updates in the happy path? i.e. In the return statements that follow any place we'd currently log "Successfully Xed the external resource". When we hit an error case we'd return the error without persisting the managed resource's status.

Would that result in errors not appearing in conditions, hence shown only as event?

Updates managed reconciler to return Requeue: true instead of RequeueAfter when encountering non-status update errors. This allows the reconciler to make use of the controller rate limiter rather use a constant value for requeuing after errors. This also renames longWait to pollInterval to more accurately reflect its behavior. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>

Adds default rate limiters to be used at both the provider and controller level. The provider rate limiter is a configurable token bucket and the controller limiter is a max of limiter that uses the provider limiter and a per-item exponential backoff. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>

hasheddan

@negz I adjusted the managed reconciler comments as you pointed out. Would love for you to get a final look at this before merge.

hasheddan · 2021-02-18T15:19:18Z

pkg/ratelimiter/default.go

+// registered with a controller manager. The bucket size is a linear function of
+// the requeues per second.
+func NewDefaultProviderRateLimiter(rps int) *workqueue.BucketRateLimiter {
+	return &workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(rps), rps*10)}


Using 10 as a linear constant here to determine bucket size is likely not suitable for every provider. However, providers may use their own BucketRateLimiter with a different value. Using the number of controllers under management is not an awful proxy, but it really depends on actual runtime number of objects / user preference. I feel pretty good about having this as the default and letting providers adjust / expose configuration as needed.

Note: if a user wanted a strict cap of requeues per second they would likely need to set bucket size = 1 here.

pkg/ratelimiter/default.go

pkg/reconciler/managed/reconciler.go

Updates comments in managed reconciler to indicate requeues are not tied to short wait but instead are explicit and trigger the configured backoff strategy. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>

hasheddan requested a review from negz February 11, 2021 19:49

hasheddan commented Feb 11, 2021

View reviewed changes

negz reviewed Feb 12, 2021

View reviewed changes

negz reviewed Feb 13, 2021

View reviewed changes

hasheddan force-pushed the backoff branch from 071c193 to 946b7d9 Compare February 15, 2021 20:45

hasheddan marked this pull request as ready for review February 15, 2021 20:48

hasheddan changed the title ~~Allow managed reconciler to make use of rate limiting if enabled~~ Make use of rate limiting in managed reconciler Feb 15, 2021

muvaf approved these changes Feb 16, 2021

View reviewed changes

hasheddan added 2 commits February 16, 2021 09:15

hasheddan force-pushed the backoff branch from 946b7d9 to 471d47b Compare February 18, 2021 15:00

hasheddan requested a review from negz February 18, 2021 15:13

hasheddan commented Feb 18, 2021

View reviewed changes

negz approved these changes Feb 19, 2021

View reviewed changes

pkg/ratelimiter/default.go Outdated Show resolved Hide resolved

pkg/reconciler/managed/reconciler.go Outdated Show resolved Hide resolved

Update managed reconciler comments with backoff

598fa1f

Updates comments in managed reconciler to indicate requeues are not tied to short wait but instead are explicit and trigger the configured backoff strategy. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>

hasheddan force-pushed the backoff branch from 2e2ea33 to 598fa1f Compare February 19, 2021 15:46

hasheddan merged commit 30a941c into crossplane:master Feb 19, 2021

hasheddan mentioned this pull request Feb 23, 2021

Use proper backoff across core controllers crossplane/crossplane#2157

Closed

muvaf mentioned this pull request Apr 7, 2021

Fix the short reconciling interval problem when hitting a create issue crossplane-contrib/provider-alibaba#66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make use of rate limiting in managed reconciler #243

Make use of rate limiting in managed reconciler #243

hasheddan commented Feb 11, 2021 •

edited

Loading

hasheddan Feb 11, 2021

negz Feb 12, 2021

hasheddan Feb 12, 2021

muvaf Feb 12, 2021

negz Feb 13, 2021

hasheddan Feb 14, 2021

negz Feb 15, 2021

negz Feb 13, 2021

hasheddan Feb 14, 2021

negz Feb 15, 2021

negz Feb 13, 2021

hasheddan Feb 14, 2021

negz Feb 15, 2021

muvaf Feb 16, 2021

hasheddan Feb 16, 2021

hasheddan commented Feb 15, 2021

muvaf left a comment

muvaf Feb 16, 2021

muvaf Feb 16, 2021

hasheddan left a comment

hasheddan Feb 18, 2021

	func WithpollInterval(after time.Duration) ReconcilerOption {
	func WithPollInterval(after time.Duration) ReconcilerOption {

Make use of rate limiting in managed reconciler #243

Make use of rate limiting in managed reconciler #243

Conversation

hasheddan commented Feb 11, 2021 • edited Loading

Description of your changes

How has this code been tested

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hasheddan commented Feb 15, 2021

muvaf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hasheddan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hasheddan commented Feb 11, 2021 •

edited

Loading