-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate usages of retry.RetryOnConflict #4246
Conversation
7bc4099
to
b07e623
Compare
Rebased and fixed timezone-sensitive unit tests. |
/assign |
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done!
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some naming nits and general questions of understanding.
By the way the issue description was a great summary and very helpful!
return common.NewAlreadyScheduledError(fmt.Sprintf("shoot is already scheduled to seed %s", *shoot.Spec.SeedName)) | ||
} | ||
|
||
// run with optimistic locking to prevent rescheduling onto different seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be mainly educational for me but bear with me please :).
What I understand:
We want to make sure we always patch the current version of the shoot object therefore the optimistic locking.
What this controller does here is determining and setting the Seed name for the Shoot.
// run with optimistic locking to prevent rescheduling onto different seed
This implies that something else would in parallel try to also set the SeedName
Then my follow-up question would be where is that other part and why are there two logical parts competing for scheduling the Seed?
Or am I on the wrong path here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was actually struggling a bit on what to do with this one.
While rechecking the details now, I realized that the reason for retrying the spec.seedName
update is gone meanwhile (gcm no longer sets status to pending for unscheduled shoots). So, I got rid of RetryOnConflict
here, simplified the handling to use optimistic locking and use the default rate limiter for the workqueue instead of custom rate limiter with baseDelay of 15s to make retries faster.
This makes the retrySyncPeriod
field in the SchedulerConfig
obsolete, so I will remove it in another PR.
Still we want to use optimistic locking here to prevent scheduling based on stale data, e.g. if the shoot was manually scheduled or it was requeued or whatever. We definitely want to make sure to not reschedule the shoot in any case.
I also played around with this a bit and it seems like the chances of running into conflicts are basically 0 and also the default workqueue works perfectly fine, I don't see any issues of switching to it like all the other controllers do.
on vacation next week therefore /unassign |
Seems like the reason for retrying the spec.seedName update is gone meanwhile (gcm no longer sets status to pending for unscheduled shoots). So, simplify the handling, use optimistic locking and use the default rate limiter for the workqueue instead of custom rate limiter with baseDelay of 15s to make retries faster.
As @BeckerMax is OOO, can you check the latest changes @rfranzke, please? (I added some changes to the scheduler, see #4246 (comment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
* Replace retry.RetryOnConflict in BackupBucket test * Replace retry.RetryOnConflict in integration tests * Eliminate TryUpdateControllerInstallation* * Eliminate TryUpdateControllerRegistration * Eliminate TryUpdateSeed * Eliminate TryUpdateShootHibernation * Replace TryUpdateShoot in scheduler * Replace TryUpdateShoot in maintenance controller * Replace TryUpdateShoot in shoot controller * Replace TryUpdateShoot in seed controller * Replace TryUpdateShoot in Operation * Eliminate TryUpdateShoot* * Log hibernation job's actions * Adapt naming and doc strings * Eliminate retry.RetryOnConflict in scheduler Seems like the reason for retrying the spec.seedName update is gone meanwhile (gcm no longer sets status to pending for unscheduled shoots). So, simplify the handling, use optimistic locking and use the default rate limiter for the workqueue instead of custom rate limiter with baseDelay of 15s to make retries faster.
* Replace retry.RetryOnConflict in BackupBucket test * Replace retry.RetryOnConflict in integration tests * Eliminate TryUpdateControllerInstallation* * Eliminate TryUpdateControllerRegistration * Eliminate TryUpdateSeed * Eliminate TryUpdateShootHibernation * Replace TryUpdateShoot in scheduler * Replace TryUpdateShoot in maintenance controller * Replace TryUpdateShoot in shoot controller * Replace TryUpdateShoot in seed controller * Replace TryUpdateShoot in Operation * Eliminate TryUpdateShoot* * Log hibernation job's actions * Adapt naming and doc strings * Eliminate retry.RetryOnConflict in scheduler Seems like the reason for retrying the spec.seedName update is gone meanwhile (gcm no longer sets status to pending for unscheduled shoots). So, simplify the handling, use optimistic locking and use the default rate limiter for the workqueue instead of custom rate limiter with baseDelay of 15s to make retries faster.
How to categorize this PR?
/area scalability
/kind technical-debt
What this PR does / why we need it:
This PR cleans up a big technical debt related to scalability in gardener. It eliminates many usages of
retry.RetryOnConflict
in its different forms.Background/Rationale:
Generally speaking, controllers should not retry update/patch requests on conflicts and rather return the error to let the queue handle retries with exponential backoff for them. This is because, a conflict error indicates, that the controller might have operated on stale data and thus, might have come to a wrong conclusion on which action to take. `RetryOnConflict ignores this safety mechanism / concurrency protection rather than solving the problem.
OTOH, if a controller is updating for example the status of the controlled resource, it should rather patch the status field w/o optimistic locking, as the status field should be exclusively owned by this controller and concurrent updates to the resource are not of interest, so the controller should ignore them.
Another problem is, that gardener was doing a lot of direct calls (via the
GardenCore
clientset) in theTryUpdate*
. E.g. in the progress reporter for shoot operations or shoot care controller, the shoot status was updated usingTryUpdateShooStatus
, which does a direct call first and then updates w/ optimistic locking (and optionally retries on conflict).By replacing this with (strategic) patches, gardener will save a lot of direct calls and conflicts.
With this background in mind, the PR accordingly
retry.RetryOnConflict
with patch requestsTryUpdate{Shoot,Seed,ControllerRegistration,ControllerInstallation}*
with patch requests where applicable, or explicit update calls or patch calls w/ optimistic lockingWhich issue(s) this PR fixes:
Part of #2822
Special notes for your reviewer:
There are other occurrences of "RetryOnConflict" semantics (in
kutil.Try{Update,Patch}
), that should also be replaced by better mechanisms. Those will be tackled in a different PR, as it requires some more thought and this PR is already quite large./squash
Release note: