Eliminate usages of retry.RetryOnConflict #4246

timebertt · 2021-06-23T12:07:27Z

How to categorize this PR?

/area scalability
/kind technical-debt

What this PR does / why we need it:

This PR cleans up a big technical debt related to scalability in gardener. It eliminates many usages of retry.RetryOnConflict in its different forms.

Background/Rationale:
Generally speaking, controllers should not retry update/patch requests on conflicts and rather return the error to let the queue handle retries with exponential backoff for them. This is because, a conflict error indicates, that the controller might have operated on stale data and thus, might have come to a wrong conclusion on which action to take. `RetryOnConflict ignores this safety mechanism / concurrency protection rather than solving the problem.

OTOH, if a controller is updating for example the status of the controlled resource, it should rather patch the status field w/o optimistic locking, as the status field should be exclusively owned by this controller and concurrent updates to the resource are not of interest, so the controller should ignore them.

Another problem is, that gardener was doing a lot of direct calls (via the GardenCore clientset) in the TryUpdate*. E.g. in the progress reporter for shoot operations or shoot care controller, the shoot status was updated using TryUpdateShooStatus, which does a direct call first and then updates w/ optimistic locking (and optionally retries on conflict).
By replacing this with (strategic) patches, gardener will save a lot of direct calls and conflicts.

With this background in mind, the PR accordingly

replaces a few direct usages of retry.RetryOnConflict with patch requests
replaces usages of TryUpdate{Shoot,Seed,ControllerRegistration,ControllerInstallation}* with patch requests where applicable, or explicit update calls or patch calls w/ optimistic locking

Which issue(s) this PR fixes:
Part of #2822

Special notes for your reviewer:

There are other occurrences of "RetryOnConflict" semantics (in kutil.Try{Update,Patch}), that should also be replaced by better mechanisms. Those will be tackled in a different PR, as it requires some more thought and this PR is already quite large.

/squash

Release note:

Optimize gardenlet's shoot controller to issue less calls to gardener-apiserver for the highly frequent status updates during reconciliations and normal care operations.

timebertt · 2021-06-23T13:35:19Z

Rebased and fixed timezone-sensitive unit tests.

rfranzke · 2021-06-23T15:22:10Z

/assign

BeckerMax · 2021-06-24T05:39:12Z

/assign

rfranzke

Well done!
/lgtm

BeckerMax

just some naming nits and general questions of understanding.

By the way the issue description was a great summary and very helpful!

test/framework/k8s_utils.go

test/integration/shoots/maintenance/maintenance_operations.go

BeckerMax · 2021-06-25T18:23:58Z

pkg/scheduler/controller/shoot/scheduler_control.go

+			return common.NewAlreadyScheduledError(fmt.Sprintf("shoot is already scheduled to seed %s", *shoot.Spec.SeedName))
+		}
+
+		// run with optimistic locking to prevent rescheduling onto different seed


This might be mainly educational for me but bear with me please :).
What I understand:
We want to make sure we always patch the current version of the shoot object therefore the optimistic locking.
What this controller does here is determining and setting the Seed name for the Shoot.

// run with optimistic locking to prevent rescheduling onto different seed

This implies that something else would in parallel try to also set the SeedName
Then my follow-up question would be where is that other part and why are there two logical parts competing for scheduling the Seed?

Or am I on the wrong path here?

Hmm, I was actually struggling a bit on what to do with this one.
While rechecking the details now, I realized that the reason for retrying the spec.seedName update is gone meanwhile (gcm no longer sets status to pending for unscheduled shoots). So, I got rid of RetryOnConflict here, simplified the handling to use optimistic locking and use the default rate limiter for the workqueue instead of custom rate limiter with baseDelay of 15s to make retries faster.

This makes the retrySyncPeriod field in the SchedulerConfig obsolete, so I will remove it in another PR.

Still we want to use optimistic locking here to prevent scheduling based on stale data, e.g. if the shoot was manually scheduled or it was requeued or whatever. We definitely want to make sure to not reschedule the shoot in any case.
I also played around with this a bit and it seems like the chances of running into conflicts are basically 0 and also the default workqueue works perfectly fine, I don't see any issues of switching to it like all the other controllers do.

pkg/operation/operation.go

BeckerMax · 2021-06-25T19:10:54Z

on vacation next week therefore /unassign

Seems like the reason for retrying the spec.seedName update is gone meanwhile (gcm no longer sets status to pending for unscheduled shoots). So, simplify the handling, use optimistic locking and use the default rate limiter for the workqueue instead of custom rate limiter with baseDelay of 15s to make retries faster.

timebertt · 2021-06-28T15:00:26Z

As @BeckerMax is OOO, can you check the latest changes @rfranzke, please? (I added some changes to the scheduler, see #4246 (comment))

rfranzke

/lgtm

* Replace retry.RetryOnConflict in BackupBucket test * Replace retry.RetryOnConflict in integration tests * Eliminate TryUpdateControllerInstallation* * Eliminate TryUpdateControllerRegistration * Eliminate TryUpdateSeed * Eliminate TryUpdateShootHibernation * Replace TryUpdateShoot in scheduler * Replace TryUpdateShoot in maintenance controller * Replace TryUpdateShoot in shoot controller * Replace TryUpdateShoot in seed controller * Replace TryUpdateShoot in Operation * Eliminate TryUpdateShoot* * Log hibernation job's actions * Adapt naming and doc strings * Eliminate retry.RetryOnConflict in scheduler Seems like the reason for retrying the spec.seedName update is gone meanwhile (gcm no longer sets status to pending for unscheduled shoots). So, simplify the handling, use optimistic locking and use the default rate limiter for the workqueue instead of custom rate limiter with baseDelay of 15s to make retries faster.

timebertt requested review from a team as code owners June 23, 2021 12:07

gardener-robot added area/scalability Scalability related kind/technical-debt Something that is only solved on the surface, but requires more (re)work to be done properly merge/squash size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 23, 2021

timebertt mentioned this pull request Jun 23, 2021

☂️ -Issue for graduating CachedRuntimeClients to beta #2822

Closed

12 tasks

gardener-robot-ci-1 added the reviewed/ok-to-test label Jun 23, 2021

gardener-robot-ci-2 added needs/ok-to-test and removed reviewed/ok-to-test labels Jun 23, 2021

timebertt added 13 commits June 23, 2021 15:33

Replace retry.RetryOnConflict in BackupBucket test

bcbce1f

Replace retry.RetryOnConflict in integration tests

09ced6c

Eliminate TryUpdateControllerInstallation*

f64614d

Eliminate TryUpdateControllerRegistration

e87b94c

Eliminate TryUpdateSeed

a6e1c92

Eliminate TryUpdateShootHibernation

d43cc4a

Replace TryUpdateShoot in scheduler

6963f60

Replace TryUpdateShoot in maintenance controller

a239dbb

Replace TryUpdateShoot in shoot controller

3992a45

Replace TryUpdateShoot in seed controller

a83f778

Replace TryUpdateShoot in Operation

1b430b1

Eliminate TryUpdateShoot*

32899d0

Log hibernation job's actions

b07e623

timebertt force-pushed the cleanup/retryonconflict branch from 7bc4099 to b07e623 Compare June 23, 2021 13:34

gardener-robot-ci-1 added the reviewed/ok-to-test label Jun 23, 2021

gardener-robot-ci-1 removed the reviewed/ok-to-test label Jun 23, 2021

gardener-robot assigned rfranzke Jun 23, 2021

gardener-robot assigned BeckerMax Jun 24, 2021

rfranzke previously approved these changes Jun 24, 2021

View reviewed changes

gardener-robot added reviewed/lgtm and removed needs/review labels Jun 24, 2021

BeckerMax reviewed Jun 25, 2021

View reviewed changes

Adapt naming and doc strings

d7bb48b

timebertt dismissed rfranzke’s stale review via d7bb48b June 28, 2021 07:06

gardener-robot added needs/review and removed needs/review labels Jun 28, 2021

gardener-robot-ci-2 added reviewed/ok-to-test and removed reviewed/ok-to-test labels Jun 28, 2021

gardener-robot-ci-2 added the reviewed/ok-to-test label Jun 28, 2021

gardener-robot-ci-1 removed the reviewed/ok-to-test label Jun 28, 2021

rfranzke approved these changes Jun 29, 2021

View reviewed changes

gardener-robot added reviewed/lgtm and removed needs/second-opinion labels Jun 29, 2021

rfranzke merged commit 54930d4 into gardener:master Jun 29, 2021

timebertt deleted the cleanup/retryonconflict branch June 29, 2021 09:08

timebertt mentioned this pull request Jun 30, 2021

Remove obsolete fields in SchedulerConfiguration #4285

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate usages of retry.RetryOnConflict #4246

Eliminate usages of retry.RetryOnConflict #4246

timebertt commented Jun 23, 2021

timebertt commented Jun 23, 2021

rfranzke commented Jun 23, 2021

BeckerMax commented Jun 24, 2021

rfranzke left a comment

BeckerMax left a comment

BeckerMax Jun 25, 2021

timebertt Jun 28, 2021

BeckerMax commented Jun 25, 2021

timebertt commented Jun 28, 2021

rfranzke left a comment

Eliminate usages of retry.RetryOnConflict #4246

Eliminate usages of retry.RetryOnConflict #4246

Conversation

timebertt commented Jun 23, 2021

timebertt commented Jun 23, 2021

rfranzke commented Jun 23, 2021

BeckerMax commented Jun 24, 2021

rfranzke left a comment

Choose a reason for hiding this comment

BeckerMax left a comment

Choose a reason for hiding this comment

BeckerMax Jun 25, 2021

Choose a reason for hiding this comment

timebertt Jun 28, 2021

Choose a reason for hiding this comment

BeckerMax commented Jun 25, 2021

timebertt commented Jun 28, 2021

rfranzke left a comment

Choose a reason for hiding this comment