Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate usages of retry.RetryOnConflict #4246

Merged
merged 15 commits into from Jun 29, 2021

Conversation

timebertt
Copy link
Member

How to categorize this PR?

/area scalability
/kind technical-debt

What this PR does / why we need it:

This PR cleans up a big technical debt related to scalability in gardener. It eliminates many usages of retry.RetryOnConflict in its different forms.

Background/Rationale:
Generally speaking, controllers should not retry update/patch requests on conflicts and rather return the error to let the queue handle retries with exponential backoff for them. This is because, a conflict error indicates, that the controller might have operated on stale data and thus, might have come to a wrong conclusion on which action to take. `RetryOnConflict ignores this safety mechanism / concurrency protection rather than solving the problem.

OTOH, if a controller is updating for example the status of the controlled resource, it should rather patch the status field w/o optimistic locking, as the status field should be exclusively owned by this controller and concurrent updates to the resource are not of interest, so the controller should ignore them.

Another problem is, that gardener was doing a lot of direct calls (via the GardenCore clientset) in the TryUpdate*. E.g. in the progress reporter for shoot operations or shoot care controller, the shoot status was updated using TryUpdateShooStatus, which does a direct call first and then updates w/ optimistic locking (and optionally retries on conflict).
By replacing this with (strategic) patches, gardener will save a lot of direct calls and conflicts.

With this background in mind, the PR accordingly

  • replaces a few direct usages of retry.RetryOnConflict with patch requests
  • replaces usages of TryUpdate{Shoot,Seed,ControllerRegistration,ControllerInstallation}* with patch requests where applicable, or explicit update calls or patch calls w/ optimistic locking

Which issue(s) this PR fixes:
Part of #2822

Special notes for your reviewer:

There are other occurrences of "RetryOnConflict" semantics (in kutil.Try{Update,Patch}), that should also be replaced by better mechanisms. Those will be tackled in a different PR, as it requires some more thought and this PR is already quite large.

/squash

Release note:

Optimize gardenlet's shoot controller to issue less calls to gardener-apiserver for the highly frequent status updates during reconciliations and normal care operations.

@timebertt timebertt requested review from a team as code owners June 23, 2021 12:07
@gardener-robot gardener-robot added area/scalability Scalability related kind/technical-debt Something that is only solved on the surface, but requires more (re)work to be done properly merge/squash size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 23, 2021
@timebertt
Copy link
Member Author

Rebased and fixed timezone-sensitive unit tests.

@rfranzke
Copy link
Member

/assign

@BeckerMax
Copy link
Contributor

/assign

rfranzke
rfranzke previously approved these changes Jun 24, 2021
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!
/lgtm

Copy link
Contributor

@BeckerMax BeckerMax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some naming nits and general questions of understanding.

By the way the issue description was a great summary and very helpful!

test/framework/k8s_utils.go Outdated Show resolved Hide resolved
return common.NewAlreadyScheduledError(fmt.Sprintf("shoot is already scheduled to seed %s", *shoot.Spec.SeedName))
}

// run with optimistic locking to prevent rescheduling onto different seed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be mainly educational for me but bear with me please :).
What I understand:
We want to make sure we always patch the current version of the shoot object therefore the optimistic locking.
What this controller does here is determining and setting the Seed name for the Shoot.

// run with optimistic locking to prevent rescheduling onto different seed

This implies that something else would in parallel try to also set the SeedName
Then my follow-up question would be where is that other part and why are there two logical parts competing for scheduling the Seed?

Or am I on the wrong path here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was actually struggling a bit on what to do with this one.
While rechecking the details now, I realized that the reason for retrying the spec.seedName update is gone meanwhile (gcm no longer sets status to pending for unscheduled shoots). So, I got rid of RetryOnConflict here, simplified the handling to use optimistic locking and use the default rate limiter for the workqueue instead of custom rate limiter with baseDelay of 15s to make retries faster.

This makes the retrySyncPeriod field in the SchedulerConfig obsolete, so I will remove it in another PR.

Still we want to use optimistic locking here to prevent scheduling based on stale data, e.g. if the shoot was manually scheduled or it was requeued or whatever. We definitely want to make sure to not reschedule the shoot in any case.
I also played around with this a bit and it seems like the chances of running into conflicts are basically 0 and also the default workqueue works perfectly fine, I don't see any issues of switching to it like all the other controllers do.

pkg/operation/operation.go Outdated Show resolved Hide resolved
pkg/operation/operation.go Outdated Show resolved Hide resolved
@BeckerMax
Copy link
Contributor

on vacation next week therefore /unassign

Seems like the reason for retrying the spec.seedName update is gone meanwhile
(gcm no longer sets status to pending for unscheduled shoots).
So, simplify the handling, use optimistic locking and use the default
rate limiter for the workqueue instead of custom rate limiter with baseDelay
of 15s to make retries faster.
@timebertt
Copy link
Member Author

As @BeckerMax is OOO, can you check the latest changes @rfranzke, please? (I added some changes to the scheduler, see #4246 (comment))

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rfranzke rfranzke merged commit 54930d4 into gardener:master Jun 29, 2021
@timebertt timebertt deleted the cleanup/retryonconflict branch June 29, 2021 09:08
krgostev pushed a commit to krgostev/gardener that referenced this pull request Apr 21, 2022
* Replace retry.RetryOnConflict in BackupBucket test

* Replace retry.RetryOnConflict in integration tests

* Eliminate TryUpdateControllerInstallation*

* Eliminate TryUpdateControllerRegistration

* Eliminate TryUpdateSeed

* Eliminate TryUpdateShootHibernation

* Replace TryUpdateShoot in scheduler

* Replace TryUpdateShoot in maintenance controller

* Replace TryUpdateShoot in shoot controller

* Replace TryUpdateShoot in seed controller

* Replace TryUpdateShoot in Operation

* Eliminate TryUpdateShoot*

* Log hibernation job's actions

* Adapt naming and doc strings

* Eliminate retry.RetryOnConflict in scheduler

Seems like the reason for retrying the spec.seedName update is gone meanwhile
(gcm no longer sets status to pending for unscheduled shoots).
So, simplify the handling, use optimistic locking and use the default
rate limiter for the workqueue instead of custom rate limiter with baseDelay
of 15s to make retries faster.
krgostev pushed a commit to krgostev/gardener that referenced this pull request Jul 5, 2022
* Replace retry.RetryOnConflict in BackupBucket test

* Replace retry.RetryOnConflict in integration tests

* Eliminate TryUpdateControllerInstallation*

* Eliminate TryUpdateControllerRegistration

* Eliminate TryUpdateSeed

* Eliminate TryUpdateShootHibernation

* Replace TryUpdateShoot in scheduler

* Replace TryUpdateShoot in maintenance controller

* Replace TryUpdateShoot in shoot controller

* Replace TryUpdateShoot in seed controller

* Replace TryUpdateShoot in Operation

* Eliminate TryUpdateShoot*

* Log hibernation job's actions

* Adapt naming and doc strings

* Eliminate retry.RetryOnConflict in scheduler

Seems like the reason for retrying the spec.seedName update is gone meanwhile
(gcm no longer sets status to pending for unscheduled shoots).
So, simplify the handling, use optimistic locking and use the default
rate limiter for the workqueue instead of custom rate limiter with baseDelay
of 15s to make retries faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scalability Scalability related kind/technical-debt Something that is only solved on the surface, but requires more (re)work to be done properly size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants