Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: kvserver: add timeout for lease acquisitions #81815

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented May 25, 2022

Backport 2/2 commits from #81136.

/cc @cockroachdb/release

Release justification: prevents cluster outage on stalled leaseholders.


kvserver: deflake TestLeasePreferencesDuringOutage

This patch deflakes TestLeasePreferencesDuringOutage by allowing nodes
to acquire a lease even when they are not the Raft leader.

Release note: None

kvserver: add timeout for lease acquisitions

This patch adds a timeout for lease acquisitions. It is set to
twice the Raft election timeout (6 seconds total), since it may need to
hold a Raft election and repropose the lease acquisition command, each
of which may take up to one election timeout.

Without this timeout, it's possible for a lease acquisition to stall
indefinitely (e.g. in the case of a stalled disk). This prevents a
NotLeaseHolderError from being returned to the client DistSender,
which in turn prevents it from trying other replicas that could acquire
the lease instead. This can cause a lease to remain invalid forever.

Release note (bug fix): Fixed a bug where an unresponsive node (e.g.
with a stalled disk) could prevent other nodes from acquiring its
leases, effectively stalling these ranges until the node was shut down
or recovered.

Touches #81100.

Release note (bug fix): Fixed a bug where an unresponsive node (e.g.
with a stalled disk) could prevent other nodes from acquiring its
leases, effectively stalling these ranges until the node was shut down
or recovered.

@erikgrinaker erikgrinaker requested a review from tbg May 25, 2022 11:40
@erikgrinaker erikgrinaker requested a review from a team as a code owner May 25, 2022 11:40
@erikgrinaker erikgrinaker self-assigned this May 25, 2022
@blathers-crl
Copy link

blathers-crl bot commented May 25, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

This patch deflakes `TestLeasePreferencesDuringOutage` by allowing nodes
to acquire a lease even when they are not the Raft leader.

Release note: None
@erikgrinaker erikgrinaker force-pushed the backport22.1-81136 branch 2 times, most recently from 2988ca1 to 30c2563 Compare June 8, 2022 18:15
This patch adds a timeout for lease acquisitions. It is set to
twice the Raft election timeout (6 seconds total), since it may need to
hold a Raft election and repropose the lease acquisition command, each
of which may take up to one election timeout.

Without this timeout, it's possible for a lease acquisition to stall
indefinitely (e.g. in the case of a stalled disk). This prevents a
`NotLeaseHolderError` from being returned to the client DistSender,
which in turn prevents it from trying other replicas that could acquire
the lease instead. This can cause a lease to remain invalid forever.

Release note (bug fix): Fixed a bug where an unresponsive node (e.g.
with a stalled disk) could prevent other nodes from acquiring its
leases, effectively stalling these ranges until the node was shut down
or recovered.
@erikgrinaker erikgrinaker merged commit e8e18d4 into cockroachdb:release-22.1 Jun 8, 2022
@erikgrinaker erikgrinaker deleted the backport22.1-81136 branch June 9, 2022 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants