Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.2: kvserver: always return NLHE on lease acquisition timeouts #85429

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Aug 1, 2022

Backport 1/1 commits from #84865.

/cc @cockroachdb/release

Release justification: improved error handling during lease acquisition.


In ab74b97 we added internal timeouts for lease acquisitions. These
were wrapped in RunWithTimeout(), as mandated for context timeouts.
However, this would mask the returned NotLeaseHolderError as a
TimeoutError, preventing the DistSender from retrying it and instead
propagating it out to the client. Additionally, context cancellation
errors from the actual RPC call were never wrapped as a
NotLeaseHolderError in the first place.

This ended up only happening in a very specific scenario where the outer
timeout added to the client context did not trigger, but the inner
timeout for the coalesced request context did trigger while the lease
request was in flight. Accidentally, the outer RunWithTimeout() call
did not return the roachpb.Error from the closure but instead passed
it via a captured variable, bypassing the error wrapping.

This patch replaces the RunWithTimeout() calls with regular
context.WithTimeout() calls to avoid the error wrapping, and returns a
NotLeaseHolderError from requestLease() if the RPC request fails and
the context was cancelled (presumably causing the error). Another option
would be to extract an NLHE from the error chain, but this would require
correct propagation of the structured error chain across RPC boundaries,
so out of an abundance of caution and with an eye towards backports, we
instead choose to return a bare NotLeaseHolderError.

The empty lease in the returned error prevents the DistSender from
updating its caches on context cancellation.

Release note (bug fix): Fixed a bug where clients could sometimes
receive errors due to lease acquisition timeouts of the form
operation "storage.pendingLeaseRequest: requesting lease" timed out after 6s.

@erikgrinaker erikgrinaker requested a review from a team as a code owner August 1, 2022 21:10
@erikgrinaker erikgrinaker self-assigned this Aug 1, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 1, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

In ab74b97 we added internal timeouts for lease acquisitions. These
were wrapped in `RunWithTimeout()`, as mandated for context timeouts.
However, this would mask the returned `NotLeaseHolderError` as a
`TimeoutError`, preventing the DistSender from retrying it and instead
propagating it out to the client. Additionally, context cancellation
errors from the actual RPC call were never wrapped as a
`NotLeaseHolderError` in the first place.

This ended up only happening in a very specific scenario where the outer
timeout added to the client context did not trigger, but the inner
timeout for the coalesced request context did trigger while the lease
request was in flight. Accidentally, the outer `RunWithTimeout()` call
did not return the `roachpb.Error` from the closure but instead passed
it via a captured variable, bypassing the error wrapping.

This patch replaces the `RunWithTimeout()` calls with regular
`context.WithTimeout()` calls to avoid the error wrapping. Another
option would be to extract an NLHE from the error chain, but this would
require correct propagation of the structured error chain across RPC
boundaries, so with an eye towards backports we instead choose to return
a bare `NotLeaseHolderError`.

The patch also removes the inner lease request timeout, since it has
questionable benefits and the outer timeout is sufficient to prevent
leases getting stuck for the overall range.

Release note (bug fix): Fixed a bug where clients could sometimes
receive errors due to lease acquisition timeouts of the form
`operation "storage.pendingLeaseRequest: requesting lease" timed out after 6s`.
@erikgrinaker erikgrinaker merged commit c2a7c3b into cockroachdb:release-21.2 Aug 8, 2022
@erikgrinaker erikgrinaker deleted the backport21.2-84865 branch August 23, 2022 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants