Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: kv: prevent lease interval regression during expiration-to-epoch promotion #123560

Open
wants to merge 4 commits into
base: release-23.1
Choose a base branch
from

Conversation

nvanbenschoten
Copy link
Member

Backport 3/3 commits from #123442.

/cc @cockroachdb/release


Fixes #121480.
Fixes #122016.

This commit resolves a bug in the expiration-based to epoch-based lease promotion transition, where the lease's effective expiration could be allowed to regress. To prevent this, we detect when such cases are about to occur and synchronously heartbeat the leaseholder's liveness record. This works because the liveness record interval and the expiration-based lease interval are the same, so a synchronous heartbeat ensures that the liveness record has a later expiration than the prior lease by the time the lease promotion goes into effect.

The code structure here leaves a lot to be desired, but since we're going to be cleaning up and/or removing a lot of this code soon anyway, I'm prioritizing backportability. This is therefore more targeted and less general than it could be.

The resolution here also leaves something to be desired. A nicer fix would be to introduce a minimum_lease_expiration field on epoch-based leases so that we can locally ensure that the expiration does not regress. This is what we plan to do for leader leases in the upcoming release. We don't make this change because it would be require a version gate to avoid replica divergence, so it would not be backportable.

Release note (bug fix): Fixed a rare bug where a lease transfer could lead to a side-transport update saw closed timestamp regression panic. The bug could occur when a node was overloaded and failing to heartbeat its node liveness record.


Release justification: fixes rare, but serious panic.

@nvanbenschoten nvanbenschoten requested a review from a team as a code owner May 3, 2024 15:22
Copy link

blathers-crl bot commented May 3, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL and one additional
    TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label May 3, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has an extra commit. Assuming you had to pickup an old commit to make a clean backport.

tbg and others added 4 commits May 30, 2024 14:03
This can be triggered rapidly because each replica might call this as it tries
and fails to acquire a lease.
This commit adds a check that a replica does not perform a lease transfer if it
does not own the previous lease. This allows us to make a stronger assumption a
layer down.

Epic: None
Release note: None
…otion

Fixes cockroachdb#121480.
Fixes cockroachdb#122016.

This commit resolves a bug in the expiration-based to epoch-based lease
promotion transition, where the lease's effective expiration could be
allowed to regress. To prevent this, we detect when such cases are about
to occur and synchronously heartbeat the leaseholder's liveness record.
This works because the liveness record interval and the expiration-based
lease interval are the same, so a synchronous heartbeat ensures that the
liveness record has a later expiration than the prior lease by the time
the lease promotion goes into effect.

The code structure here leaves a lot to be desired, but since we're
going to be cleaning up and/or removing a lot of this code soon anyway,
I'm prioritizing backportability. This is therefore more targeted and
less general than it could be.

The resolution here also leaves something to be desired. A nicer fix
would be to introduce a minimum_lease_expiration field on epoch-based
leases so that we can locally ensure that the expiration does not
regress. This is what we plan to do for leader leases in the upcoming
release. We don't make this change because it would be require a version
gate to avoid replica divergence, so it would not be backportable.

Release note (bug fix): Fixed a rare bug where a lease transfer could
lead to a `side-transport update saw closed timestamp regression` panic.
The bug could occur when a node was overloaded and failing to heartbeat
its node liveness record.
This commit adds a check that `args.PrevLease` is equivalent to
`cArgs.EvalCtx.GetLease()` to RequestLease. This ensures that the
validation here is consistent with the validation that was performed
when the lease request was constructed.

Release note: None
Epic: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants