Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: kvserver: propagate GCHint for partial deletions #110643

Merged
merged 12 commits into from Oct 30, 2023

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Sep 14, 2023

Backport on behalf of @pavelkalinnikov:

Please see individual PRs for details.

/cc @cockroachdb/release


This PR backports the new behaviour of GCHint introduced in #110078 to 23.1, and makes it conditional on a default-off cluster setting. Since 23.2, it will always enabled by a version gate.

Since it is late to enable this behaviour in 23.1 (risk of incompatibility) or introduce a version gate to make it completely safe, hide it behind a default-off cluster setting. In 23.2, it will be enabled by default, and the cluster setting will be deprecated.

The safest moment to enable this cluster setting is when there is some confidence that the cluster binaries will not rollback to previous patch versions of 23.1. The risk exists only in mixed-version state in which some 23.1 binaries don't know the new GCHint fields, and some do.


Epic: none

Release note (bug fix): Fixed a bug that could occasionally cause schema change jobs (e.g. table/index drops) to appear stuck in state "waiting for MVCC GC" for much longer than expected. The fix only applies to future schema changes -- existing stuck jobs can be processed by manually force-enqueueing the relevant ranges in the MVCC GC queue under the DB Console's advanced debug page.

Release note (ops change): introduce a default-off cluster setting kv.gc.sticky_hint.enabled which helps expediting garbage collection after range deletions, such as when a SQL table or index is dropped.


Release justification: fixing bug affecting many users

The GCHint is currently used for 3 purposes. Document the status quo
before making changes.

Epic: none
Release note: none
This is a cosmetic commit. It replaces a lower-level GCHint "reset"
method with a method that takes an effective GC threshold.

Epic: none
Release note: none
Previously, GCHint was used for 3 purposes:
1. Instruct GC to run at specific times. This is used by SQL table/index
   drop jobs, to limit the amount of time they wait on data GC.
2. Hint GC that a range is fully covered by range tombstones, so that GC
   can use ClearRange for faster deletes. This is useful for bulk data
   deletions as well.
3. Hint GC that a range was fully covered by range tombstones at some
   recent point. GC prioritizes going through such ranges, and processes
   them together, because this makes Pebble compaction more efficient.

The use-case 1 was broken. If a range tombstone does not fully cover the
Range keyspace, GCHint is not written. Correspondingly, for small
table/index deletions (spanning less than a single range), or deletions
that don't perfectly match range bounds, there will be some ranges with
no hints. For these ranges, GC might be arbitrarily delayed, and the
schema change jobs would be stuck.

In addition, GCHint propagation during merges was lossy. It could drop
the hint if either LHS or RHS doesn't have a hint, or has some data.

This commit extends GCHint to fix the use-case 1. Two fields are added:
min and max deletion timestamp. These fields are populated even for
partial range deletions, and are propagated during splits and merges
unconditionally. The choice of min-max approach is explained in the
comments to the corresponding GCHint fields.

Release note (bug fix): Fixed a bug that could occasionally cause schema
change jobs (e.g. table/index drops) to appear stuck in state "waiting
for MVCC GC" for much longer than expected. The fix only applies to
future schema changes -- existing stuck jobs can be processed by
manually force-enqueueing the relevant ranges in the MVCC GC queue under
the DB Console's advanced debug page.

Epic: none
The test checks that all the GCHint invariants hold before/after the
ScheduleGCFor operation, and that it returns true iff it modified the
hint.

The test helped to fix one bug in the method implementation.

Epic: none
Release note: none
Epic: none
Release note: none
Epic: none
Release note: none
The test checks that all the GCHint invariants hold before/after the
Merge operation; that the Merge operation is commutative; and that Merge
returns true iff it modified the hint.

For now, the test only Merges an empty hint with a non-empty hint. Soon
adding tests for non-empty + non-empty cases.

The test helped to fix one bug in Merge commutativity.

Epic: none
Release note: none
@blathers-crl blathers-crl bot requested a review from a team as a code owner September 14, 2023 14:15
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.1-110078 branch 2 times, most recently from 2e99807 to 4fb239d Compare September 14, 2023 14:15
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Sep 14, 2023
@blathers-crl
Copy link
Author

blathers-crl bot commented Sep 14, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Sep 14, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor

Let's let this bake on master for a week or two, and wait until the next 23.1 release is cut to get more branch baking time.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 4, 2023

Should we merge this now? The next release is in 3 weeks, which gives us a fair amount of baking time. Or we can wait for the next release to get more confidence.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 4, 2023

In the spirit of the upcoming backport policy changes, I'd also like to get @nvanbenschoten's high-level take on this. Happy to jump on a call to catch you up.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 5, 2023

I think @nvanbenschoten's instinct was right, I think there is a case where this will lead to assertion failures and node crashes due to diverging in-memory and on-disk state following snapshot application.

Below, vNew signifies the new patch version with support for this GCHint, while vOld signifies the old version.

  1. A vNew leaseholder n1 writes GCHint{GCTimestamp: foo}.
  2. A vOld follower n2 applies GCHint{} to its in-memory state.
  3. The vOld follower n2 becomes leaseholder.
  4. The vOld leaseholder n2 sends a snapshot to a vNew follower n3.
  5. n3 receives GCHint{} from SnapshotRequest.Header.State, but GCHint{GCTimestamp: foo} in the snapshot data.
  6. n3 applies the snapshot and crashes.

@pavelkalinnikov Can you verify this, and write up a release-blocker issue? We'll need a version gate for this in 23.2, and/or try to come up with a way to handle this safely in 23.1. One option might be to ignore this field if empty during the assertion comparison, similarly to DeprecatedUsingAppliedStateKey.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 5, 2023

Discussed with @erikgrinaker.

For this scenario to occur, the SnapshotRequest.Header.State from step 5 should feed into the in-memory Replica state in step 6. However, it doesn't seem to be the case, we're loading the in-memory Replica state from disk. It's still worth running this test to make sure.

Since it is late to enable this behaviour in 23.1 (risk of backwards
incompatibility), hide it behind a default-off cluster setting. In 23.2,
it will be enabled by default, and the cluster setting will be
deprecated.

The new GCHint behaviour is likely backwards compatible, but we are
hiding it behind a setting for extra safety.

The safest moment to enable this cluster setting is when there is some
confidence that the cluster binaries will not rollback to previous patch
versions of 23.1. The risk exists only in mixed-version state in which
some 23.1 binaries don't know the new GCHint fields, and some do.

Epic: none
Release note (ops change): introduce a default-off cluster setting
`kv.gc.sticky_hint.enabled` which helps expediting garbage collection
after range deletions, such as when a SQL table or index is dropped.
@pav-kv
Copy link
Collaborator

pav-kv commented Oct 26, 2023

@erikgrinaker @nvanbenschoten I've added one commit from #112948 which introduces a default-off cluster setting for this change. See the PR description for details.

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 26, 2023

The tests in CI failed because they assume the setting is on (on master). Need to fix the tests so that they are aware whether the setting is on/off.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this across the line.

It's unfortunate that we have to gate a bugfix on a cluster setting, but I suppose it's prudent given the small risk of replica state divergence in mixed-version clusters. The setting also can't be used after-the-fact to resolve an already stuck job, as it only applies to new schema deletions.

pkg/kv/kvserver/batcheval/cmd_delete_range.go Outdated Show resolved Hide resolved
@dt
Copy link
Member

dt commented Oct 26, 2023

replica state divergence in mixed-version clusters

Have we run a (manual) test where a 23.1 build with this change persists some descs and then is downgraded to a 23.1 build without the change?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 26, 2023

Have we run a (manual) test where a 23.1 build with this change persists some descs and then is downgraded to a 23.1 build without the change?

Yes, we ran a bunch of manual tests, both with mixed-version binaries, and upgraded/downgraded binaries. We believe this should be safe even without the cluster setting, but have gated it on a default-off setting out of caution (and to conform with the new backport policy).

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 27, 2023

@erikgrinaker

The setting also can't be used after-the-fact to resolve an already stuck job, as it only applies to new schema deletions.

When I tested this previously, I saw that the GCHint is overridden when the job is restarted. So the workaround should be: 1. enable the cluster setting, 2. restart the affected jobs (or the nodes). I tested this workflow to check.

Would be good to verify this interaction after the introduction of the cluster setting.

Epic: none
Release note: none
This commit fixed TestStoreMergeGCHint for 23.1 branch which has the
sticky GC hint setting off by default. This commit also adds a new test
to check that the cluster setting changes the GC hint writing behaviour
correctly.

Epic: none
Release note: none
@pav-kv pav-kv force-pushed the blathers/backport-release-23.1-110078 branch from 8512738 to 3bb7709 Compare October 27, 2023 11:34
@pav-kv
Copy link
Collaborator

pav-kv commented Oct 27, 2023

Would be good to verify this interaction after the introduction of the cluster setting.

Did an end-to-end test again (on 23.1 with this PR as is).

First, with the setting off I did the same steps to reach a stuck schema change job state. With a TTL of 10 min, I waited 20 min, and tried manually bumping the job into the mvccGC queue. The queue refused to process this range (low priority, shouldQueue=false).

Then I enabled the cluster setting. Waited another 20 min. The job did not magically got unstuck (because we did not override the GCHint yet). However, after I restarted the schema change job, in 15 min it completed (meaning it wrote a new GCHint, and then some time later GC did pick it up).

@pav-kv pav-kv merged commit 7952853 into release-23.1 Oct 30, 2023
6 checks passed
@pav-kv pav-kv deleted the blathers/backport-release-23.1-110078 branch October 30, 2023 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants