batcheval: add `BarrierRequest.WithLeaseAppliedIndex` #117967

erikgrinaker · 2024-01-19T10:38:00Z

Extracted from #117612.

batcheval: add BarrierRequest.WithLeaseAppliedIndex

This can be used to detect whether a replica has applied the barrier command yet.

kvnemsis: add support for Barrier operations

This only executes random Barrier requests, but does not verify that the barrier guarantees are actually satisfied (i.e. that all past and concurrent writes are applied before it returns). At least we get some execution coverage, and verify that it does not have negative interactions with other operations.

Epic: none
Release note: None

cockroach-teamcity · 2024-01-19T10:38:11Z

This change is

nvanbenschoten

Reviewed 9 of 9 files at r1, 9 of 9 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvnemesis/generator.go line 1126 at r2 (raw file):

		// them within an existing range. This may race with a concurrent split, in
		// which case the Barrier will fail, but that's ok -- most should still
		// succeed.

Consider mentioning here that the validator will ignore this failure.

pkg/kv/kvserver/batcheval/cmd_barrier_test.go line 248 at r1 (raw file):

		return errC
	}
	_ = barrierAsync

nit: can we delete this now that the function is used?

This can be used to detect whether a replica has applied the barrier command yet. Epic: none Release note: None

This only executes random `Barrier` requests, but does not verify that the barrier guarantees are actually satisfied (i.e. that all past and concurrent writes are applied before it returns). At least we get some execution coverage, and verify that it does not have negative interactions with other operations. Epic: none Release note: None

erikgrinaker · 2024-01-20T21:05:59Z

bors r+

craig · 2024-01-20T21:42:46Z

Build succeeded:

Bazel Essential CI (Cockroach)

117612: rangefeed: fix premature checkpoint due to intent resolution race r=erikgrinaker a=erikgrinaker This PR depends on the following changes, which should be backported together: * #117787 * #117859 * #117967 * #117968 * #117969 --- It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all. This could happen because `PushTxn` may return a false `ABORTED` status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically assumes an abort (see `Replica.CanCreateTxnRecord` and `batcheval.SynthesizeTxnFromMeta`). However, the rangefeed txn pusher trusted this `ABORTED` status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario: - A rangefeed is running on a lagging follower. - A txn writes an intent, which is replicated to the follower. - The closed timestamp advances past the intent. - The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower. - The rangefeed pushes the txn to advance its resolved timestamp. - The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed). - The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint. - The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee. - The changefeed sees an event below its frontier and drops it, never emitting these events at all. This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking. Resolves #104309. Epic: none Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

erikgrinaker requested a review from nvanbenschoten January 19, 2024 10:38

erikgrinaker self-assigned this Jan 19, 2024

erikgrinaker requested a review from a team as a code owner January 19, 2024 10:38

erikgrinaker requested a review from a team January 19, 2024 10:38

erikgrinaker mentioned this pull request Jan 19, 2024

rangefeed: fix premature checkpoint due to intent resolution race #117612

Merged

nvanbenschoten approved these changes Jan 19, 2024

View reviewed changes

erikgrinaker added 2 commits January 20, 2024 20:49

batcheval: add BarrierRequest.WithLeaseAppliedIndex

e494351

This can be used to detect whether a replica has applied the barrier command yet. Epic: none Release note: None

erikgrinaker force-pushed the barrier-lai branch from 252a0c2 to d4e4dac Compare January 20, 2024 20:50

craig bot merged commit 3c837c2 into cockroachdb:master Jan 20, 2024
8 of 9 checks passed

erikgrinaker mentioned this pull request Jan 22, 2024

kv/kvnemesis: TestKVNemesisSingleNode failed #118005

Closed

This was referenced Jan 29, 2024

release-23.2: batcheval: add BarrierRequest.WithLeaseAppliedIndex #118407

Merged

release-23.1: batcheval: add BarrierRequest.WithLeaseAppliedIndex #118408

Merged

release-22.2: batcheval: add BarrierRequest.WithLeaseAppliedIndex #118474

Merged

blathers-crl bot mentioned this pull request Feb 2, 2024

staging-v22.2.18: release-22.2: batcheval: add BarrierRequest.WithLeaseAppliedIndex #118632

Merged

erikgrinaker mentioned this pull request Feb 8, 2024

release-23.2.1-rc: rangefeed: fix premature checkpoint due to intent resolution race #118981

Merged

erikgrinaker mentioned this pull request Feb 15, 2024

release-23.1.15-rc: rangefeed: fix premature checkpoint due to intent resolution race #119270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batcheval: add `BarrierRequest.WithLeaseAppliedIndex` #117967

batcheval: add `BarrierRequest.WithLeaseAppliedIndex` #117967

erikgrinaker commented Jan 19, 2024

cockroach-teamcity commented Jan 19, 2024

nvanbenschoten left a comment

erikgrinaker commented Jan 20, 2024

craig bot commented Jan 20, 2024

batcheval: add BarrierRequest.WithLeaseAppliedIndex #117967

batcheval: add BarrierRequest.WithLeaseAppliedIndex #117967

Conversation

erikgrinaker commented Jan 19, 2024

cockroach-teamcity commented Jan 19, 2024

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jan 20, 2024

craig bot commented Jan 20, 2024

batcheval: add `BarrierRequest.WithLeaseAppliedIndex` #117967

batcheval: add `BarrierRequest.WithLeaseAppliedIndex` #117967