Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add roachtest for snapshot ingest with excises #124591

Merged

Conversation

aadityasondhi
Copy link
Collaborator

This patch adds a roachtest for running snapshots with excises enabled. In this workload, when splits and excises are disabled, we see an inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot ingest, and p99 latencies don't spike over a threshold.

Informs #80607.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aadityasondhi aadityasondhi force-pushed the 20240502.snapshot-roachtest-excise branch 9 times, most recently from 357fde6 to 765053a Compare May 23, 2024 22:27
@aadityasondhi aadityasondhi marked this pull request as ready for review May 23, 2024 22:38
@aadityasondhi aadityasondhi requested a review from a team as a code owner May 23, 2024 22:38
@aadityasondhi aadityasondhi requested review from herkolategan, DarrylWong, sumeerbhola and a team and removed request for a team May 23, 2024 22:38
@aadityasondhi
Copy link
Collaborator Author

Charts for this roachtest can be found in this internal thread.

@aadityasondhi aadityasondhi added backport-24.1.x Flags PRs that need to be backported to 24.1. and removed backport-24.1.x Flags PRs that need to be backported to 24.1. labels May 23, 2024
@aadityasondhi
Copy link
Collaborator Author

I will hold the backport until this test bakes in master for a while.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @DarrylWong, and @herkolategan)


pkg/cmd/roachtest/tests/admission_control_snapshot_overload_excise.go line 57 at r1 (raw file):

				t.Fatalf("expected at least 4 nodes, found %d", c.Spec().NodeCount)
			}

Worth adding a comment like:

// COCKROACH_CONCURRENT_COMPACTIONS is set to 1 since we want to ensure that snapshot ingests don't result in LSM inversion even with a very low compaction rate. With Pebble's IngestAndExcise all the ingested sstables should ingest into L6.
// COCKROACH_CONCURRENT_SNAPSHOT* is increased so that the rate of snapshot application is high.
// COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD is reduced so that there is certainty that the restarted node will be caught up via snapshots, and not via raft log replay.

pkg/cmd/roachtest/tests/admission_control_snapshot_overload_excise.go line 186 at r1 (raw file):

				// Assert on l0 sublevel count and p99 latencies.
				latencyMetric := divQuery("histogram_quantile(0.99, sum by(le) (rate(sql_service_latency_bucket[2m])))", 1<<20 /* 1ms */)
				const latencyThreshold = 100

So 100ms because of the divQuery. Worth adding a code comment.

@aadityasondhi aadityasondhi force-pushed the 20240502.snapshot-roachtest-excise branch from 765053a to 677bf83 Compare May 24, 2024 14:58
Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r=sumeerbhola

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DarrylWong, @herkolategan, and @sumeerbhola)


pkg/cmd/roachtest/tests/admission_control_snapshot_overload_excise.go line 57 at r1 (raw file):

Previously, sumeerbhola wrote…

Worth adding a comment like:

// COCKROACH_CONCURRENT_COMPACTIONS is set to 1 since we want to ensure that snapshot ingests don't result in LSM inversion even with a very low compaction rate. With Pebble's IngestAndExcise all the ingested sstables should ingest into L6.
// COCKROACH_CONCURRENT_SNAPSHOT* is increased so that the rate of snapshot application is high.
// COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD is reduced so that there is certainty that the restarted node will be caught up via snapshots, and not via raft log replay.

Done.


pkg/cmd/roachtest/tests/admission_control_snapshot_overload_excise.go line 186 at r1 (raw file):

Previously, sumeerbhola wrote…

So 100ms because of the divQuery. Worth adding a code comment.

Done.

@craig
Copy link
Contributor

craig bot commented May 24, 2024

Build failed:

This patch adds a roachtest for running snapshots with excises enabled.
In this workload, when splits and excises are disabled, we see an
inverted LSM and degraded p99 latencies.

The test asserts that the LSM stays healthy while doing the snapshot
ingest, and p99 latencies don't spike over a threshold.

Informs cockroachdb#80607.

Release note: None
@aadityasondhi aadityasondhi force-pushed the 20240502.snapshot-roachtest-excise branch from 677bf83 to bb693c8 Compare May 24, 2024 15:53
@aadityasondhi
Copy link
Collaborator Author

lint failure 🫤 will wait on branch passing before merging

@aadityasondhi
Copy link
Collaborator Author

bors r=sumeerbhola

@craig craig bot merged commit 6ad7dee into cockroachdb:master May 24, 2024
21 of 22 checks passed
@aadityasondhi aadityasondhi deleted the 20240502.snapshot-roachtest-excise branch May 28, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants