bench/rttanalysis: shard TestBenchmarkExpectation to avoid timeouts #153721

spilchen · 2025-09-18T15:08:30Z

The TestBenchmarkExpectation benchmark has been frequently timing out after 15 minutes. This appears to be caused by slow CI machines rather than issues with the test logic itself.

To address this, the test is now split into four shards. Each shard is executed separately and receives the full 15-minute timeout budget. This should reduce the likelihood of timeout test failures.

Fixes #148384

Release note: none
Epic: none

cockroach-teamcity · 2025-09-18T15:08:42Z

This change is

rafiss

i like this idea! just had minor comments

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @spilchen)

pkg/bench/rttanalysis/validate_benchmark_data_test.go line 14 at r2 (raw file):

func TestBenchmarkExpectationShard1(t *testing.T) {
	reg.RunExpectations(t, 1, 4)

super nit: let's make a const for shardCount=4

pkg/bench/rttanalysis/registry.go line 84 at r2 (raw file):

		// Distribute test groups across shards using hash-based assignment
		for groupName, testCases := range r.r {
			h := fnv32Hash(groupName)

i'm curious if there are enough names in the registry to actually get an even number of tests across the shards? would we be able to get away with using round-robin here? (i wasn't sure why we needed to have consistent shard assignment)

rickystewart

All of my comments are nits and could be ignored.

rickystewart · 2025-09-18T20:47:51Z

pkg/bench/rttanalysis/registry.go

+// assigned to the specific shard will be run, enabling parallel execution.
+func (r *Registry) RunExpectations(t *testing.T, shard, totalShards int) {
+	defer jobs.TestingSetIDsToIgnore(map[jobspb.JobID]struct{}{3001: {}, 3002: {}})()
 	skip.UnderStress(t)


Nit: skip.UnderDuress() is a tidier way to skip in many of these circumstances (race, deadlock, stress). You will have to keep UnderShort() though.

rickystewart · 2025-09-18T20:56:02Z

pkg/bench/rttanalysis/validate_benchmark_data_test.go

-	"github.com/cockroachdb/cockroach/pkg/jobs/jobspb"
-)
+// NOTE: If you change the number of shards, you must also update the
+// shard_count in BUILD.bazel to match.


Nit, very optional: you can check the environment variable TEST_TOTAL_SHARDS. If it's set, it's the total number of shards (4, in this case). You can add an init-time assert to ensure that it's the value you expect it to be. Something like this (note, I have not tested this or made sure it compiles):

var _: int = func() int { totalShardsStr := os.GetEnv("TEST_TOTAL_SHARDS") var totalShards int if totalShardsStr != "" { totalShards = strconv.Atoi(totalShardsStr) } if totalShards != 0 { if totalShards != EXPECTED_TOTAL_SHARDS { panic("update shard_count in pkg/bench/rttanalysis/BUILD.bazel") } } return 0 }()

rickystewart · 2025-09-18T20:57:31Z

pkg/ccl/benchccl/rttanalysisccl/multi_region_bench_test.go

 }))

-func TestBenchmarkExpectation(t *testing.T) { reg.RunExpectations(t) }
+func TestBenchmarkExpectation(t *testing.T) { reg.RunExpectations(t, 1, 1) }


Nit, optional: if you keep RunExpectations() and have it delegate to a new function RunExpectationsSharded(t, 1, 1), then you don't have to change this call. Up to you.

spilchen

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)

pkg/bench/rttanalysis/registry.go line 84 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i'm curious if there are enough names in the registry to actually get an even number of tests across the shards? would we be able to get away with using round-robin here? (i wasn't sure why we needed to have consistent shard assignment)

There are 31 names in the registry. With the hash assignment, each shard ran: 6, 10, 4, 10. Not too bad, but the 3rd shard was under represented. It seems like the names in the registry are fairly stable, so I don't know if there is a need for consistent hashing. I changed to the round robin approach you mentioned.

pkg/bench/rttanalysis/validate_benchmark_data_test.go line 14 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

super nit: let's make a const for shardCount=4

Done.

spilchen · 2025-09-19T12:35:10Z

pkg/bench/rttanalysis/validate_benchmark_data_test.go

-	"github.com/cockroachdb/cockroach/pkg/jobs/jobspb"
-)
+// NOTE: If you change the number of shards, you must also update the
+// shard_count in BUILD.bazel to match.


The TestBenchmarkExpectation benchmark has been frequently timing out after 15 minutes. This appears to be caused by slow CI machines rather than issues with the test logic itself. To address this, the test is now split into four shards. Each shard is executed separately and receives the full 15-minute timeout budget. This should reduce the likelihood of timeout test failures. Fixes cockroachdb#148384 Release note: none Epic: none

spilchen · 2025-09-19T14:32:27Z

TFTRs!

bors r+

craig · 2025-09-19T16:20:53Z

Build succeeded:

rafiss · 2025-09-22T20:14:03Z

blathers backport 25.3

backporting to resolve #152227

blathers-crl · 2025-09-22T20:14:08Z

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.

Issue #148384: branch-release-25.3.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

spilchen self-assigned this Sep 18, 2025

spilchen force-pushed the gh-148384/250918/1010/rttanalysis-timeout/pr-ready branch from 92036cc to eed425d Compare September 18, 2025 17:40

spilchen marked this pull request as ready for review September 18, 2025 18:32

spilchen requested a review from a team as a code owner September 18, 2025 18:32

rafiss approved these changes Sep 18, 2025

View reviewed changes

rickystewart approved these changes Sep 18, 2025

View reviewed changes

spilchen force-pushed the gh-148384/250918/1010/rttanalysis-timeout/pr-ready branch from eed425d to 9375cc5 Compare September 19, 2025 12:35

spilchen commented Sep 19, 2025

View reviewed changes

spilchen force-pushed the gh-148384/250918/1010/rttanalysis-timeout/pr-ready branch from 9375cc5 to 9fecc53 Compare September 19, 2025 13:17

craig bot merged commit 5feef23 into cockroachdb:master Sep 19, 2025
23 checks passed

celeste-cockroachdb bot added the target-release-25.4.0 label Sep 19, 2025

blathers-crl bot mentioned this pull request Sep 22, 2025

bench/rttanalysis: TestBenchmarkExpectation failed [timeout] #148384

Closed

spilchen mentioned this pull request Sep 22, 2025

release-25.3: bench/rttanalysis: shard TestBenchmarkExpectation to avoid timeouts #153899

Merged

celeste-cockroachdb bot added v25.4.0-prerelease and removed target-release-25.4.0 labels Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bench/rttanalysis: shard TestBenchmarkExpectation to avoid timeouts #153721

bench/rttanalysis: shard TestBenchmarkExpectation to avoid timeouts #153721

Uh oh!

spilchen commented Sep 18, 2025

Uh oh!

cockroach-teamcity commented Sep 18, 2025

Uh oh!

rafiss left a comment

Uh oh!

rickystewart left a comment

Uh oh!

rickystewart Sep 18, 2025

Uh oh!

rickystewart Sep 18, 2025

Uh oh!

spilchen Sep 19, 2025

Uh oh!

rickystewart Sep 18, 2025

Uh oh!

spilchen left a comment

Uh oh!

spilchen Sep 19, 2025

Uh oh!

spilchen commented Sep 19, 2025

Uh oh!

Uh oh!

craig bot commented Sep 19, 2025

Uh oh!

rafiss commented Sep 22, 2025

Uh oh!

blathers-crl bot commented Sep 22, 2025

Uh oh!

Uh oh!

bench/rttanalysis: shard TestBenchmarkExpectation to avoid timeouts #153721

bench/rttanalysis: shard TestBenchmarkExpectation to avoid timeouts #153721

Uh oh!

Conversation

spilchen commented Sep 18, 2025

Uh oh!

cockroach-teamcity commented Sep 18, 2025

Uh oh!

rafiss left a comment

Choose a reason for hiding this comment

Uh oh!

rickystewart left a comment

Choose a reason for hiding this comment

Uh oh!

rickystewart Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

rickystewart Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

spilchen Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

rickystewart Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

spilchen left a comment

Choose a reason for hiding this comment

Uh oh!

spilchen Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

spilchen commented Sep 19, 2025

Uh oh!

Uh oh!

craig bot commented Sep 19, 2025

Uh oh!

rafiss commented Sep 22, 2025

Uh oh!

blathers-crl bot commented Sep 22, 2025

Uh oh!

Uh oh!