Skip to content

Conversation

spilchen
Copy link
Contributor

The TestBenchmarkExpectation benchmark has been frequently timing out after 15 minutes. This appears to be caused by slow CI machines rather than issues with the test logic itself.

To address this, the test is now split into four shards. Each shard is executed separately and receives the full 15-minute timeout budget. This should reduce the likelihood of timeout test failures.

Fixes #148384

Release note: none
Epic: none

@spilchen spilchen self-assigned this Sep 18, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spilchen spilchen force-pushed the gh-148384/250918/1010/rttanalysis-timeout/pr-ready branch from 92036cc to eed425d Compare September 18, 2025 17:40
@spilchen spilchen marked this pull request as ready for review September 18, 2025 18:32
@spilchen spilchen requested a review from a team as a code owner September 18, 2025 18:32
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this idea! just had minor comments

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spilchen)


pkg/bench/rttanalysis/validate_benchmark_data_test.go line 14 at r2 (raw file):

func TestBenchmarkExpectationShard1(t *testing.T) {
	reg.RunExpectations(t, 1, 4)

super nit: let's make a const for shardCount=4


pkg/bench/rttanalysis/registry.go line 84 at r2 (raw file):

		// Distribute test groups across shards using hash-based assignment
		for groupName, testCases := range r.r {
			h := fnv32Hash(groupName)

i'm curious if there are enough names in the registry to actually get an even number of tests across the shards? would we be able to get away with using round-robin here? (i wasn't sure why we needed to have consistent shard assignment)

Copy link
Collaborator

@rickystewart rickystewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my comments are nits and could be ignored.

// assigned to the specific shard will be run, enabling parallel execution.
func (r *Registry) RunExpectations(t *testing.T, shard, totalShards int) {
defer jobs.TestingSetIDsToIgnore(map[jobspb.JobID]struct{}{3001: {}, 3002: {}})()
skip.UnderStress(t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: skip.UnderDuress() is a tidier way to skip in many of these circumstances (race, deadlock, stress). You will have to keep UnderShort() though.

"github.com/cockroachdb/cockroach/pkg/jobs/jobspb"
)
// NOTE: If you change the number of shards, you must also update the
// shard_count in BUILD.bazel to match.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, very optional: you can check the environment variable TEST_TOTAL_SHARDS. If it's set, it's the total number of shards (4, in this case). You can add an init-time assert to ensure that it's the value you expect it to be. Something like this (note, I have not tested this or made sure it compiles):

var _: int = func() int {
    totalShardsStr := os.GetEnv("TEST_TOTAL_SHARDS")
    var totalShards int
    if totalShardsStr != "" {
        totalShards = strconv.Atoi(totalShardsStr)
    }
    if totalShards != 0 {
        if totalShards != EXPECTED_TOTAL_SHARDS {
             panic("update shard_count in pkg/bench/rttanalysis/BUILD.bazel")
        }
    }
    return 0
}()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

}))

func TestBenchmarkExpectation(t *testing.T) { reg.RunExpectations(t) }
func TestBenchmarkExpectation(t *testing.T) { reg.RunExpectations(t, 1, 1) }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, optional: if you keep RunExpectations() and have it delegate to a new function RunExpectationsSharded(t, 1, 1), then you don't have to change this call. Up to you.

@spilchen spilchen force-pushed the gh-148384/250918/1010/rttanalysis-timeout/pr-ready branch from eed425d to 9375cc5 Compare September 19, 2025 12:35
Copy link
Contributor Author

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/bench/rttanalysis/registry.go line 84 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i'm curious if there are enough names in the registry to actually get an even number of tests across the shards? would we be able to get away with using round-robin here? (i wasn't sure why we needed to have consistent shard assignment)

There are 31 names in the registry. With the hash assignment, each shard ran: 6, 10, 4, 10. Not too bad, but the 3rd shard was under represented. It seems like the names in the registry are fairly stable, so I don't know if there is a need for consistent hashing. I changed to the round robin approach you mentioned.


pkg/bench/rttanalysis/validate_benchmark_data_test.go line 14 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

super nit: let's make a const for shardCount=4

Done.

"github.com/cockroachdb/cockroach/pkg/jobs/jobspb"
)
// NOTE: If you change the number of shards, you must also update the
// shard_count in BUILD.bazel to match.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

The TestBenchmarkExpectation benchmark has been frequently timing out
after 15 minutes.  This appears to be caused by slow CI machines rather
than issues with the test logic itself.

To address this, the test is now split into four shards. Each shard is
executed separately and receives the full 15-minute timeout budget. This
should reduce the likelihood of timeout test failures.

Fixes cockroachdb#148384

Release note: none
Epic: none
@spilchen spilchen force-pushed the gh-148384/250918/1010/rttanalysis-timeout/pr-ready branch from 9375cc5 to 9fecc53 Compare September 19, 2025 13:17
@spilchen
Copy link
Contributor Author

TFTRs!

bors r+

@craig craig bot merged commit 5feef23 into cockroachdb:master Sep 19, 2025
23 checks passed
@craig
Copy link
Contributor

craig bot commented Sep 19, 2025

@rafiss
Copy link
Collaborator

rafiss commented Sep 22, 2025

blathers backport 25.3

backporting to resolve #152227

Copy link

blathers-crl bot commented Sep 22, 2025

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #148384: branch-release-25.3.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bench/rttanalysis: TestBenchmarkExpectation failed [timeout]
4 participants