-
Notifications
You must be signed in to change notification settings - Fork 4k
bench/rttanalysis: shard TestBenchmarkExpectation to avoid timeouts #153721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bench/rttanalysis: shard TestBenchmarkExpectation to avoid timeouts #153721
Conversation
92036cc
to
eed425d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like this idea! just had minor comments
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @spilchen)
pkg/bench/rttanalysis/validate_benchmark_data_test.go
line 14 at r2 (raw file):
func TestBenchmarkExpectationShard1(t *testing.T) { reg.RunExpectations(t, 1, 4)
super nit: let's make a const
for shardCount=4
pkg/bench/rttanalysis/registry.go
line 84 at r2 (raw file):
// Distribute test groups across shards using hash-based assignment for groupName, testCases := range r.r { h := fnv32Hash(groupName)
i'm curious if there are enough names in the registry to actually get an even number of tests across the shards? would we be able to get away with using round-robin here? (i wasn't sure why we needed to have consistent shard assignment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my comments are nits and could be ignored.
pkg/bench/rttanalysis/registry.go
Outdated
// assigned to the specific shard will be run, enabling parallel execution. | ||
func (r *Registry) RunExpectations(t *testing.T, shard, totalShards int) { | ||
defer jobs.TestingSetIDsToIgnore(map[jobspb.JobID]struct{}{3001: {}, 3002: {}})() | ||
skip.UnderStress(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: skip.UnderDuress()
is a tidier way to skip in many of these circumstances (race, deadlock, stress). You will have to keep UnderShort()
though.
"github.com/cockroachdb/cockroach/pkg/jobs/jobspb" | ||
) | ||
// NOTE: If you change the number of shards, you must also update the | ||
// shard_count in BUILD.bazel to match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, very optional: you can check the environment variable TEST_TOTAL_SHARDS
. If it's set, it's the total number of shards (4, in this case). You can add an init
-time assert to ensure that it's the value you expect it to be. Something like this (note, I have not tested this or made sure it compiles):
var _: int = func() int {
totalShardsStr := os.GetEnv("TEST_TOTAL_SHARDS")
var totalShards int
if totalShardsStr != "" {
totalShards = strconv.Atoi(totalShardsStr)
}
if totalShards != 0 {
if totalShards != EXPECTED_TOTAL_SHARDS {
panic("update shard_count in pkg/bench/rttanalysis/BUILD.bazel")
}
}
return 0
}()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
})) | ||
|
||
func TestBenchmarkExpectation(t *testing.T) { reg.RunExpectations(t) } | ||
func TestBenchmarkExpectation(t *testing.T) { reg.RunExpectations(t, 1, 1) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, optional: if you keep RunExpectations()
and have it delegate to a new function RunExpectationsSharded(t, 1, 1)
, then you don't have to change this call. Up to you.
eed425d
to
9375cc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/bench/rttanalysis/registry.go
line 84 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i'm curious if there are enough names in the registry to actually get an even number of tests across the shards? would we be able to get away with using round-robin here? (i wasn't sure why we needed to have consistent shard assignment)
There are 31 names in the registry. With the hash assignment, each shard ran: 6, 10, 4, 10. Not too bad, but the 3rd shard was under represented. It seems like the names in the registry are fairly stable, so I don't know if there is a need for consistent hashing. I changed to the round robin approach you mentioned.
pkg/bench/rttanalysis/validate_benchmark_data_test.go
line 14 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
super nit: let's make a
const
for shardCount=4
Done.
"github.com/cockroachdb/cockroach/pkg/jobs/jobspb" | ||
) | ||
// NOTE: If you change the number of shards, you must also update the | ||
// shard_count in BUILD.bazel to match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
The TestBenchmarkExpectation benchmark has been frequently timing out after 15 minutes. This appears to be caused by slow CI machines rather than issues with the test logic itself. To address this, the test is now split into four shards. Each shard is executed separately and receives the full 15-minute timeout budget. This should reduce the likelihood of timeout test failures. Fixes cockroachdb#148384 Release note: none Epic: none
9375cc5
to
9fecc53
Compare
TFTRs! bors r+ |
blathers backport 25.3 backporting to resolve #152227 |
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #148384: branch-release-25.3. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
The TestBenchmarkExpectation benchmark has been frequently timing out after 15 minutes. This appears to be caused by slow CI machines rather than issues with the test logic itself.
To address this, the test is now split into four shards. Each shard is executed separately and receives the full 15-minute timeout budget. This should reduce the likelihood of timeout test failures.
Fixes #148384
Release note: none
Epic: none