fix(integration): log fuzz seed so failures can be reproduced#7552
Open
sandy2008 wants to merge 1 commit into
Open
fix(integration): log fuzz seed so failures can be reproduced#7552sandy2008 wants to merge 1 commit into
sandy2008 wants to merge 1 commit into
Conversation
Adds a newFuzzRand(t *testing.T) *rand.Rand helper that logs the random seed every fuzz test draws (via t.Logf), and replaces 13 ad-hoc rand.New(rand.NewSource(time.Now().Unix())) call sites across integration/query_fuzz_test.go (11 sites) and integration/parquet_querier_test.go (2 sites) with calls to it. A FUZZ_SEED environment variable overrides the default time-based seed so a failing CI run can be replayed locally byte-for-byte. This is a pure observability change. It does not fix the underlying TestProtobufCodecFuzz flake tracked in cortexproject#7548 -- it makes the next failure classifiable. The current code silently consumes the seed via rand.New(rand.NewSource(now.Unix())) and never surfaces it in the test log, so the failing query/series combination cannot be regenerated and every failure has to be diagnosed from the comparison output alone. That is insufficient to distinguish real Cortex/Prometheus divergence from known-class non-determinism from infra noise. Logging the seed (and accepting an override) is the minimum surface needed to make the next failure reproducible. Invalid FUZZ_SEED values are logged rather than silently falling back, so a typo in CI configuration surfaces immediately instead of masquerading as a successful override. The helper is applied to all 13 fuzz call sites, not just TestProtobufCodecFuzz, because the underlying observability gap is identical for every fuzz test in the integration suite and applying the helper everywhere is strictly less code than leaving twelve other tests with the same gap. This is the final PR in a four-PR series filed against issues cortexproject#7545-cortexproject#7548, complementing PR cortexproject#7544 (which fixes cortexproject#7543). Fixes cortexproject#7548 Signed-off-by: Sandy Chen <ychen@monoidtech.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Adds a small
newFuzzRand(t *testing.T) *rand.Randhelper atintegration/query_fuzz_test.go:1986that logs the random seed every fuzz test draws (viat.Logf), and replaces the 13 ad-hocrand.New(rand.NewSource(time.Now().Unix()))/rand.New(rand.NewSource(now.Unix()))call sites acrossintegration/query_fuzz_test.go(11 sites) andintegration/parquet_querier_test.go(2 sites) with calls to it. AFUZZ_SEEDenvironment variable overrides the defaulttime.Now().Unix()so a failing CI run can be replayed locally byte-for-byte.This is a pure observability change. It does not fix the underlying flake tracked in #7548 —
TestProtobufCodecFuzzwill continue to fail at whatever rate it currently does. What it changes is that the next failure log will include the exact seed used, so the run can be reproduced deterministically and classified (real Cortex/Prometheus divergence vs. another non-deterministic-set message vs. a transient harness issue).Why
Issue #7548 documents
TestProtobufCodecFuzzfailures in CI. Triage is currently blocked because the seed is silently consumed byrand.New(rand.NewSource(now.Unix()))and never surfaced in the test log, so the failing query/series combination cannot be re-generated. Every failure has to be diagnosed from the comparison output alone, which is insufficient to distinguish "real bug" from "known-class non-determinism" from "infra noise". Logging the seed (and accepting an override) is the minimum surface needed to make the next failure classifiable.The same problem applies to every fuzz test in
integration/query_fuzz_test.goandintegration/parquet_querier_test.go— see "Scope expansion" below.How
Two pieces:
New helper at
integration/query_fuzz_test.go:1986:Invalid
FUZZ_SEEDvalues are logged (with the parse error) rather than silently falling back, so a typo in CI configuration surfaces immediately instead of masquerading as a successful override.13 call-site replacements — every
rand.New(rand.NewSource(time.Now().Unix()))andrand.New(rand.NewSource(now.Unix()))in:integration/query_fuzz_test.go:TestNativeHistogramFuzz,TestExperimentalPromQLFuncsWithPrometheus,TestDisableChunkTrimmingFuzz,TestExpandedPostingsCacheFuzz,TestVerticalShardingFuzz,TestProtobufCodecFuzz,TestStoreGatewayLazyExpandedPostingsSeriesFuzz,TestStoreGatewayLazyExpandedPostingsSeriesFuzzWithPrometheus,TestBackwardCompatibilityQueryFuzz,TestPrometheusCompatibilityQueryFuzz,TestRW1vsRW2QueryFuzz(11 sites).integration/parquet_querier_test.go:TestParquetFuzz,TestParquetProjectionPushdownFuzz(2 sites). The unused"math/rand"import is also dropped from this file since the helper is in the other test file (same package).No production code is touched.
Scope expansion
This PR was filed against #7548 (
TestProtobufCodecFuzzspecifically) but the helper is applied to all 13 fuzz call sites, not justTestProtobufCodecFuzz. This is intentional: the underlying observability gap — seed is computed fromtime.Now().Unix()and never logged — is identical for every fuzz test in the integration suite, and applying the helper everywhere is strictly less code than leaving twelve other tests with the same gap. The risk is small (test-only change, no production code, no new dependencies, the helper is ~15 lines), and the benefit when the next different fuzz test flakes is exactly the same as the benefit forTestProtobufCodecFuzztoday. A reviewer who prefers a narrower change can squash the scope to the singleTestProtobufCodecFuzzcall site without functional impact, but the wider change is the recommended shape.Reproducing a failed seed
When a CI run fails, copy the seed from the
integration fuzz random seed: <N>log line and replay locally:Substitute the actual seed and the actual test name. The same
FUZZ_SEEDworks for every fuzz test in the suite because they all draw fromnewFuzzRand(t)now.Which issue(s) this PR fixes
Fixes #7548
This is the final PR in a four-PR series filed against issues #7545–#7548, complementing PR #7544 (which fixes #7543). The full series:
TestParquetFuzztopk/bottomk tie-collapse flake (fixes Flaky test: TestParquetFuzz — topk/bottomk tie non-determinism not fully filtered #7543)TestExpandedPostingsCacheFuzzinverted loop +sres1, sres1typo (fixes Flaky test: TestExpandedPostingsCacheFuzz — inverted isValidQuery filter lets backward-incompat queries (quantile, atan2, …) through #7545)sameErrorClasscanonicalizer for non-deterministic bracket-list error messages (fixes Flaky test: TestPrometheusCompatibilityQueryFuzz / TestExperimentalPromQLFuncsWithPrometheus — error-string comparator sensitive to non-deterministic series-list ordering #7546)hasOrVectorFallbackpredicate againstor vector(...)divergences inTestVerticalShardingFuzz(fixes Flaky test: TestVerticalShardingFuzz — sharded vs unsharded "or vector(…)" semantic divergence #7547)newFuzzRandhelper to log + override fuzz seeds (fixes Flaky test: TestProtobufCodecFuzz — unconfirmed root cause (tracking issue) #7548)Together the four PRs address the four open
integration_query_fuzzflake issues with the minimum-surface fix that meaningfully advances each one. #7548 is intentionally an observability fix rather than a behaviour fix because the failing-case data needed to classify the underlying flake does not currently exist; this PR collects it.Checklist
CHANGELOG.mdupdated — not applicable; test-only change with no user-facing behaviour change.FUZZ_SEEDenv var is documented in the helper's godoc comment.TestNewFuzzRandwith three subtests can be added in a follow-up.Test plan
Local (no Docker harness required):
gofmt -l integration/query_fuzz_test.go integration/parquet_querier_test.go— cleangoimports -local github.com/cortexproject/cortex -l integration/query_fuzz_test.go integration/parquet_querier_test.go— cleango vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/...— cleango build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/...— cleanValidation that seed logging actually appears (requires Docker):
make ./cmd/cortex/.uptodatego test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$"— confirm theintegration fuzz random seed: <N> (override with FUZZ_SEED env var)line appears in the log.FUZZ_SEED=42 go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$"— confirm the override path logsoverridden to 42 via FUZZ_SEEDand the run is deterministic (two invocations with the sameFUZZ_SEEDexercise the same query set).FUZZ_SEED=notanint go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$"— confirm the invalid-input path logsignoring invalid FUZZ_SEED="notanint"and falls back to the time-based seed without skipping the test.