fix(integration): log fuzz seed so failures can be reproduced by sandy2008 · Pull Request #7552 · cortexproject/cortex

sandy2008 · 2026-05-22T10:16:25Z

What this PR does

Adds a small newFuzzRand(t *testing.T) *rand.Rand helper at integration/query_fuzz_test.go:1986 that logs the random seed every fuzz test draws (via t.Logf), and replaces the 13 ad-hoc rand.New(rand.NewSource(time.Now().Unix())) / rand.New(rand.NewSource(now.Unix())) call sites across integration/query_fuzz_test.go (11 sites) and integration/parquet_querier_test.go (2 sites) with calls to it. A FUZZ_SEED environment variable overrides the default time.Now().Unix() so a failing CI run can be replayed locally byte-for-byte.

This is a pure observability change. It does not fix the underlying flake tracked in #7548 — TestProtobufCodecFuzz will continue to fail at whatever rate it currently does. What it changes is that the next failure log will include the exact seed used, so the run can be reproduced deterministically and classified (real Cortex/Prometheus divergence vs. another non-deterministic-set message vs. a transient harness issue).

Why

Issue #7548 documents TestProtobufCodecFuzz failures in CI. Triage is currently blocked because the seed is silently consumed by rand.New(rand.NewSource(now.Unix())) and never surfaced in the test log, so the failing query/series combination cannot be re-generated. Every failure has to be diagnosed from the comparison output alone, which is insufficient to distinguish "real bug" from "known-class non-determinism" from "infra noise". Logging the seed (and accepting an override) is the minimum surface needed to make the next failure classifiable.

The same problem applies to every fuzz test in integration/query_fuzz_test.go and integration/parquet_querier_test.go — see "Scope expansion" below.

How

Two pieces:

New helper at integration/query_fuzz_test.go:1986:

// newFuzzRand returns a *rand.Rand whose seed is logged via t.Logf so failing
// fuzz cases can be reproduced. By default the seed is time.Now().Unix();
// setting FUZZ_SEED to a base-10 int64 overrides the default and pins the
// run to a specific seed.
func newFuzzRand(t *testing.T) *rand.Rand {
    seed := time.Now().Unix()
    if v := os.Getenv("FUZZ_SEED"); v != "" {
        if parsed, err := strconv.ParseInt(v, 10, 64); err == nil {
            t.Logf("integration fuzz random seed: overridden to %d via FUZZ_SEED", parsed)
            seed = parsed
        } else {
            t.Logf("integration fuzz random seed: ignoring invalid FUZZ_SEED=%q: %v", v, err)
        }
    }
    t.Logf("integration fuzz random seed: %d (override with FUZZ_SEED env var)", seed)
    return rand.New(rand.NewSource(seed))
}

Invalid FUZZ_SEED values are logged (with the parse error) rather than silently falling back, so a typo in CI configuration surfaces immediately instead of masquerading as a successful override.

13 call-site replacements — every rand.New(rand.NewSource(time.Now().Unix())) and rand.New(rand.NewSource(now.Unix())) in:
- integration/query_fuzz_test.go: TestNativeHistogramFuzz, TestExperimentalPromQLFuncsWithPrometheus, TestDisableChunkTrimmingFuzz, TestExpandedPostingsCacheFuzz, TestVerticalShardingFuzz, TestProtobufCodecFuzz, TestStoreGatewayLazyExpandedPostingsSeriesFuzz, TestStoreGatewayLazyExpandedPostingsSeriesFuzzWithPrometheus, TestBackwardCompatibilityQueryFuzz, TestPrometheusCompatibilityQueryFuzz, TestRW1vsRW2QueryFuzz (11 sites).
- integration/parquet_querier_test.go: TestParquetFuzz, TestParquetProjectionPushdownFuzz (2 sites). The unused "math/rand" import is also dropped from this file since the helper is in the other test file (same package).
No production code is touched.

Scope expansion

This PR was filed against #7548 (TestProtobufCodecFuzz specifically) but the helper is applied to all 13 fuzz call sites, not just TestProtobufCodecFuzz. This is intentional: the underlying observability gap — seed is computed from time.Now().Unix() and never logged — is identical for every fuzz test in the integration suite, and applying the helper everywhere is strictly less code than leaving twelve other tests with the same gap. The risk is small (test-only change, no production code, no new dependencies, the helper is ~15 lines), and the benefit when the next different fuzz test flakes is exactly the same as the benefit for TestProtobufCodecFuzz today. A reviewer who prefers a narrower change can squash the scope to the single TestProtobufCodecFuzz call site without functional impact, but the wider change is the recommended shape.

Reproducing a failed seed

When a CI run fails, copy the seed from the integration fuzz random seed: <N> log line and replay locally:

FUZZ_SEED=12345 go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$"

Substitute the actual seed and the actual test name. The same FUZZ_SEED works for every fuzz test in the suite because they all draw from newFuzzRand(t) now.

Which issue(s) this PR fixes

Fixes #7548

This is the final PR in a four-PR series filed against issues #7545–#7548, complementing PR #7544 (which fixes #7543). The full series:

PR fix(integration): stabilise TestParquetFuzz topk/bottomk tie flake #7544 — TestParquetFuzz topk/bottomk tie-collapse flake (fixes Flaky test: TestParquetFuzz — topk/bottomk tie non-determinism not fully filtered #7543)
PR There is no code and I am sad. #1 — TestExpandedPostingsCacheFuzz inverted loop + sres1, sres1 typo (fixes Flaky test: TestExpandedPostingsCacheFuzz — inverted isValidQuery filter lets backward-incompat queries (quantile, atan2, …) through #7545)
PR Refactor web API and submit upstream #2 — sameErrorClass canonicalizer for non-deterministic bracket-list error messages (fixes Flaky test: TestPrometheusCompatibilityQueryFuzz / TestExperimentalPromQLFuncsWithPrometheus — error-string comparator sensitive to non-deterministic series-list ordering #7546)
PR Figure out what to do with the ingestor code #3 — hasOrVectorFallback predicate against or vector(...) divergences in TestVerticalShardingFuzz (fixes Flaky test: TestVerticalShardingFuzz — sharded vs unsharded "or vector(…)" semantic divergence #7547)
PR Recording rules #4 (this one) — newFuzzRand helper to log + override fuzz seeds (fixes Flaky test: TestProtobufCodecFuzz — unconfirmed root cause (tracking issue) #7548)

Together the four PRs address the four open integration_query_fuzz flake issues with the minimum-surface fix that meaningfully advances each one. #7548 is intentionally an observability fix rather than a behaviour fix because the failing-case data needed to classify the underlying flake does not currently exist; this PR collects it.

Checklist

CHANGELOG.md updated — not applicable; test-only change with no user-facing behaviour change.
Documentation updated — not applicable; no flags or config changed. The FUZZ_SEED env var is documented in the helper's godoc comment.
Tests: no new test added. The helper is exercised by every fuzz test in the suite, and its branches (default, valid override, invalid override) are simple enough that a unit test would mostly re-state the implementation. If reviewers prefer one, a TestNewFuzzRand with three subtests can be added in a follow-up.

Test plan

Local (no Docker harness required):

gofmt -l integration/query_fuzz_test.go integration/parquet_querier_test.go — clean
goimports -local github.com/cortexproject/cortex -l integration/query_fuzz_test.go integration/parquet_querier_test.go — clean
go vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
go build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean

Validation that seed logging actually appears (requires Docker):

make ./cmd/cortex/.uptodate
go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$" — confirm the integration fuzz random seed: <N> (override with FUZZ_SEED env var) line appears in the log.
FUZZ_SEED=42 go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$" — confirm the override path logs overridden to 42 via FUZZ_SEED and the run is deterministic (two invocations with the same FUZZ_SEED exercise the same query set).
FUZZ_SEED=notanint go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$" — confirm the invalid-input path logs ignoring invalid FUZZ_SEED="notanint" and falls back to the time-based seed without skipping the test.

Adds a newFuzzRand(t *testing.T) *rand.Rand helper that logs the random seed every fuzz test draws (via t.Logf), and replaces 13 ad-hoc rand.New(rand.NewSource(time.Now().Unix())) call sites across integration/query_fuzz_test.go (11 sites) and integration/parquet_querier_test.go (2 sites) with calls to it. A FUZZ_SEED environment variable overrides the default time-based seed so a failing CI run can be replayed locally byte-for-byte. This is a pure observability change. It does not fix the underlying TestProtobufCodecFuzz flake tracked in cortexproject#7548 -- it makes the next failure classifiable. The current code silently consumes the seed via rand.New(rand.NewSource(now.Unix())) and never surfaces it in the test log, so the failing query/series combination cannot be regenerated and every failure has to be diagnosed from the comparison output alone. That is insufficient to distinguish real Cortex/Prometheus divergence from known-class non-determinism from infra noise. Logging the seed (and accepting an override) is the minimum surface needed to make the next failure reproducible. Invalid FUZZ_SEED values are logged rather than silently falling back, so a typo in CI configuration surfaces immediately instead of masquerading as a successful override. The helper is applied to all 13 fuzz call sites, not just TestProtobufCodecFuzz, because the underlying observability gap is identical for every fuzz test in the integration suite and applying the helper everywhere is strictly less code than leaving twelve other tests with the same gap. This is the final PR in a four-PR series filed against issues cortexproject#7545-cortexproject#7548, complementing PR cortexproject#7544 (which fixes cortexproject#7543). Fixes cortexproject#7548 Signed-off-by: Sandy Chen <ychen@monoidtech.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

pull-request-size Bot added the size/M label May 22, 2026

dosubot Bot added type/observability To help know what is going on inside Cortex type/tests labels May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(integration): log fuzz seed so failures can be reproduced#7552

fix(integration): log fuzz seed so failures can be reproduced#7552
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-protobuf-fuzz-observability

sandy2008 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sandy2008 commented May 22, 2026

What this PR does

Why

How

Scope expansion

Reproducing a failed seed

Which issue(s) this PR fixes

Checklist

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant