Skip to content

fix(integration): log fuzz seed so failures can be reproduced#7552

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-protobuf-fuzz-observability
Open

fix(integration): log fuzz seed so failures can be reproduced#7552
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-protobuf-fuzz-observability

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

What this PR does

Adds a small newFuzzRand(t *testing.T) *rand.Rand helper at integration/query_fuzz_test.go:1986 that logs the random seed every fuzz test draws (via t.Logf), and replaces the 13 ad-hoc rand.New(rand.NewSource(time.Now().Unix())) / rand.New(rand.NewSource(now.Unix())) call sites across integration/query_fuzz_test.go (11 sites) and integration/parquet_querier_test.go (2 sites) with calls to it. A FUZZ_SEED environment variable overrides the default time.Now().Unix() so a failing CI run can be replayed locally byte-for-byte.

This is a pure observability change. It does not fix the underlying flake tracked in #7548TestProtobufCodecFuzz will continue to fail at whatever rate it currently does. What it changes is that the next failure log will include the exact seed used, so the run can be reproduced deterministically and classified (real Cortex/Prometheus divergence vs. another non-deterministic-set message vs. a transient harness issue).

Why

Issue #7548 documents TestProtobufCodecFuzz failures in CI. Triage is currently blocked because the seed is silently consumed by rand.New(rand.NewSource(now.Unix())) and never surfaced in the test log, so the failing query/series combination cannot be re-generated. Every failure has to be diagnosed from the comparison output alone, which is insufficient to distinguish "real bug" from "known-class non-determinism" from "infra noise". Logging the seed (and accepting an override) is the minimum surface needed to make the next failure classifiable.

The same problem applies to every fuzz test in integration/query_fuzz_test.go and integration/parquet_querier_test.go — see "Scope expansion" below.

How

Two pieces:

  1. New helper at integration/query_fuzz_test.go:1986:

    // newFuzzRand returns a *rand.Rand whose seed is logged via t.Logf so failing
    // fuzz cases can be reproduced. By default the seed is time.Now().Unix();
    // setting FUZZ_SEED to a base-10 int64 overrides the default and pins the
    // run to a specific seed.
    func newFuzzRand(t *testing.T) *rand.Rand {
        seed := time.Now().Unix()
        if v := os.Getenv("FUZZ_SEED"); v != "" {
            if parsed, err := strconv.ParseInt(v, 10, 64); err == nil {
                t.Logf("integration fuzz random seed: overridden to %d via FUZZ_SEED", parsed)
                seed = parsed
            } else {
                t.Logf("integration fuzz random seed: ignoring invalid FUZZ_SEED=%q: %v", v, err)
            }
        }
        t.Logf("integration fuzz random seed: %d (override with FUZZ_SEED env var)", seed)
        return rand.New(rand.NewSource(seed))
    }

    Invalid FUZZ_SEED values are logged (with the parse error) rather than silently falling back, so a typo in CI configuration surfaces immediately instead of masquerading as a successful override.

  2. 13 call-site replacements — every rand.New(rand.NewSource(time.Now().Unix())) and rand.New(rand.NewSource(now.Unix())) in:

    • integration/query_fuzz_test.go: TestNativeHistogramFuzz, TestExperimentalPromQLFuncsWithPrometheus, TestDisableChunkTrimmingFuzz, TestExpandedPostingsCacheFuzz, TestVerticalShardingFuzz, TestProtobufCodecFuzz, TestStoreGatewayLazyExpandedPostingsSeriesFuzz, TestStoreGatewayLazyExpandedPostingsSeriesFuzzWithPrometheus, TestBackwardCompatibilityQueryFuzz, TestPrometheusCompatibilityQueryFuzz, TestRW1vsRW2QueryFuzz (11 sites).
    • integration/parquet_querier_test.go: TestParquetFuzz, TestParquetProjectionPushdownFuzz (2 sites). The unused "math/rand" import is also dropped from this file since the helper is in the other test file (same package).

    No production code is touched.

Scope expansion

This PR was filed against #7548 (TestProtobufCodecFuzz specifically) but the helper is applied to all 13 fuzz call sites, not just TestProtobufCodecFuzz. This is intentional: the underlying observability gap — seed is computed from time.Now().Unix() and never logged — is identical for every fuzz test in the integration suite, and applying the helper everywhere is strictly less code than leaving twelve other tests with the same gap. The risk is small (test-only change, no production code, no new dependencies, the helper is ~15 lines), and the benefit when the next different fuzz test flakes is exactly the same as the benefit for TestProtobufCodecFuzz today. A reviewer who prefers a narrower change can squash the scope to the single TestProtobufCodecFuzz call site without functional impact, but the wider change is the recommended shape.

Reproducing a failed seed

When a CI run fails, copy the seed from the integration fuzz random seed: <N> log line and replay locally:

FUZZ_SEED=12345 go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$"

Substitute the actual seed and the actual test name. The same FUZZ_SEED works for every fuzz test in the suite because they all draw from newFuzzRand(t) now.

Which issue(s) this PR fixes

Fixes #7548

This is the final PR in a four-PR series filed against issues #7545#7548, complementing PR #7544 (which fixes #7543). The full series:

Together the four PRs address the four open integration_query_fuzz flake issues with the minimum-surface fix that meaningfully advances each one. #7548 is intentionally an observability fix rather than a behaviour fix because the failing-case data needed to classify the underlying flake does not currently exist; this PR collects it.

Checklist

  • CHANGELOG.md updated — not applicable; test-only change with no user-facing behaviour change.
  • Documentation updated — not applicable; no flags or config changed. The FUZZ_SEED env var is documented in the helper's godoc comment.
  • Tests: no new test added. The helper is exercised by every fuzz test in the suite, and its branches (default, valid override, invalid override) are simple enough that a unit test would mostly re-state the implementation. If reviewers prefer one, a TestNewFuzzRand with three subtests can be added in a follow-up.

Test plan

Local (no Docker harness required):

  • gofmt -l integration/query_fuzz_test.go integration/parquet_querier_test.go — clean
  • goimports -local github.com/cortexproject/cortex -l integration/query_fuzz_test.go integration/parquet_querier_test.go — clean
  • go vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
  • go build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean

Validation that seed logging actually appears (requires Docker):

  • make ./cmd/cortex/.uptodate
  • go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$" — confirm the integration fuzz random seed: <N> (override with FUZZ_SEED env var) line appears in the log.
  • FUZZ_SEED=42 go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$" — confirm the override path logs overridden to 42 via FUZZ_SEED and the run is deterministic (two invocations with the same FUZZ_SEED exercise the same query set).
  • FUZZ_SEED=notanint go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=1 ./integration/... -run "^TestProtobufCodecFuzz$" — confirm the invalid-input path logs ignoring invalid FUZZ_SEED="notanint" and falls back to the time-based seed without skipping the test.

Adds a newFuzzRand(t *testing.T) *rand.Rand helper that logs the random
seed every fuzz test draws (via t.Logf), and replaces 13 ad-hoc
rand.New(rand.NewSource(time.Now().Unix())) call sites across
integration/query_fuzz_test.go (11 sites) and
integration/parquet_querier_test.go (2 sites) with calls to it. A
FUZZ_SEED environment variable overrides the default time-based seed so
a failing CI run can be replayed locally byte-for-byte.

This is a pure observability change. It does not fix the underlying
TestProtobufCodecFuzz flake tracked in cortexproject#7548 -- it makes the next
failure classifiable. The current code silently consumes the seed via
rand.New(rand.NewSource(now.Unix())) and never surfaces it in the test
log, so the failing query/series combination cannot be regenerated and
every failure has to be diagnosed from the comparison output alone.
That is insufficient to distinguish real Cortex/Prometheus divergence
from known-class non-determinism from infra noise. Logging the seed
(and accepting an override) is the minimum surface needed to make the
next failure reproducible.

Invalid FUZZ_SEED values are logged rather than silently falling back,
so a typo in CI configuration surfaces immediately instead of
masquerading as a successful override.

The helper is applied to all 13 fuzz call sites, not just
TestProtobufCodecFuzz, because the underlying observability gap is
identical for every fuzz test in the integration suite and applying
the helper everywhere is strictly less code than leaving twelve other
tests with the same gap.

This is the final PR in a four-PR series filed against issues
cortexproject#7545-cortexproject#7548, complementing PR cortexproject#7544 (which fixes cortexproject#7543).

Fixes cortexproject#7548

Signed-off-by: Sandy Chen <ychen@monoidtech.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@dosubot dosubot Bot added type/observability To help know what is going on inside Cortex type/tests labels May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment