You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TestParquetFuzz's promqlsmith opts do not include WithEnabledAggrs(enabledAggrs), so the random query generator still emits topk / bottomk queries against this test. Combined with the highly tie-prone data values produced by e2e.CreateBlock (float64(i+j) with i ∈ [0,19], j ∈ [0,59] → values 0–78 with massive overlap), and the inherent non-determinism of topk/bottomk tie-breaking between Cortex's parquet path and standalone Prometheus, the result is a recurring 1 test cases failed flake at the # of samples mismatch line.
The existing sampleNumComparer was meant to be the relaxation for this case, but it only compares total sample count across all output series. When two engines pick different tied series at different timestamps, the chosen winners have different time-window coverage downstream of topk, so total counts still diverge — and the assertion fires.
This is not a Cortex / parquet correctness bug — it's a test-side issue. The majority of fuzz tests in query_fuzz_test.go (9 of 12) already pass WithEnabledAggrs(enabledAggrs) to suppress these specific aggregators; TestParquetFuzz is one of three that omits it (the other two being TestStoreGatewayLazyExpandedPostingsSeriesFuzz and TestStoreGatewayLazyExpandedPostingsSeriesFuzzWithPrometheus, which haven't surfaced as flakes in the recent 18-day CI window but share the same theoretical hole).
res1 and res2 selected different tied series (different status_code values) because the inner sub-expression (a or b) % a produces value 0 for every surviving sample → topk(1, …) has no canonical winner.
Empirical flake rate (last 18 days, ~245 CI runs across master + PRs)
Test
PR-CI occurrences
Master-CI occurrences
TestParquetFuzz (sample-count mismatch)
5
0
integration_query_fuzz job (any test)
15
0
TestParquetFuzz failure rate: ~2% of PR CI runs, 0% on master.
Arch split for TestParquetFuzz failures: 4/5 arm64, 1/5 amd64. arm64 runners are 10–35% slower per fuzz test, which likely widens race / iteration-order divergence windows; parquet-go also has amd64-only SIMD (vendor/github.com/parquet-go/parquet-go/*_amd64.s), leaving arm64 on the pure-Go fallback. These are hypotheses for the skew; neither was independently proven in this investigation.
integration/parquet_querier_test.go:172-175 passes only WithEnabledFunctions(enabledFunctions) to promqlsmith.New(…) — it does not pass WithEnabledAggrs(enabledAggrs). Therefore the generator uses the promqlsmith default aggregator set (vendor/github.com/cortexproject/promqlsmith/opts.go:22-35), which includes TOPK, BOTTOMK, COUNT_VALUES, STDDEV, STDVAR, QUANTILE.
isValidQuery(skipBackwardIncompat=true) (called via skipStdAggregations=true by this test, see integration/query_fuzz_test.go:1983-2020) filters generated query strings containing stddev, stdvar, quantile, limitk, limit_ratio — but nottopk, bottomk, or count_values.
topk/bottomk ties are resolved by the upstream Prometheus engine via strict > (see vendor/github.com/prometheus/prometheus/promql/engine.gotopkHeap insertion), so the first-encountered tied series wins. Whether Cortex's parquet path and standalone Prometheus encounter the series in the same order depends on storage iteration; in practice they differ.
sampleNumComparer (integration/query_fuzz_test.go:897-925) compares only total Matrix sample count. Different tied winners → different downstream time coverage → different total counts.
Sibling test TestParquetProjectionPushdownFuzz is t.Skip("Disabled due to flakiness"); that one uses hardcoded queries (not promqlsmith), so its skip is for a separate reason — not addressed by this issue.
This matches the pattern used by TestNativeHistogramFuzz, TestExperimentalPromQLFuncsWithPrometheus, TestDisableChunkTrimmingFuzz, TestExpandedPostingsCacheFuzz, TestVerticalShardingFuzz, TestProtobufCodecFuzz, TestBackwardCompatibilityQueryFuzz, TestPrometheusCompatibilityQueryFuzz, and TestRW1vsRW2QueryFuzz. enabledAggrs is already defined at integration/query_fuzz_test.go:44-46 as {SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE}.
Trade-off accepted
Loses random fuzz coverage of topk/bottomk (and count_values, stddev, stdvar) against the parquet path. Same trade-off already accepted by most other fuzz tests in the suite.
If topk/bottomk coverage of parquet is desired in the future, write a deterministic dedicated test rather than expanding sampleNumComparer's relaxation; the comparator cannot normalize away time-window coverage drift induced by tie-break choice without effectively re-implementing topk semantics.
Why simpler/alternative fixes don't work
"Inject per-series epsilon in data generation" (e.g. change e2e/util.go:384 from float64(i+j) to float64(i+j) + 1e-9*float64(i)) — does not fix this case, because the inner expression ({a} or {b}) % {a} produces exact 0 for every surviving sample regardless of input values. Modulo always produces ties.
"Strengthen sampleNumComparer to count per-series buckets" — still does not normalize for time-window divergence between different chosen ties.
"Force sortSeries=true in parquet Select" — production behavior change to satisfy a test; off-table.
Not addressed by this issue (separate flakes)
The same integration_query_fuzz job hits other flaky tests with different root causes; this issue should not try to subsume them. Each likely needs a separate report/fix:
Test
Distinct root cause (rough)
TestExpandedPostingsCacheFuzz
Data-freshness race: res1 = NaN, res2 = values (one Cortex hasn't yet ingested the iteration's new push).
Error-string comparator: same error type with [A,B] vs [B,A] list order from non-deterministic map iteration in the error message.
TestVerticalShardingFuzz
Semantic divergence: … or vector(…) fallback fires in the unsharded engine but not all shards of the sharded engine.
TestProtobufCodecFuzz
Unknown / not enough samples to classify.
(arm64 skew across the job is hypothesized to come from ~10–35% slower runners + amd64-only SIMD in parquet-go, which would widen race windows for all of the above. Not independently proven.)
Acceptance criteria
After fix, observe no TestParquetFuzz# of samples mismatch failures across a representative local-run sample (e.g. ≥200 iterations on each of arm64 and amd64; given the ~2% per-CI-run rate, fewer iterations is statistically inconclusive).
No reduction in coverage of parquet-specific paths (the test still exercises parquet via SUM/MIN/MAX/AVG/GROUP/COUNT/QUANTILE aggregators and the full function set).
Summary
TestParquetFuzz'spromqlsmithopts do not includeWithEnabledAggrs(enabledAggrs), so the random query generator still emitstopk/bottomkqueries against this test. Combined with the highly tie-prone data values produced bye2e.CreateBlock(float64(i+j)withi ∈ [0,19], j ∈ [0,59]→ values 0–78 with massive overlap), and the inherent non-determinism oftopk/bottomktie-breaking between Cortex's parquet path and standalone Prometheus, the result is a recurring1 test cases failedflake at the# of samples mismatchline.The existing
sampleNumComparerwas meant to be the relaxation for this case, but it only compares total sample count across all output series. When two engines pick different tied series at different timestamps, the chosen winners have different time-window coverage downstream oftopk, so total counts still diverge — and the assertion fires.This is not a Cortex / parquet correctness bug — it's a test-side issue. The majority of fuzz tests in
query_fuzz_test.go(9 of 12) already passWithEnabledAggrs(enabledAggrs)to suppress these specific aggregators;TestParquetFuzzis one of three that omits it (the other two beingTestStoreGatewayLazyExpandedPostingsSeriesFuzzandTestStoreGatewayLazyExpandedPostingsSeriesFuzzWithPrometheus, which haven't surfaced as flakes in the recent 18-day CI window but share the same theoretical hole).Most recent occurrence
ubuntu-24.04-arm,arm64, build tagintegration_query_fuzzFailure excerpt
res1andres2selected different tied series (differentstatus_codevalues) because the inner sub-expression(a or b) % aproduces value0for every surviving sample →topk(1, …)has no canonical winner.Empirical flake rate (last 18 days, ~245 CI runs across master + PRs)
TestParquetFuzz(sample-count mismatch)integration_query_fuzzjob (any test)TestParquetFuzzfailure rate: ~2% of PR CI runs, 0% on master.TestParquetFuzzfailures: 4/5 arm64, 1/5 amd64. arm64 runners are 10–35% slower per fuzz test, which likely widens race / iteration-order divergence windows; parquet-go also has amd64-only SIMD (vendor/github.com/parquet-go/parquet-go/*_amd64.s), leaving arm64 on the pure-Go fallback. These are hypotheses for the skew; neither was independently proven in this investigation.Sample prior failures (same root cause)
Root cause
integration/parquet_querier_test.go:172-175passes onlyWithEnabledFunctions(enabledFunctions)topromqlsmith.New(…)— it does not passWithEnabledAggrs(enabledAggrs). Therefore the generator uses the promqlsmith default aggregator set (vendor/github.com/cortexproject/promqlsmith/opts.go:22-35), which includesTOPK,BOTTOMK,COUNT_VALUES,STDDEV,STDVAR,QUANTILE.isValidQuery(skipBackwardIncompat=true)(called viaskipStdAggregations=trueby this test, seeintegration/query_fuzz_test.go:1983-2020) filters generated query strings containingstddev,stdvar,quantile,limitk,limit_ratio— but nottopk,bottomk, orcount_values.topk/bottomkties are resolved by the upstream Prometheus engine via strict>(seevendor/github.com/prometheus/prometheus/promql/engine.gotopkHeapinsertion), so the first-encountered tied series wins. Whether Cortex's parquet path and standalone Prometheus encounter the series in the same order depends on storage iteration; in practice they differ.sampleNumComparer(integration/query_fuzz_test.go:897-925) compares only total Matrix sample count. Different tied winners → different downstream time coverage → different total counts.TestParquetProjectionPushdownFuzzist.Skip("Disabled due to flakiness"); that one uses hardcoded queries (notpromqlsmith), so its skip is for a separate reason — not addressed by this issue.Proposed fix
Primary (~15 lines, matches sibling-test precedent):
Pass
promqlsmith.WithEnabledAggrs(enabledAggrs)to the opts inintegration/parquet_querier_test.go:172-175, e.g.:This matches the pattern used by
TestNativeHistogramFuzz,TestExperimentalPromQLFuncsWithPrometheus,TestDisableChunkTrimmingFuzz,TestExpandedPostingsCacheFuzz,TestVerticalShardingFuzz,TestProtobufCodecFuzz,TestBackwardCompatibilityQueryFuzz,TestPrometheusCompatibilityQueryFuzz, andTestRW1vsRW2QueryFuzz.enabledAggrsis already defined atintegration/query_fuzz_test.go:44-46as{SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE}.Trade-off accepted
topk/bottomk(andcount_values,stddev,stdvar) against the parquet path. Same trade-off already accepted by most other fuzz tests in the suite.topk/bottomkcoverage of parquet is desired in the future, write a deterministic dedicated test rather than expandingsampleNumComparer's relaxation; the comparator cannot normalize away time-window coverage drift induced by tie-break choice without effectively re-implementingtopksemantics.Why simpler/alternative fixes don't work
e2e/util.go:384fromfloat64(i+j)tofloat64(i+j) + 1e-9*float64(i)) — does not fix this case, because the inner expression({a} or {b}) % {a}produces exact0for every surviving sample regardless of input values. Modulo always produces ties.sampleNumComparerto count per-series buckets" — still does not normalize for time-window divergence between different chosen ties.sortSeries=truein parquetSelect" — production behavior change to satisfy a test; off-table.Not addressed by this issue (separate flakes)
The same
integration_query_fuzzjob hits other flaky tests with different root causes; this issue should not try to subsume them. Each likely needs a separate report/fix:TestExpandedPostingsCacheFuzzres1 = NaN,res2 = values(one Cortex hasn't yet ingested the iteration's new push).TestPrometheusCompatibilityQueryFuzz/TestExperimentalPromQLFuncsWithPrometheus[A,B]vs[B,A]list order from non-deterministic map iteration in the error message.TestVerticalShardingFuzz… or vector(…)fallback fires in the unsharded engine but not all shards of the sharded engine.TestProtobufCodecFuzz(arm64 skew across the job is hypothesized to come from ~10–35% slower runners + amd64-only SIMD in
parquet-go, which would widen race windows for all of the above. Not independently proven.)Acceptance criteria
TestParquetFuzz# of samples mismatchfailures across a representative local-run sample (e.g. ≥200 iterations on each of arm64 and amd64; given the ~2% per-CI-run rate, fewer iterations is statistically inconclusive).TestParquetProjectionPushdownFuzzremains skipped (separate issue).Filed after a 3-round multi-agent investigation; full notes available on request.