[SPARK-56392][SQL] Make Sample.seed Optional to distinguish user-specified vs random seeds by rahulketch · Pull Request #55261 · apache/spark

rahulketch · 2026-04-08T13:09:15Z

What changes were proposed in this pull request?

Change Sample.seed from Long to Option[Long] so that Spark can distinguish between user-specified seeds and system-generated random seeds.

Sample.seed type changed from Long to Option[Long]. Some(seed) means the user explicitly specified a seed (via SQL REPEATABLE clause or the programmatic sample(fraction, seed) API). None means no seed was specified.
Added a Sample companion object with a backwards-compatible apply(... seed: Long ...) overload that wraps the seed in Some, so all existing callers continue to compile unchanged.
The SQL parser now produces Some(seed) when a REPEATABLE (seed) clause is present, and None otherwise (instead of eagerly generating a random seed).
SampleExec resolves None to a random seed lazily via a new resolvedSeed field.
SparkConnectPlanner passes Some(seed) when the proto message has a seed, None otherwise.
Dataset.sample(fraction) and Dataset.sample(withReplacement, fraction) (the no-seed overloads) now pass None directly instead of generating a random seed upfront.
V2ScanRelationPushDown resolves None to a random seed when pushing down to DSV2 connectors, preserving existing behavior.

Why are the changes needed?

Previously, the parser always generated a random seed when no REPEATABLE clause was present, making it impossible for downstream components to know whether the seed was explicitly requested by the user. This distinction is important for correctness — for example, TABLESAMPLE (x PERCENT) REPEATABLE (seed) relies on deterministic row ordering within partitions, which may require disabling optimizations like out-of-order file processing. Without Option[Long], there is no way to know at the physical plan level whether ordering guarantees are needed.

Does this PR introduce any user-facing change?

No. The Sample companion object provides a backwards-compatible apply that accepts Long, so all existing code continues to work unchanged. The runtime behavior for both seeded and unseeded samples is preserved.

How was this patch tested?

Updated PlanParserSuite to verify that:

TABLESAMPLE without REPEATABLE produces seed = None
TABLESAMPLE ... REPEATABLE (seed) produces seed = Some(seed)

Added new test cases for the REPEATABLE clause with both percent and bucket sampling.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

…ified vs random seeds Change `Sample.seed` from `Long` to `Option[Long]` so that Spark can distinguish between user-specified seeds (SQL `REPEATABLE` clause or programmatic API) and system-generated random seeds. Previously, the parser always generated a random seed when no `REPEATABLE` clause was present, making it impossible for downstream components to know whether the seed was explicitly requested by the user. This distinction is important for optimizations that depend on whether sampling must be deterministic. Changes: - `Sample.seed` type changed from `Long` to `Option[Long]` - Added `Sample` companion object with backwards-compatible `apply(... seed: Long ...)` overload - Parser produces `Some(seed)` for `REPEATABLE`, `None` otherwise - `SampleExec` resolves `None` to a random seed lazily via `resolvedSeed` - `SparkConnectPlanner` passes `Some(seed)` when proto has seed, `None` otherwise - `Dataset.sample(fraction)` and `sample(withReplacement, fraction)` pass `None` - Updated tests to verify `None` vs `Some(seed)` behavior Co-authored-by: Isaac

cloud-fan · 2026-04-08T17:02:58Z

the timeout issue is unrelated, thanks, merging to master!

szehon-ho

late lgtm

cloud-fan closed this in 0eac893 Apr 8, 2026

szehon-ho reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56392][SQL] Make Sample.seed Optional to distinguish user-specified vs random seeds#55261

[SPARK-56392][SQL] Make Sample.seed Optional to distinguish user-specified vs random seeds#55261
rahulketch wants to merge 1 commit intoapache:masterfrom
rahulketch:sample-seed-optional

rahulketch commented Apr 8, 2026 •

edited

Loading

Uh oh!

cloud-fan commented Apr 8, 2026

Uh oh!

szehon-ho left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rahulketch commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Apr 8, 2026

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rahulketch commented Apr 8, 2026 •

edited

Loading