Skip to content

[SPARK-56392][SQL] Make Sample.seed Optional to distinguish user-specified vs random seeds#55261

Closed
rahulketch wants to merge 1 commit intoapache:masterfrom
rahulketch:sample-seed-optional
Closed

[SPARK-56392][SQL] Make Sample.seed Optional to distinguish user-specified vs random seeds#55261
rahulketch wants to merge 1 commit intoapache:masterfrom
rahulketch:sample-seed-optional

Conversation

@rahulketch
Copy link
Copy Markdown
Contributor

@rahulketch rahulketch commented Apr 8, 2026

What changes were proposed in this pull request?

Change Sample.seed from Long to Option[Long] so that Spark can distinguish between user-specified seeds and system-generated random seeds.

  • Sample.seed type changed from Long to Option[Long]. Some(seed) means the user explicitly specified a seed (via SQL REPEATABLE clause or the programmatic sample(fraction, seed) API). None means no seed was specified.
  • Added a Sample companion object with a backwards-compatible apply(... seed: Long ...) overload that wraps the seed in Some, so all existing callers continue to compile unchanged.
  • The SQL parser now produces Some(seed) when a REPEATABLE (seed) clause is present, and None otherwise (instead of eagerly generating a random seed).
  • SampleExec resolves None to a random seed lazily via a new resolvedSeed field.
  • SparkConnectPlanner passes Some(seed) when the proto message has a seed, None otherwise.
  • Dataset.sample(fraction) and Dataset.sample(withReplacement, fraction) (the no-seed overloads) now pass None directly instead of generating a random seed upfront.
  • V2ScanRelationPushDown resolves None to a random seed when pushing down to DSV2 connectors, preserving existing behavior.

Why are the changes needed?

Previously, the parser always generated a random seed when no REPEATABLE clause was present, making it impossible for downstream components to know whether the seed was explicitly requested by the user. This distinction is important for correctness — for example, TABLESAMPLE (x PERCENT) REPEATABLE (seed) relies on deterministic row ordering within partitions, which may require disabling optimizations like out-of-order file processing. Without Option[Long], there is no way to know at the physical plan level whether ordering guarantees are needed.

Does this PR introduce any user-facing change?

No. The Sample companion object provides a backwards-compatible apply that accepts Long, so all existing code continues to work unchanged. The runtime behavior for both seeded and unseeded samples is preserved.

How was this patch tested?

Updated PlanParserSuite to verify that:

  • TABLESAMPLE without REPEATABLE produces seed = None
  • TABLESAMPLE ... REPEATABLE (seed) produces seed = Some(seed)

Added new test cases for the REPEATABLE clause with both percent and bucket sampling.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

…ified vs random seeds

Change `Sample.seed` from `Long` to `Option[Long]` so that Spark can distinguish between user-specified seeds (SQL `REPEATABLE` clause or programmatic API) and system-generated random seeds.

Previously, the parser always generated a random seed when no `REPEATABLE` clause was present, making it impossible for downstream components to know whether the seed was explicitly requested by the user. This distinction is important for optimizations that depend on whether sampling must be deterministic.

Changes:
- `Sample.seed` type changed from `Long` to `Option[Long]`
- Added `Sample` companion object with backwards-compatible `apply(... seed: Long ...)` overload
- Parser produces `Some(seed)` for `REPEATABLE`, `None` otherwise
- `SampleExec` resolves `None` to a random seed lazily via `resolvedSeed`
- `SparkConnectPlanner` passes `Some(seed)` when proto has seed, `None` otherwise
- `Dataset.sample(fraction)` and `sample(withReplacement, fraction)` pass `None`
- Updated tests to verify `None` vs `Some(seed)` behavior

Co-authored-by: Isaac
@cloud-fan
Copy link
Copy Markdown
Contributor

the timeout issue is unrelated, thanks, merging to master!

@cloud-fan cloud-fan closed this in 0eac893 Apr 8, 2026
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

late lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants