[SPARK-56392][SQL] Make Sample.seed Optional to distinguish user-specified vs random seeds#55261
Closed
rahulketch wants to merge 1 commit intoapache:masterfrom
Closed
[SPARK-56392][SQL] Make Sample.seed Optional to distinguish user-specified vs random seeds#55261rahulketch wants to merge 1 commit intoapache:masterfrom
rahulketch wants to merge 1 commit intoapache:masterfrom
Conversation
…ified vs random seeds Change `Sample.seed` from `Long` to `Option[Long]` so that Spark can distinguish between user-specified seeds (SQL `REPEATABLE` clause or programmatic API) and system-generated random seeds. Previously, the parser always generated a random seed when no `REPEATABLE` clause was present, making it impossible for downstream components to know whether the seed was explicitly requested by the user. This distinction is important for optimizations that depend on whether sampling must be deterministic. Changes: - `Sample.seed` type changed from `Long` to `Option[Long]` - Added `Sample` companion object with backwards-compatible `apply(... seed: Long ...)` overload - Parser produces `Some(seed)` for `REPEATABLE`, `None` otherwise - `SampleExec` resolves `None` to a random seed lazily via `resolvedSeed` - `SparkConnectPlanner` passes `Some(seed)` when proto has seed, `None` otherwise - `Dataset.sample(fraction)` and `sample(withReplacement, fraction)` pass `None` - Updated tests to verify `None` vs `Some(seed)` behavior Co-authored-by: Isaac
Contributor
|
the timeout issue is unrelated, thanks, merging to master! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Change
Sample.seedfromLongtoOption[Long]so that Spark can distinguish between user-specified seeds and system-generated random seeds.Sample.seedtype changed fromLongtoOption[Long].Some(seed)means the user explicitly specified a seed (via SQLREPEATABLEclause or the programmaticsample(fraction, seed)API).Nonemeans no seed was specified.Samplecompanion object with a backwards-compatibleapply(... seed: Long ...)overload that wraps the seed inSome, so all existing callers continue to compile unchanged.Some(seed)when aREPEATABLE (seed)clause is present, andNoneotherwise (instead of eagerly generating a random seed).SampleExecresolvesNoneto a random seed lazily via a newresolvedSeedfield.SparkConnectPlannerpassesSome(seed)when the proto message has a seed,Noneotherwise.Dataset.sample(fraction)andDataset.sample(withReplacement, fraction)(the no-seed overloads) now passNonedirectly instead of generating a random seed upfront.V2ScanRelationPushDownresolvesNoneto a random seed when pushing down to DSV2 connectors, preserving existing behavior.Why are the changes needed?
Previously, the parser always generated a random seed when no
REPEATABLEclause was present, making it impossible for downstream components to know whether the seed was explicitly requested by the user. This distinction is important for correctness — for example,TABLESAMPLE (x PERCENT) REPEATABLE (seed)relies on deterministic row ordering within partitions, which may require disabling optimizations like out-of-order file processing. WithoutOption[Long], there is no way to know at the physical plan level whether ordering guarantees are needed.Does this PR introduce any user-facing change?
No. The
Samplecompanion object provides a backwards-compatibleapplythat acceptsLong, so all existing code continues to work unchanged. The runtime behavior for both seeded and unseeded samples is preserved.How was this patch tested?
Updated
PlanParserSuiteto verify that:TABLESAMPLEwithoutREPEATABLEproducesseed = NoneTABLESAMPLE ... REPEATABLE (seed)producesseed = Some(seed)Added new test cases for the
REPEATABLEclause with both percent and bucket sampling.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.6)