[SPARK-55978][SQL][FOLLOWUP] Don't block V2 join pushdown when pushed Sample has fraction=1 without replacement#56031
[SPARK-55978][SQL][FOLLOWUP] Don't block V2 join pushdown when pushed Sample has fraction=1 without replacement#56031cloud-fan wants to merge 3 commits into
Conversation
… Sample has fraction=1 ### What changes were proposed in this pull request? Followup to apache#54972. Narrow the `V2ScanRelationPushDown` join-pushdown guard so it only blocks pushdown when at least one side has a pushed `Sample` with fraction < 1. At fraction = 1 the sample is a no-op on the result set, so dropping it inside the merged scan builder is safe. ### Why are the changes needed? The guard added in SPARK-55978 exists because the merged scan builder for `SupportsPushDownJoin` cannot carry a pushed sample and would silently discard it. The hazard is *silent result change*. At fraction = 1, no rows are excluded, so dropping the sample changes nothing observable. The current guard is therefore stricter than its rationale requires, and unnecessarily skips join pushdown for queries that land at `TABLESAMPLE SYSTEM (100 PERCENT)` (parameterized queries, query generators, environment-tuned fractions). ### Does this PR introduce _any_ user-facing change? No behavior change for queries with fraction < 1. For queries where the pushed sample has fraction = 1, join pushdown now proceeds — same result set, faster plan. ### How was this patch tested? - Existing `"join pushdown is skipped when a side has a pushed sample"` test moved from `100 PERCENT` to `50 PERCENT` so it keeps exercising the (still-active) fraction < 1 branch of the guard. - New `"100% SYSTEM sample does not block join pushdown"` test asserts the new fraction = 1 short-circuit, locking in the contract. ### Was this patch authored or co-authored using generative AI tooling? No.
| // fraction < 1, because the merged scan builder would silently | ||
| // discard it and change the result. At fraction = 1 the sample is | ||
| // a no-op on the result set, so dropping it is safe. | ||
| // TODO(SPARK-56504): Extend SupportsPushDownJoin to accept pushed |
There was a problem hiding this comment.
looks like this is @stanyao 's todo JIRA, can we make this pr target that and remove the TODO?
szehon-ho
left a comment
There was a problem hiding this comment.
Thanks for the follow-up — narrowing the guard to fraction < 1 matches the original SPARK-55978 rationale well (avoid silently discarding a sample that actually filters rows).
Assumption (full table scan): This PR relies on the assumption that a pushed sample with upperBound - lowerBound >= 1.0 is a no-op on the result row set — i.e. equivalent to reading the full table without sampling — so it is safe when join pushdown drops/discards the sample (merged scan builder cannot carry it; JDBC join subqueries omit TABLESAMPLE, etc.). That is reasonable for SQL TABLESAMPLE at 100% (withReplacement = false, lowerBound = 0) and for the motivating SYSTEM (100 PERCENT) case, but it is still an assumption: execution becomes "scan without sample," not "scan with 100% sample," and correctness depends on connectors/dialects treating 100% as including all rows/blocks (especially SYSTEM, which is implementation-defined).
Tests: Moving the blocking case to 50 PERCENT and adding the 100 PERCENT join-pushed test look good. Optional follow-ups: TABLESAMPLE (100 PERCENT) (Bernoulli), and asserting row results match the no-sample join.
Left an inline note on withReplacement.
| // TODO(SPARK-56504): Extend SupportsPushDownJoin to accept pushed | ||
| // samples so sources supporting both can handle the composition. | ||
| leftHolder.pushedSample.isEmpty && rightHolder.pushedSample.isEmpty && | ||
| leftHolder.pushedSample.forall(s => s.upperBound - s.lowerBound >= 1.0) && |
There was a problem hiding this comment.
upperBound - lowerBound >= 1.0 does not check withReplacement. With withReplacement = true, SampleExec uses PoissonSampler: even at rate 1.0, each input row can be emitted 0, 1, 2, … times, so discarding the pushed sample during join pushdown is not the same as a full scan. SQL TABLESAMPLE always sets withReplacement = false, but DataFrame.sample(withReplacement = true, fraction = 1.0) can be pushed to DSv2.
Suggest tightening the guard, e.g.:
!s.withReplacement && s.upperBound - s.lowerBound >= 1.0(optionally also s.lowerBound <= RandomSampler.roundingEpsilon for clarity).
A sample with withReplacement=true at fraction 1.0 uses PoissonSampler and is not a no-op (each row can emit 0, 1, 2, ... times). Tighten the guard so only !withReplacement, fraction >= 1.0 samples short-circuit the join-pushdown block.
…hdown Locks in the new !s.withReplacement guard via the DataFrame.sample API (SQL TABLESAMPLE can't set withReplacement=true).
| // Do not push down join if either side has a pushed sample with | ||
| // fraction < 1, because the merged scan builder would silently | ||
| // discard it and change the result. At fraction = 1 without | ||
| // replacement the sample is a no-op on the result set, so dropping |
There was a problem hiding this comment.
could you update the PR description to include the phrase "without replacement"? otherwise the words "At fraction = 1 the sample is a no-op on the result set" sounds incorrect
|
thanks for review, merging to master/4.x/4.2! |
… Sample has fraction=1 without replacement ### What changes were proposed in this pull request? Followup to #54972. Narrow the `V2ScanRelationPushDown` join-pushdown guard so it only blocks pushdown when at least one side has a pushed `Sample` that can actually change the result set. A pushed sample is treated as a no-op when **both** `withReplacement = false` **and** `upperBound - lowerBound >= 1.0` (fraction = 1). In that case dropping the sample inside the merged scan builder is safe. ### Why are the changes needed? The guard added in SPARK-55978 exists because the merged scan builder for `SupportsPushDownJoin` cannot carry a pushed sample and would silently discard it. The hazard is *silent result change*. For a `Sample` without replacement at fraction = 1, no rows are excluded, so dropping the sample changes nothing observable. The current guard is therefore stricter than its rationale requires, and unnecessarily skips join pushdown for queries that land at `TABLESAMPLE SYSTEM (100 PERCENT)` (parameterized queries, query generators, environment-tuned fractions). With replacement is a different story: `SampleExec` uses `PoissonSampler`, and even at rate 1.0 each input row can be emitted 0, 1, 2, … times, so it is **not** a no-op. SQL `TABLESAMPLE` always sets `withReplacement = false`, but `DataFrame.sample(withReplacement = true, fraction = 1.0)` can be pushed to DSv2, so the guard must keep blocking that case. ### Does this PR introduce _any_ user-facing change? No behavior change for queries with fraction < 1, or for `withReplacement = true` at any fraction. For queries where the pushed sample has `withReplacement = false` and fraction = 1, join pushdown now proceeds — same result set, faster plan. ### How was this patch tested? - Existing `"join pushdown is skipped when a side has a pushed sample"` test moved from `100 PERCENT` to `50 PERCENT` so it keeps exercising the (still-active) fraction < 1 branch of the guard. - New `"100% SYSTEM sample does not block join pushdown"` test asserts the new fraction = 1, without-replacement short-circuit. - New `"with-replacement sample blocks join pushdown even at fraction 1"` test uses the DataFrame API to assert that Poisson sampling at fraction 1 keeps blocking join pushdown. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #56031 from cloud-fan/SPARK-55978-followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e1babc6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… Sample has fraction=1 without replacement ### What changes were proposed in this pull request? Followup to #54972. Narrow the `V2ScanRelationPushDown` join-pushdown guard so it only blocks pushdown when at least one side has a pushed `Sample` that can actually change the result set. A pushed sample is treated as a no-op when **both** `withReplacement = false` **and** `upperBound - lowerBound >= 1.0` (fraction = 1). In that case dropping the sample inside the merged scan builder is safe. ### Why are the changes needed? The guard added in SPARK-55978 exists because the merged scan builder for `SupportsPushDownJoin` cannot carry a pushed sample and would silently discard it. The hazard is *silent result change*. For a `Sample` without replacement at fraction = 1, no rows are excluded, so dropping the sample changes nothing observable. The current guard is therefore stricter than its rationale requires, and unnecessarily skips join pushdown for queries that land at `TABLESAMPLE SYSTEM (100 PERCENT)` (parameterized queries, query generators, environment-tuned fractions). With replacement is a different story: `SampleExec` uses `PoissonSampler`, and even at rate 1.0 each input row can be emitted 0, 1, 2, … times, so it is **not** a no-op. SQL `TABLESAMPLE` always sets `withReplacement = false`, but `DataFrame.sample(withReplacement = true, fraction = 1.0)` can be pushed to DSv2, so the guard must keep blocking that case. ### Does this PR introduce _any_ user-facing change? No behavior change for queries with fraction < 1, or for `withReplacement = true` at any fraction. For queries where the pushed sample has `withReplacement = false` and fraction = 1, join pushdown now proceeds — same result set, faster plan. ### How was this patch tested? - Existing `"join pushdown is skipped when a side has a pushed sample"` test moved from `100 PERCENT` to `50 PERCENT` so it keeps exercising the (still-active) fraction < 1 branch of the guard. - New `"100% SYSTEM sample does not block join pushdown"` test asserts the new fraction = 1, without-replacement short-circuit. - New `"with-replacement sample blocks join pushdown even at fraction 1"` test uses the DataFrame API to assert that Poisson sampling at fraction 1 keeps blocking join pushdown. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #56031 from cloud-fan/SPARK-55978-followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e1babc6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Followup to #54972.
Narrow the
V2ScanRelationPushDownjoin-pushdown guard so it only blocks pushdown when at least one side has a pushedSamplethat can actually change the result set. A pushed sample is treated as a no-op when bothwithReplacement = falseandupperBound - lowerBound >= 1.0(fraction = 1). In that case dropping the sample inside the merged scan builder is safe.Why are the changes needed?
The guard added in SPARK-55978 exists because the merged scan builder for
SupportsPushDownJoincannot carry a pushed sample and would silently discard it. The hazard is silent result change. For aSamplewithout replacement at fraction = 1, no rows are excluded, so dropping the sample changes nothing observable. The current guard is therefore stricter than its rationale requires, and unnecessarily skips join pushdown for queries that land atTABLESAMPLE SYSTEM (100 PERCENT)(parameterized queries, query generators, environment-tuned fractions).With replacement is a different story:
SampleExecusesPoissonSampler, and even at rate 1.0 each input row can be emitted 0, 1, 2, … times, so it is not a no-op. SQLTABLESAMPLEalways setswithReplacement = false, butDataFrame.sample(withReplacement = true, fraction = 1.0)can be pushed to DSv2, so the guard must keep blocking that case.Does this PR introduce any user-facing change?
No behavior change for queries with fraction < 1, or for
withReplacement = trueat any fraction. For queries where the pushed sample haswithReplacement = falseand fraction = 1, join pushdown now proceeds — same result set, faster plan.How was this patch tested?
"join pushdown is skipped when a side has a pushed sample"test moved from100 PERCENTto50 PERCENTso it keeps exercising the (still-active) fraction < 1 branch of the guard."100% SYSTEM sample does not block join pushdown"test asserts the new fraction = 1, without-replacement short-circuit."with-replacement sample blocks join pushdown even at fraction 1"test uses the DataFrame API to assert that Poisson sampling at fraction 1 keeps blocking join pushdown.Was this patch authored or co-authored using generative AI tooling?
No.