[SPARK-41220][SQL] Range partitioner sample supports column pruning #38756
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Make
RangePartitioning
take a new parameterplanForSample
, so we can pass the it toShuffleExchangeExec
.Add a new rule
OptimizeSampleForRangePartitioning
to infer theplanForSample
if has benefit at preparations phase.Why are the changes needed?
When do a global sort or repartition by range, firstly we do sample to get the range bounds, then use the range partitioner to do shuffle exchange.
The issue is, the plan for sample is coupled with the original query. What we need for sample plan is the columns for sort order but the original query plan contains all data columns. So we can do column pruning for the sample plan to only fetch the ordering columns.
A common example is:
OPTIMIZE table ZORDER BY columns
Does this PR introduce any user-facing change?
no it's improve performance
How was this patch tested?
add tests
performace test:
after do column pruning, the stage for sample from 7s -> 0.4s, the whole query 1.4X