[SPARK-41220][SQL] Range partitioner sample supports column pruning #38756

ulysses-you · 2022-11-22T10:28:26Z

What changes were proposed in this pull request?

Make RangePartitioning take a new parameter planForSample, so we can pass the it to ShuffleExchangeExec.

Add a new rule OptimizeSampleForRangePartitioning to infer the planForSample if has benefit at preparations phase.

Why are the changes needed?

When do a global sort or repartition by range, firstly we do sample to get the range bounds, then use the range partitioner to do shuffle exchange.

The issue is, the plan for sample is coupled with the original query. What we need for sample plan is the columns for sort order but the original query plan contains all data columns. So we can do column pruning for the sample plan to only fetch the ordering columns.

A common example is: OPTIMIZE table ZORDER BY columns

Does this PR introduce any user-facing change?

no it's improve performance

How was this patch tested?

add tests

performace test:

val rows = 1000000
val columns = (0 until(200)).map(i => s"uuid() c$i")
spark.range(rows).selectExpr(columns: _*).repartition(30).write.format("parquet").saveAsTable("t")
spark.sql("SELECT * FROM t ORDER BY c0, c1, c2").write.format("noop").mode("overwrite").save()

after do column pruning, the stage for sample from 7s -> 0.4s, the whole query 1.4X

ulysses-you · 2022-11-22T10:33:00Z

how about this idea ? cc @yaooqinn @wangyum @HyukjinKwon @cloud-fan

cloud-fan · 2022-11-28T14:21:14Z

I'd rather have a nested query execution for sample, instead of changing the query plan tree.

ulysses-you · 2022-12-09T10:02:18Z

@cloud-fan how about this updating ? I make a new query execution before shuffle, so that we can have a optimized plan to sample.

github-actions · 2023-03-23T00:20:07Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Nov 22, 2022

Range partitioner sample supports column pruning

0dec4ac

ulysses-you force-pushed the range-partition branch from a7d89d7 to 0dec4ac Compare December 9, 2022 09:59

fix ut

b86b113

github-actions bot added the Stale label Mar 23, 2023

github-actions bot closed this Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41220][SQL] Range partitioner sample supports column pruning #38756

[SPARK-41220][SQL] Range partitioner sample supports column pruning #38756

ulysses-you commented Nov 22, 2022 •

edited

Loading

ulysses-you commented Nov 22, 2022

cloud-fan commented Nov 28, 2022

ulysses-you commented Dec 9, 2022

github-actions bot commented Mar 23, 2023

[SPARK-41220][SQL] Range partitioner sample supports column pruning #38756

[SPARK-41220][SQL] Range partitioner sample supports column pruning #38756

Conversation

ulysses-you commented Nov 22, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

ulysses-you commented Nov 22, 2022

cloud-fan commented Nov 28, 2022

ulysses-you commented Dec 9, 2022

github-actions bot commented Mar 23, 2023

ulysses-you commented Nov 22, 2022 •

edited

Loading