Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41220][SQL] Range partitioner sample supports column pruning #38756

Closed
wants to merge 2 commits into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Nov 22, 2022

What changes were proposed in this pull request?

Make RangePartitioning take a new parameter planForSample, so we can pass the it to ShuffleExchangeExec.

Add a new rule OptimizeSampleForRangePartitioning to infer the planForSample if has benefit at preparations phase.

Why are the changes needed?

When do a global sort or repartition by range, firstly we do sample to get the range bounds, then use the range partitioner to do shuffle exchange.

The issue is, the plan for sample is coupled with the original query. What we need for sample plan is the columns for sort order but the original query plan contains all data columns. So we can do column pruning for the sample plan to only fetch the ordering columns.

A common example is: OPTIMIZE table ZORDER BY columns

Does this PR introduce any user-facing change?

no it's improve performance

How was this patch tested?

add tests

performace test:

val rows = 1000000
val columns = (0 until(200)).map(i => s"uuid() c$i")
spark.range(rows).selectExpr(columns: _*).repartition(30).write.format("parquet").saveAsTable("t")
spark.sql("SELECT * FROM t ORDER BY c0, c1, c2").write.format("noop").mode("overwrite").save()

after do column pruning, the stage for sample from 7s -> 0.4s, the whole query 1.4X
image

@github-actions github-actions bot added the SQL label Nov 22, 2022
@ulysses-you
Copy link
Contributor Author

how about this idea ? cc @yaooqinn @wangyum @HyukjinKwon @cloud-fan

@cloud-fan
Copy link
Contributor

I'd rather have a nested query execution for sample, instead of changing the query plan tree.

@ulysses-you
Copy link
Contributor Author

@cloud-fan how about this updating ? I make a new query execution before shuffle, so that we can have a optimized plan to sample.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 23, 2023
@github-actions github-actions bot closed this Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants