[SPARK-35203][SQL] Improve Repartition statistics estimation #32309

wangyum · 2021-04-23T07:22:40Z

What changes were proposed in this pull request?

This PR improves Repartition and RepartitionByExpr statistics estimation using child statistics.

Why are the changes needed?

The current implementation will missing column stat. For example:

CREATE TABLE t1 USING parquet AS SELECT id % 10 AS key FROM range(100);
ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS;
set spark.sql.cbo.enabled=true;
EXPLAIN COST SELECT key FROM (SELECT key FROM t1 DISTRIBUTE BY key) t GROUP BY key;

Before this PR:

== Optimized Logical Plan ==
Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=1600.0 B)
+- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100)
   +- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100)

After this PR:

== Optimized Logical Plan ==
Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=160.0 B, rowCount=10)
+- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100)
   +- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

SparkQA · 2021-04-23T08:19:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42380/

SparkQA · 2021-04-23T08:19:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42380/

SparkQA · 2021-04-23T11:25:42Z

Test build #137850 has finished for PR 32309 at commit a010dea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-23T13:59:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42392/

SparkQA · 2021-04-23T13:59:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42392/

SparkQA · 2021-04-23T17:38:13Z

Test build #137863 has finished for PR 32309 at commit 9d4c349.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-12T15:33:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44265/

SparkQA · 2021-06-12T16:08:28Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44265/

SparkQA · 2021-06-12T18:56:32Z

Test build #139740 has finished for PR 32309 at commit 5553429.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-15T21:28:21Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44356/

maropu · 2021-06-16T00:55:14Z

...cala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala

@@ -81,9 +81,9 @@ object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
    ProjectEstimation.estimate(p).getOrElse(fallback(p))
  }

-  override def visitRepartition(p: Repartition): Statistics = default(p)
+  override def visitRepartition(p: Repartition): Statistics = fallback(p)


Q: we need to fall back into the size-based one instead of just calling p.child.stats?

For better maintenance, if BasicStatsPlanVisitor and SizeInBytesOnlyStatsPlanVisitor have the same implementation, then fallback. For example visitLocalLimit:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala

Line 76 in 5553429

override def visitLocalLimit(p: LocalLimit): Statistics = fallback(p)

Ah, I see. It's trivial and sgtm.

maropu · 2021-06-16T00:55:25Z

Looks fine otherwise.

maropu · 2021-06-16T01:20:48Z

Thank you, @wangyum . Merged to master.

github-actions bot added the SQL label Apr 23, 2021

wangyum requested review from cloud-fan, maropu and wzhfy April 25, 2021 06:02

wangyum added 3 commits June 12, 2021 21:38

Improve Repartition statistics estimation

62c7447

fix

18e8e93

Improve test

5553429

cloud-fan approved these changes Jun 15, 2021

View reviewed changes

maropu reviewed Jun 16, 2021

View reviewed changes

maropu approved these changes Jun 16, 2021

View reviewed changes

maropu closed this in b08cf6e Jun 16, 2021

wangyum deleted the SPARK-35203 branch June 16, 2021 05:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35203][SQL] Improve Repartition statistics estimation #32309

[SPARK-35203][SQL] Improve Repartition statistics estimation #32309

wangyum commented Apr 23, 2021 •

edited

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Jun 12, 2021

SparkQA commented Jun 12, 2021

SparkQA commented Jun 12, 2021

SparkQA commented Jun 15, 2021

maropu Jun 16, 2021

wangyum Jun 16, 2021

maropu Jun 16, 2021

maropu commented Jun 16, 2021

maropu commented Jun 16, 2021

[SPARK-35203][SQL] Improve Repartition statistics estimation #32309

[SPARK-35203][SQL] Improve Repartition statistics estimation #32309

Conversation

wangyum commented Apr 23, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Apr 23, 2021

SparkQA commented Jun 12, 2021

SparkQA commented Jun 12, 2021

SparkQA commented Jun 12, 2021

SparkQA commented Jun 15, 2021

maropu Jun 16, 2021

Choose a reason for hiding this comment

wangyum Jun 16, 2021

Choose a reason for hiding this comment

maropu Jun 16, 2021

Choose a reason for hiding this comment

maropu commented Jun 16, 2021

maropu commented Jun 16, 2021

wangyum commented Apr 23, 2021 •

edited