Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35203][SQL] Improve Repartition statistics estimation #32309

Closed
wants to merge 3 commits into from
Closed

[SPARK-35203][SQL] Improve Repartition statistics estimation #32309

wants to merge 3 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Apr 23, 2021

What changes were proposed in this pull request?

This PR improves Repartition and RepartitionByExpr statistics estimation using child statistics.

Why are the changes needed?

The current implementation will missing column stat. For example:

CREATE TABLE t1 USING parquet AS SELECT id % 10 AS key FROM range(100);
ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS;
set spark.sql.cbo.enabled=true;
EXPLAIN COST SELECT key FROM (SELECT key FROM t1 DISTRIBUTE BY key) t GROUP BY key;

Before this PR:

== Optimized Logical Plan ==
Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=1600.0 B)
+- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100)
   +- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100)

After this PR:

== Optimized Logical Plan ==
Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=160.0 B, rowCount=10)
+- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100)
   +- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

@github-actions github-actions bot added the SQL label Apr 23, 2021
@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42380/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42380/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137850 has finished for PR 32309 at commit a010dea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42392/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42392/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137863 has finished for PR 32309 at commit 9d4c349.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44265/

@SparkQA
Copy link

SparkQA commented Jun 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44265/

@SparkQA
Copy link

SparkQA commented Jun 12, 2021

Test build #139740 has finished for PR 32309 at commit 5553429.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44356/

@@ -81,9 +81,9 @@ object BasicStatsPlanVisitor extends LogicalPlanVisitor[Statistics] {
ProjectEstimation.estimate(p).getOrElse(fallback(p))
}

override def visitRepartition(p: Repartition): Statistics = default(p)
override def visitRepartition(p: Repartition): Statistics = fallback(p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: we need to fall back into the size-based one instead of just calling p.child.stats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better maintenance, if BasicStatsPlanVisitor and SizeInBytesOnlyStatsPlanVisitor have the same implementation, then fallback. For example visitLocalLimit:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. It's trivial and sgtm.

@maropu
Copy link
Member

maropu commented Jun 16, 2021

Looks fine otherwise.

@maropu maropu closed this in b08cf6e Jun 16, 2021
@maropu
Copy link
Member

maropu commented Jun 16, 2021

Thank you, @wangyum . Merged to master.

@wangyum wangyum deleted the SPARK-35203 branch June 16, 2021 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants