Skip to content

[SPARK-35567][SQL] Fix: Explain cost is not showing statistics for all the nodes#32704

Closed
shahidki31 wants to merge 3 commits intoapache:masterfrom
shahidki31:shahid/fixshowstats
Closed

[SPARK-35567][SQL] Fix: Explain cost is not showing statistics for all the nodes#32704
shahidki31 wants to merge 3 commits intoapache:masterfrom
shahidki31:shahid/fixshowstats

Conversation

@shahidki31
Copy link
Copy Markdown
Contributor

@shahidki31 shahidki31 commented May 30, 2021

What changes were proposed in this pull request?

Explain cost command in spark currently doesn't show statistics for all the nodes. It misses some nodes in almost all the TPCDS queries.
In this PR, we are collecting all the plan nodes including the subqueries and computing the statistics for each node, if it doesn't exists in stats cache,

Why are the changes needed?

Before Fix
For eg: Query1, Project node doesn't have statistics
image

Query15, Aggregate node doesn't have statistics

image

After Fix:
Query1:
image
Query 15:
image

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual testing

@github-actions github-actions bot added the SQL label May 30, 2021
@shahidki31
Copy link
Copy Markdown
Contributor Author

cc @HyukjinKwon @maropu @cloud-fan @srowen Kindly review

@srowen
Copy link
Copy Markdown
Member

srowen commented May 30, 2021

I don't know enough to review this, sorry

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43607/

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 31, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43607/

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 31, 2021

Test build #139086 has finished for PR 32704 at commit f1eced2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Copy Markdown
Contributor

Does this fix nested subqueries?

@shahidki31
Copy link
Copy Markdown
Contributor Author

@cloud-fan Yes, collectWithSubqueries will include nested subqueries as well.
image

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 31, 2021

Test build #139103 has finished for PR 32704 at commit b78a9c9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Copy Markdown
Contributor Author

Retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43623/

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 31, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43627/

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 31, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43623/

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 31, 2021

Test build #139107 has finished for PR 32704 at commit b78a9c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Copy Markdown
Contributor

cloud-fan commented May 31, 2021

thanks, merging to master!

@cloud-fan cloud-fan closed this in cd2ef9c May 31, 2021
@shahidki31
Copy link
Copy Markdown
Contributor Author

Thanks a lot @cloud-fan

@shahidki31 shahidki31 deleted the shahid/fixshowstats branch May 31, 2021 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants