-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27395][SQL] Improve EXPLAIN command #24759
Conversation
This looks easier to read but doesn't the new formatting loose some info? From 1st example, New:
Also, |
@ekoifman Thanks for your comment. So currently,
Yeah. Its definitely an option. Lets get some more feedback and we can decide on it. |
Test build #106024 has finished for PR 24759 at commit
|
Test build #106029 has finished for PR 24759 at commit
|
Test build #106030 has finished for PR 24759 at commit
|
Test build #106034 has finished for PR 24759 at commit
|
cc @cloud-fan |
9a17839
to
01c6655
Compare
Test build #107722 has finished for PR 24759 at commit
|
retest this please |
Test build #107723 has finished for PR 24759 at commit
|
Test build #107726 has finished for PR 24759 at commit
|
hi @dilipbiswal, this activity looks nice to me and I like this. I'm currently not sure that this current approach is the best for spark, so I think studying EXPLAIN output in other DBMS-like systems is beneficial for this? Also, any reference that this current approach is based on? Anyway, I checked EXPLAIN output in postgresql, mysql, and presto by using the TPCDS q1; |
@maropu Thanks for your input. The current approach is based on DB2. I had some preliminary discussion with @gatorsmile before starting on this work. What are your concerns with current approach ? |
oh I see, since I've never used db2, I didn't notice that (this output is based on the db2 one). Thanks!
The postgresql format does so (only basic plan contents are printed in explain). |
@maropu Thank you !! Yeah, we could add more stuff to the header portion of the plan. I had discussed a bit with Sean on this and there are plans to make statistics info available during physical planning. So the plan was to have these stats info printed in the top section of the plan along side each operator (Like row count, num bytes etc). Once we do that, i don't think there would be enough space to print other info without cluttering the output.. but we could always discuss about it :-). Another thing to note is that, we will still be supporting the old plan format. So users would have flexibility to choose between the two ways. |
oh, yeah, plan stats also are candidates for that ;) But, as you said, the space is too limited.
Is this expected to be enabled by default? How do we switch the format? It seems some systems accpet options in an explain command as follows;
|
@maropu Yeah.. my proposal would be to introduce a new format for explain. Currently its controlled by a config. Lets see what wenchen and sean think on this.. |
01c6655
to
42730b3
Compare
cc @cloud-fan If you can please take a look at this one when you have time.. |
Test build #108207 has finished for PR 24759 at commit
|
42730b3
to
cb4bb2f
Compare
Test build #108302 has finished for PR 24759 at commit
|
retest this please |
Test build #108312 has finished for PR 24759 at commit
|
This is a big improvement! I have several high-level comments
|
/** ONE line description of this node with more information */ | ||
def verboseString(maxFields: Int): String | ||
|
||
/** ONE line description of this node with some suffix information */ | ||
def verboseStringWithSuffix(maxFields: Int): String = verboseString(maxFields) | ||
|
||
def verboseString( | ||
planToOperatorID: mutable.LinkedHashMap[TreeNode[_], Int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is specific to query plan, we can put it in QueryPlan
instead of TreeNode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Sounds reasonable. Its been a while :-) Let me try to look at the code later today.
c89f725
to
5f4e8b3
Compare
Test build #109655 has finished for PR 24759 at commit
|
Test build #109725 has finished for PR 24759 at commit
|
thanks, merging to master! |
Thanks a LOT @cloud-fan @gatorsmile @maryannxue @maropu @ekoifman |
@dilipbiswal The Leaf Scan Node is one of the nodes that contain the most important info for perf investigation. Could you create another ticket and submit a follow-up PR on it? The new node is too simple compared with the old one. New:
Old:
|
@gatorsmile Sure.. I will address it. |
Any plan to support this format in |
maybe we can add an overload: |
ok, I'll try to make a pr later in that approach. Thanks! |
### What changes were proposed in this pull request? This pr intends to add `ExplainMode` for explaining `Dataset/DataFrame` with a given format mode (`ExplainMode`). `ExplainMode` has four types along with the SQL EXPLAIN command: `Simple`, `Extended`, `Codegen`, `Cost`, and `Formatted`. For example, this pr enables users to explain DataFrame/Dataset with the `FORMATTED` format implemented in #24759; ``` scala> spark.range(10).groupBy("id").count().explain(ExplainMode.Formatted) == Physical Plan == * HashAggregate (3) +- * HashAggregate (2) +- * Range (1) (1) Range [codegen id : 1] Output: [id#0L] (2) HashAggregate [codegen id : 1] Input: [id#0L] (3) HashAggregate [codegen id : 1] Input: [id#0L, count#8L] ``` This comes from [the cloud-fan suggestion.](#24759 (comment)) ### Why are the changes needed? To follow the SQL EXPLAIN command. ### Does this PR introduce any user-facing change? No, this is just for a new API in Dataset. ### How was this patch tested? Add tests in `ExplainSuite`. Closes #26829 from maropu/DatasetExplain. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR aims at improving the way physical plans are explained in spark.
Currently, the explain output for physical plan may look very cluttered and each operator's
string representation can be very wide and wraps around in the display making it little
hard to follow. This especially happens when explaining a query 1) Operating on wide tables
2) Has complex expressions etc.
This PR attempts to split the output into two sections. In the header section, we display
the basic operator tree with a number associated with each operator. In this section, we strictly
control what we output for each operator. In the footer section, each operator is verbosely
displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be
correlated by the originating expression id from its parent plan.
To illustrate, here is a simple plan displayed in old vs new way.
Example query1 :
Old :
New :
Example Query2 (subquery):
Old:
New:
Note:
Currently this PR provides a basic infrastructure
for explain enhancement. The details about individual operators will be implemented
in follow-up prs
How was this patch tested?
Added a new test
explain.sql
that tests basic scenarios. Need to add more tests.