-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans #24464
Conversation
Test build #104923 has finished for PR 24464 at commit
|
*/ | ||
case class ExplainCommand( | ||
logicalPlan: LogicalPlan, | ||
extended: Boolean = false, | ||
codegen: Boolean = false, | ||
cost: Boolean = false) | ||
cost: Boolean = false, | ||
optQueryExecution: Option[QueryExecution] = None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a better way to keep using analyzed plan of dataset and showing correct pre-analyzed plan, other than passing in the query execution of the dataset.
@cloud-fan @dongjoon-hyun @HyukjinKwon Do you think this fix work? Please take a look. Thanks. |
@@ -498,7 +498,8 @@ class Dataset[T] private[sql]( | |||
* @since 1.6.0 | |||
*/ | |||
def explain(extended: Boolean): Unit = { | |||
val explain = ExplainCommand(queryExecution.logical, extended = extended) | |||
val explain = ExplainCommand(queryExecution.logical, extended = extended, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about passing the QueryExecution
as first parameter? IIUC, there is only another place where ExplainCommand
is created, so it is not going to be a big change too and it is going to be cleaner IMO...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExplainCommand
is also created in SparkSqlParser
where there is no QueryExecution
. So it has both LogicalPlan
and QueryExecution
parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my point is: can we create a QueryExecution
in SparkSqlParser
and pass that to ExplainCommand
?
The only problem I can see in doing this is that we need to bind the newly generated QueryExecution
to a SparkSession
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a change. Please see if it is clearer for you.
@@ -494,11 +494,15 @@ class Dataset[T] private[sql]( | |||
/** | |||
* Prints the plans (logical and physical) to the console for debugging purposes. | |||
* | |||
* Note that temporary views are already resolved when creating `Dataset`. So if | |||
* temporary views are changed after that, the output of this command shows the plans | |||
* before such changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is clearer to make a note in this doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether this is the right place for this comment. I mean, I think that putting it here, it is not clear whether the output of the explain
shows the plan with "old" views, but the dataset is executed with the "new" ones or both use the "old", as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main access of explain command for Dataset. If not this place, any other place you will suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is: this note and this behavior is not specific to explain
, but it is common to all operations on a dataset. Putting the comment here can be confusing as it may imply that it is not true for other operations on dataset, which is not true. I'd rather put it in the comment of the Dataset class.
* @param extended whether to do extended explain or not | ||
* @param codegen whether to output generated code from whole-stage codegen or not | ||
* @param cost whether to show cost information for operators. | ||
*/ | ||
case class ExplainCommand( | ||
logicalPlan: LogicalPlan, | ||
queryExecution: QueryExecution, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 ExplainCommand
now accepts only QueryExecution
.
/** | ||
* This is mainly used for tests. | ||
*/ | ||
def apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for test only. Because some test suites passes logicalPlan
and extended
, and we can't overload apply
with default parameters. So to create this for test only purpose.
@@ -494,11 +494,15 @@ class Dataset[T] private[sql]( | |||
/** | |||
* Prints the plans (logical and physical) to the console for debugging purposes. | |||
* | |||
* Note that temporary views are already resolved when creating `Dataset`. So if | |||
* temporary views are changed after that, the output of this command shows the plans | |||
* before such changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether this is the right place for this comment. I mean, I think that putting it here, it is not clear whether the output of the explain
shows the plan with "old" views, but the dataset is executed with the "new" ones or both use the "old", as it is.
extended: Boolean, | ||
codegen: Boolean, | ||
cost: Boolean): ExplainCommand = { | ||
val sparkSession = SparkSession.getActiveSession |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkSession.active
?
Test build #105054 has finished for PR 24464 at commit
|
@mgaido91 Your comments were addressed. Thanks for review. |
@cloud-fan @dongjoon-hyun Can you help check this? I think it should be in better shape now. |
Test build #105076 has finished for PR 24464 at commit
|
Test build #105077 has finished for PR 24464 at commit
|
@@ -498,7 +498,7 @@ class Dataset[T] private[sql]( | |||
* @since 1.6.0 | |||
*/ | |||
def explain(extended: Boolean): Unit = { | |||
val explain = ExplainCommand(queryExecution.logical, extended = extended) | |||
val explain = ExplainCommand(queryExecution, extended = extended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the current bug, I agree with @viirya . I believe this is inevitable.
Could you give us some guide for this, @cloud-fan and @gatorsmile ?
Retest this please. |
Test build #105105 has finished for PR 24464 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart a couple of style-related comments, thanks for this fix @viirya
* run through the analyzer and optimizer when this command is actually run. | ||
*/ | ||
def apply( | ||
logicalPlan: LogicalPlan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: one more indent here and in the lines below?
* This is mainly used for tests. | ||
*/ | ||
def apply( | ||
logicalPlan: LogicalPlan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ditto
Test build #105137 has finished for PR 24464 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the analyze doesn't finish within Could we revert this commit? |
@viirya can we just fix this by doing the following: def explain(extended: Boolean): Unit = {
val outputString =
if (extended) {
queryExecution.toString
} else {
queryExecution.simpleString
}
// scalastyle:off println
println(outputString)
// scalastyle:on println
} That seems a lot simpler, and it does not trigger full analysis and optimization from the parser. |
Thank you, @gatorsmile and @hvanhovell . scala> sql("explain select 1").explain(true)
== Parsed Logical Plan ==
ExplainCommand == Parsed Logical Plan ==
'Project [unresolvedalias(1, None)]
+- OneRowRelation
== Analyzed Logical Plan ==
1: int
Project [1 AS 1#2]
+- OneRowRelation
== Optimized Logical Plan ==
Project [1 AS 1#2]
+- OneRowRelation
== Physical Plan ==
*(1) Project [1 AS 1#2]
+- *(1) Scan OneRowRelation[]
, false, false, false
== Analyzed Logical Plan ==
plan: string
ExplainCommand == Parsed Logical Plan ==
'Project [unresolvedalias(1, None)]
+- OneRowRelation
== Analyzed Logical Plan ==
1: int
Project [1 AS 1#2]
+- OneRowRelation
== Optimized Logical Plan ==
Project [1 AS 1#2]
+- OneRowRelation
== Physical Plan ==
*(1) Project [1 AS 1#2]
+- *(1) Scan OneRowRelation[]
, false, false, false
== Optimized Logical Plan ==
ExplainCommand == Parsed Logical Plan ==
'Project [unresolvedalias(1, None)]
+- OneRowRelation
== Analyzed Logical Plan ==
1: int
Project [1 AS 1#2]
+- OneRowRelation
== Optimized Logical Plan ==
Project [1 AS 1#2]
+- OneRowRelation
== Physical Plan ==
*(1) Project [1 AS 1#2]
+- *(1) Scan OneRowRelation[]
, false, false, false
== Physical Plan ==
Execute ExplainCommand
+- ExplainCommand == Parsed Logical Plan ==
'Project [unresolvedalias(1, None)]
+- OneRowRelation
== Analyzed Logical Plan ==
1: int
Project [1 AS 1#2]
+- OneRowRelation
== Optimized Logical Plan ==
Project [1 AS 1#2]
+- OneRowRelation
== Physical Plan ==
*(1) Project [1 AS 1#2]
+- *(1) Scan OneRowRelation[]
, false, false, false Sorry about this mess. I'll revert this. @viirya . Please proceed with @hvanhovell 's advice~ |
This is reverted via 039db87 and I reopened SPARK-27439 . |
oh sorry for that and thanks @gatorsmile, @hvanhovell and @dongjoon-hyun |
What changes were proposed in this pull request?
Because a review is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain command aren't the same with the plan of the dataset.
Before:
After:
To fix it, this passes query execution of Dataset when explaining it. The query execution contains pre-analyzed plan which is consistent with Dataset's result.
How was this patch tested?
Manually test and unit test.