[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans #24464

viirya · 2019-04-26T03:48:38Z

What changes were proposed in this pull request?

Because a review is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain command aren't the same with the plan of the dataset.

scala> spark.range(10).createOrReplaceTempView("test")
scala> spark.range(5).createOrReplaceTempView("test2")
scala> spark.sql("select * from test").createOrReplaceTempView("tmp001")
scala> val df = spark.sql("select * from tmp001")
scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001")
scala> df.show
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
scala> df.explain(true)

Before:

== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `tmp001`

== Analyzed Logical Plan ==
id: bigint
Project [id#2L]
+- SubqueryAlias `tmp001`
   +- Project [id#2L]
      +- SubqueryAlias `test2`
         +- Range (0, 5, step=1, splits=Some(12))

== Optimized Logical Plan ==
Range (0, 5, step=1, splits=Some(12))

== Physical Plan ==
*(1) Range (0, 5, step=1, splits=12)

After:

== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `tmp001`

== Analyzed Logical Plan ==
id: bigint
Project [id#0L]
+- SubqueryAlias `tmp001`
   +- Project [id#0L]
      +- SubqueryAlias `test`
         +- Range (0, 10, step=1, splits=Some(12))

== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(12))

== Physical Plan ==
*(1) Range (0, 10, step=1, splits=12)

To fix it, this passes query execution of Dataset when explaining it. The query execution contains pre-analyzed plan which is consistent with Dataset's result.

How was this patch tested?

Manually test and unit test.

viirya · 2019-04-26T03:49:43Z

cc @dongjoon-hyun @HyukjinKwon @cloud-fan

SparkQA · 2019-04-26T06:47:52Z

Test build #104923 has finished for PR 24464 at commit dcbcc7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-04-26T10:05:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala

 */
 case class ExplainCommand(
    logicalPlan: LogicalPlan,
    extended: Boolean = false,
    codegen: Boolean = false,
-    cost: Boolean = false)
+    cost: Boolean = false,
+    optQueryExecution: Option[QueryExecution] = None)


I don't have a better way to keep using analyzed plan of dataset and showing correct pre-analyzed plan, other than passing in the query execution of the dataset.

viirya · 2019-04-29T16:19:18Z

@cloud-fan @dongjoon-hyun @HyukjinKwon Do you think this fix work? Please take a look. Thanks.

mgaido91 · 2019-05-01T10:06:12Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -498,7 +498,8 @@ class Dataset[T] private[sql](
   * @since 1.6.0
   */
  def explain(extended: Boolean): Unit = {
-    val explain = ExplainCommand(queryExecution.logical, extended = extended)
+    val explain = ExplainCommand(queryExecution.logical, extended = extended,


what about passing the QueryExecution as first parameter? IIUC, there is only another place where ExplainCommand is created, so it is not going to be a big change too and it is going to be cleaner IMO...

ExplainCommand is also created in SparkSqlParser where there is no QueryExecution. So it has both LogicalPlan and QueryExecution parameters.

Yes, my point is: can we create a QueryExecution in SparkSqlParser and pass that to ExplainCommand?

The only problem I can see in doing this is that we need to bind the newly generated QueryExecution to a SparkSession.

Made a change. Please see if it is clearer for you.

viirya · 2019-05-01T11:33:33Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -494,11 +494,15 @@ class Dataset[T] private[sql](
  /**
   * Prints the plans (logical and physical) to the console for debugging purposes.
   *
+   * Note that temporary views are already resolved when creating `Dataset`. So if
+   * temporary views are changed after that, the output of this command shows the plans
+   * before such changes.


I think it is clearer to make a note in this doc.

I am not sure whether this is the right place for this comment. I mean, I think that putting it here, it is not clear whether the output of the explain shows the plan with "old" views, but the dataset is executed with the "new" ones or both use the "old", as it is.

This is the main access of explain command for Dataset. If not this place, any other place you will suggest?

My point is: this note and this behavior is not specific to explain, but it is common to all operations on a dataset. Putting the comment here can be confusing as it may imply that it is not true for other operations on dataset, which is not true. I'd rather put it in the comment of the Dataset class.

viirya · 2019-05-01T11:34:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala

 * @param extended whether to do extended explain or not
 * @param codegen whether to output generated code from whole-stage codegen or not
 * @param cost whether to show cost information for operators.
 */
 case class ExplainCommand(
-    logicalPlan: LogicalPlan,
+    queryExecution: QueryExecution,


@mgaido91 ExplainCommand now accepts only QueryExecution.

viirya · 2019-05-01T11:35:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala

+  /**
+   * This is mainly used for tests.
+   */
+  def apply(


This is for test only. Because some test suites passes logicalPlan and extended, and we can't overload apply with default parameters. So to create this for test only purpose.

mgaido91 · 2019-05-01T12:15:34Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -494,11 +494,15 @@ class Dataset[T] private[sql](
  /**
   * Prints the plans (logical and physical) to the console for debugging purposes.
   *
+   * Note that temporary views are already resolved when creating `Dataset`. So if
+   * temporary views are changed after that, the output of this command shows the plans
+   * before such changes.


I am not sure whether this is the right place for this comment. I mean, I think that putting it here, it is not clear whether the output of the explain shows the plan with "old" views, but the dataset is executed with the "new" ones or both use the "old", as it is.

mgaido91 · 2019-05-01T12:16:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala

+    extended: Boolean,
+    codegen: Boolean,
+    cost: Boolean): ExplainCommand = {
+    val sparkSession = SparkSession.getActiveSession


SparkSession.active?

SparkQA · 2019-05-01T14:29:43Z

Test build #105054 has finished for PR 24464 at commit 2cb7d26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-05-02T10:10:26Z

@mgaido91 Your comments were addressed. Thanks for review.

viirya · 2019-05-02T10:21:00Z

@cloud-fan @dongjoon-hyun Can you help check this? I think it should be in better shape now.

SparkQA · 2019-05-02T12:56:10Z

Test build #105076 has finished for PR 24464 at commit 18064d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-02T13:06:23Z

Test build #105077 has finished for PR 24464 at commit bfd6eea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-05-03T03:09:19Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -498,7 +498,7 @@ class Dataset[T] private[sql](
   * @since 1.6.0
   */
  def explain(extended: Boolean): Unit = {
-    val explain = ExplainCommand(queryExecution.logical, extended = extended)
+    val explain = ExplainCommand(queryExecution, extended = extended)


To fix the current bug, I agree with @viirya . I believe this is inevitable.
Could you give us some guide for this, @cloud-fan and @gatorsmile ?

dongjoon-hyun · 2019-05-03T15:39:40Z

Retest this please.

SparkQA · 2019-05-03T18:37:24Z

Test build #105105 has finished for PR 24464 at commit bfd6eea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91

LGTM apart a couple of style-related comments, thanks for this fix @viirya

mgaido91 · 2019-05-04T08:05:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala

+   * run through the analyzer and optimizer when this command is actually run.
+   */
+  def apply(
+    logicalPlan: LogicalPlan,


nit: one more indent here and in the lines below?

mgaido91 · 2019-05-04T08:05:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala

+   * This is mainly used for tests.
+   */
+  def apply(
+    logicalPlan: LogicalPlan,


SparkQA · 2019-05-06T04:45:17Z

Test build #105137 has finished for PR 24464 at commit 62959cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Thank you, @viirya and @mgaido91 .
In order to fix the original problem, I'll merge this second trial patch.

gatorsmile · 2019-05-20T20:35:07Z

If the analyze doesn't finish within maxIterations, the RuleExecutor try to print the whole plan, and it finally will try to print QueryExecution.optimizedPlan and QueryExecution.executedPlan, which would trigger assertNotAnalysisRule and throws RuntimeException This method should not be called in the analyzer.

Could we revert this commit?

hvanhovell · 2019-05-20T21:57:17Z

@viirya can we just fix this by doing the following:

def explain(extended: Boolean): Unit = {
  val outputString =
     if (extended) {
       queryExecution.toString
     } else {
       queryExecution.simpleString
     }
  // scalastyle:off println
  println(outputString)
  // scalastyle:on println
}

That seems a lot simpler, and it does not trigger full analysis and optimization from the parser.

dongjoon-hyun · 2019-05-20T22:05:34Z

Thank you, @gatorsmile and @hvanhovell .
@gatorsmile also gave the regression example about explain on EXPLAIN statement. Even a simple example like the following, the output is a huge mess.

scala> sql("explain select 1").explain(true)
== Parsed Logical Plan ==
ExplainCommand == Parsed Logical Plan ==
'Project [unresolvedalias(1, None)]
+- OneRowRelation

== Analyzed Logical Plan ==
1: int
Project [1 AS 1#2]
+- OneRowRelation

== Optimized Logical Plan ==
Project [1 AS 1#2]
+- OneRowRelation

== Physical Plan ==
*(1) Project [1 AS 1#2]
+- *(1) Scan OneRowRelation[]
, false, false, false

== Analyzed Logical Plan ==
plan: string
ExplainCommand == Parsed Logical Plan ==
'Project [unresolvedalias(1, None)]
+- OneRowRelation

== Analyzed Logical Plan ==
1: int
Project [1 AS 1#2]
+- OneRowRelation

== Optimized Logical Plan ==
Project [1 AS 1#2]
+- OneRowRelation

== Physical Plan ==
*(1) Project [1 AS 1#2]
+- *(1) Scan OneRowRelation[]
, false, false, false

== Optimized Logical Plan ==
ExplainCommand == Parsed Logical Plan ==
'Project [unresolvedalias(1, None)]
+- OneRowRelation

== Analyzed Logical Plan ==
1: int
Project [1 AS 1#2]
+- OneRowRelation

== Optimized Logical Plan ==
Project [1 AS 1#2]
+- OneRowRelation

== Physical Plan ==
*(1) Project [1 AS 1#2]
+- *(1) Scan OneRowRelation[]
, false, false, false

== Physical Plan ==
Execute ExplainCommand
   +- ExplainCommand == Parsed Logical Plan ==
'Project [unresolvedalias(1, None)]
+- OneRowRelation

== Analyzed Logical Plan ==
1: int
Project [1 AS 1#2]
+- OneRowRelation

== Optimized Logical Plan ==
Project [1 AS 1#2]
+- OneRowRelation

== Physical Plan ==
*(1) Project [1 AS 1#2]
+- *(1) Scan OneRowRelation[]
, false, false, false

Sorry about this mess. I'll revert this. @viirya . Please proceed with @hvanhovell 's advice~

dongjoon-hyun · 2019-05-20T22:25:00Z

This is reverted via 039db87 and I reopened SPARK-27439 .

viirya · 2019-05-21T01:56:53Z

oh sorry for that and thanks @gatorsmile, @hvanhovell and @dongjoon-hyun
I will follow up with @hvanhovell's advice.

Use existing queryexecution if any, in explain command.

dcbcc7f

viirya commented Apr 26, 2019

View reviewed changes

mgaido91 reviewed May 1, 2019

View reviewed changes

Pass query execution to explain command.

2cb7d26

viirya commented May 1, 2019

View reviewed changes

mgaido91 reviewed May 1, 2019

View reviewed changes

viirya changed the title ~~[SPARK-27439][SQL] Use existing queryexecution in explaining Dataset~~ [SPARK-27439][SQL] Explainging Dataset should show correct resolved plans May 2, 2019

Address comments.

bfd6eea

viirya force-pushed the SPARK-27439-2 branch from 18064d0 to bfd6eea Compare May 2, 2019 10:09

dongjoon-hyun reviewed May 3, 2019

View reviewed changes

dongjoon-hyun approved these changes May 3, 2019

View reviewed changes

mgaido91 reviewed May 4, 2019

View reviewed changes

Add indent.

62959cd

dongjoon-hyun reviewed May 6, 2019

View reviewed changes

dongjoon-hyun closed this in 4b725e5 May 6, 2019

viirya mentioned this pull request May 21, 2019

[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans #24654

Closed

viirya deleted the SPARK-27439-2 branch December 27, 2023 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans #24464

[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans #24464

viirya commented Apr 26, 2019

viirya commented Apr 26, 2019

SparkQA commented Apr 26, 2019

viirya Apr 26, 2019

viirya commented Apr 29, 2019

mgaido91 May 1, 2019

viirya May 1, 2019

mgaido91 May 1, 2019 •

edited

viirya May 1, 2019

viirya May 1, 2019

mgaido91 May 1, 2019

viirya May 1, 2019

mgaido91 May 1, 2019

viirya May 1, 2019

viirya May 1, 2019

mgaido91 May 1, 2019

mgaido91 May 1, 2019

SparkQA commented May 1, 2019

viirya commented May 2, 2019

viirya commented May 2, 2019

SparkQA commented May 2, 2019

SparkQA commented May 2, 2019

dongjoon-hyun May 3, 2019

dongjoon-hyun commented May 3, 2019

SparkQA commented May 3, 2019

mgaido91 left a comment

mgaido91 May 4, 2019

mgaido91 May 4, 2019

SparkQA commented May 6, 2019

dongjoon-hyun left a comment

gatorsmile commented May 20, 2019

hvanhovell commented May 20, 2019 •

edited

dongjoon-hyun commented May 20, 2019 •

edited

dongjoon-hyun commented May 20, 2019

viirya commented May 21, 2019

[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans #24464

[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans #24464

Conversation

viirya commented Apr 26, 2019

What changes were proposed in this pull request?

How was this patch tested?

viirya commented Apr 26, 2019

SparkQA commented Apr 26, 2019

Choose a reason for hiding this comment

viirya commented Apr 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 May 1, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 1, 2019

viirya commented May 2, 2019

viirya commented May 2, 2019

SparkQA commented May 2, 2019

SparkQA commented May 2, 2019

Choose a reason for hiding this comment

dongjoon-hyun commented May 3, 2019

SparkQA commented May 3, 2019

mgaido91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 6, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

gatorsmile commented May 20, 2019

hvanhovell commented May 20, 2019 • edited

dongjoon-hyun commented May 20, 2019 • edited

dongjoon-hyun commented May 20, 2019

viirya commented May 21, 2019

mgaido91 May 1, 2019 •

edited

hvanhovell commented May 20, 2019 •

edited

dongjoon-hyun commented May 20, 2019 •

edited