[SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED #26042

dilipbiswal · 2019-10-07T16:45:26Z

What changes were proposed in this pull request?

Currently we report only output attributes of a scan while doing EXPLAIN FORMATTED.
This PR implements the verboseStringWithOperatorId in DataSourceScanExec to report additional information about a scan such as pushed down filters, partition filters, location etc.

SQL

EXPLAIN FORMATTED
  SELECT key, max(val) 
  FROM   explain_temp1 
  WHERE  key > 0 
  GROUP  BY key 
  ORDER  BY key

Before

== Physical Plan ==
* Sort (9)
+- Exchange (8)
   +- * HashAggregate (7)
      +- Exchange (6)
         +- * HashAggregate (5)
            +- * Project (4)
               +- * Filter (3)
                  +- * ColumnarToRow (2)
                     +- Scan parquet default.explain_temp1 (1)


(1) Scan parquet default.explain_temp1 
Output: [key#x, val#x]

....
....
....

After


== Physical Plan ==
* Sort (9)
+- Exchange (8)
   +- * HashAggregate (7)
      +- Exchange (6)
         +- * HashAggregate (5)
            +- * Project (4)
               +- * Filter (3)
                  +- * ColumnarToRow (2)
                     +- Scan parquet default.explain_temp1 (1)


(1) Scan parquet default.explain_temp1 
Output: [key#x, val#x]
Batched: true
DataFilters: [isnotnull(key#x), (key#x > 0)]
Format: Parquet
Location: InMemoryFileIndex[file:/tmp/apache/spark/spark-warehouse/explain_temp1]
PushedFilters: [IsNotNull(key), GreaterThan(key,0)]
ReadSchema: struct<key:int,val:int>

...
...
...

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA · 2019-10-07T17:16:08Z

Test build #111842 has finished for PR 26042 at commit 303997c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-08T03:02:44Z

Test build #111858 has finished for PR 26042 at commit 6178789.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-08T06:28:22Z

sql/core/src/test/resources/sql-tests/results/explain.sql.out

+DataFilters: [isnotnull(key#x), (key#x > 0)]
+Format: Parquet
+Location [not included in comparison]sql/core/spark-warehouse/[not included in comparison]
+PushedFilters: [IsNotNull(key), GreaterThan(key,0)]


what's the difference between data filters and pushed filters?

@cloud-fan Actually i don't know for sure. Looking at the output, could it be one is the catalyst's view of the filter and the other is the datasource's view of the filter i.e after we translate it ? I am guessing here :-)

@cloud-fan Checked the code. Our guess was right - fyi

private val pushedDownFilters = dataFilters.flatMap(DataSourceStrategy.translateFilter) logInfo(s"Pushed Filters: ${pushedDownFilters.mkString(",")}")

Then I think we only need to mention pushedDownFilters and DPP filters. Other filters are ignored/has no effect.

@cloud-fan OK.

SparkQA · 2019-10-08T07:05:01Z

Test build #111877 has finished for PR 26042 at commit c025ead.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-10-08T07:25:45Z

retest this please

SparkQA · 2019-10-08T11:02:47Z

Test build #111885 has finished for PR 26042 at commit c025ead.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-08T15:02:47Z

Test build #111896 has started for PR 26042 at commit d641924.

SparkQA · 2019-10-08T21:45:22Z

Test build #111913 has finished for PR 26042 at commit 56602e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-10-09T05:50:53Z

cc @cloud-fan

cloud-fan · 2019-10-10T07:39:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -65,6 +65,23 @@ trait DataSourceScanExec extends LeafExecNode {
      s"$nodeNamePrefix$nodeName${truncatedString(output, "[", ",", "]", maxFields)}$metadataStr")
  }

+  override def verboseStringWithOperatorId(): String = {
+     val metadataStr = metadata.toSeq.sorted.map {


nit: we can do filter fist then map

cloud-fan · 2019-10-10T07:44:23Z

sql/core/src/test/resources/sql-tests/results/explain.sql.out

@@ -58,6 +58,11 @@ struct<plan:string>

 (1) Scan parquet default.explain_temp1 
 Output: [key#x, val#x]
+Batched: true
+Format: Parquet


This is already in the node name Scan parquet default.explain_temp1

@cloud-fan I can filter it. But wanted to double check first. The one we print in the node name is relation.toString and the one printed here is relation.format.toString. Are they going to be same always ?

relation.toString always contain the format name AFAIK.

@cloud-fan OK.. sounds good wenchen. I will drop this then.

cloud-fan · 2019-10-10T07:46:06Z

...erver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala

@@ -100,7 +100,8 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite {
    "subquery/in-subquery/in-group-by.sql",
    "subquery/in-subquery/simple-in.sql",
    "subquery/in-subquery/in-order-by.sql",
-    "subquery/in-subquery/in-set-operations.sql"
+    "subquery/in-subquery/in-set-operations.sql",
+    "explain.sql"


@cloud-fan Let me enable it and see.

cloud-fan · 2019-10-10T07:46:36Z

sql/core/src/test/resources/sql-tests/results/explain.sql.out

+Format: Parquet
+Location [not included in comparison]sql/core/spark-warehouse/[not included in comparison]
+PushedFilters: [IsNotNull(key), GreaterThan(key,10)]
+ReadSchema: struct<key:int,val:int>


Can we have a test case to show the DPP filter?

@cloud-fan I have enhanced the test in ExplainSuite.

…in EXPLAIN FORMATTED.

dilipbiswal · 2019-10-12T18:42:25Z

retest this please

SparkQA · 2019-10-12T19:05:07Z

Test build #111977 has finished for PR 26042 at commit 7fc4d5e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-13T03:25:06Z

Test build #111986 has finished for PR 26042 at commit d07e2b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-15T07:05:02Z

Test build #112087 has finished for PR 26042 at commit 8beed08.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-10-15T07:05:56Z

retest this please

cloud-fan · 2019-10-15T07:16:53Z

sql/core/src/test/resources/sql-tests/results/explain.sql.out

@@ -110,6 +114,10 @@ struct<plan:string>

 (1) Scan parquet default.explain_temp1 
 Output: [key#x, val#x]
+Batched: true
+Location [not included in comparison]
+PushedFilters: [IsNotNull(key), GreaterThan(key,0)]


do we have a test case to display the DPP filter?

@cloud-fan I have enhanced the test in ExplainSuite to test the DPP filter. What do you think ? Perhaps we need some data in order to trigger it ? I am using the test @gatorsmile mentioned in the JIRA in ExplainSuite.

I'm asking it because I don't see where we extract the DPP filter in this PR. You can see from FileSourceScanExec#dynamicallySelectedPartitions that we can get DPP filter by partitionFilters.filter(isDynamicPruningFilter).

dilipbiswal · 2019-10-15T08:15:05Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

+            "Subquery:1 Hosting operator id = 1 Hosting Expression = k#xL IN subquery#x"
+          val expected_pattern2 =
+            "PartitionFilters: \\[isnotnull\\(k#xL\\), dynamicpruningexpression\\(k#xL " +
+              "IN subquery#x\\)\\]"


@cloud-fan Is this not a DPP filter ?

ah i see, we already print all the partition filters. LGTM then

cloud-fan · 2019-10-15T08:57:11Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

@@ -436,6 +436,9 @@ class SQLQueryTestSuite extends QueryTest with SharedSparkSession {
      .replaceAll(
        s"Location.*/sql/core/spark-warehouse/$clsName/",
        s"Location ${notIncludedMsg}sql/core/spark-warehouse/")
+      .replaceAll(
+        s"Location.*\\.\\.\\.",


shall we merge it with the previous case and always normalize the entire location string?

@cloud-fan Actually i had thought about it. But for some tests, i wasn't sure if we should change the output. For example :
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql

the test is :
DESC EXTENDED t PARTITION (ds='2017-08-01', hr=10)

Perhaps displaying the location is the intention. What do you think ?

If we decide to do it, i feel we should do it in another pr. So if required, can be reverted easily ?

oh wait, why does the previous case doesn't work for the new changes?

@cloud-fan because, we truncate the location to 100 chars for explain.

How about we don't truncate the location in the new explain?

This string replace is initially introduced at #16373, for SHOW TABLE which doesn't truncate the location property.

How about we don't truncate the location in the new explain?

Running test("explain formatted - check presence of subquery in case of DPP") in ExplainSuite

Wenchen, i am afraid, the location can be arbitrarily large like following. Are we sure we want to show everything ? It may make the plan un-readable again ?

PrunedInMemoryFileIndex[file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- warehouse/org.apache.spark.sql.ExplainSuite/df1/k=193, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- warehouse/org.apache.spark.sql.ExplainSuite/df1/k=546, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- warehouse/org.apache.spark.sql.ExplainSuite/df1/k=0, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- warehouse/org.apache.spark.sql.ExplainSuite/df1/k=27, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- warehouse/org.apache.spark.sql.ExplainSuite/df1/k=572, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=333, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- warehouse/org.apache.spark.sql.ExplainSuite/df1/k=451, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- warehouse/org.apache.spark.sql.ExplainSuite/df1/k=56, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- warehouse/org.apache.spark.sql.ExplainSuite/df1/k=55, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- warehouse/org.apache.spark.sql.ExplainSuite/df1/k=628, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=937, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=609, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=37, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=7, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=908, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=34, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=621, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=596, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=108, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=42, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=990, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=294, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=418, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=749, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=606, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark- .... .... .... ..... .....

ah, this is very different from SHOW TABLE.

We need to think about how to display a list of locations in the explain output. I think it's more useful to display only the first location, with the number of remaining locations if there are any.

@cloud-fan OK.. makes sense.

SparkQA · 2019-10-15T10:47:20Z

Test build #112093 has finished for PR 26042 at commit 8beed08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-17T07:34:36Z

Test build #112205 has finished for PR 26042 at commit 192960d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-17T08:33:52Z

retest this please

SparkQA · 2019-10-17T09:00:52Z

Test build #112212 has finished for PR 26042 at commit 192960d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-17T18:44:00Z

Test build #112223 has finished for PR 26042 at commit 490ee3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-18T07:53:28Z

thanks, merging to master!

dilipbiswal · 2019-10-18T15:48:52Z

@cloud-fan Thank you very much.

dilipbiswal changed the title ~~[SPARK-29092] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED~~ [SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED Oct 7, 2019

dilipbiswal mentioned this pull request Oct 7, 2019

[SPARK-29366][SQL] Subqueries created for DPP are not printed in EXPLAIN FORMATTED #26039

Closed

dongjoon-hyun added the SQL label Oct 7, 2019

cloud-fan reviewed Oct 8, 2019

View reviewed changes

dilipbiswal force-pushed the verbose_string_datasrc_scanexec branch from d641924 to 56602e1 Compare October 8, 2019 17:57

cloud-fan reviewed Oct 10, 2019

View reviewed changes

dilipbiswal added 6 commits October 10, 2019 23:41

[SPARK-29092] Report additional information about DataSourceScanExec …

ec16299

…in EXPLAIN FORMATTED.

ignore explain test from thriftserver

97aaadc

address test failure

6526889

Code review

91ce794

latest

9560005

enable thrift test

7fc4d5e

dilipbiswal force-pushed the verbose_string_datasrc_scanexec branch from 56602e1 to 7fc4d5e Compare October 11, 2019 17:37

test failure

d07e2b0

Code review

8beed08

cloud-fan reviewed Oct 15, 2019

View reviewed changes

dilipbiswal commented Oct 15, 2019

View reviewed changes

cloud-fan reviewed Oct 15, 2019

View reviewed changes

code review

192960d

cloud-fan approved these changes Oct 17, 2019

View reviewed changes

test failure

490ee3b

cloud-fan closed this in ec5d698 Oct 18, 2019

[SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED #26042

[SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED #26042

Conversation

dilipbiswal commented Oct 7, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Oct 7, 2019

SparkQA commented Oct 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 8, 2019

dilipbiswal commented Oct 8, 2019

SparkQA commented Oct 8, 2019

SparkQA commented Oct 8, 2019

SparkQA commented Oct 8, 2019

dilipbiswal commented Oct 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal commented Oct 12, 2019

SparkQA commented Oct 12, 2019

SparkQA commented Oct 13, 2019

SparkQA commented Oct 15, 2019

dilipbiswal commented Oct 15, 2019

Choose a reason for hiding this comment

dilipbiswal Oct 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Oct 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Oct 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 15, 2019

SparkQA commented Oct 17, 2019

cloud-fan commented Oct 17, 2019

SparkQA commented Oct 17, 2019

SparkQA commented Oct 17, 2019

cloud-fan commented Oct 18, 2019

dilipbiswal commented Oct 18, 2019

dilipbiswal Oct 15, 2019 •

edited

Loading

cloud-fan Oct 15, 2019 •

edited

Loading

dilipbiswal Oct 17, 2019 •

edited

Loading