Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED #26042

Closed

Conversation

dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Currently we report only output attributes of a scan while doing EXPLAIN FORMATTED.
This PR implements the verboseStringWithOperatorId in DataSourceScanExec to report additional information about a scan such as pushed down filters, partition filters, location etc.

SQL

EXPLAIN FORMATTED
  SELECT key, max(val) 
  FROM   explain_temp1 
  WHERE  key > 0 
  GROUP  BY key 
  ORDER  BY key

Before

== Physical Plan ==
* Sort (9)
+- Exchange (8)
   +- * HashAggregate (7)
      +- Exchange (6)
         +- * HashAggregate (5)
            +- * Project (4)
               +- * Filter (3)
                  +- * ColumnarToRow (2)
                     +- Scan parquet default.explain_temp1 (1)


(1) Scan parquet default.explain_temp1 
Output: [key#x, val#x]

....
....
....

After


== Physical Plan ==
* Sort (9)
+- Exchange (8)
   +- * HashAggregate (7)
      +- Exchange (6)
         +- * HashAggregate (5)
            +- * Project (4)
               +- * Filter (3)
                  +- * ColumnarToRow (2)
                     +- Scan parquet default.explain_temp1 (1)


(1) Scan parquet default.explain_temp1 
Output: [key#x, val#x]
Batched: true
DataFilters: [isnotnull(key#x), (key#x > 0)]
Format: Parquet
Location: InMemoryFileIndex[file:/tmp/apache/spark/spark-warehouse/explain_temp1]
PushedFilters: [IsNotNull(key), GreaterThan(key,0)]
ReadSchema: struct<key:int,val:int>

...
...
...

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

@dilipbiswal dilipbiswal changed the title [SPARK-29092] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED [SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED Oct 7, 2019
@SparkQA
Copy link

SparkQA commented Oct 7, 2019

Test build #111842 has finished for PR 26042 at commit 303997c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111858 has finished for PR 26042 at commit 6178789.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

DataFilters: [isnotnull(key#x), (key#x > 0)]
Format: Parquet
Location [not included in comparison]sql/core/spark-warehouse/[not included in comparison]
PushedFilters: [IsNotNull(key), GreaterThan(key,0)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between data filters and pushed filters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Actually i don't know for sure. Looking at the output, could it be one is the catalyst's view of the filter and the other is the datasource's view of the filter i.e after we translate it ? I am guessing here :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Checked the code. Our guess was right - fyi

private val pushedDownFilters = dataFilters.flatMap(DataSourceStrategy.translateFilter)
  logInfo(s"Pushed Filters: ${pushedDownFilters.mkString(",")}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we only need to mention pushedDownFilters and DPP filters. Other filters are ignored/has no effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111877 has finished for PR 26042 at commit c025ead.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111885 has finished for PR 26042 at commit c025ead.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111896 has started for PR 26042 at commit d641924.

@dilipbiswal dilipbiswal force-pushed the verbose_string_datasrc_scanexec branch from d641924 to 56602e1 Compare October 8, 2019 17:57
@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111913 has finished for PR 26042 at commit 56602e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

cc @cloud-fan

@@ -65,6 +65,23 @@ trait DataSourceScanExec extends LeafExecNode {
s"$nodeNamePrefix$nodeName${truncatedString(output, "[", ",", "]", maxFields)}$metadataStr")
}

override def verboseStringWithOperatorId(): String = {
val metadataStr = metadata.toSeq.sorted.map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can do filter fist then map

@@ -58,6 +58,11 @@ struct<plan:string>

(1) Scan parquet default.explain_temp1
Output: [key#x, val#x]
Batched: true
Format: Parquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already in the node name Scan parquet default.explain_temp1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I can filter it. But wanted to double check first. The one we print in the node name is relation.toString and the one printed here is relation.format.toString. Are they going to be same always ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relation.toString always contain the format name AFAIK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan OK.. sounds good wenchen. I will drop this then.

@@ -100,7 +100,8 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite {
"subquery/in-subquery/in-group-by.sql",
"subquery/in-subquery/simple-in.sql",
"subquery/in-subquery/in-order-by.sql",
"subquery/in-subquery/in-set-operations.sql"
"subquery/in-subquery/in-set-operations.sql",
"explain.sql"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Let me enable it and see.

Format: Parquet
Location [not included in comparison]sql/core/spark-warehouse/[not included in comparison]
PushedFilters: [IsNotNull(key), GreaterThan(key,10)]
ReadSchema: struct<key:int,val:int>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test case to show the DPP filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I have enhanced the test in ExplainSuite.

@dilipbiswal dilipbiswal force-pushed the verbose_string_datasrc_scanexec branch from 56602e1 to 7fc4d5e Compare October 11, 2019 17:37
@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 12, 2019

Test build #111977 has finished for PR 26042 at commit 7fc4d5e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2019

Test build #111986 has finished for PR 26042 at commit d07e2b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112087 has finished for PR 26042 at commit 8beed08.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@@ -110,6 +114,10 @@ struct<plan:string>

(1) Scan parquet default.explain_temp1
Output: [key#x, val#x]
Batched: true
Location [not included in comparison]
PushedFilters: [IsNotNull(key), GreaterThan(key,0)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test case to display the DPP filter?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I have enhanced the test in ExplainSuite to test the DPP filter. What do you think ? Perhaps we need some data in order to trigger it ? I am using the test @gatorsmile mentioned in the JIRA in ExplainSuite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking it because I don't see where we extract the DPP filter in this PR. You can see from FileSourceScanExec#dynamicallySelectedPartitions that we can get DPP filter by partitionFilters.filter(isDynamicPruningFilter).

"Subquery:1 Hosting operator id = 1 Hosting Expression = k#xL IN subquery#x"
val expected_pattern2 =
"PartitionFilters: \\[isnotnull\\(k#xL\\), dynamicpruningexpression\\(k#xL " +
"IN subquery#x\\)\\]"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Is this not a DPP filter ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, we already print all the partition filters. LGTM then

@@ -436,6 +436,9 @@ class SQLQueryTestSuite extends QueryTest with SharedSparkSession {
.replaceAll(
s"Location.*/sql/core/spark-warehouse/$clsName/",
s"Location ${notIncludedMsg}sql/core/spark-warehouse/")
.replaceAll(
s"Location.*\\.\\.\\.",
Copy link
Contributor

@cloud-fan cloud-fan Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we merge it with the previous case and always normalize the entire location string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Actually i had thought about it. But for some tests, i wasn't sure if we should change the output. For example :
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql

the test is :
DESC EXTENDED t PARTITION (ds='2017-08-01', hr=10)

Perhaps displaying the location is the intention. What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to do it, i feel we should do it in another pr. So if required, can be reverted easily ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, why does the previous case doesn't work for the new changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan because, we truncate the location to 100 chars for explain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we don't truncate the location in the new explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string replace is initially introduced at #16373, for SHOW TABLE which doesn't truncate the location property.

Copy link
Contributor Author

@dilipbiswal dilipbiswal Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we don't truncate the location in the new explain?

Running test("explain formatted - check presence of subquery in case of DPP") in ExplainSuite

Wenchen, i am afraid, the location can be arbitrarily large like following. Are we sure we want to show everything ? It may make the plan un-readable again ?

PrunedInMemoryFileIndex[file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=193, 
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=546, 
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=0, 
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=27, 
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=572, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=333, 
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=451, 
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=56, 
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=55, 
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=628, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=937, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=609, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=37, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=7, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=908, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=34, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=621, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=596, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=108, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=42, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=990, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=294, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=418, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=749, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=606, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
....
....
....
..... 
.....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, this is very different from SHOW TABLE.

We need to think about how to display a list of locations in the explain output. I think it's more useful to display only the first location, with the number of remaining locations if there are any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan OK.. makes sense.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112093 has finished for PR 26042 at commit 8beed08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Test build #112205 has finished for PR 26042 at commit 192960d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Test build #112212 has finished for PR 26042 at commit 192960d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Test build #112223 has finished for PR 26042 at commit 490ee3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in ec5d698 Oct 18, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants