-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED #26042
[SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED #26042
Conversation
Test build #111842 has finished for PR 26042 at commit
|
Test build #111858 has finished for PR 26042 at commit
|
DataFilters: [isnotnull(key#x), (key#x > 0)] | ||
Format: Parquet | ||
Location [not included in comparison]sql/core/spark-warehouse/[not included in comparison] | ||
PushedFilters: [IsNotNull(key), GreaterThan(key,0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between data filters and pushed filters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Actually i don't know for sure. Looking at the output, could it be one is the catalyst's view of the filter and the other is the datasource's view of the filter i.e after we translate it ? I am guessing here :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Checked the code. Our guess was right - fyi
private val pushedDownFilters = dataFilters.flatMap(DataSourceStrategy.translateFilter)
logInfo(s"Pushed Filters: ${pushedDownFilters.mkString(",")}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think we only need to mention pushedDownFilters and DPP filters. Other filters are ignored/has no effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan OK.
Test build #111877 has finished for PR 26042 at commit
|
retest this please |
Test build #111885 has finished for PR 26042 at commit
|
Test build #111896 has started for PR 26042 at commit |
d641924
to
56602e1
Compare
Test build #111913 has finished for PR 26042 at commit
|
cc @cloud-fan |
@@ -65,6 +65,23 @@ trait DataSourceScanExec extends LeafExecNode { | |||
s"$nodeNamePrefix$nodeName${truncatedString(output, "[", ",", "]", maxFields)}$metadataStr") | |||
} | |||
|
|||
override def verboseStringWithOperatorId(): String = { | |||
val metadataStr = metadata.toSeq.sorted.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can do filter
fist then map
@@ -58,6 +58,11 @@ struct<plan:string> | |||
|
|||
(1) Scan parquet default.explain_temp1 | |||
Output: [key#x, val#x] | |||
Batched: true | |||
Format: Parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already in the node name Scan parquet default.explain_temp1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I can filter it. But wanted to double check first. The one we print in the node name is relation.toString
and the one printed here is relation.format.toString
. Are they going to be same always ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relation.toString
always contain the format name AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan OK.. sounds good wenchen. I will drop this then.
@@ -100,7 +100,8 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite { | |||
"subquery/in-subquery/in-group-by.sql", | |||
"subquery/in-subquery/simple-in.sql", | |||
"subquery/in-subquery/in-order-by.sql", | |||
"subquery/in-subquery/in-set-operations.sql" | |||
"subquery/in-subquery/in-set-operations.sql", | |||
"explain.sql" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Let me enable it and see.
Format: Parquet | ||
Location [not included in comparison]sql/core/spark-warehouse/[not included in comparison] | ||
PushedFilters: [IsNotNull(key), GreaterThan(key,10)] | ||
ReadSchema: struct<key:int,val:int> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test case to show the DPP filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I have enhanced the test in ExplainSuite.
56602e1
to
7fc4d5e
Compare
retest this please |
Test build #111977 has finished for PR 26042 at commit
|
Test build #111986 has finished for PR 26042 at commit
|
Test build #112087 has finished for PR 26042 at commit
|
retest this please |
@@ -110,6 +114,10 @@ struct<plan:string> | |||
|
|||
(1) Scan parquet default.explain_temp1 | |||
Output: [key#x, val#x] | |||
Batched: true | |||
Location [not included in comparison] | |||
PushedFilters: [IsNotNull(key), GreaterThan(key,0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test case to display the DPP filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I have enhanced the test in ExplainSuite to test the DPP filter. What do you think ? Perhaps we need some data in order to trigger it ? I am using the test @gatorsmile mentioned in the JIRA in ExplainSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking it because I don't see where we extract the DPP filter in this PR. You can see from FileSourceScanExec#dynamicallySelectedPartitions
that we can get DPP filter by partitionFilters.filter(isDynamicPruningFilter)
.
"Subquery:1 Hosting operator id = 1 Hosting Expression = k#xL IN subquery#x" | ||
val expected_pattern2 = | ||
"PartitionFilters: \\[isnotnull\\(k#xL\\), dynamicpruningexpression\\(k#xL " + | ||
"IN subquery#x\\)\\]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Is this not a DPP filter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, we already print all the partition filters. LGTM then
@@ -436,6 +436,9 @@ class SQLQueryTestSuite extends QueryTest with SharedSparkSession { | |||
.replaceAll( | |||
s"Location.*/sql/core/spark-warehouse/$clsName/", | |||
s"Location ${notIncludedMsg}sql/core/spark-warehouse/") | |||
.replaceAll( | |||
s"Location.*\\.\\.\\.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we merge it with the previous case and always normalize the entire location string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Actually i had thought about it. But for some tests, i wasn't sure if we should change the output. For example :
sql/core/src/test/resources/sql-tests/results/describe-part-after-analyze.sql
the test is :
DESC EXTENDED t PARTITION (ds='2017-08-01', hr=10)
Perhaps displaying the location is the intention. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to do it, i feel we should do it in another pr. So if required, can be reverted easily ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, why does the previous case doesn't work for the new changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan because, we truncate the location to 100 chars for explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we don't truncate the location in the new explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This string replace is initially introduced at #16373, for SHOW TABLE which doesn't truncate the location property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we don't truncate the location in the new explain?
Running test("explain formatted - check presence of subquery in case of DPP")
in ExplainSuite
Wenchen, i am afraid, the location can be arbitrarily large like following. Are we sure we want to show everything ? It may make the plan un-readable again ?
PrunedInMemoryFileIndex[file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=193,
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=546,
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=0,
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=27,
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=572, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=333,
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=451,
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=56,
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=55,
file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
warehouse/org.apache.spark.sql.ExplainSuite/df1/k=628, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=937, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=609, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=37, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=7, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=908, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=34, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=621, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=596, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=108, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=42, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=990, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=294, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=418, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=749, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-warehouse/org.apache.spark.sql.ExplainSuite/df1/k=606, file:/Users/dilipbiswal/mygit/apache/spark/sql/core/spark-
....
....
....
.....
.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, this is very different from SHOW TABLE.
We need to think about how to display a list of locations in the explain output. I think it's more useful to display only the first location, with the number of remaining locations if there are any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan OK.. makes sense.
Test build #112093 has finished for PR 26042 at commit
|
Test build #112205 has finished for PR 26042 at commit
|
retest this please |
Test build #112212 has finished for PR 26042 at commit
|
Test build #112223 has finished for PR 26042 at commit
|
thanks, merging to master! |
@cloud-fan Thank you very much. |
What changes were proposed in this pull request?
Currently we report only output attributes of a scan while doing EXPLAIN FORMATTED.
This PR implements the
verboseStringWithOperatorId
in DataSourceScanExec to report additional information about a scan such as pushed down filters, partition filters, location etc.SQL
Before
After
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?