[SPARK-28595][SQL] explain should not trigger partition listing #25328

cloud-fan · 2019-08-01T16:50:22Z

What changes were proposed in this pull request?

Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again.

This is caused by FileSourceScanExec:

In its toString, it needs to report the number of partitions it reads. This needs to query the hive metastore.
In its outputOrdering, it needs to get all the files. This needs to query the hive metastore.

This PR fixes by:

toString do not need to report the number of partitions it reads. We should report it via SQL metrics.
The outputOrdering is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck.

How was this patch tested?

existing tests

cloud-fan · 2019-08-01T16:50:43Z

cc @hvanhovell @maryannxue @viirya

dongjoon-hyun · 2019-08-01T17:48:08Z

I agree that it was very hard to meet the condition. BTW, IIRC, the main reason for that optimization was to get the same result with Hive for the LIMIT queries which didn't have ORDER BY.

I think it's worth to give up this optimization so that explain don't need to get stuck.

SparkQA · 2019-08-01T18:17:00Z

Test build #108525 has finished for PR 25328 at commit fb8793d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-02T04:42:33Z

I don't think a system needs to guarantee the output order of a SQL query without ORDER BY. But let me add a legacy config to keep this optimization, just in case. what do you think @dongjoon-hyun ?

SparkQA · 2019-08-02T18:47:04Z

Test build #108571 has finished for PR 25328 at commit fa763eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-06T04:59:47Z

Test build #108688 has finished for PR 25328 at commit 0652c22.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

gatorsmile · 2019-08-06T07:03:49Z

retest this please

SparkQA · 2019-08-06T10:45:25Z

Test build #108699 has finished for PR 25328 at commit 0652c22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-06T10:51:25Z

Test build #108700 has finished for PR 25328 at commit 0652c22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2019-08-06T14:32:43Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

+  }
+
+  protected override def afterAll(): Unit = {
+    spark.sessionState.conf.unsetConf(SQLConf.LEGACY_BUCKETED_TABLE_SCAN_OUTPUT_ORDERING)


Should we do a "store and recover the old conf" instead?

In the test, we assume that every test suite will keep the shared SparkSession clean after tests are run. So the old conf should be the default conf here, and we only need to call unsetConf to restore the default config.

maryannxue · 2019-08-06T14:34:07Z

LGTM except one minor comment https://github.com/apache/spark/pull/25328/files#r311096949.

gatorsmile · 2019-08-07T00:51:27Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala

+      sql("CREATE TABLE t USING json PARTITIONED BY (j) AS SELECT 1 i, 2 j")
+      assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
+      spark.table("t").explain()
+      assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)


Add a test case that can return non-zero when spark.sql.legacy.bucketedTableScan.outputOrdering is set to true?

gatorsmile · 2019-08-07T00:51:47Z

LGTM except one comment.

SparkQA · 2019-08-07T05:22:52Z

Test build #108745 has finished for PR 25328 at commit 264a259.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-07T07:05:02Z

Test build #108747 has finished for PR 25328 at commit bf9b261.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-07T07:07:41Z

retest this please

SparkQA · 2019-08-07T10:42:55Z

Test build #108753 has finished for PR 25328 at commit bf9b261.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-07T11:14:55Z

thanks for the review, merging to master!

dongjoon-hyun added the SQL label Aug 1, 2019

cloud-fan added 2 commits August 6, 2019 10:17

explain should not trigger partition listing

ca34d7b

add a config

b2b016e

cloud-fan force-pushed the ui branch from fa763eb to b2b016e Compare August 6, 2019 02:18

fix

0652c22

viirya reviewed Aug 6, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala Show resolved Hide resolved

maryannxue reviewed Aug 6, 2019

View reviewed changes

gatorsmile reviewed Aug 7, 2019

View reviewed changes

viirya approved these changes Aug 7, 2019

View reviewed changes

address comment

bf9b261

cloud-fan force-pushed the ui branch from 264a259 to bf9b261 Compare August 7, 2019 05:56

cloud-fan closed this in 469423f Aug 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28595][SQL] explain should not trigger partition listing #25328

[SPARK-28595][SQL] explain should not trigger partition listing #25328

cloud-fan commented Aug 1, 2019

cloud-fan commented Aug 1, 2019

dongjoon-hyun commented Aug 1, 2019

SparkQA commented Aug 1, 2019

cloud-fan commented Aug 2, 2019

SparkQA commented Aug 2, 2019

SparkQA commented Aug 6, 2019

gatorsmile commented Aug 6, 2019

SparkQA commented Aug 6, 2019

SparkQA commented Aug 6, 2019

maryannxue Aug 6, 2019

cloud-fan Aug 6, 2019

maryannxue commented Aug 6, 2019

gatorsmile Aug 7, 2019 •

edited

gatorsmile commented Aug 7, 2019

SparkQA commented Aug 7, 2019

SparkQA commented Aug 7, 2019

cloud-fan commented Aug 7, 2019

SparkQA commented Aug 7, 2019

cloud-fan commented Aug 7, 2019

[SPARK-28595][SQL] explain should not trigger partition listing #25328

[SPARK-28595][SQL] explain should not trigger partition listing #25328

Conversation

cloud-fan commented Aug 1, 2019

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Aug 1, 2019

dongjoon-hyun commented Aug 1, 2019

SparkQA commented Aug 1, 2019

cloud-fan commented Aug 2, 2019

SparkQA commented Aug 2, 2019

SparkQA commented Aug 6, 2019

gatorsmile commented Aug 6, 2019

SparkQA commented Aug 6, 2019

SparkQA commented Aug 6, 2019

maryannxue Aug 6, 2019

Choose a reason for hiding this comment

cloud-fan Aug 6, 2019

Choose a reason for hiding this comment

maryannxue commented Aug 6, 2019

gatorsmile Aug 7, 2019 • edited

Choose a reason for hiding this comment

gatorsmile commented Aug 7, 2019

SparkQA commented Aug 7, 2019

SparkQA commented Aug 7, 2019

cloud-fan commented Aug 7, 2019

SparkQA commented Aug 7, 2019

cloud-fan commented Aug 7, 2019

gatorsmile Aug 7, 2019 •

edited