New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28595][SQL] explain should not trigger partition listing #25328
Conversation
I agree that it was very hard to meet the condition. BTW, IIRC, the main reason for that optimization was to get the same result with
|
Test build #108525 has finished for PR 25328 at commit
|
I don't think a system needs to guarantee the output order of a SQL query without ORDER BY. But let me add a legacy config to keep this optimization, just in case. what do you think @dongjoon-hyun ? |
Test build #108571 has finished for PR 25328 at commit
|
Test build #108688 has finished for PR 25328 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Show resolved
Hide resolved
retest this please |
Test build #108699 has finished for PR 25328 at commit
|
Test build #108700 has finished for PR 25328 at commit
|
} | ||
|
||
protected override def afterAll(): Unit = { | ||
spark.sessionState.conf.unsetConf(SQLConf.LEGACY_BUCKETED_TABLE_SCAN_OUTPUT_ORDERING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do a "store and recover the old conf" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the test, we assume that every test suite will keep the shared SparkSession
clean after tests are run. So the old conf should be the default conf here, and we only need to call unsetConf
to restore the default config.
LGTM except one minor comment https://github.com/apache/spark/pull/25328/files#r311096949. |
sql("CREATE TABLE t USING json PARTITIONED BY (j) AS SELECT 1 i, 2 j") | ||
assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0) | ||
spark.table("t").explain() | ||
assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test case that can return non-zero when spark.sql.legacy.bucketedTableScan.outputOrdering
is set to true?
LGTM except one comment. |
Test build #108745 has finished for PR 25328 at commit
|
Test build #108747 has finished for PR 25328 at commit
|
retest this please |
Test build #108753 has finished for PR 25328 at commit
|
thanks for the review, merging to master! |
What changes were proposed in this pull request?
Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again.
This is caused by
FileSourceScanExec
:toString
, it needs to report the number of partitions it reads. This needs to query the hive metastore.outputOrdering
, it needs to get all the files. This needs to query the hive metastore.This PR fixes by:
toString
do not need to report the number of partitions it reads. We should report it via SQL metrics.outputOrdering
is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck.How was this patch tested?
existing tests