Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28595][SQL] explain should not trigger partition listing #25328

Closed
wants to merge 4 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again.

This is caused by FileSourceScanExec:

  1. In its toString, it needs to report the number of partitions it reads. This needs to query the hive metastore.
  2. In its outputOrdering, it needs to get all the files. This needs to query the hive metastore.

This PR fixes by:

  1. toString do not need to report the number of partitions it reads. We should report it via SQL metrics.
  2. The outputOrdering is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck.

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

cc @hvanhovell @maryannxue @viirya

@dongjoon-hyun
Copy link
Member

I agree that it was very hard to meet the condition. BTW, IIRC, the main reason for that optimization was to get the same result with Hive for the LIMIT queries which didn't have ORDER BY.

I think it's worth to give up this optimization so that explain don't need to get stuck.

@SparkQA
Copy link

SparkQA commented Aug 1, 2019

Test build #108525 has finished for PR 25328 at commit fb8793d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

I don't think a system needs to guarantee the output order of a SQL query without ORDER BY. But let me add a legacy config to keep this optimization, just in case. what do you think @dongjoon-hyun ?

@SparkQA
Copy link

SparkQA commented Aug 2, 2019

Test build #108571 has finished for PR 25328 at commit fa763eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108688 has finished for PR 25328 at commit 0652c22.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108699 has finished for PR 25328 at commit 0652c22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108700 has finished for PR 25328 at commit 0652c22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

protected override def afterAll(): Unit = {
spark.sessionState.conf.unsetConf(SQLConf.LEGACY_BUCKETED_TABLE_SCAN_OUTPUT_ORDERING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a "store and recover the old conf" instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test, we assume that every test suite will keep the shared SparkSession clean after tests are run. So the old conf should be the default conf here, and we only need to call unsetConf to restore the default config.

@maryannxue
Copy link
Contributor

LGTM except one minor comment https://github.com/apache/spark/pull/25328/files#r311096949.

sql("CREATE TABLE t USING json PARTITIONED BY (j) AS SELECT 1 i, 2 j")
assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
spark.table("t").explain()
assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
Copy link
Member

@gatorsmile gatorsmile Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case that can return non-zero when spark.sql.legacy.bucketedTableScan.outputOrdering is set to true?

@gatorsmile
Copy link
Member

LGTM except one comment.

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108745 has finished for PR 25328 at commit 264a259.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108747 has finished for PR 25328 at commit bf9b261.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108753 has finished for PR 25328 at commit bf9b261.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in 469423f Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants