Skip to content

[SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning#27232

Closed
fuwhu wants to merge 7 commits intoapache:masterfrom
fuwhu:SPARK-30525
Closed

[SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning#27232
fuwhu wants to merge 7 commits intoapache:masterfrom
fuwhu:SPARK-30525

Conversation

@fuwhu
Copy link
Contributor

@fuwhu fuwhu commented Jan 16, 2020

What changes were proposed in this pull request?

HiveTableScanExec does not prune partitions again after SessionCatalog.listPartitionsByFilter called.

Why are the changes needed?

In HiveTableScanExec, it will push down to hive metastore for partition pruning if spark.sql.hive.metastorePartitionPruning is true, and then it will prune the returned partitions again using partition filters, because some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But now this problem is already fixed in HiveExternalCatalog.listPartitionsByFilter, the HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. So it is not necessary any more to double prune in HiveTableScanExec.

Does this PR introduce any user-facing change?

no

How was this patch tested?

Existing unit tests.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116829 has finished for PR 27232 at commit 6abfdc9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116856 has finished for PR 27232 at commit ef469f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor

AngersZhuuuu commented Jan 17, 2020

emmmm, prune partition again is because when call listPartitionsByFilter kit can't convert all spark filter expression to hive metastore filter condition. and will get more partition then exactly wanted. then prune again will spark's method to keep smallest partitions.

If we can't promise all these case was fixed in listPartitionsByFilter, we may still need this. and it's cost is negligible

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 17, 2020

emmmm, prune partition again is because when call listPartitionsByFilter kit can't convert all spark filter expression to hive metastore filter condition. and will get more partition then exactly wanted. then prune again will spark's method to keep smallest partitions.

If we can't promise all these case was fixed in listPartitionsByFilter, we may still need this. and it's cost is negligible

HiveExternalCatalog.listPartitionsByFilter will call HiveClient.getPartitionsByFilter to push down to hive metastore for partition pruning, which may not convert all spark filters to hive filters.

But now it already call ExternalCatalogUtils.prunePartitionsByFilter to prune the results returned by HiveClient.getPartitionsByFilter again in HiveExternalCatalog.listPartitionsByFilter. So it is not necessary any more to prune again in HiveTableScanExec.

you can check the code : https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L1254

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 18, 2020

cc @cloud-fan

@fuwhu fuwhu closed this Jan 20, 2020
@fuwhu fuwhu reopened this Jan 20, 2020
@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117093 has finished for PR 27232 at commit ef469f5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 21, 2020

retest this please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the tests call partitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is to verify the SQLConf HIVE_METASTORE_PARTITION_PRUNING, but the partitions method will return same result no matter HIVE_METASTORE_PARTITION_PRUNING is true or false.

what about just remove the test case "Verify SQLConf HIVE_METASTORE_PARTITION_PRUNING" in HiveTableScanSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to refine the test, please help review again. thanks a lot.

@SparkQA
Copy link

SparkQA commented Jan 21, 2020

Test build #117152 has finished for PR 27232 at commit ef469f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2020

Test build #117165 has finished for PR 27232 at commit d8eefe9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu fuwhu requested review from cloud-fan and wangyum and removed request for wangyum January 28, 2020 02:30
@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117457 has finished for PR 27232 at commit 95d62c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117462 has finished for PR 27232 at commit ff7d71a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117480 has finished for PR 27232 at commit df9c4e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu fuwhu requested a review from srowen January 29, 2020 02:28
@srowen srowen removed their request for review January 29, 2020 02:32
@srowen
Copy link
Member

srowen commented Jan 29, 2020

I'm not a good reviewer for this

@fuwhu
Copy link
Contributor Author

fuwhu commented Jan 29, 2020

I'm not a good reviewer for this

ok, sorry for bothering.

@fuwhu fuwhu changed the title [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to hive metastore [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again Jan 30, 2020
@fuwhu fuwhu requested review from gatorsmile, gengliangwang and maropu and removed request for maropu January 30, 2020 04:08
@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117664 has finished for PR 27232 at commit a862bd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu fuwhu changed the title [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again [SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning Feb 1, 2020
@SparkQA
Copy link

SparkQA commented Feb 1, 2020

Test build #117708 has finished for PR 27232 at commit 057a594.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@fuwhu
Copy link
Contributor Author

fuwhu commented Feb 2, 2020

cc @cloud-fan

prunePartitions(hivePartitions)
}
} else {
if (sparkSession.sessionState.conf.metastorePartitionPruning) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent, we should add && partitionPruningPred.nonEmpty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, added.


// exposed for tests
@transient lazy val rawPartitions = {
@transient lazy val rawPartitions: Seq[HivePartition] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we call rawPartitions, the relation.prunedPartitions must be empty. We can remove relation.prunedPartitions.getOrElse below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, removed.

@cloud-fan
Copy link
Contributor

LGTM except 2 minor comments

@cloud-fan
Copy link
Contributor

retest this please

2 similar comments
@fuwhu
Copy link
Contributor Author

fuwhu commented Feb 3, 2020

retest this please

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 3, 2020

Test build #117772 has finished for PR 27232 at commit f0ad037.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 47659a0 Feb 3, 2020
@fuwhu fuwhu deleted the SPARK-30525 branch February 4, 2020 01:50
@fuwhu
Copy link
Contributor Author

fuwhu commented Feb 4, 2020

thank you all for help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants