[SPARK-30525][SQL]HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning#27232
Conversation
|
Test build #116829 has finished for PR 27232 at commit
|
|
Test build #116856 has finished for PR 27232 at commit
|
|
emmmm, prune partition again is because when call If we can't promise all these case was fixed in |
HiveExternalCatalog.listPartitionsByFilter will call HiveClient.getPartitionsByFilter to push down to hive metastore for partition pruning, which may not convert all spark filters to hive filters. But now it already call ExternalCatalogUtils.prunePartitionsByFilter to prune the results returned by HiveClient.getPartitionsByFilter again in HiveExternalCatalog.listPartitionsByFilter. So it is not necessary any more to prune again in HiveTableScanExec. you can check the code : https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L1254 |
|
cc @cloud-fan |
|
Test build #117093 has finished for PR 27232 at commit
|
|
retest this please. |
There was a problem hiding this comment.
can the tests call partitions?
There was a problem hiding this comment.
The test is to verify the SQLConf HIVE_METASTORE_PARTITION_PRUNING, but the partitions method will return same result no matter HIVE_METASTORE_PARTITION_PRUNING is true or false.
what about just remove the test case "Verify SQLConf HIVE_METASTORE_PARTITION_PRUNING" in HiveTableScanSuite?
There was a problem hiding this comment.
updated to refine the test, please help review again. thanks a lot.
|
Test build #117152 has finished for PR 27232 at commit
|
|
Test build #117165 has finished for PR 27232 at commit
|
|
Test build #117457 has finished for PR 27232 at commit
|
|
Test build #117462 has finished for PR 27232 at commit
|
|
Test build #117480 has finished for PR 27232 at commit
|
|
I'm not a good reviewer for this |
ok, sorry for bothering. |
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
Outdated
Show resolved
Hide resolved
|
Test build #117664 has finished for PR 27232 at commit
|
…ionsByFilter called in HiveTableScanExec.
|
Test build #117708 has finished for PR 27232 at commit
|
|
cc @cloud-fan |
| prunePartitions(hivePartitions) | ||
| } | ||
| } else { | ||
| if (sparkSession.sessionState.conf.metastorePartitionPruning) { |
There was a problem hiding this comment.
to be consistent, we should add && partitionPruningPred.nonEmpty
|
|
||
| // exposed for tests | ||
| @transient lazy val rawPartitions = { | ||
| @transient lazy val rawPartitions: Seq[HivePartition] = { |
There was a problem hiding this comment.
when we call rawPartitions, the relation.prunedPartitions must be empty. We can remove relation.prunedPartitions.getOrElse below.
|
LGTM except 2 minor comments |
|
retest this please |
2 similar comments
|
retest this please |
|
retest this please |
|
Test build #117772 has finished for PR 27232 at commit
|
|
thanks, merging to master! |
|
thank you all for help. |
What changes were proposed in this pull request?
HiveTableScanExec does not prune partitions again after SessionCatalog.listPartitionsByFilter called.
Why are the changes needed?
In HiveTableScanExec, it will push down to hive metastore for partition pruning if spark.sql.hive.metastorePartitionPruning is true, and then it will prune the returned partitions again using partition filters, because some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But now this problem is already fixed in HiveExternalCatalog.listPartitionsByFilter, the HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. So it is not necessary any more to double prune in HiveTableScanExec.
Does this PR introduce any user-facing change?
no
How was this patch tested?
Existing unit tests.