[HUDI-5384] Adding optimization rule to appropriately push down filters into the HoodieFileIndex#7423
Conversation
bceaf7d to
4c28887
Compare
ec53d90 to
5292742
Compare
HoodieFileIndexHoodieFileIndex
2905580 to
09b901a
Compare
YuweiXiao
left a comment
There was a problem hiding this comment.
Left a few comments, looks good overall!
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java
Outdated
Show resolved
Hide resolved
| f.references.subsetOf(partitionSet) | ||
| ) | ||
| val extraPartitionFilter = | ||
| dataFilters.flatMap(exprUtils.extractPredicatesWithinOutputSet(_, partitionSet)) |
There was a problem hiding this comment.
why do we have another extract here after the filters.parttion above?
| HoodieAnalysis.customPreCBORules.foreach { ruleBuilder => | ||
| extensions.injectPreCBORule(ruleBuilder(_)) | ||
| } | ||
| */ |
There was a problem hiding this comment.
Is this comment left intended?
There was a problem hiding this comment.
Yes, ideally this rule should be invoked in pre-CBO slot, but unfortunately it's not supported for earlier Spark versions
|
|
||
| // NOTE: We should only push-down the predicates [[HoodieFileIndex]], which we didn't | ||
| // prune on before | ||
| if (partitionPruningFilters.nonEmpty && !fileIndex.prunedFor(partitionPruningFilters)) { |
There was a problem hiding this comment.
Do we also need to add the prunedFor check in the standard read path (i.e., catalog-based table read)?
There was a problem hiding this comment.
This check is only needed here to gate against re-entry -- w/o it this rule will be looping ad infinitum
23d58d5 to
42a342a
Compare
HoodieFileIndexHoodieFileIndex
|
|
||
| // NOTE: We should only push-down the predicates [[HoodieFileIndex]], which we didn't | ||
| // prune on before | ||
| if (partitionPruningFilters.nonEmpty && !fileIndex.prunedFor(partitionPruningFilters)) { |
There was a problem hiding this comment.
We need to remove this check for empty list to make sure we list the table in case when there are no predicates to make sure index reflects proper table size
| * it's actually accessed | ||
| */ | ||
| protected lazy val fileIndex: HoodieFileIndex = | ||
| // TODO revert |
There was a problem hiding this comment.
fix the comment or address it please.
| HoodieAnalysis.customPreCBORules.foreach { ruleBuilder => | ||
| extensions.injectPreCBORule(ruleBuilder(_)) | ||
| } | ||
| */ |
…d from path using Spark DS
…eIndex` has an empty caches
…was pruned against
…ad of pre-CBO one, since Spark 2.x doesn't support CBO
…ection (for troubleshooting)
…led seq of pushed-down predicates to properly handle case when there are actually no partition predicates pushed down
…s created properly
cc679d4 to
296dadf
Compare
296dadf to
7c0c8a2
Compare
HoodieFileIndexHoodieFileIndex
…rs into the `HoodieFileIndex` (apache#7423) ### Change Logs This is a follow-up for apache#6680 After transitioning of the `HoodieFileIndex` to do file-listing _lazily_ by default, following issue has been uncovered: due to - Listing now being delayed (until actual execution of the `FileSourceScanExec` node) - Spark not providing a generic `Rule` to push-down the predicates (it has `PruneFileSourcePartitions` but it's only applicable to `CatalogFileIndex`) Statistics (based on the `FileIndex`) for Hudi's relations have been incorrectly estimated due to now these being delayed until the execution time when partition-predicates are pushed-down to `HoodieFileIndex`. To work this around we're introducing a new `HoodiePruneFileSourcePartitions` rule that is - Structurally borrowing from `PruneFileSourcePartitions` - Pushes down predicates to `HoodieFileIndex` to perform partition-pruning in time, before subsequent CBO stage - Addresses the issue of statistics for Hudi's relations being incorrectly estimated For more details around the impact of `HoodiePruneFileSourcePartitions`, please check out corresponding `TestHoodiePruneFileSourcePartitions`
…rs into the `HoodieFileIndex` (apache#7423) ### Change Logs This is a follow-up for apache#6680 After transitioning of the `HoodieFileIndex` to do file-listing _lazily_ by default, following issue has been uncovered: due to - Listing now being delayed (until actual execution of the `FileSourceScanExec` node) - Spark not providing a generic `Rule` to push-down the predicates (it has `PruneFileSourcePartitions` but it's only applicable to `CatalogFileIndex`) Statistics (based on the `FileIndex`) for Hudi's relations have been incorrectly estimated due to now these being delayed until the execution time when partition-predicates are pushed-down to `HoodieFileIndex`. To work this around we're introducing a new `HoodiePruneFileSourcePartitions` rule that is - Structurally borrowing from `PruneFileSourcePartitions` - Pushes down predicates to `HoodieFileIndex` to perform partition-pruning in time, before subsequent CBO stage - Addresses the issue of statistics for Hudi's relations being incorrectly estimated For more details around the impact of `HoodiePruneFileSourcePartitions`, please check out corresponding `TestHoodiePruneFileSourcePartitions`

Change Logs
This is a follow-up for #6680
After transitioning of the
HoodieFileIndexto do file-listing lazily by default, following issue has been uncovered: due toFileSourceScanExecnode)Ruleto push-down the predicates (it hasPruneFileSourcePartitionsbut it's only applicable toCatalogFileIndex)Statistics (based on the
FileIndex) for Hudi's relations have been incorrectly estimated due to now these being delayed until the execution time when partition-predicates are pushed-down toHoodieFileIndex.To work this around we're introducing a new
HoodiePruneFileSourcePartitionsrule that isPruneFileSourcePartitionsHoodieFileIndexto perform partition-pruning in time, before subsequent CBO stageFor more details around the impact of
HoodiePruneFileSourcePartitions, please check out correspondingTestHoodiePruneFileSourcePartitionsImpact
No impact
Risk level (write none, low medium or high below)
Medium
Documentation Update
N/A
Contributor's checklist