Skip to content

[HUDI-5384] Adding optimization rule to appropriately push down filters into the HoodieFileIndex#7423

Merged
alexeykudinkin merged 30 commits intoapache:masterfrom
onehouseinc:ak/fl-idx-lz-lst-fix
Jan 20, 2023
Merged

[HUDI-5384] Adding optimization rule to appropriately push down filters into the HoodieFileIndex#7423
alexeykudinkin merged 30 commits intoapache:masterfrom
onehouseinc:ak/fl-idx-lz-lst-fix

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Dec 10, 2022

Change Logs

This is a follow-up for #6680

After transitioning of the HoodieFileIndex to do file-listing lazily by default, following issue has been uncovered: due to

  • Listing now being delayed (until actual execution of the FileSourceScanExec node)
  • Spark not providing a generic Rule to push-down the predicates (it has PruneFileSourcePartitions but it's only applicable to CatalogFileIndex)

Statistics (based on the FileIndex) for Hudi's relations have been incorrectly estimated due to now these being delayed until the execution time when partition-predicates are pushed-down to HoodieFileIndex.

To work this around we're introducing a new HoodiePruneFileSourcePartitions rule that is

  • Structurally borrowing from PruneFileSourcePartitions
  • Pushes down predicates to HoodieFileIndex to perform partition-pruning in time, before subsequent CBO stage
  • Addresses the issue of statistics for Hudi's relations being incorrectly estimated

For more details around the impact of HoodiePruneFileSourcePartitions, please check out corresponding TestHoodiePruneFileSourcePartitions

Impact

No impact

Risk level (write none, low medium or high below)

Medium

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@alexeykudinkin alexeykudinkin added priority:blocker Production down; release blocker engine:spark Spark integration area:sql SQL interfaces labels Dec 13, 2022
@alexeykudinkin alexeykudinkin changed the title [MINOR] Adding optimization rule to appropriately push down filters into the HoodieFileIndex [HUDI-5384] Adding optimization rule to appropriately push down filters into the HoodieFileIndex Dec 13, 2022
@alexeykudinkin alexeykudinkin requested a review from yihua December 14, 2022 23:54
Copy link
Contributor

@YuweiXiao YuweiXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, looks good overall!

f.references.subsetOf(partitionSet)
)
val extraPartitionFilter =
dataFilters.flatMap(exprUtils.extractPredicatesWithinOutputSet(_, partitionSet))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have another extract here after the filters.parttion above?

HoodieAnalysis.customPreCBORules.foreach { ruleBuilder =>
extensions.injectPreCBORule(ruleBuilder(_))
}
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment left intended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you address this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally this rule should be invoked in pre-CBO slot, but unfortunately it's not supported for earlier Spark versions


// NOTE: We should only push-down the predicates [[HoodieFileIndex]], which we didn't
// prune on before
if (partitionPruningFilters.nonEmpty && !fileIndex.prunedFor(partitionPruningFilters)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to add the prunedFor check in the standard read path (i.e., catalog-based table read)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is only needed here to gate against re-entry -- w/o it this rule will be looping ad infinitum

@alexeykudinkin alexeykudinkin force-pushed the ak/fl-idx-lz-lst-fix branch 4 times, most recently from 23d58d5 to 42a342a Compare December 21, 2022 07:39
@alexeykudinkin alexeykudinkin changed the title [HUDI-5384] Adding optimization rule to appropriately push down filters into the HoodieFileIndex [HUDI-5384][Stacked on 7528] Adding optimization rule to appropriately push down filters into the HoodieFileIndex Dec 21, 2022

// NOTE: We should only push-down the predicates [[HoodieFileIndex]], which we didn't
// prune on before
if (partitionPruningFilters.nonEmpty && !fileIndex.prunedFor(partitionPruningFilters)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove this check for empty list to make sure we list the table in case when there are no predicates to make sure index reflects proper table size

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. good job on chasing and fixing it. couple of minor comments to address. then, we are good to go

* it's actually accessed
*/
protected lazy val fileIndex: HoodieFileIndex =
// TODO revert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the comment or address it please.

HoodieAnalysis.customPreCBORules.foreach { ruleBuilder =>
extensions.injectPreCBORule(ruleBuilder(_))
}
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you address this.

@alexeykudinkin alexeykudinkin changed the title [HUDI-5384][Stacked on 7528] Adding optimization rule to appropriately push down filters into the HoodieFileIndex [HUDI-5384] Adding optimization rule to appropriately push down filters into the HoodieFileIndex Jan 19, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@alexeykudinkin
Copy link
Contributor Author

alexeykudinkin commented Jan 20, 2023

@alexeykudinkin alexeykudinkin merged commit b1552ef into apache:master Jan 20, 2023
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…rs into the `HoodieFileIndex` (apache#7423)

### Change Logs

This is a follow-up for apache#6680

After transitioning of the `HoodieFileIndex` to do file-listing _lazily_ by default, following issue has been uncovered: due to 

 - Listing now being delayed (until actual execution of the `FileSourceScanExec` node)
 - Spark not providing a generic `Rule` to push-down the predicates (it has `PruneFileSourcePartitions` but it's only applicable to `CatalogFileIndex`)

Statistics (based on the `FileIndex`) for Hudi's relations have been incorrectly estimated due to now these being delayed until the execution time when partition-predicates are pushed-down to `HoodieFileIndex`.

To work this around we're introducing a new `HoodiePruneFileSourcePartitions` rule that is

 - Structurally borrowing from `PruneFileSourcePartitions`
 - Pushes down predicates to `HoodieFileIndex` to perform partition-pruning in time, before subsequent CBO stage
 - Addresses the issue of statistics for Hudi's relations being incorrectly estimated

For more details around the impact of `HoodiePruneFileSourcePartitions`, please check out corresponding `TestHoodiePruneFileSourcePartitions`
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…rs into the `HoodieFileIndex` (apache#7423)

### Change Logs

This is a follow-up for apache#6680

After transitioning of the `HoodieFileIndex` to do file-listing _lazily_ by default, following issue has been uncovered: due to 

 - Listing now being delayed (until actual execution of the `FileSourceScanExec` node)
 - Spark not providing a generic `Rule` to push-down the predicates (it has `PruneFileSourcePartitions` but it's only applicable to `CatalogFileIndex`)

Statistics (based on the `FileIndex`) for Hudi's relations have been incorrectly estimated due to now these being delayed until the execution time when partition-predicates are pushed-down to `HoodieFileIndex`.

To work this around we're introducing a new `HoodiePruneFileSourcePartitions` rule that is

 - Structurally borrowing from `PruneFileSourcePartitions`
 - Pushes down predicates to `HoodieFileIndex` to perform partition-pruning in time, before subsequent CBO stage
 - Addresses the issue of statistics for Hudi's relations being incorrectly estimated

For more details around the impact of `HoodiePruneFileSourcePartitions`, please check out corresponding `TestHoodiePruneFileSourcePartitions`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:sql SQL interfaces engine:spark Spark integration priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants