[HUDI-5384] Adding optimization rule to appropriately push down filters into the `HoodieFileIndex` by alexeykudinkin · Pull Request #7423 · apache/hudi

alexeykudinkin · 2022-12-10T02:14:20Z

Change Logs

This is a follow-up for #6680

After transitioning of the HoodieFileIndex to do file-listing lazily by default, following issue has been uncovered: due to

Listing now being delayed (until actual execution of the FileSourceScanExec node)
Spark not providing a generic Rule to push-down the predicates (it has PruneFileSourcePartitions but it's only applicable to CatalogFileIndex)

Statistics (based on the FileIndex) for Hudi's relations have been incorrectly estimated due to now these being delayed until the execution time when partition-predicates are pushed-down to HoodieFileIndex.

To work this around we're introducing a new HoodiePruneFileSourcePartitions rule that is

Structurally borrowing from PruneFileSourcePartitions
Pushes down predicates to HoodieFileIndex to perform partition-pruning in time, before subsequent CBO stage
Addresses the issue of statistics for Hudi's relations being incorrectly estimated

For more details around the impact of HoodiePruneFileSourcePartitions, please check out corresponding TestHoodiePruneFileSourcePartitions

Impact

No impact

Risk level (write none, low medium or high below)

Medium

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

YuweiXiao

Left a few comments, looks good overall!

hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java

YuweiXiao · 2022-12-16T09:16:06Z

...park/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala

+      f.references.subsetOf(partitionSet)
+    )
+    val extraPartitionFilter =
+      dataFilters.flatMap(exprUtils.extractPredicatesWithinOutputSet(_, partitionSet))


why do we have another extract here after the filters.parttion above?

YuweiXiao · 2022-12-16T09:33:11Z

...source/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieSparkSessionExtension.scala

+    HoodieAnalysis.customPreCBORules.foreach { ruleBuilder =>
+      extensions.injectPreCBORule(ruleBuilder(_))
+    }
+    */


Is this comment left intended?

can you address this.

Yes, ideally this rule should be invoked in pre-CBO slot, but unfortunately it's not supported for earlier Spark versions

YuweiXiao · 2022-12-16T09:35:42Z

...park/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala

+
+      // NOTE: We should only push-down the predicates [[HoodieFileIndex]], which we didn't
+      //       prune on before
+      if (partitionPruningFilters.nonEmpty && !fileIndex.prunedFor(partitionPruningFilters)) {


Do we also need to add the prunedFor check in the standard read path (i.e., catalog-based table read)?

This check is only needed here to gate against re-entry -- w/o it this rule will be looping ad infinitum

alexeykudinkin · 2023-01-17T22:07:07Z

...park/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala

+
+      // NOTE: We should only push-down the predicates [[HoodieFileIndex]], which we didn't
+      //       prune on before
+      if (partitionPruningFilters.nonEmpty && !fileIndex.prunedFor(partitionPruningFilters)) {


We need to remove this check for empty list to make sure we list the table in case when there are no predicates to make sure index reflects proper table size

nsivabalan

LGTM. good job on chasing and fixing it. couple of minor comments to address. then, we are good to go

nsivabalan · 2023-01-18T15:30:47Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

   * it's actually accessed
   */
-  protected lazy val fileIndex: HoodieFileIndex =
+  // TODO revert


fix the comment or address it please.

nsivabalan · 2023-01-18T15:33:29Z

...source/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieSparkSessionExtension.scala

+    HoodieAnalysis.customPreCBORules.foreach { ruleBuilder =>
+      extensions.injectPreCBORule(ruleBuilder(_))
+    }
+    */


can you address this.

…d from path using Spark DS

…eIndex` has an empty caches

…was pruned against

…ad of pre-CBO one, since Spark 2.x doesn't support CBO

…iately

…ection (for troubleshooting)

…led seq of pushed-down predicates to properly handle case when there are actually no partition predicates pushed down

…s created properly

…a simple flag

hudi-bot · 2023-01-20T08:38:09Z

CI report:

78a6da0 UNKNOWN
296dadf UNKNOWN
7c0c8a2 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin · 2023-01-20T16:28:35Z

CI is green:

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=14464&view=results

…rs into the `HoodieFileIndex` (apache#7423) ### Change Logs This is a follow-up for apache#6680 After transitioning of the `HoodieFileIndex` to do file-listing _lazily_ by default, following issue has been uncovered: due to - Listing now being delayed (until actual execution of the `FileSourceScanExec` node) - Spark not providing a generic `Rule` to push-down the predicates (it has `PruneFileSourcePartitions` but it's only applicable to `CatalogFileIndex`) Statistics (based on the `FileIndex`) for Hudi's relations have been incorrectly estimated due to now these being delayed until the execution time when partition-predicates are pushed-down to `HoodieFileIndex`. To work this around we're introducing a new `HoodiePruneFileSourcePartitions` rule that is - Structurally borrowing from `PruneFileSourcePartitions` - Pushes down predicates to `HoodieFileIndex` to perform partition-pruning in time, before subsequent CBO stage - Addresses the issue of statistics for Hudi's relations being incorrectly estimated For more details around the impact of `HoodiePruneFileSourcePartitions`, please check out corresponding `TestHoodiePruneFileSourcePartitions`

alexeykudinkin force-pushed the ak/fl-idx-lz-lst-fix branch from bceaf7d to 4c28887 Compare December 13, 2022 04:24

alexeykudinkin requested a review from codope December 13, 2022 06:08

alexeykudinkin force-pushed the ak/fl-idx-lz-lst-fix branch from ec53d90 to 5292742 Compare December 13, 2022 17:37

alexeykudinkin requested a review from nsivabalan December 13, 2022 21:13

alexeykudinkin added priority:blocker Production down; release blocker engine:spark Spark integration area:sql SQL interfaces labels Dec 13, 2022

alexeykudinkin changed the title ~~[MINOR] Adding optimization rule to appropriately push down filters into the HoodieFileIndex~~ [HUDI-5384] Adding optimization rule to appropriately push down filters into the HoodieFileIndex Dec 13, 2022

alexeykudinkin requested a review from yihua December 14, 2022 23:54

alexeykudinkin force-pushed the ak/fl-idx-lz-lst-fix branch from 2905580 to 09b901a Compare December 15, 2022 01:01

YuweiXiao approved these changes Dec 16, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/fl-idx-lz-lst-fix branch 4 times, most recently from 23d58d5 to 42a342a Compare December 21, 2022 07:39

alexeykudinkin changed the title ~~[HUDI-5384] Adding optimization rule to appropriately push down filters into the HoodieFileIndex~~ [HUDI-5384][Stacked on 7528] Adding optimization rule to appropriately push down filters into the HoodieFileIndex Dec 21, 2022

alexeykudinkin assigned nsivabalan Dec 22, 2022

alexeykudinkin commented Jan 17, 2023

View reviewed changes

nsivabalan reviewed Jan 18, 2023

View reviewed changes

Alexey Kudinkin added 11 commits January 19, 2023 09:52

Scaffolded HoodiePruneFileSourcePartitions

8833cdd

Wired HoodieFileSouircePartitions into HoodieSparkSessionExtension

ac47261

Missing license

cf76fec

Fixing isHoodieTable predicate to handle the case when table is rea…

47329f2

…d from path using Spark DS

Apply HoodiePruneFileSourcePartitions only in cases when `HoodieFil…

832bc07

…eIndex` has an empty caches

Annotated FileIndexes as not thread-safe

b2829dd

Rebasing HoodieFileIndex to explicitly keep track of predicates it …

c561121

…was pruned against

Tidying up

1250236

Carry-over HoodieFileIndex listing mode override from SQLConf

9fe1d59

Added tests

64f5588

Updated fixtures

aa3f062

Alexey Kudinkin added 16 commits January 19, 2023 09:52

Clear pruned partition-predicates cache

3ed17fe

Revisited test to match plans structurally instead of textually

460d89b

Backport utilities not available in Spark 2.x

691c3af

Inject HoodiePruneFileSourcePartitions as an Optimizer's rule inste…

2e58e52

…ad of pre-CBO one, since Spark 2.x doesn't support CBO

Missing license

a97297a

Fixing test to be compatible w/ Spark 2.x

ce1dfd2

Fixing compilation

a590c6d

Fixing HoodiePruneFileSourcePartitions to handle recurrence appropr…

1487b81

…iately

Fixing compilation (after rebase)

b83c078

Use UnsafeProjection facade to support fallback to Interpreted proj…

04cb856

…ection (for troubleshooting)

Fixing compilation

1a1b7ac

Disable NestedSchemaPruning temporarily

4ce2eb3

Revisit routine validating whether HoodieFileIndex had already hand…

535e4b4

…led seq of pushed-down predicates to properly handle case when there are actually no partition predicates pushed down

Tidying up

176e39c

Fixed HoodieClientTestHarness init seq to make sure Spark session i…

fd9e6a2

…s created properly

Make sure HoodieCleintTestHarness sets "spark.testing" appropriately

a6a9337

alexeykudinkin force-pushed the ak/fl-idx-lz-lst-fix branch from cc679d4 to 296dadf Compare January 19, 2023 20:33

Alexey Kudinkin added 3 commits January 19, 2023 12:51

Added test with no partition predicates being pushed down

431e38d

Revisited handling of whether predicates were pushed-down to just be …

fd2a365

…a simple flag

Fixed HoodieBaseRelation to properly propagate its size

7c0c8a2

alexeykudinkin force-pushed the ak/fl-idx-lz-lst-fix branch from 296dadf to 7c0c8a2 Compare January 19, 2023 20:51

alexeykudinkin changed the title ~~[HUDI-5384][Stacked on 7528] Adding optimization rule to appropriately push down filters into the HoodieFileIndex~~ [HUDI-5384] Adding optimization rule to appropriately push down filters into the HoodieFileIndex Jan 19, 2023

nsivabalan approved these changes Jan 19, 2023

View reviewed changes

alexeykudinkin merged commit b1552ef into apache:master Jan 20, 2023

jonvex mentioned this pull request Jul 23, 2025

[HUDI-9621] Fix divergence between file index and incremental file index #13593

Merged

4 tasks

Conversation

alexeykudinkin commented Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

YuweiXiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jan 20, 2023

CI report:

Uh oh!

alexeykudinkin commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexeykudinkin commented Dec 10, 2022 •

edited

Loading

nsivabalan left a comment •

edited

Loading

alexeykudinkin commented Jan 20, 2023 •

edited

Loading