[HUDI-8208] Fix partition stats bound when compacting or clustering by codope · Pull Request #12050 · apache/hudi

codope · 2024-10-03T16:23:59Z

Change Logs

The [min, max] range in column stats or partition stats can keep widening with udpates or deletes, because we simply take min of all mins' and max of maxs' while merging the stats. This can lead to a degenerative case where all partitions qualify for a predicate based on stats, even though actually very few partitions meet the predicate based on actual data. It defeats the purpose of pruning/skipping using stats. To fix this problem, we need to bring the range to a tighter bound. In order to do so, this PR:

Adds a flag in column stats metadata payload - isTightBound - to indicate whether min/max range is a tighter bound based on latest snapshot or not. It is false by default and set to true during compaction or clustering.
Adds a config to disable calculating tight bounds. Enabled by default for compaction and clustering.
To calculate tight bound, we look at the colstats partition for the uncompacted or unclustered files and then merge the colstats with that of the compacted or clustered files. Most of the changes are in HoodieTableMetadataUtil.

Impact

More effective partition pruning for non-partition key fields.

Risk level (write none, low medium or high below)

medium

Scans unmerged log records during compaction or clustering.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

lokeshj1703

Thanks for working on this @codope! I have one comment inline.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

codope · 2024-10-07T16:56:39Z

hudi-hadoop-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java

    }
  }

-  public static HoodieDataBlock getDataBlock(HoodieLogBlockType dataBlockType, List<IndexedRecord> records,


The method here and below are moved to HoodieCommonTestHarness for better reusability.

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala

danny0405 · 2024-10-09T00:17:28Z

To calculate tight bound, we look at the colstats partition for the uncompacted or unclustered files and then merge the colstats with that of the compacted or clustered files.

Are you saying instead of using the native min_max range for columns in files generated from compaction and clustering, we recompute the column stats ranges from the source files? For example if we have f1 with range [v1, v2] and f2 with range [v3, v4], instead of using [v1, v4] as the compaction file range, we still use the composition of [v1, v2] and [v3, v4] ?

codope · 2024-10-09T06:05:25Z

To calculate tight bound, we look at the colstats partition for the uncompacted or unclustered files and then merge the colstats with that of the compacted or clustered files.

Are you saying instead of using the native min_max range for columns in files generated from compaction and clustering, we recompute the column stats ranges from the source files? For example if we have f1 with range [v1, v2] and f2 with range [v3, v4], instead of using [v1, v4] as the compaction file range, we still use the composition of [v1, v2] and [v3, v4] ?

For the files generated from compaction and clustering, we were already using the native min, max range. But, we ignored the files that were not compacted or clustered from the partition stats update. If, luckily, all the file slices in a partition were compacted or clustered, then the partition stats would have a tight bound even without this patch.

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

nsivabalan · 2024-10-08T23:08:06Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+      if (shouldScanColStatsForTightBound) {
+        tableMetadata = HoodieTableMetadata.create(engineContext, dataMetaClient.getStorage(), metadataConfig, dataMetaClient.getBasePath().toString());
+      } else {
+        tableMetadata = null;


can we move the below logic within this if block only.
trying to avoid doing null assignment to tableMetadata.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

danny0405 · 2024-10-10T00:28:56Z

But, we ignored the files that were not compacted or clustered from the partition stats update.

Ok, you are talking about partition stats specifically, but I still feel the partition stats data structure should support idempotent update for the files, if the file is not involved in the compaction/clustering, there is no need to update the stats for it, the clustering and compaction does not change any partition path of the table.

nsivabalan · 2024-10-14T17:29:05Z

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java

      .sinceVersion("1.0.0")
      .withDocumentation("Parallelism to use, when generating partition stats index.");

+  public static final ConfigProperty<Boolean> ENABLE_PARTITION_STATS_INDEX_TIGHT_BOUND = ConfigProperty


do we even need this config to be exposed to user?
should we keep it internal?

we can take this up separately. not blocking this patch for now

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

hudi-bot · 2024-10-14T20:33:39Z

CI report:

4e9f362 UNKNOWN
481af6d Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:M PR with lines of changes in (100, 300] label Oct 3, 2024

codope force-pushed the hudi-8208-part-stats-comp-clust branch from 14035d7 to 5a63604 Compare October 4, 2024 12:46

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Oct 4, 2024

codope force-pushed the hudi-8208-part-stats-comp-clust branch from a94bb1f to ed3ce0f Compare October 7, 2024 06:33

github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Oct 7, 2024

lokeshj1703 reviewed Oct 7, 2024

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Show resolved Hide resolved

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Show resolved Hide resolved

codope force-pushed the hudi-8208-part-stats-comp-clust branch from ed3ce0f to 8efbfcb Compare October 7, 2024 16:37

codope marked this pull request as ready for review October 7, 2024 16:39

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Oct 7, 2024

codope commented Oct 7, 2024

View reviewed changes

nsivabalan reviewed Oct 8, 2024

View reviewed changes

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala Outdated Show resolved Hide resolved

nsivabalan reviewed Oct 9, 2024

View reviewed changes

codope added 7 commits October 14, 2024 10:07

[HUDI-8208] Fix partition stats bound when compacting or clustering

641364c

build partition stats for log files and fix tests

bcd59c2

fix checkstyle

8a15ddf

Add isTightBound flag in stats metadata payload

445333a

fix another flink test for stats schema

4aa6bb8

Add more tests

219962c

use merged file slices from fsview and reorder test

fde8646

codope force-pushed the hudi-8208-part-stats-comp-clust branch from 8efbfcb to fde8646 Compare October 14, 2024 07:29

test fixes

44b4795

nsivabalan reviewed Oct 14, 2024

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Show resolved Hide resolved

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Show resolved Hide resolved

Adding more java docs

481af6d

nsivabalan approved these changes Oct 14, 2024

View reviewed changes

codope merged commit b4e1e5e into apache:master Oct 14, 2024

Conversation

codope commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codope Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danny0405 commented Oct 9, 2024

Uh oh!

codope commented Oct 9, 2024

Uh oh!

Uh oh!

Uh oh!

nsivabalan Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danny0405 commented Oct 10, 2024

Uh oh!

nsivabalan Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

nsivabalan Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Oct 14, 2024

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codope commented Oct 3, 2024 •

edited

Loading