Skip to content

[HUDI-8208] Fix partition stats bound when compacting or clustering#12050

Merged
codope merged 9 commits intoapache:masterfrom
codope:hudi-8208-part-stats-comp-clust
Oct 14, 2024
Merged

[HUDI-8208] Fix partition stats bound when compacting or clustering#12050
codope merged 9 commits intoapache:masterfrom
codope:hudi-8208-part-stats-comp-clust

Conversation

@codope
Copy link
Member

@codope codope commented Oct 3, 2024

Change Logs

The [min, max] range in column stats or partition stats can keep widening with udpates or deletes, because we simply take min of all mins' and max of maxs' while merging the stats. This can lead to a degenerative case where all partitions qualify for a predicate based on stats, even though actually very few partitions meet the predicate based on actual data. It defeats the purpose of pruning/skipping using stats. To fix this problem, we need to bring the range to a tighter bound. In order to do so, this PR:

  1. Adds a flag in column stats metadata payload - isTightBound - to indicate whether min/max range is a tighter bound based on latest snapshot or not. It is false by default and set to true during compaction or clustering.
  2. Adds a config to disable calculating tight bounds. Enabled by default for compaction and clustering.
  3. To calculate tight bound, we look at the colstats partition for the uncompacted or unclustered files and then merge the colstats with that of the compacted or clustered files. Most of the changes are in HoodieTableMetadataUtil.

Impact

More effective partition pruning for non-partition key fields.

Risk level (write none, low medium or high below)

medium

Scans unmerged log records during compaction or clustering.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Oct 3, 2024
@codope codope force-pushed the hudi-8208-part-stats-comp-clust branch from 14035d7 to 5a63604 Compare October 4, 2024 12:46
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Oct 4, 2024
@codope codope force-pushed the hudi-8208-part-stats-comp-clust branch from a94bb1f to ed3ce0f Compare October 7, 2024 06:33
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Oct 7, 2024
Copy link
Collaborator

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @codope! I have one comment inline.

@codope codope force-pushed the hudi-8208-part-stats-comp-clust branch from ed3ce0f to 8efbfcb Compare October 7, 2024 16:37
@codope codope marked this pull request as ready for review October 7, 2024 16:39
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Oct 7, 2024
}
}

public static HoodieDataBlock getDataBlock(HoodieLogBlockType dataBlockType, List<IndexedRecord> records,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method here and below are moved to HoodieCommonTestHarness for better reusability.

@danny0405
Copy link
Contributor

To calculate tight bound, we look at the colstats partition for the uncompacted or unclustered files and then merge the colstats with that of the compacted or clustered files.

Are you saying instead of using the native min_max range for columns in files generated from compaction and clustering, we recompute the column stats ranges from the source files? For example if we have f1 with range [v1, v2] and f2 with range [v3, v4], instead of using [v1, v4] as the compaction file range, we still use the composition of [v1, v2] and [v3, v4] ?

@codope
Copy link
Member Author

codope commented Oct 9, 2024

To calculate tight bound, we look at the colstats partition for the uncompacted or unclustered files and then merge the colstats with that of the compacted or clustered files.

Are you saying instead of using the native min_max range for columns in files generated from compaction and clustering, we recompute the column stats ranges from the source files? For example if we have f1 with range [v1, v2] and f2 with range [v3, v4], instead of using [v1, v4] as the compaction file range, we still use the composition of [v1, v2] and [v3, v4] ?

For the files generated from compaction and clustering, we were already using the native min, max range. But, we ignored the files that were not compacted or clustered from the partition stats update. If, luckily, all the file slices in a partition were compacted or clustered, then the partition stats would have a tight bound even without this patch.

if (shouldScanColStatsForTightBound) {
tableMetadata = HoodieTableMetadata.create(engineContext, dataMetaClient.getStorage(), metadataConfig, dataMetaClient.getBasePath().toString());
} else {
tableMetadata = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move the below logic within this if block only.
trying to avoid doing null assignment to tableMetadata.

@danny0405
Copy link
Contributor

But, we ignored the files that were not compacted or clustered from the partition stats update.

Ok, you are talking about partition stats specifically, but I still feel the partition stats data structure should support idempotent update for the files, if the file is not involved in the compaction/clustering, there is no need to update the stats for it, the clustering and compaction does not change any partition path of the table.

@codope codope force-pushed the hudi-8208-part-stats-comp-clust branch from 8efbfcb to fde8646 Compare October 14, 2024 07:29
.sinceVersion("1.0.0")
.withDocumentation("Parallelism to use, when generating partition stats index.");

public static final ConfigProperty<Boolean> ENABLE_PARTITION_STATS_INDEX_TIGHT_BOUND = ConfigProperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we even need this config to be exposed to user?
should we keep it internal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can take this up separately. not blocking this patch for now

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit b4e1e5e into apache:master Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants