Skip to content

feat: Add configurable cleaner policy for metadata table#17935

Merged
nsivabalan merged 6 commits intoapache:masterfrom
suryaprasanna:surya-dev-04
Jan 28, 2026
Merged

feat: Add configurable cleaner policy for metadata table#17935
nsivabalan merged 6 commits intoapache:masterfrom
suryaprasanna:surya-dev-04

Conversation

@suryaprasanna
Copy link
Contributor

@suryaprasanna suryaprasanna commented Jan 18, 2026

Describe the issue this Pull Request addresses

Metadata table currently inherits the cleaning policy from the data table, which may not always be optimal for metadata table operations. This PR introduces a dedicated configurable cleaning policy for the metadata table that can either use main table's clean or use KEEP_LATEST_FILE_VERSIONS which is recommended strategy for MOR table type.

Summary and Changelog

This PR adds support for configuring the metadata table's cleaning policy independently from the data table. Users can now set hoodie.metadata.clean.policy to control how the metadata table performs cleaning operations.

Changes:

  • Added new config hoodie.metadata.use.main.table.clean.policy in HoodieMetadataConfig with default value as true
  • Modified HoodieMetadataWriteUtils.createMetadataWriteConfig() to use either KEEP_LATEST_FILE_VERSIONS stratgy for clean or to inherit clean policy from data table.

Impact

Users can now independently configure metadata table cleaning behavior. Creating a way for metadata clean policy to use (KEEP_LATEST_FILE_VERSIONS) as it ensures efficient file management regardless of the data table's cleaning strategy.

Risk Level

low - This change is backward compatible. The hoodie.metadata.use.main.table.clean.policy config makes sure the original logic is not overridden.

Documentation Update

New config:

  • hoodie.metadata.use.main.table.clean.policy (advanced): Determines the cleaner policy for metadata table. Default: true.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Jan 18, 2026
.withCleanerPolicy(metadataTableCleaningPolicy);

if (HoodieCleaningPolicy.KEEP_LATEST_COMMITS.equals(dataTableCleaningPolicy)) {
if (HoodieCleaningPolicy.KEEP_LATEST_COMMITS.equals(metadataTableCleaningPolicy)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if 1.2 X is needed anymore.
previously, there is no way one can configure cleaner configuration for mdt, and hence we were deriving from data table's clean policy and hence added the 20% buffer.
but if we are looking to add explicit configurations, we don't need buffer.

.defaultValue(KEEP_LATEST_FILE_VERSIONS.name())
.markAdvanced()
.sinceVersion("1.2.0")
.withDocumentation("This config determines the cleaner policy for metadata table.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add a boolean Config property named hoodie.metadata.clean.follow.data.table.policy and enable it by default.

out of the box, lets fall back to old behavior with 1.2X multiplier.

but advanced users will have a choice to override this and use hoodie.metadata.clean.policy on a need basis.
I am worried that, some data table could have 7 day cleaner retention, but switching to this FILE versions based and not deriving the value from data table might have an impact

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, here we are only adding the policy override. the values required for these policies are still derived from data table.
but are we not making a assumption here that, both data table and metadata will be using same clean policy.

what incase users configure different policy for each data table and mdt table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be, if the policy does not align between data table and mdt, we could throw exceptions.
and also add a configuration for the multiplier. leave the default as 1.2 x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel adding more configs will unnecessarily complicate the code. Let us just maintain one config for metadata cleaner policy, we can decide which one it should be.

.defaultValue(KEEP_LATEST_FILE_VERSIONS.name())
.markAdvanced()
.sinceVersion("1.2.0")
.withDocumentation("This config determines the cleaner policy for metadata table.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, here we are only adding the policy override. the values required for these policies are still derived from data table.
but are we not making a assumption here that, both data table and metadata will be using same clean policy.

what incase users configure different policy for each data table and mdt table.

.defaultValue(KEEP_LATEST_FILE_VERSIONS.name())
.markAdvanced()
.sinceVersion("1.2.0")
.withDocumentation("This config determines the cleaner policy for metadata table.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be, if the policy does not align between data table and mdt, we could throw exceptions.
and also add a configuration for the multiplier. leave the default as 1.2 x

return this;
}

public HoodieMetadataConfig.Builder withCleanerPolicy(HoodieCleaningPolicy policy) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add tests.
check out TestHoodieMetadataConfig

@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Jan 26, 2026

final long maxLogFileSizeBytes = writeConfig.getMetadataConfig().getMaxLogFileSize();
// Borrow the cleaner policy from the main table and adjust the cleaner policy based on the main table's cleaner policy
boolean useMainTableCleanPolicy = writeConfig.getMetadataConfig().useMainTableCleanPolicy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

driveFromDatatableCleanPolicy

.withDocumentation("Controls the criteria to log compacted files groups in metadata table.");

public static final ConfigProperty<Boolean> USE_MAIN_TABLE_CLEAN_POLICY = ConfigProperty
.key(METADATA_PREFIX + ".use.main.table.clean.policy")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

METADATA_PREFIX + derive.datatable.clean.policy

.defaultValue(true)
.markAdvanced()
.sinceVersion("1.2.0")
.withDocumentation("This config determines whether the cleaner policy should use main table's cleaner policy.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the wording. main -> data

return getIntOrDefault(RECORD_PREPARATION_PARALLELISM);
}

public boolean useMainTableCleanPolicy() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the naming once the config key and property is fixed

return this;
}

public HoodieMetadataConfig.Builder withMainTableCleanPolicy(boolean useMainTableCleanPolicy) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@nsivabalan
Copy link
Contributor

LGTM. mostly minor comments

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 1ed2821 into apache:master Jan 28, 2026
72 checks passed
PavithranRick pushed a commit to PavithranRick/hudi that referenced this pull request Feb 2, 2026
alexr17 pushed a commit to alexr17/hudi that referenced this pull request Feb 5, 2026
prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants