feat: Add configurable cleaner policy for metadata table#17935
feat: Add configurable cleaner policy for metadata table#17935nsivabalan merged 6 commits intoapache:masterfrom
Conversation
| .withCleanerPolicy(metadataTableCleaningPolicy); | ||
|
|
||
| if (HoodieCleaningPolicy.KEEP_LATEST_COMMITS.equals(dataTableCleaningPolicy)) { | ||
| if (HoodieCleaningPolicy.KEEP_LATEST_COMMITS.equals(metadataTableCleaningPolicy)) { |
There was a problem hiding this comment.
I am not sure if 1.2 X is needed anymore.
previously, there is no way one can configure cleaner configuration for mdt, and hence we were deriving from data table's clean policy and hence added the 20% buffer.
but if we are looking to add explicit configurations, we don't need buffer.
| .defaultValue(KEEP_LATEST_FILE_VERSIONS.name()) | ||
| .markAdvanced() | ||
| .sinceVersion("1.2.0") | ||
| .withDocumentation("This config determines the cleaner policy for metadata table."); |
There was a problem hiding this comment.
can we also add a boolean Config property named hoodie.metadata.clean.follow.data.table.policy and enable it by default.
out of the box, lets fall back to old behavior with 1.2X multiplier.
but advanced users will have a choice to override this and use hoodie.metadata.clean.policy on a need basis.
I am worried that, some data table could have 7 day cleaner retention, but switching to this FILE versions based and not deriving the value from data table might have an impact
There was a problem hiding this comment.
oh, here we are only adding the policy override. the values required for these policies are still derived from data table.
but are we not making a assumption here that, both data table and metadata will be using same clean policy.
what incase users configure different policy for each data table and mdt table.
There was a problem hiding this comment.
may be, if the policy does not align between data table and mdt, we could throw exceptions.
and also add a configuration for the multiplier. leave the default as 1.2 x
There was a problem hiding this comment.
I feel adding more configs will unnecessarily complicate the code. Let us just maintain one config for metadata cleaner policy, we can decide which one it should be.
| .defaultValue(KEEP_LATEST_FILE_VERSIONS.name()) | ||
| .markAdvanced() | ||
| .sinceVersion("1.2.0") | ||
| .withDocumentation("This config determines the cleaner policy for metadata table."); |
There was a problem hiding this comment.
oh, here we are only adding the policy override. the values required for these policies are still derived from data table.
but are we not making a assumption here that, both data table and metadata will be using same clean policy.
what incase users configure different policy for each data table and mdt table.
| .defaultValue(KEEP_LATEST_FILE_VERSIONS.name()) | ||
| .markAdvanced() | ||
| .sinceVersion("1.2.0") | ||
| .withDocumentation("This config determines the cleaner policy for metadata table."); |
There was a problem hiding this comment.
may be, if the policy does not align between data table and mdt, we could throw exceptions.
and also add a configuration for the multiplier. leave the default as 1.2 x
| return this; | ||
| } | ||
|
|
||
| public HoodieMetadataConfig.Builder withCleanerPolicy(HoodieCleaningPolicy policy) { |
There was a problem hiding this comment.
We need to add tests.
check out TestHoodieMetadataConfig
7631ef2 to
e427f02
Compare
|
|
||
| final long maxLogFileSizeBytes = writeConfig.getMetadataConfig().getMaxLogFileSize(); | ||
| // Borrow the cleaner policy from the main table and adjust the cleaner policy based on the main table's cleaner policy | ||
| boolean useMainTableCleanPolicy = writeConfig.getMetadataConfig().useMainTableCleanPolicy(); |
There was a problem hiding this comment.
driveFromDatatableCleanPolicy
| .withDocumentation("Controls the criteria to log compacted files groups in metadata table."); | ||
|
|
||
| public static final ConfigProperty<Boolean> USE_MAIN_TABLE_CLEAN_POLICY = ConfigProperty | ||
| .key(METADATA_PREFIX + ".use.main.table.clean.policy") |
There was a problem hiding this comment.
METADATA_PREFIX + derive.datatable.clean.policy
| .defaultValue(true) | ||
| .markAdvanced() | ||
| .sinceVersion("1.2.0") | ||
| .withDocumentation("This config determines whether the cleaner policy should use main table's cleaner policy."); |
There was a problem hiding this comment.
fix the wording. main -> data
| return getIntOrDefault(RECORD_PREPARATION_PARALLELISM); | ||
| } | ||
|
|
||
| public boolean useMainTableCleanPolicy() { |
There was a problem hiding this comment.
fix the naming once the config key and property is fixed
| return this; | ||
| } | ||
|
|
||
| public HoodieMetadataConfig.Builder withMainTableCleanPolicy(boolean useMainTableCleanPolicy) { |
|
LGTM. mostly minor comments |
e427f02 to
24e82ff
Compare
Describe the issue this Pull Request addresses
Metadata table currently inherits the cleaning policy from the data table, which may not always be optimal for metadata table operations. This PR introduces a dedicated configurable cleaning policy for the metadata table that can either use main table's clean or use KEEP_LATEST_FILE_VERSIONS which is recommended strategy for MOR table type.
Summary and Changelog
This PR adds support for configuring the metadata table's cleaning policy independently from the data table. Users can now set
hoodie.metadata.clean.policyto control how the metadata table performs cleaning operations.Changes:
hoodie.metadata.use.main.table.clean.policyinHoodieMetadataConfigwith default value astrueHoodieMetadataWriteUtils.createMetadataWriteConfig()to use either KEEP_LATEST_FILE_VERSIONS stratgy for clean or to inherit clean policy from data table.Impact
Users can now independently configure metadata table cleaning behavior. Creating a way for metadata clean policy to use (
KEEP_LATEST_FILE_VERSIONS) as it ensures efficient file management regardless of the data table's cleaning strategy.Risk Level
low - This change is backward compatible. The
hoodie.metadata.use.main.table.clean.policyconfig makes sure the original logic is not overridden.Documentation Update
New config:
hoodie.metadata.use.main.table.clean.policy(advanced): Determines the cleaner policy for metadata table. Default: true.Contributor's checklist