perf: Add dedicated batch size config for LSM timeline migration on u…#19052
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for working on this. The PR adds a dedicated v7-to-v8 LSM timeline migration batch size and drops the per-batch compactAndClean call. The config wiring and Flink propagation look consistent, and deferring compaction to the next archival cycle is reasonable since write keeps the manifest valid after each batch. One memory edge case is worth a look in the inline comment before a maintainer takes it forward. A couple of minor durability nits — one magic-number default in a test assertion and one hardcoded default in a comment — both of which could silently mislead if the underlying default ever changes.
hudi-agent
left a comment
There was a problem hiding this comment.
Thanks for the updates — prior feedback is addressed (config key renamed, engineContext dropped, in-loop batching test added, nits fixed). One clarifying note inline.
…pgrade
Describe the issue this Pull Request addresses
This pr ports and finalizes #18411.
When a Hudi table is upgraded from table version 7 to 8, the legacy archived timeline is migrated into the LSM timeline in
SevenToEightUpgradeHandler.upgradeToLSMTimeline(). Previously this migration reused the regular archival batch size (hoodie.commits.archival.batch, default 10) and rancompactAndClean()after every batch. Eachwrite()involves several remote-storage operations (exists check, parquet write, manifest update), so for tables with hundreds of archived actions this produced excessive I/O and significantly inflated the one-time migration time.This PR makes the migration batch size independently configurable with a larger default and removes the per-batch compaction during migration, addressing HUDI-18410.
Summary and Changelog
hoodie.migration.commits.archival.batchinHoodieArchivalConfig(default500, advanced), with awithMigrationCommitsArchivalBatchSize(int)builder method and agetMigrationCommitArchivalBatchSize()accessor onHoodieWriteConfig.SevenToEightUpgradeHandler.upgradeToLSMTimeline()now reads the new migration batch size instead ofgetCommitArchivalBatchSize(), so migration batching is decoupled from regular archival batching.lsmTimelineWriter.compactAndClean(engineContext)calls (both per-batch and final-batch) from the migration loop.TestSevenToEightUpgradeHandler: add tests for the config default/override and for migration behavior — batching follows the migration batch size andcompactAndCleanis never invoked (viamockStatic/mockConstruction).TestFlinkWriteClients: add a test asserting a rawhoodie.migration.commits.archival.batchset on the FlinkConfigurationpropagates throughFlinkWriteClients.getHoodieClientConfig()toHoodieWriteConfig, independent of the regular archival batch size.Impact
getCommitArchivalBatchSize().Risk Level
low — Changes are confined to the one-time v7→v8 upgrade path and a new, defaulted config; no public API changes. Verified with new unit tests in
TestSevenToEightUpgradeHandler(batching count and absence ofcompactAndClean) andTestFlinkWriteClients(Flink config propagation); both classes pass (16 and 21 tests respectively).Documentation Update
New advanced config
hoodie.migration.commits.archival.batch(default 500) is self-documented viawithDocumentationand will surface in the generated configuration reference. No separate docs page change required.Contributor's checklist