Skip to content

perf: Add dedicated batch size config for LSM timeline migration on u…#19052

Merged
danny0405 merged 2 commits into
apache:masterfrom
cshuo:ISSUE-18410
Jun 24, 2026
Merged

perf: Add dedicated batch size config for LSM timeline migration on u…#19052
danny0405 merged 2 commits into
apache:masterfrom
cshuo:ISSUE-18410

Conversation

@cshuo

@cshuo cshuo commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

…pgrade

Describe the issue this Pull Request addresses

This pr ports and finalizes #18411.

When a Hudi table is upgraded from table version 7 to 8, the legacy archived timeline is migrated into the LSM timeline in SevenToEightUpgradeHandler.upgradeToLSMTimeline(). Previously this migration reused the regular archival batch size (hoodie.commits.archival.batch, default 10) and ran compactAndClean() after every batch. Each write() involves several remote-storage operations (exists check, parquet write, manifest update), so for tables with hundreds of archived actions this produced excessive I/O and significantly inflated the one-time migration time.

This PR makes the migration batch size independently configurable with a larger default and removes the per-batch compaction during migration, addressing HUDI-18410.

Summary and Changelog

  • Introduce config hoodie.migration.commits.archival.batch in HoodieArchivalConfig (default 500, advanced), with a withMigrationCommitsArchivalBatchSize(int) builder method and a getMigrationCommitArchivalBatchSize() accessor on HoodieWriteConfig.
  • SevenToEightUpgradeHandler.upgradeToLSMTimeline() now reads the new migration batch size instead of getCommitArchivalBatchSize(), so migration batching is decoupled from regular archival batching.
  • Drop the lsmTimelineWriter.compactAndClean(engineContext) calls (both per-batch and final-batch) from the migration loop.
  • TestSevenToEightUpgradeHandler: add tests for the config default/override and for migration behavior — batching follows the migration batch size and compactAndClean is never invoked (via mockStatic/mockConstruction).
  • TestFlinkWriteClients: add a test asserting a raw hoodie.migration.commits.archival.batch set on the Flink Configuration propagates through FlinkWriteClients.getHoodieClientConfig() to HoodieWriteConfig, independent of the regular archival batch size.

Impact

  • Functional impact: Faster v7→v8 LSM timeline migration for tables with large archived timelines, due to fewer/larger batches and no per-batch compaction. Default migration batch size changes from 10 (shared) to 500 (dedicated). No change to normal read/write paths or to regular archival behavior.
  • Maintainability: Migration batching is now explicit and separated from the general archival config, removing the previously overloaded reuse of getCommitArchivalBatchSize().
  • Extensibility: The dedicated config lets operators tune migration throughput per environment without affecting steady-state archival.

Risk Level

low — Changes are confined to the one-time v7→v8 upgrade path and a new, defaulted config; no public API changes. Verified with new unit tests in TestSevenToEightUpgradeHandler (batching count and absence of compactAndClean) and TestFlinkWriteClients (Flink config propagation); both classes pass (16 and 21 tests respectively).

Documentation Update

New advanced config hoodie.migration.commits.archival.batch (default 500) is self-documented via withDocumentation and will surface in the generated configuration reference. No separate docs page change required.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@cshuo cshuo requested a review from danny0405 June 23, 2026 11:50

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this. The PR adds a dedicated v7-to-v8 LSM timeline migration batch size and drops the per-batch compactAndClean call. The config wiring and Flink propagation look consistent, and deferring compaction to the next archival cycle is reasonable since write keeps the manifest valid after each batch. One memory edge case is worth a look in the inline comment before a maintainer takes it forward. A couple of minor durability nits — one magic-number default in a test assertion and one hardcoded default in a comment — both of which could silently mislead if the underlying default ever changes.

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 24, 2026
@github-actions github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Jun 24, 2026
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the updates — prior feedback is addressed (config key renamed, engineContext dropped, in-loop batching test added, nits fixed). One clarifying note inline.

@danny0405 danny0405 merged commit cf314f1 into apache:master Jun 24, 2026
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants