Skip to content

fix(spark): Add options for archive procedure#18437

Open
fhan688 wants to merge 14 commits into
apache:masterfrom
fhan688:add-options-for-archive-procedure
Open

fix(spark): Add options for archive procedure#18437
fhan688 wants to merge 14 commits into
apache:masterfrom
fhan688:add-options-for-archive-procedure

Conversation

@fhan688
Copy link
Copy Markdown
Contributor

@fhan688 fhan688 commented Apr 1, 2026

Describe the issue this Pull Request addresses

The current archive_commits stored procedure only supports fixed parameters (min_commits, max_commits, retain_commits, enable_metadata), which cannot meet users' needs for custom Hudi configurations. Users need to pass additional Hudi configuration options (such as hoodie.keep.min.commits, hoodie.keep.max.commits, hoodie.commits.archival.batch etc.) to control archiving behavior.

Summary and Changelog

Add support for the options parameter to the archive_commits stored procedure, allowing users to pass additional Hudi configuration options in the format key=value,key2=value2.

ArchiveExecutorUtils.java: Added a Map<String, String> conf parameter to pass to HoodieWriteConfig

ArchiveCommitsProcedure.scala: Added an optional parameter options (String type), parsed and converted to a Map

TestArchiveCommitsProcedure.scala: Added test cases to verify the options parameter functionality

Usage:

CALL archive_commits(table => 'table', retain_commits => 1,

options => 'hoodie.keep.min.commits = 2, hoodie.keep.max.commits=3')

Impact

Added an optional parameter options, without changing the default behavior of the existing API, enhancing the configurability of stored procedures

Risk Level

low - Only added optional parameters, without affecting existing functionality

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 1, 2026
@danny0405 danny0405 self-assigned this Apr 2, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The approach is clean and the backward-compatible design is appreciated. Two things worth a quick look: the withProps call position means user-supplied options can silently override the named parameters (likely intentional, but worth a comment or doc note), and extractOptions will throw if a token in the options string has no = sign, which is now directly user-facing.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback! The extractOptions method now validates input format with helpful error messages, and withProps(conf) has been moved ahead of the named-parameter builders so explicit procedure args take precedence. Both of my prior comments look resolved. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:S PR with lines of changes in (10, 100] labels May 19, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration on this! The extractOptions rewrite is much more robust now — null/empty handling, duplicate-key warning, and the dedicated unit tests in TestHoodieCLIUtils are great additions. The ArchiveCommitsProcedure cleanup also reads more clearly.

However, while tracing the precedence logic I noticed that CI is currently failing for Test Call archive_commits Procedure with options with Expected 2, but got 6 (on Spark 3.3 and 3.4). I dug into the builder chain and believe this is a real semantic bug rather than just a test issue — the named-param defaults end up overriding the user's options for any archival/clean/metadata config key. I left one inline comment explaining the trace; the prior comment from @zhaoyudi-creator about the enable_metadata default looks already addressed (still true in the current source), and @fhan688's named-priority comment is partially addressed but interacts with this bug. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! The precedence design (named params > options > defaults) is now well-documented in both ArchiveCommitsProcedure and ArchiveExecutorUtils, the stripping logic via isArgDefined correctly preserves named-parameter precedence, and the new extractOptions parser handles null/empty/duplicate/whitespace edge cases cleanly with dedicated unit tests. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. One small naming nit below — otherwise the change reads cleanly.

cc @yihua

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: This PR adds support for passing custom configuration options to the archive_commits procedure via a new options parameter. An enhanced option parser validates and normalizes comma-separated key=value strings, the archive executor is extended to accept and apply configuration properties, and the procedure integrates option parsing with config passing. Comprehensive tests validate parser behavior, parameter precedence, and error handling.

CodeRabbit: hudi-agent#35 (review)

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration on this! The precedence wiring (named params > options > defaults) is well-documented now, the withProps(conf) placement is intentional and explained, and the isArgDefined-based merging in getArchiveConfigs correctly handles the options-only case that previously broke CI. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of small naming and formatting nits below.

cc @yihua

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.50000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.92%. Comparing base (3e1d300) to head (94b0cab).
⚠️ Report is 178 commits behind head on master.

Files with missing lines Patch % Lines
...rc/main/scala/org/apache/hudi/HoodieCLIUtils.scala 84.61% 3 Missing and 1 partial ⚠️
...i/command/procedures/ArchiveCommitsProcedure.scala 90.90% 2 Missing and 2 partials ⚠️
...n/java/org/apache/hudi/cli/commands/SparkMain.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18437      +/-   ##
============================================
+ Coverage     68.54%   68.92%   +0.37%     
- Complexity    27975    29095    +1120     
============================================
  Files          2442     2509      +67     
  Lines        134530   139528    +4998     
  Branches      16251    17127     +876     
============================================
+ Hits          92218    96169    +3951     
- Misses        35038    35599     +561     
- Partials       7274     7760     +486     
Flag Coverage Δ
common-and-other-modules 44.40% <0.00%> (+0.03%) ⬆️
hadoop-mr-java-client 44.87% <ø> (-0.12%) ⬇️
spark-client-hadoop-common 48.24% <ø> (-0.16%) ⬇️
spark-java-tests 49.35% <30.98%> (+0.55%) ⬆️
spark-scala-tests 45.29% <80.28%> (-0.37%) ⬇️
utilities 37.43% <0.00%> (-0.94%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/cli/ArchiveExecutorUtils.java 84.21% <100.00%> (+0.87%) ⬆️
...n/java/org/apache/hudi/cli/commands/SparkMain.java 0.00% <0.00%> (ø)
...rc/main/scala/org/apache/hudi/HoodieCLIUtils.scala 81.42% <84.61%> (+0.17%) ⬆️
...i/command/procedures/ArchiveCommitsProcedure.scala 92.53% <90.90%> (-4.34%) ⬇️

... and 383 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@fhan688
Copy link
Copy Markdown
Contributor Author

fhan688 commented May 22, 2026

the doc and code are refined, please help review, thanks! @danny0405 @yihua

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback! All three prior nits are resolved: the parameter rename to options is propagated to the withProps call site, the trailing newline is restored, and parseHoodieConfigs now uses a single foldLeft pass with the token parsing extracted into a helper. The new fold preserves the original "last occurrence wins" semantics and still warns on duplicates. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants