fix(spark): Add options for archive procedure#18437
Conversation
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! The approach is clean and the backward-compatible design is appreciated. Two things worth a quick look: the withProps call position means user-supplied options can silently override the named parameters (likely intentional, but worth a comment or doc note), and extractOptions will throw if a token in the options string has no = sign, which is now directly user-facing.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for addressing the feedback! The extractOptions method now validates input format with helpful error messages, and withProps(conf) has been moved ahead of the named-parameter builders so explicit procedure args take precedence. Both of my prior comments look resolved. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the iteration on this! The extractOptions rewrite is much more robust now — null/empty handling, duplicate-key warning, and the dedicated unit tests in TestHoodieCLIUtils are great additions. The ArchiveCommitsProcedure cleanup also reads more clearly.
However, while tracing the precedence logic I noticed that CI is currently failing for Test Call archive_commits Procedure with options with Expected 2, but got 6 (on Spark 3.3 and 3.4). I dug into the builder chain and believe this is a real semantic bug rather than just a test issue — the named-param defaults end up overriding the user's options for any archival/clean/metadata config key. I left one inline comment explaining the trace; the prior comment from @zhaoyudi-creator about the enable_metadata default looks already addressed (still true in the current source), and @fhan688's named-priority comment is partially addressed but interacts with this bug. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the iteration! The precedence design (named params > options > defaults) is now well-documented in both ArchiveCommitsProcedure and ArchiveExecutorUtils, the stripping logic via isArgDefined correctly preserves named-parameter precedence, and the new extractOptions parser handles null/empty/duplicate/whitespace edge cases cleanly with dedicated unit tests. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. One small naming nit below — otherwise the change reads cleanly.
cc @yihua
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
CodeRabbit Walkthrough: This PR adds support for passing custom configuration options to the archive_commits procedure via a new options parameter. An enhanced option parser validates and normalizes comma-separated key=value strings, the archive executor is extended to accept and apply configuration properties, and the procedure integrates option parsing with config passing. Comprehensive tests validate parser behavior, parameter precedence, and error handling.
CodeRabbit: hudi-agent#35 (review)
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the iteration on this! The precedence wiring (named params > options > defaults) is well-documented now, the withProps(conf) placement is intentional and explained, and the isArgDefined-based merging in getArchiveConfigs correctly handles the options-only case that previously broke CI. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of small naming and formatting nits below.
cc @yihua
…ent review results.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18437 +/- ##
============================================
+ Coverage 68.54% 68.92% +0.37%
- Complexity 27975 29095 +1120
============================================
Files 2442 2509 +67
Lines 134530 139528 +4998
Branches 16251 17127 +876
============================================
+ Hits 92218 96169 +3951
- Misses 35038 35599 +561
- Partials 7274 7760 +486
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
the doc and code are refined, please help review, thanks! @danny0405 @yihua |
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for addressing the feedback! All three prior nits are resolved: the parameter rename to options is propagated to the withProps call site, the trailing newline is restored, and parseHoodieConfigs now uses a single foldLeft pass with the token parsing extracted into a helper. The new fold preserves the original "last occurrence wins" semantics and still warns on duplicates. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Describe the issue this Pull Request addresses
The current
archive_commitsstored procedure only supports fixed parameters (min_commits, max_commits, retain_commits, enable_metadata), which cannot meet users' needs for custom Hudi configurations. Users need to pass additional Hudi configuration options (such ashoodie.keep.min.commits,hoodie.keep.max.commits,hoodie.commits.archival.batchetc.) to control archiving behavior.Summary and Changelog
Add support for the
optionsparameter to thearchive_commitsstored procedure, allowing users to pass additional Hudi configuration options in the format key=value,key2=value2.ArchiveExecutorUtils.java: Added a Map<String, String> conf parameter to pass to HoodieWriteConfig
ArchiveCommitsProcedure.scala: Added an optional parameter options (String type), parsed and converted to a Map
TestArchiveCommitsProcedure.scala: Added test cases to verify the options parameter functionality
Usage:
CALL archive_commits(table => 'table', retain_commits => 1,
options => 'hoodie.keep.min.commits = 2, hoodie.keep.max.commits=3')
Impact
Added an optional parameter options, without changing the default behavior of the existing API, enhancing the configurability of stored procedures
Risk Level
low - Only added optional parameters, without affecting existing functionality
Documentation Update
None
Contributor's checklist