Skip to content

feat(client): Add pre-write validator framework#18239

Open
nada-attia wants to merge 11 commits intoapache:masterfrom
nada-attia:nada/oss/prewrite-validator
Open

feat(client): Add pre-write validator framework#18239
nada-attia wants to merge 11 commits intoapache:masterfrom
nada-attia:nada/oss/prewrite-validator

Conversation

@nada-attia
Copy link
Contributor

@nada-attia nada-attia commented Feb 23, 2026

Describe the issue this Pull Request addresses

This PR introduces a pluggable pre-write validation framework that allows custom validators to run before write operations begin.

For example, before allowing a write operation with a schema change, we may need to make a call to a schema service to verify if the schema update is permitted for the given dataset based on certain policies. This validation must happen before the write begins to prevent invalid schema changes from being committed.

Related issue: #18008

Currently, Hudi only supports pre-commit validators that run after data has been written but before commit. This PR adds an earlier validation hook at the pre-write stage, allowing failures to be detected before any write work begins.

Summary and Changelog

Summary: Users can now configure custom validators that run before write operations via the hoodie.prewrite.validators configuration property.

Changelog:

  • Added PreWriteValidator interface for implementing custom pre-write validators
  • Added PreWriteValidatorUtils utility class to load and run configured validators
  • Added HoodiePreWriteValidatorConfig configuration class with hoodie.prewrite.validators property
  • Modified BaseHoodieWriteClient.preWrite() to invoke configured validators
  • Added getPreWriteValidators() method to HoodieWriteConfig
  • Added unit tests for PreWriteValidatorUtils

Impact

  • New public API: PreWriteValidator interface that users can implement for custom validators
  • New configuration: hoodie.prewrite.validators - comma-separated list of validator class names
  • No breaking changes to existing functionality
  • No performance impact when no validators are configured

Risk Level

low - This is an additive feature with no changes to existing write paths when no validators are configured. The feature is opt-in via configuration.

Documentation Update

The following documentation updates are needed:

  • Config description for hoodie.prewrite.validators is included in the code
  • Website documentation for the new pre-write validator feature should be added

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

closes #18245

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 23, 2026
@nada-attia nada-attia force-pushed the nada/oss/prewrite-validator branch from de4c43d to d670d57 Compare February 23, 2026 15:57
@nada-attia nada-attia marked this pull request as draft February 23, 2026 16:46
@nada-attia nada-attia marked this pull request as ready for review February 24, 2026 14:43
@nsivabalan
Copy link
Contributor

can you add a valid motivation or use-case that necessitates such pre write validator in the PR desc.

*
* @return true if validation passed, false if validation failed
*/
private static <T> boolean runValidator(PreWriteValidator validator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these async similar to how we do it for pre commit validator ?

private static CompletableFuture<Boolean> runValidatorAsync(SparkPreCommitValidator validator, HoodieWriteMetadata<?> writeMetadata,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param records HoodieData of records to be written, may be null
*/
public void preWrite(String instantTime, WriteOperationType writeOperationType,
HoodieTableMetaClient metaClient, HoodieData<HoodieRecord<T>> records) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we may not have records for some of the operations, lets make it Option<HoodieData<HoodieRecord<T>>> recordsOpt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public class HoodiePreWriteValidatorConfig extends HoodieConfig {

public static final ConfigProperty<String> VALIDATOR_CLASS_NAMES = ConfigProperty
.key("hoodie.prewrite.validators")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @yihua : looks like we have named the pre commit validator as
hoodie.precommit.validators and hence Nada has named this hoodie.prewrite.validators.
can we leave it like this, or do you recommend to go w/ hoodie.write.prewrite.validators ?

Introduce a pluggable pre-write validation framework that allows custom
validators to run before write operations begin. This enables validation
of conditions like schema compatibility, permissions, and service
onboarding status before proceeding with writes.
Update the pre-write validator interface to accept an Iterable of
HoodieRecord so validators can inspect incoming data before write
operations. This enables data quality validations at the pre-write stage.

Updated HoodieJavaWriteClient to pass records to preWrite for all
write operations where records are available.
…records

This change updates the pre-write validator interface to accept records
wrapped in HoodieData, enabling validators to inspect incoming data
before the write operation proceeds.

Key changes:
- PreWriteValidator.validate() now accepts HoodieData<HoodieRecord<T>>
  instead of Iterable<HoodieRecord<T>>
- Each client wraps its records with the appropriate HoodieData
  implementation when calling preWrite():
  - HoodieJavaWriteClient uses HoodieListData.eager()
  - SparkRDDWriteClient uses HoodieJavaRDD.of()
- Validators that don't need records can simply ignore the parameter
  (it may be null for operations like compact/cluster)
Add support for pre-write validators in the Flink write client:
- Override preWrite() with records parameter to call validators
- Update upsertPreppedRecords, bulkInsertPreppedRecords, and
  deletePrepped to pass records wrapped in HoodieListData.eager()
- Add test for invalid validator class name handling
- Add test for whitespace handling in validator class names
- Test that validators receive records on insert/upsert operations
- Test that failing validators block write operations
- Test that multiple validators are invoked in sequence
Remove RecordCapturingValidator and CountingValidator test classes.
Use simpler FirstPassingValidator and SecondPassingValidator instead
to verify that multiple validators are invoked during write operations.
@nada-attia nada-attia force-pushed the nada/oss/prewrite-validator branch from c48fe47 to fea570e Compare March 3, 2026 16:46
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Mar 3, 2026
@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 3, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 40.25974% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.23%. Comparing base (43d8ed8) to head (89c3bac).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...ache/hudi/client/utils/PreWriteValidatorUtils.java 10.81% 32 Missing and 1 partial ⚠️
...che/hudi/config/HoodiePreWriteValidatorConfig.java 29.41% 12 Missing ⚠️
...pache/hudi/client/validator/PreWriteValidator.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18239      +/-   ##
============================================
- Coverage     57.25%   57.23%   -0.02%     
- Complexity    18601    18603       +2     
============================================
  Files          1948     1951       +3     
  Lines        106690   106751      +61     
  Branches      13196    13198       +2     
============================================
+ Hits          61084    61100      +16     
- Misses        39846    39887      +41     
- Partials       5760     5764       +4     
Flag Coverage Δ
hadoop-mr-java-client 45.23% <32.35%> (+0.01%) ⬆️
spark-java-tests 47.40% <32.46%> (-0.01%) ⬇️
spark-scala-tests 45.50% <29.87%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../org/apache/hudi/client/BaseHoodieWriteClient.java 73.44% <100.00%> (+0.25%) ⬆️
...java/org/apache/hudi/config/HoodieWriteConfig.java 85.03% <100.00%> (+0.01%) ⬆️
.../org/apache/hudi/client/HoodieJavaWriteClient.java 78.16% <100.00%> (ø)
...va/org/apache/hudi/client/SparkRDDWriteClient.java 89.50% <100.00%> (ø)
...pache/hudi/client/validator/PreWriteValidator.java 0.00% <0.00%> (ø)
...che/hudi/config/HoodiePreWriteValidatorConfig.java 29.41% <29.41%> (ø)
...ache/hudi/client/utils/PreWriteValidatorUtils.java 10.81% <10.81%> (ø)

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add pre-write validator framework to allow validation before write operations

4 participants