feat(client): Add pre-write validator framework#18239
feat(client): Add pre-write validator framework#18239nada-attia wants to merge 11 commits intoapache:masterfrom
Conversation
de4c43d to
d670d57
Compare
|
can you add a valid motivation or use-case that necessitates such pre write validator in the PR desc. |
| * | ||
| * @return true if validation passed, false if validation failed | ||
| */ | ||
| private static <T> boolean runValidator(PreWriteValidator validator, |
There was a problem hiding this comment.
can we make these async similar to how we do it for pre commit validator ?
| * @param records HoodieData of records to be written, may be null | ||
| */ | ||
| public void preWrite(String instantTime, WriteOperationType writeOperationType, | ||
| HoodieTableMetaClient metaClient, HoodieData<HoodieRecord<T>> records) { |
There was a problem hiding this comment.
if we may not have records for some of the operations, lets make it Option<HoodieData<HoodieRecord<T>>> recordsOpt
| public class HoodiePreWriteValidatorConfig extends HoodieConfig { | ||
|
|
||
| public static final ConfigProperty<String> VALIDATOR_CLASS_NAMES = ConfigProperty | ||
| .key("hoodie.prewrite.validators") |
There was a problem hiding this comment.
hey @yihua : looks like we have named the pre commit validator as
hoodie.precommit.validators and hence Nada has named this hoodie.prewrite.validators.
can we leave it like this, or do you recommend to go w/ hoodie.write.prewrite.validators ?
Introduce a pluggable pre-write validation framework that allows custom validators to run before write operations begin. This enables validation of conditions like schema compatibility, permissions, and service onboarding status before proceeding with writes.
Update the pre-write validator interface to accept an Iterable of HoodieRecord so validators can inspect incoming data before write operations. This enables data quality validations at the pre-write stage. Updated HoodieJavaWriteClient to pass records to preWrite for all write operations where records are available.
…records This change updates the pre-write validator interface to accept records wrapped in HoodieData, enabling validators to inspect incoming data before the write operation proceeds. Key changes: - PreWriteValidator.validate() now accepts HoodieData<HoodieRecord<T>> instead of Iterable<HoodieRecord<T>> - Each client wraps its records with the appropriate HoodieData implementation when calling preWrite(): - HoodieJavaWriteClient uses HoodieListData.eager() - SparkRDDWriteClient uses HoodieJavaRDD.of() - Validators that don't need records can simply ignore the parameter (it may be null for operations like compact/cluster)
Add support for pre-write validators in the Flink write client: - Override preWrite() with records parameter to call validators - Update upsertPreppedRecords, bulkInsertPreppedRecords, and deletePrepped to pass records wrapped in HoodieListData.eager()
- Add test for invalid validator class name handling - Add test for whitespace handling in validator class names
- Test that validators receive records on insert/upsert operations - Test that failing validators block write operations - Test that multiple validators are invoked in sequence
Remove RecordCapturingValidator and CountingValidator test classes. Use simpler FirstPassingValidator and SecondPassingValidator instead to verify that multiple validators are invoked during write operations.
c48fe47 to
fea570e
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18239 +/- ##
============================================
- Coverage 57.25% 57.23% -0.02%
- Complexity 18601 18603 +2
============================================
Files 1948 1951 +3
Lines 106690 106751 +61
Branches 13196 13198 +2
============================================
+ Hits 61084 61100 +16
- Misses 39846 39887 +41
- Partials 5760 5764 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
This PR introduces a pluggable pre-write validation framework that allows custom validators to run before write operations begin.
For example, before allowing a write operation with a schema change, we may need to make a call to a schema service to verify if the schema update is permitted for the given dataset based on certain policies. This validation must happen before the write begins to prevent invalid schema changes from being committed.
Related issue: #18008
Currently, Hudi only supports pre-commit validators that run after data has been written but before commit. This PR adds an earlier validation hook at the pre-write stage, allowing failures to be detected before any write work begins.
Summary and Changelog
Summary: Users can now configure custom validators that run before write operations via the
hoodie.prewrite.validatorsconfiguration property.Changelog:
PreWriteValidatorinterface for implementing custom pre-write validatorsPreWriteValidatorUtilsutility class to load and run configured validatorsHoodiePreWriteValidatorConfigconfiguration class withhoodie.prewrite.validatorspropertyBaseHoodieWriteClient.preWrite()to invoke configured validatorsgetPreWriteValidators()method toHoodieWriteConfigPreWriteValidatorUtilsImpact
PreWriteValidatorinterface that users can implement for custom validatorshoodie.prewrite.validators- comma-separated list of validator class namesRisk Level
low - This is an additive feature with no changes to existing write paths when no validators are configured. The feature is opt-in via configuration.
Documentation Update
The following documentation updates are needed:
hoodie.prewrite.validatorsis included in the codeContributor's checklist
closes #18245