Skip to content

feat(common): add core pre-commit validation framework - Phase 1 #18068

Open
shangxinli wants to merge 4 commits intoapache:masterfrom
shangxinli:precommit-validation-phase1
Open

feat(common): add core pre-commit validation framework - Phase 1 #18068
shangxinli wants to merge 4 commits intoapache:masterfrom
shangxinli:precommit-validation-phase1

Conversation

@shangxinli
Copy link
Contributor

@shangxinli shangxinli commented Jan 31, 2026

Describe the issue this Pull Request addresses

This PR implements Phase 1 of a pre-commit validation framework that enables validators to access commit metadata, timeline, and write statistics. This addresses the need for data quality validation before commits are finalized, particularly for detecting data loss in streaming ingestion scenarios.

Closes #18067

Summary and Changelog

Summary:
Adds an engine-agnostic pre-commit validation framework in hudi-common that provides validators with access to commit metadata, write statistics, and timeline information. This enables validators to perform sophisticated data quality checks, such as comparing streaming source offsets with actual record counts to detect data loss.

Changelog:

  • Added BasePreCommitValidator abstract class as foundation for all validators
  • Added ValidationContext interface to provide metadata access across engines
  • Added StreamingOffsetValidator base class for streaming offset validation with configurable tolerance and warn-only mode
  • Added CheckpointUtils utility for parsing and comparing multi-format streaming checkpoints (DeltaStreamer Kafka, Flink Kafka, Pulsar, Kinesis)
  • Added comprehensive unit tests: TestCheckpointUtils (14 test cases) and TestStreamingOffsetValidator (9 test cases)
  • All code is new, no existing code was copied

Configuration properties introduced:

  • hoodie.precommit.validators.streaming.offset.tolerance.percentage (default: 0.0)
  • hoodie.precommit.validators.warn.only (default: false)

This is Phase 1 of a 3-phase implementation:

  • Phase 1 (this PR): Core framework in hudi-common
  • Phase 2 (future): Flink-specific implementation
  • Phase 3 (future): Spark/DeltaStreamer implementation

Impact

Public API Changes:

  • New public classes in org.apache.hudi.client.validator package:
    • BasePreCommitValidator (abstract class)
    • ValidationContext (interface)
    • StreamingOffsetValidator (abstract class)
  • New utility class: org.apache.hudi.common.util.CheckpointUtils
  • New configuration properties for validator configuration

User-Facing Changes:
None in Phase 1. This provides the framework foundation; actual validator implementations will be added in Phase 2 (Flink) and Phase 3 (Spark/DeltaStreamer).

Performance Impact:
None. The framework is passive until validators are configured and implemented in future phases.

Risk Level

Risk Level: low

Justification:

  • Phase 1 adds only foundational framework code in hudi-common with no active usage
  • No existing code paths are modified
  • No integration with write paths yet (comes in Phase 2/3)
  • Comprehensive unit test coverage (23 test cases total)
  • All tests pass with 0 checkstyle violations

Verification:

  • Built successfully with Maven: mvn clean install -pl hudi-common -am -DskipTests
  • Checkstyle validation: 0 violations
  • Unit tests: All 23 tests pass
  • No breaking changes to existing functionality

Documentation Update

Documentation needed for future phases:

When Phase 2 (Flink) and Phase 3 (Spark/DeltaStreamer) are implemented, the following documentation will be added:

  • Configuration guide for the new validator properties
  • Examples of how to enable and configure validators
  • Explanation of tolerance percentage and warn-only mode

For Phase 1:
No user-facing documentation needed as the framework is not yet exposed to users. Code-level documentation is complete with comprehensive Javadocs in all classes.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable
  • Commits are signed and follow conventions
  • All existing tests pass (checkstyle: 0 violations, build: success)

Implement engine-agnostic pre-commit validation framework in hudi-common
that enables validators to access commit metadata, timeline, and write
statistics across all write engines (Spark, Flink, Java).

This is Phase 1 of a 3-phase implementation:
- Phase 1 (this commit): Core framework in hudi-common
- Phase 2: Flink-specific implementation
- Phase 3: Spark/DeltaStreamer implementation

Key components added:

1. BasePreCommitValidator
   - Abstract base class for all validators
   - Supports metadata-based validation
   - Engine-agnostic design

2. ValidationContext (interface)
   - Provides access to commit metadata, timeline, write stats
   - Engine-specific implementations provide concrete access
   - Abstracts engine details from validation logic

3. StreamingOffsetValidator
   - Base class for streaming offset validators
   - Compares source offset differences with record counts
   - Configurable tolerance and warn-only mode
   - Supports multiple checkpoint formats (Kafka, Flink, Pulsar, Kinesis)

4. CheckpointUtils
   - Multi-format checkpoint parsing utility
   - Supports DeltaStreamer Kafka format (Phase 1)
   - Extensible for Flink, Pulsar, Kinesis (future phases)
   - Offset difference calculation with edge case handling

5. Comprehensive unit tests
   - TestCheckpointUtils: 14 test cases
   - TestStreamingOffsetValidator: 9 test cases

Configuration:
- hoodie.precommit.validators.streaming.offset.tolerance.percentage (default: 0.0)
- hoodie.precommit.validators.warn.only (default: false)
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Jan 31, 2026
@shangxinli shangxinli changed the title Issue#18067 Phase 1: Core pre-commit validation framework feat: add core pre-commit validation framework - Phase 1 (#18067) Jan 31, 2026
@shangxinli shangxinli changed the title feat: add core pre-commit validation framework - Phase 1 (#18067) feat: add core pre-commit validation framework - Phase 1 (#18067) Jan 31, 2026
@shangxinli shangxinli changed the title feat: add core pre-commit validation framework - Phase 1 (#18067) feat: add core pre-commit validation framework - Phase 1 (#18067) Jan 31, 2026
@shangxinli shangxinli changed the title feat: add core pre-commit validation framework - Phase 1 (#18067) feat(common): add core pre-commit validation framework - Phase 1 (#18067) Jan 31, 2026
@voonhous voonhous changed the title feat(common): add core pre-commit validation framework - Phase 1 (#18067) feat(common): add core pre-commit validation framework - Phase 1 Feb 1, 2026
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this! This PR adds a well-structured generalized pre-commit validation framework with test coverage for the checkpoint parsing logic. I left a few comments for clarification.

* Phase 3: Spark-specific implementations in hudi-client/hudi-spark-client
*/
public abstract class BasePreCommitValidator {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a validator framework under org.apache.hudi.client.validator in hudi-client/hudi-spark-client (e.g. SparkPreCommitValidator), with HoodiePreCommitValidatorConfig.VALIDATOR_CLASS_NAMES to specify the validator implementation classes to use. Have you considered how this new BasePreCommitValidator will integrate with the existing SparkPreCommitValidator and SparkValidatorUtils.runValidators() in Phase 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Javadoc explaining the plan: in Phase 3, SparkPreCommitValidator will be refactored to extend BasePreCommitValidator, and SparkValidatorUtils.runValidators() will be updated to invoke validateWithMetadata() for validators extending this class. Existing VALIDATOR_CLASS_NAMES config will continue to work.

throw new IllegalArgumentException(
"Invalid checkpoint format. Expected: topic,partition:offset,... Got: " + checkpointStr);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The splits.length < 1 check is unreachable; String.split() always returns at least one element. Did you mean splits.length < 2 to validate that partition data is present (consistent with parseDeltaStreamerKafkaCheckpoint)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to splits.length < 2. Also added a test case for topic-only input.

public interface ValidationContext {

/**
* Get the current commit instant time being validated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly large interface with 11 methods. Some of these (like getTotalInsertRecordsWritten, getTotalUpdateRecordsWritten) can be derived from getWriteStats(). Have you considered keeping the interface minimal (metadata + timeline + stats access) and providing the computed methods as defaults for common logic across engines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Slimmed the interface down to 6 core abstract methods (getInstantTime, getCommitMetadata, getWriteStats, getActiveTimeline, getPreviousCommitInstant, getPreviousCommitMetadata). The 5 computed methods are now default implementations derived from the core methods.

* Phase 2: Flink-specific implementations in hudi-flink-datasource
* Phase 3: Spark-specific implementations in hudi-client/hudi-spark-client
*/
public abstract class BasePreCommitValidator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this new validator abstraction intended to be public and user-facing? If so, mark this @PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING) and the methods with @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added @PublicAPIClass(maturity = EVOLVING) on BasePreCommitValidator and ValidationContext, and @PublicAPIMethod(maturity = EVOLVING) on public/protected methods.

Comment on lines 55 to 57
protected boolean supportsMetadataValidation() {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason of having this? Is exposing validate or validateWithMetadata not good enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's unnecessary indirection. Removed supportsMetadataValidation() entirely — validateWithMetadata is sufficient.

Comment on lines +57 to +58
protected static final String TOLERANCE_PERCENTAGE_KEY = "hoodie.precommit.validators.streaming.offset.tolerance.percentage";
protected static final String WARN_ONLY_MODE_KEY = "hoodie.precommit.validators.warn.only";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define the configs in HoodiePreCommitValidatorConfig so that they will be surfaces to the configs documentation on the Hudi website during the release process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added STREAMING_OFFSET_TOLERANCE_PERCENTAGE and WARN_ONLY_MODE as ConfigProperty entries in HoodiePreCommitValidatorConfig so they get surfaced to the website docs.

Comment on lines 178 to 179
LOG.info("Offset validation passed. Offset diff: {}, Records: {}, Deviation: {:.2f}% (within {}%)",
offsetDiff, recordsWritten, deviation, tolerancePercentage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{:.2f} does not work with SLF4J. Could you double check the format here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed — using String.format("%.2f", deviation) with SLF4J {} placeholder.

Comment on lines +195 to +206
private double calculateDeviation(long offsetDiff, long recordsWritten) {
// Handle edge cases
if (offsetDiff == 0 && recordsWritten == 0) {
return 0.0; // Both zero - perfect match (no data processed)
}
if (offsetDiff == 0 || recordsWritten == 0) {
return 100.0; // One is zero - complete mismatch
}

long difference = Math.abs(offsetDiff - recordsWritten);
return (100.0 * difference) / offsetDiff;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for append-only case? For upsert case with dedup and/or event-time ordering, the deviation could be legit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added a note in the class Javadoc clarifying this is primarily for append-only ingestion. For upsert workloads with dedup or event-time ordering, users should configure a higher tolerance or use warn-only mode.

* @return Map of partition → offset
* @throws IllegalArgumentException if format is invalid
*/
private static Map<Integer, Long> parseDeltaStreamerKafkaCheckpoint(String checkpointStr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already KafkaOffsetGen.CheckpointUtils#strToOffsets to parse the Kafka offset. Should that be removed and be consolidated to reuse this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a Javadoc note about the duplication. Cannot consolidate directly right now since hudi-common cannot depend on Kafka client types (TopicPartition) from hudi-utilities. This method returns Map<Integer, Long> to avoid that dependency. Noted for future refactoring.

Xinli Shang added 3 commits February 16, 2026 16:54
- Add @PublicAPIClass/@PublicAPIMethod annotations to BasePreCommitValidator and ValidationContext
- Add Javadoc on integration plan with existing SparkPreCommitValidator
- Remove unnecessary supportsMetadataValidation() method
- Slim ValidationContext from 11 abstract methods to 6 core + 5 default computed methods
- Add ConfigProperty entries to HoodiePreCommitValidatorConfig for website doc surfacing
- Fix SLF4J format string (was using Python-style {:.2f})
- Add Javadoc clarifying append-only vs upsert/dedup deviation
- Fix unreachable splits.length < 1 check to < 2 in extractTopicName
- Add consolidation note referencing KafkaOffsetGen.CheckpointUtils#strToOffsets
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pluggable Pre-Commit Validation Framework with Streaming Offset Validators

3 participants

Comments