feat(common): add core pre-commit validation framework - Phase 1 by shangxinli · Pull Request #18068 · apache/hudi

shangxinli · 2026-01-31T18:33:09Z

Describe the issue this Pull Request addresses

This PR implements Phase 1 of a pre-commit validation framework that enables validators to access commit metadata, timeline, and write statistics. This addresses the need for data quality validation before commits are finalized, particularly for detecting data loss in streaming ingestion scenarios.

Closes #18067

Summary and Changelog

Summary:
Adds an engine-agnostic pre-commit validation framework in hudi-common that provides validators with access to commit metadata, write statistics, and timeline information. This enables validators to perform sophisticated data quality checks, such as comparing streaming source offsets with actual record counts to detect data loss.

Changelog:

Added BasePreCommitValidator abstract class as foundation for all validators
Added ValidationContext interface to provide metadata access across engines
Added StreamingOffsetValidator base class for streaming offset validation with configurable tolerance and warn-only mode
Added CheckpointUtils utility for parsing and comparing multi-format streaming checkpoints (DeltaStreamer Kafka, Flink Kafka, Pulsar, Kinesis)
Added comprehensive unit tests: TestCheckpointUtils (14 test cases) and TestStreamingOffsetValidator (9 test cases)
All code is new, no existing code was copied

Configuration properties introduced:

hoodie.precommit.validators.streaming.offset.tolerance.percentage (default: 0.0)
hoodie.precommit.validators.warn.only (default: false)

This is Phase 1 of a 3-phase implementation:

Phase 1 (this PR): Core framework in hudi-common
Phase 2 (future): Flink-specific implementation
Phase 3 (future): Spark/DeltaStreamer implementation

Impact

Public API Changes:

New public classes in org.apache.hudi.client.validator package:
- BasePreCommitValidator (abstract class)
- ValidationContext (interface)
- StreamingOffsetValidator (abstract class)
New utility class: org.apache.hudi.common.util.CheckpointUtils
New configuration properties for validator configuration

User-Facing Changes:
None in Phase 1. This provides the framework foundation; actual validator implementations will be added in Phase 2 (Flink) and Phase 3 (Spark/DeltaStreamer).

Performance Impact:
None. The framework is passive until validators are configured and implemented in future phases.

Risk Level

Risk Level: low

Justification:

Phase 1 adds only foundational framework code in hudi-common with no active usage
No existing code paths are modified
No integration with write paths yet (comes in Phase 2/3)
Comprehensive unit test coverage (23 test cases total)
All tests pass with 0 checkstyle violations

Verification:

Built successfully with Maven: mvn clean install -pl hudi-common -am -DskipTests
Checkstyle validation: 0 violations
Unit tests: All 23 tests pass
No breaking changes to existing functionality

Documentation Update

Documentation needed for future phases:

When Phase 2 (Flink) and Phase 3 (Spark/DeltaStreamer) are implemented, the following documentation will be added:

Configuration guide for the new validator properties
Examples of how to enable and configure validators
Explanation of tolerance percentage and warn-only mode

For Phase 1:
No user-facing documentation needed as the framework is not yet exposed to users. Code-level documentation is complete with comprehensive Javadocs in all classes.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable
Commits are signed and follow conventions
All existing tests pass (checkstyle: 0 violations, build: success)

Implement engine-agnostic pre-commit validation framework in hudi-common that enables validators to access commit metadata, timeline, and write statistics across all write engines (Spark, Flink, Java). This is Phase 1 of a 3-phase implementation: - Phase 1 (this commit): Core framework in hudi-common - Phase 2: Flink-specific implementation - Phase 3: Spark/DeltaStreamer implementation Key components added: 1. BasePreCommitValidator - Abstract base class for all validators - Supports metadata-based validation - Engine-agnostic design 2. ValidationContext (interface) - Provides access to commit metadata, timeline, write stats - Engine-specific implementations provide concrete access - Abstracts engine details from validation logic 3. StreamingOffsetValidator - Base class for streaming offset validators - Compares source offset differences with record counts - Configurable tolerance and warn-only mode - Supports multiple checkpoint formats (Kafka, Flink, Pulsar, Kinesis) 4. CheckpointUtils - Multi-format checkpoint parsing utility - Supports DeltaStreamer Kafka format (Phase 1) - Extensible for Flink, Pulsar, Kinesis (future phases) - Offset difference calculation with edge case handling 5. Comprehensive unit tests - TestCheckpointUtils: 14 test cases - TestStreamingOffsetValidator: 9 test cases Configuration: - hoodie.precommit.validators.streaming.offset.tolerance.percentage (default: 0.0) - hoodie.precommit.validators.warn.only (default: false)

yihua

Thanks for contributing this! This PR adds a well-structured generalized pre-commit validation framework with test coverage for the checkpoint parsing logic. I left a few comments for clarification.

yihua · 2026-02-10T00:23:13Z

hudi-common/src/main/java/org/apache/hudi/client/validator/BasePreCommitValidator.java

+ * Phase 3: Spark-specific implementations in hudi-client/hudi-spark-client
+ */
+public abstract class BasePreCommitValidator {
+


There's already a validator framework under org.apache.hudi.client.validator in hudi-client/hudi-spark-client (e.g. SparkPreCommitValidator), with HoodiePreCommitValidatorConfig.VALIDATOR_CLASS_NAMES to specify the validator implementation classes to use. Have you considered how this new BasePreCommitValidator will integrate with the existing SparkPreCommitValidator and SparkValidatorUtils.runValidators() in Phase 3?

Added Javadoc explaining the plan: in Phase 3, SparkPreCommitValidator will be refactored to extend BasePreCommitValidator, and SparkValidatorUtils.runValidators() will be updated to invoke validateWithMetadata() for validators extending this class. Existing VALIDATOR_CLASS_NAMES config will continue to work.

yihua · 2026-02-10T00:23:13Z

hudi-common/src/main/java/org/apache/hudi/common/util/CheckpointUtils.java

+      throw new IllegalArgumentException(
+          "Invalid checkpoint format. Expected: topic,partition:offset,... Got: " + checkpointStr);
+    }
+


The splits.length < 1 check is unreachable; String.split() always returns at least one element. Did you mean splits.length < 2 to validate that partition data is present (consistent with parseDeltaStreamerKafkaCheckpoint)?

Fixed to splits.length < 2. Also added a test case for topic-only input.

yihua · 2026-02-10T00:23:13Z

hudi-common/src/main/java/org/apache/hudi/client/validator/ValidationContext.java

+public interface ValidationContext {
+
+  /**
+   * Get the current commit instant time being validated.


This is a fairly large interface with 11 methods. Some of these (like getTotalInsertRecordsWritten, getTotalUpdateRecordsWritten) can be derived from getWriteStats(). Have you considered keeping the interface minimal (metadata + timeline + stats access) and providing the computed methods as defaults for common logic across engines?

Good call. Slimmed the interface down to 6 core abstract methods (getInstantTime, getCommitMetadata, getWriteStats, getActiveTimeline, getPreviousCommitInstant, getPreviousCommitMetadata). The 5 computed methods are now default implementations derived from the core methods.

yihua · 2026-02-16T04:54:57Z

hudi-common/src/main/java/org/apache/hudi/client/validator/BasePreCommitValidator.java

+ * Phase 2: Flink-specific implementations in hudi-flink-datasource
+ * Phase 3: Spark-specific implementations in hudi-client/hudi-spark-client
+ */
+public abstract class BasePreCommitValidator {


If this new validator abstraction intended to be public and user-facing? If so, mark this @PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING) and the methods with @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)?

Done. Added @PublicAPIClass(maturity = EVOLVING) on BasePreCommitValidator and ValidationContext, and @PublicAPIMethod(maturity = EVOLVING) on public/protected methods.

yihua · 2026-02-16T04:56:16Z

hudi-common/src/main/java/org/apache/hudi/client/validator/BasePreCommitValidator.java

+  protected boolean supportsMetadataValidation() {
+    return false;
+  }


What's the reason of having this? Is exposing validate or validateWithMetadata not good enough?

Agreed, it's unnecessary indirection. Removed supportsMetadataValidation() entirely — validateWithMetadata is sufficient.

yihua · 2026-02-16T04:58:22Z

hudi-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java

+  protected static final String TOLERANCE_PERCENTAGE_KEY = "hoodie.precommit.validators.streaming.offset.tolerance.percentage";
+  protected static final String WARN_ONLY_MODE_KEY = "hoodie.precommit.validators.warn.only";


Define the configs in HoodiePreCommitValidatorConfig so that they will be surfaces to the configs documentation on the Hudi website during the release process?

Done. Added STREAMING_OFFSET_TOLERANCE_PERCENTAGE and WARN_ONLY_MODE as ConfigProperty entries in HoodiePreCommitValidatorConfig so they get surfaced to the website docs.

yihua · 2026-02-16T05:02:33Z

hudi-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java

+      LOG.info("Offset validation passed. Offset diff: {}, Records: {}, Deviation: {:.2f}% (within {}%)",
+          offsetDiff, recordsWritten, deviation, tolerancePercentage);


{:.2f} does not work with SLF4J. Could you double check the format here?

Good catch. Fixed — using String.format("%.2f", deviation) with SLF4J {} placeholder.

yihua · 2026-02-16T05:04:28Z

hudi-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java

+  private double calculateDeviation(long offsetDiff, long recordsWritten) {
+    // Handle edge cases
+    if (offsetDiff == 0 && recordsWritten == 0) {
+      return 0.0;  // Both zero - perfect match (no data processed)
+    }
+    if (offsetDiff == 0 || recordsWritten == 0) {
+      return 100.0;  // One is zero - complete mismatch
+    }
+
+    long difference = Math.abs(offsetDiff - recordsWritten);
+    return (100.0 * difference) / offsetDiff;
+  }


Is this for append-only case? For upsert case with dedup and/or event-time ordering, the deviation could be legit.

Good point. Added a note in the class Javadoc clarifying this is primarily for append-only ingestion. For upsert workloads with dedup or event-time ordering, users should configure a higher tolerance or use warn-only mode.

yihua · 2026-02-16T05:20:47Z

hudi-common/src/main/java/org/apache/hudi/common/util/CheckpointUtils.java

+   * @return Map of partition → offset
+   * @throws IllegalArgumentException if format is invalid
+   */
+  private static Map<Integer, Long> parseDeltaStreamerKafkaCheckpoint(String checkpointStr) {


There is already KafkaOffsetGen.CheckpointUtils#strToOffsets to parse the Kafka offset. Should that be removed and be consolidated to reuse this one?

Added a Javadoc note about the duplication. Cannot consolidate directly right now since hudi-common cannot depend on Kafka client types (TopicPartition) from hudi-utilities. This method returns Map<Integer, Long> to avoid that dependency. Noted for future refactoring.

- Add @PublicAPIClass/@PublicAPIMethod annotations to BasePreCommitValidator and ValidationContext - Add Javadoc on integration plan with existing SparkPreCommitValidator - Remove unnecessary supportsMetadataValidation() method - Slim ValidationContext from 11 abstract methods to 6 core + 5 default computed methods - Add ConfigProperty entries to HoodiePreCommitValidatorConfig for website doc surfacing - Fix SLF4J format string (was using Python-style {:.2f}) - Add Javadoc clarifying append-only vs upsert/dedup deviation - Fix unreachable splits.length < 1 check to < 2 in extractTopicName - Add consolidation note referencing KafkaOffsetGen.CheckpointUtils#strToOffsets

hudi-bot · 2026-02-17T03:25:26Z

CI report:

feaeb3b UNKNOWN
5ead771 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:XL PR with lines of changes > 1000 label Jan 31, 2026

shangxinli changed the title ~~Issue#18067 Phase 1: Core pre-commit validation framework~~ feat: add core pre-commit validation framework - Phase 1 (#18067) Jan 31, 2026

shangxinli changed the title ~~feat: add core pre-commit validation framework - Phase 1 (#18067)~~ feat: add core pre-commit validation framework - Phase 1 (#18067) Jan 31, 2026

shangxinli changed the title ~~feat: add core pre-commit validation framework - Phase 1 (#18067)~~ feat(common): add core pre-commit validation framework - Phase 1 (#18067) Jan 31, 2026

voonhous changed the title ~~feat(common): add core pre-commit validation framework - Phase 1 (#18067)~~ feat(common): add core pre-commit validation framework - Phase 1 Feb 1, 2026

yihua reviewed Feb 16, 2026

View reviewed changes

Xinli Shang added 3 commits February 16, 2026 16:54

Retrigger CI

feaeb3b

Retrigger CI

5ead771

		protected static final String TOLERANCE_PERCENTAGE_KEY = "hoodie.precommit.validators.streaming.offset.tolerance.percentage";
		protected static final String WARN_ONLY_MODE_KEY = "hoodie.precommit.validators.warn.only";

		LOG.info("Offset validation passed. Offset diff: {}, Records: {}, Deviation: {:.2f}% (within {}%)",
		offsetDiff, recordsWritten, deviation, tolerancePercentage);

Conversation

shangxinli commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 17, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

shangxinli commented Jan 31, 2026 •

edited

Loading