From 5f86fb6ab56333ae1d21cdbcee61e6d244d10ec6 Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Fri, 27 Mar 2026 11:37:39 -0700 Subject: [PATCH 01/12] feat: Add Spark streamer validators for phase 3 precommit validation Implements phase 3 of the precommit validation framework by adding: - SparkKafkaOffsetValidator: Validates Kafka offset consistency - SparkValidationContext: Provides Spark-specific validation context - SparkStreamerValidatorUtils: Utility functions for Spark streamer validation - Comprehensive test coverage for all validator components - Integration with StreamSync and HoodiePreCommitValidatorConfig Co-Authored-By: Claude Opus 4.6 --- bootstrap_register_only_issue.md | 170 +++++++++ .../HoodiePreCommitValidatorConfig.java | 8 +- .../hudi/utilities/streamer/StreamSync.java | 5 + .../validator/SparkKafkaOffsetValidator.java | 57 ++++ .../SparkStreamerValidatorUtils.java | 172 ++++++++++ .../validator/SparkValidationContext.java | 137 ++++++++ .../TestSparkKafkaOffsetValidator.java | 322 ++++++++++++++++++ .../TestSparkStreamerValidatorUtils.java | 237 +++++++++++++ .../validator/TestSparkValidationContext.java | 156 +++++++++ 9 files changed, 1262 insertions(+), 2 deletions(-) create mode 100644 bootstrap_register_only_issue.md create mode 100644 hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java create mode 100644 hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java create mode 100644 hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkValidationContext.java create mode 100644 hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkKafkaOffsetValidator.java create mode 100644 hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java create mode 100644 hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkValidationContext.java diff --git a/bootstrap_register_only_issue.md b/bootstrap_register_only_issue.md new file mode 100644 index 0000000000000..52914b1bd7ae9 --- /dev/null +++ b/bootstrap_register_only_issue.md @@ -0,0 +1,170 @@ +### Feature Description + +**What the feature achieves:** +Adds a `REGISTER_ONLY` bootstrap mode that allows Hudi to register existing partitions and their file listings without reading file contents or creating skeleton files. At query time, Hudi natively reads these partitions as plain Parquet, ensuring `SELECT * FROM table` returns complete results across all tiers — no view wrappers or query changes needed. This enables a three-tier bootstrap strategy for onboarding large Hive tables where historical data resides in cold storage (e.g., S3 Glacier, Azure Archive). + +**Why this feature is needed:** +Problem: Organizations migrating large Hive tables to Hudi often have a tiered storage layout: +- Recent data (e.g., last 30 days) in hot/standard storage — should be fully rewritten into Hudi +- Warm data (e.g., 30 days to 1 year) in standard storage — suitable for METADATA_ONLY bootstrap +- Cold data (e.g., older than 1 year) in archival/cold storage (S3 Glacier, etc.) — cannot be read without expensive retrieval + +Current gaps: +- Bootstrap requires every discovered partition to be either `FULL_RECORD` or `METADATA_ONLY` (enforced by `checkArgument` in `SparkBootstrapCommitActionExecutor.java:292-293`) +- Both modes require reading file contents: `FULL_RECORD` rewrites all data, `METADATA_ONLY` reads every record to extract record keys +- For cold storage, reading file contents triggers data retrieval (e.g., restore from Glacier), which is expensive, slow, and often impractical for terabytes of archival data +- If users bootstrap only recent partitions and skip cold ones entirely, Hudi queries that span into the cold date range silently return incomplete results — **silent data loss** +- The bootstrap epic ([#14665](https://github.com/apache/hudi/issues/14665)) describes "Onboard for new partitions alone" but there is no implementation that safely handles query completeness for skipped partitions + +Real scenario: +- A Hive table has 3 years of daily partitions (~1,095 partitions) +- Only the last year of data is in standard storage; older data is in S3 Glacier +- User wants to onboard to Hudi but cannot afford to restore 2+ years of Glacier data just to extract record keys +- With today's bootstrap, the user must either: (a) pay the Glacier retrieval cost for all cold data, or (b) skip old partitions and risk silent data loss on queries + +### User Experience + +**How users will use this feature:** + +Configuration: +```properties +# Use the date-based 3-tier selector +hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.DateBasedBootstrapModeSelector + +# Partitions newer than 30 days → FULL_RECORD (full rewrite into Hudi) +hoodie.bootstrap.mode.selector.days.full_record=30 + +# Partitions between 30 and 365 days → METADATA_ONLY (skeleton files, read warm storage for record keys) +hoodie.bootstrap.mode.selector.days.metadata_only=365 + +# Partitions older than 365 days → REGISTER_ONLY (no file content reading at all) +# (implicit: anything older than metadata_only threshold) + +# Partition date format (to parse partition paths like datestr=2024-01-15) +hoodie.bootstrap.mode.selector.partition.date.format=yyyy-MM-dd +hoodie.bootstrap.mode.selector.partition.date.field=datestr +``` + +Usage Example — Spark bootstrap: +```scala +spark.emptyDataFrame.write + .format("hudi") + .option("hoodie.bootstrap.base.path", "/data/hive_table") + .option("hoodie.table.name", "my_hudi_table") + .option("hoodie.datasource.write.operation", "bootstrap") + .option("hoodie.bootstrap.mode.selector", + "org.apache.hudi.client.bootstrap.selector.DateBasedBootstrapModeSelector") + .option("hoodie.bootstrap.mode.selector.days.full_record", "30") + .option("hoodie.bootstrap.mode.selector.days.metadata_only", "365") + .option("hoodie.bootstrap.mode.selector.partition.date.format", "yyyy-MM-dd") + .option("hoodie.bootstrap.mode.selector.partition.date.field", "datestr") + .mode(SaveMode.Overwrite) + .save("/data/my_hudi_table") +``` + +Query behavior — **no query changes needed**: +```sql +-- This returns ALL data: hot (FULL_RECORD) + warm (METADATA_ONLY) + cold (REGISTER_ONLY) +-- No views, no UNION ALL, no special syntax +SELECT * FROM my_hudi_table; + +-- Partition filtering works as expected +SELECT * FROM my_hudi_table WHERE datestr >= '2024-01-01'; + +-- Cold partition queries work, just read as plain Parquet (may have different performance) +SELECT * FROM my_hudi_table WHERE datestr = '2022-06-15'; +``` + +Performance characteristics by tier: + +| Tier | Bootstrap cost | Query performance | Hudi meta columns | +|------|---------------|-------------------|-------------------| +| FULL_RECORD (hot) | Full rewrite | Best — native Hudi file | All populated | +| METADATA_ONLY (warm) | Read for record keys | Moderate — skeleton stitching at read time | All populated | +| REGISTER_ONLY (cold) | File listing only (no content read) | Same as plain Parquet | Returned as null | + +Write guardrails: +- Upserts/deletes targeting REGISTER_ONLY partitions fail fast with a clear error message +- This is expected: without record keys, Hudi cannot index or merge records in these partitions +- If cold data is later restored to warm/hot storage, partitions can be "promoted" via re-bootstrap + +API Changes: + +New public APIs: +```java +// New enum value in BootstrapMode +public enum BootstrapMode { + FULL_RECORD, + METADATA_ONLY, + REGISTER_ONLY // NEW: register partition file listing without reading contents +} + +// New selector +public class DateBasedBootstrapModeSelector extends BootstrapModeSelector { + @Override + public Map> select( + List>> partitions); +} +``` + +### Hudi RFC Requirements + +**RFC PR link:** (if applicable) + +Why RFC is needed: +- Does this change public interfaces/APIs? **Yes** + - New `REGISTER_ONLY` value in `BootstrapMode` enum + - New `DateBasedBootstrapModeSelector` class + - Read path changes for bootstrapped tables to natively serve plain Parquet files without Hudi meta columns + - New table property to indicate presence of REGISTER_ONLY partitions + +- Does this change storage format? **Minor** + - Bootstrap commit metadata will include REGISTER_ONLY partition entries (file listings without skeleton files) + - No new file formats; REGISTER_ONLY partitions reference original source Parquet files as-is + - Backward compatible: tables without REGISTER_ONLY partitions are unaffected + +Justification: +- Extends the bootstrap mode enum (affects selectors, executor, read path) +- Read path changes require handling files without Hudi meta columns within a Hudi table +- Needs design review for query completeness and schema merging guarantees +- Affects how Hudi defines table boundaries (a Hudi table now includes "unmanaged" partitions) + +### Task Breakdown + +**Phase 1: Core Bootstrap Changes (write path)** +- Add `REGISTER_ONLY` to `BootstrapMode` enum (`BootstrapMode.java`) +- Create `DateBasedBootstrapModeSelector` with configurable date thresholds and partition date parsing +- Add config properties to `HoodieBootstrapConfig.java` for date-based tier boundaries +- Modify `SparkBootstrapCommitActionExecutor` to handle 3 modes: + - Relax `checkArgument` validation (line 292-293) to accept `REGISTER_ONLY` partitions + - For `REGISTER_ONLY`: record partition file listings in commit metadata without reading file contents or creating skeleton files + - Skip bootstrap index entries for `REGISTER_ONLY` partitions (`HFileBootstrapIndex`) +- Add table property `hoodie.bootstrap.has.register.only.partitions` +- Unit tests for selector and executor changes + +**Phase 2: Read Path — Native Query Completeness (critical)** +- Modify `HoodieBootstrapRelation` (Spark) to handle REGISTER_ONLY partitions: + - When a base file has no bootstrap index entry AND its schema has no Hudi meta columns → read as plain Parquet + - Return null for Hudi meta columns (`_hoodie_record_key`, `_hoodie_commit_time`, `_hoodie_partition_path`, `_hoodie_file_name`, `_hoodie_commit_seqno`) +- Handle schema merging: queries spanning multiple tiers must produce a unified schema where Hudi meta columns are nullable +- Ensure partition pruning works correctly for REGISTER_ONLY partitions +- Integration tests verifying: + - `SELECT *` across all 3 tiers returns complete results + - `SELECT * WHERE partition_col = ` returns correct data + - `SELECT * WHERE partition_col = ` performance is unaffected + - Hudi meta columns are null for REGISTER_ONLY rows, populated for others + +**Phase 3: Write Path Guardrails** +- Fail fast when upsert/delete targets a REGISTER_ONLY partition with a clear error message: + `"Cannot upsert/delete in REGISTER_ONLY bootstrap partition [datestr=2022-06-15]. Re-bootstrap with FULL_RECORD or METADATA_ONLY mode to enable writes."` +- Allow insert-overwrite to "promote" a REGISTER_ONLY partition to a regular Hudi partition (optional, future enhancement) + +**Phase 4: Tooling & Documentation** (optional, future) +- CLI command to list partitions by bootstrap mode +- CLI command to "promote" REGISTER_ONLY partitions to METADATA_ONLY or FULL_RECORD (when data is restored from cold storage) +- Documentation and migration guide updates + +### Related Issues +- [#14665](https://github.com/apache/hudi/issues/14665) — Efficient bootstrap and migration of existing non-Hudi dataset (parent epic) +- [#15974](https://github.com/apache/hudi/issues/15974) — Treat full bootstrap table as regular table +- [#15856](https://github.com/apache/hudi/issues/15856) — Precombine field is not required for metadata only bootstrap diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePreCommitValidatorConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePreCommitValidatorConfig.java index f85cc44120d4e..169494b7244ac 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePreCommitValidatorConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePreCommitValidatorConfig.java @@ -43,7 +43,10 @@ public class HoodiePreCommitValidatorConfig extends HoodieConfig { .key("hoodie.precommit.validators") .defaultValue("") .markAdvanced() - .withDocumentation("Comma separated list of class names that can be invoked to validate commit"); + .withDocumentation("Comma separated list of class names that can be invoked to validate commit. " + + "Available streaming offset validators: " + + "org.apache.hudi.sink.validator.FlinkKafkaOffsetValidator (Flink Kafka), " + + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator (Spark/HoodieStreamer Kafka)"); public static final String VALIDATOR_TABLE_VARIABLE = ""; public static final ConfigProperty EQUALITY_SQL_QUERIES = ConfigProperty @@ -71,7 +74,8 @@ public class HoodiePreCommitValidatorConfig extends HoodieConfig { .markAdvanced() .withDocumentation("Tolerance percentage for streaming offset validation " + "(used by org.apache.hudi.client.validator.StreamingOffsetValidator " - + "and org.apache.hudi.sink.validator.FlinkKafkaOffsetValidator). " + + "and org.apache.hudi.sink.validator.FlinkKafkaOffsetValidator " + + "and org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator). " + "The validator compares the offset difference (expected records from source) " + "with actual records written. If the deviation exceeds this percentage, " + "the commit is rejected or warned depending on the validation failure policy. " diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java index 600891c85dff6..ef5e89fe806b3 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java @@ -115,6 +115,7 @@ import org.apache.hudi.utilities.sources.InputBatch; import org.apache.hudi.utilities.sources.Source; import org.apache.hudi.utilities.streamer.HoodieStreamer.Config; +import org.apache.hudi.utilities.streamer.validator.SparkStreamerValidatorUtils; import org.apache.hudi.utilities.transform.Transformer; import com.codahale.metrics.Timer; @@ -874,6 +875,10 @@ private Pair, JavaRDD> writeToSinkAndDoMetaSync(Hood totalSuccessfulRecords); String commitActionType = CommitUtils.getCommitActionType(cfg.operation, HoodieTableType.valueOf(cfg.tableType)); + // Run pre-commit streaming offset validators (if configured) before commit + SparkStreamerValidatorUtils.runValidators(props, instantTime, writeStatusRDD, + checkpointCommitMetadata, metaClient); + boolean success = writeClient.commit(instantTime, writeStatusRDD, Option.of(checkpointCommitMetadata), commitActionType, partitionToReplacedFileIds, Option.empty(), Option.of(writeStatusValidator)); releaseResourcesInvoked = true; diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java new file mode 100644 index 0000000000000..b45c7759db81f --- /dev/null +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.streamer.validator; + +import org.apache.hudi.client.validator.StreamingOffsetValidator; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.table.checkpoint.StreamerCheckpointV1; +import org.apache.hudi.common.util.CheckpointUtils.CheckpointFormat; + +/** + * Spark/HoodieStreamer-specific Kafka offset validator. + * + *

Validates that the number of records written matches the Kafka offset difference + * between the current and previous HoodieStreamer checkpoints. Uses the Spark Kafka + * checkpoint format stored with key {@code deltastreamer.checkpoint.key} in extraMetadata.

+ * + *

Configuration: + *

    + *
  • {@code hoodie.precommit.validators}: Include + * {@code org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator}
  • + *
  • {@code hoodie.precommit.validators.streaming.offset.tolerance.percentage}: + * Acceptable deviation (default: 0.0 = strict)
  • + *
  • {@code hoodie.precommit.validators.failure.policy}: + * FAIL (default) or WARN_LOG
  • + *

+ * + *

This validator is primarily intended for append-only ingestion from Kafka via HoodieStreamer. + * For upsert workloads with deduplication, configure a higher tolerance or use WARN_LOG.

+ */ +public class SparkKafkaOffsetValidator extends StreamingOffsetValidator { + + /** + * Create a Spark Kafka offset validator. + * + * @param config Validator configuration + */ + public SparkKafkaOffsetValidator(TypedProperties config) { + super(config, StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, CheckpointFormat.SPARK_KAFKA); + } +} diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java new file mode 100644 index 0000000000000..42275f1187a35 --- /dev/null +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java @@ -0,0 +1,172 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.streamer.validator; + +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.client.validator.BasePreCommitValidator; +import org.apache.hudi.client.validator.ValidationContext; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ReflectionUtils; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.config.HoodiePreCommitValidatorConfig; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.spark.api.java.JavaRDD; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.Arrays; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +/** + * Utility for running pre-commit validators in the HoodieStreamer commit flow. + * + *

Instantiates and executes validators configured via + * {@code hoodie.precommit.validators}. Each validator must extend + * {@link BasePreCommitValidator} and have a constructor that accepts + * {@link TypedProperties}.

+ * + *

Called from {@code StreamSync.writeToSinkAndDoMetaSync()} before + * the commit is finalized.

+ */ +public class SparkStreamerValidatorUtils { + + private static final Logger LOG = LoggerFactory.getLogger(SparkStreamerValidatorUtils.class); + + /** + * Run all configured pre-commit validators. + * + * @param props Configuration properties containing validator class names + * @param instant Commit instant time + * @param writeStatusRDD Write statuses from Spark write operations + * @param checkpointCommitMetadata Extra metadata being committed (contains checkpoint info) + * @param metaClient Table meta client for timeline access and previous commit lookup + * @throws HoodieValidationException if any validator fails with FAIL policy + */ + public static void runValidators(TypedProperties props, + String instant, + JavaRDD writeStatusRDD, + Map checkpointCommitMetadata, + HoodieTableMetaClient metaClient) { + String validatorClassNames = props.getString( + HoodiePreCommitValidatorConfig.VALIDATOR_CLASS_NAMES.key(), + HoodiePreCommitValidatorConfig.VALIDATOR_CLASS_NAMES.defaultValue()); + + if (StringUtils.isNullOrEmpty(validatorClassNames)) { + return; + } + + // Collect write statuses and build context + List allWriteStatus = writeStatusRDD.collect(); + HoodieCommitMetadata currentMetadata = buildCommitMetadata(allWriteStatus, checkpointCommitMetadata); + List writeStats = allWriteStatus.stream() + .map(WriteStatus::getStat) + .collect(Collectors.toList()); + + // Load previous commit metadata from timeline + Option previousCommitMetadata = loadPreviousCommitMetadata(metaClient); + + ValidationContext context = new SparkValidationContext( + instant, + Option.of(currentMetadata), + Option.of(writeStats), + previousCommitMetadata, + metaClient); + + // Instantiate and run each validator + List classNames = Arrays.stream(validatorClassNames.split(",")) + .map(String::trim) + .filter(s -> !s.isEmpty()) + .collect(Collectors.toList()); + + for (String className : classNames) { + try { + BasePreCommitValidator validator = (BasePreCommitValidator) + ReflectionUtils.loadClass(className, new Class[] {TypedProperties.class}, props); + LOG.info("Running pre-commit validator: {} for instant: {}", className, instant); + validator.validateWithMetadata(context); + LOG.info("Pre-commit validator {} passed for instant: {}", className, instant); + } catch (HoodieValidationException e) { + LOG.error("Pre-commit validator {} failed for instant: {}", className, instant, e); + throw e; + } catch (Exception e) { + LOG.error("Failed to instantiate or run validator: {}", className, e); + throw new HoodieValidationException( + "Failed to run pre-commit validator: " + className, e); + } + } + } + + /** + * Build HoodieCommitMetadata from write statuses and extra metadata. + * This constructs the metadata object that would be committed, giving + * validators access to the same data. + */ + private static HoodieCommitMetadata buildCommitMetadata( + List writeStatuses, Map extraMetadata) { + HoodieCommitMetadata metadata = new HoodieCommitMetadata(); + + // Add write stats + for (WriteStatus status : writeStatuses) { + HoodieWriteStat stat = status.getStat(); + if (stat != null) { + metadata.addWriteStat(stat.getPartitionPath(), stat); + } + } + + // Add extra metadata (includes checkpoint info like deltastreamer.checkpoint.key) + if (extraMetadata != null) { + extraMetadata.forEach(metadata::addMetadata); + } + + return metadata; + } + + /** + * Load the previous completed commit metadata from the timeline. + */ + private static Option loadPreviousCommitMetadata(HoodieTableMetaClient metaClient) { + try { + HoodieTimeline completedTimeline = metaClient.reloadActiveTimeline() + .getWriteTimeline() + .filterCompletedInstants(); + Option lastInstant = completedTimeline.lastInstant(); + if (lastInstant.isPresent()) { + return Option.of(completedTimeline.readCommitMetadata(lastInstant.get())); + } + } catch (Exception e) { + LOG.warn("Failed to load previous commit metadata, skipping previous commit comparison", e); + } + return Option.empty(); + } + + private SparkStreamerValidatorUtils() { + // Utility class + } +} diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkValidationContext.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkValidationContext.java new file mode 100644 index 0000000000000..d8f845bd19789 --- /dev/null +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkValidationContext.java @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.streamer.validator; + +import org.apache.hudi.client.validator.ValidationContext; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.Option; + +import java.util.List; + +/** + * Spark/HoodieStreamer implementation of {@link ValidationContext}. + * + *

Constructed from data available in {@code StreamSync.writeToSinkAndDoMetaSync()} + * before the commit is finalized. Provides validators with access to commit metadata, + * write statistics, and previous commit information for streaming offset validation.

+ * + *

Unlike Flink's implementation, Spark can optionally provide active timeline access + * via {@link HoodieTableMetaClient} for richer validation patterns.

+ */ +public class SparkValidationContext implements ValidationContext { + + private final String instantTime; + private final Option commitMetadata; + private final Option> writeStats; + private final Option previousCommitMetadata; + private final HoodieTableMetaClient metaClient; + + /** + * Create a Spark validation context with full timeline access. + * + * @param instantTime Current commit instant time + * @param commitMetadata Current commit metadata (with extraMetadata including checkpoints) + * @param writeStats Write statistics from write operations + * @param previousCommitMetadata Metadata from the previous completed commit + * @param metaClient Table meta client for timeline access (may be null for testing) + */ + public SparkValidationContext(String instantTime, + Option commitMetadata, + Option> writeStats, + Option previousCommitMetadata, + HoodieTableMetaClient metaClient) { + this.instantTime = instantTime; + this.commitMetadata = commitMetadata; + this.writeStats = writeStats; + this.previousCommitMetadata = previousCommitMetadata; + this.metaClient = metaClient; + } + + /** + * Create a Spark validation context without timeline access (for testing). + * + * @param instantTime Current commit instant time + * @param commitMetadata Current commit metadata (with extraMetadata including checkpoints) + * @param writeStats Write statistics from write operations + * @param previousCommitMetadata Metadata from the previous completed commit + */ + public SparkValidationContext(String instantTime, + Option commitMetadata, + Option> writeStats, + Option previousCommitMetadata) { + this(instantTime, commitMetadata, writeStats, previousCommitMetadata, null); + } + + @Override + public String getInstantTime() { + return instantTime; + } + + @Override + public Option getCommitMetadata() { + return commitMetadata; + } + + @Override + public Option> getWriteStats() { + return writeStats; + } + + /** + * Get the active timeline. Available when metaClient is provided. + * + * @throws UnsupportedOperationException if metaClient was not provided + */ + @Override + public HoodieActiveTimeline getActiveTimeline() { + if (metaClient == null) { + throw new UnsupportedOperationException( + "Active timeline is not available without HoodieTableMetaClient."); + } + return metaClient.getActiveTimeline(); + } + + /** + * Not directly supported. Use {@link #isFirstCommit()} or + * {@link #getPreviousCommitMetadata()} instead. + * + * @throws UnsupportedOperationException always + */ + @Override + public Option getPreviousCommitInstant() { + throw new UnsupportedOperationException( + "getPreviousCommitInstant() is not available in HoodieStreamer pre-commit validation context. " + + "Use isFirstCommit() or getPreviousCommitMetadata() instead."); + } + + @Override + public boolean isFirstCommit() { + return !previousCommitMetadata.isPresent(); + } + + @Override + public Option getPreviousCommitMetadata() { + return previousCommitMetadata; + } +} diff --git a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkKafkaOffsetValidator.java b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkKafkaOffsetValidator.java new file mode 100644 index 0000000000000..d109aa3246f64 --- /dev/null +++ b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkKafkaOffsetValidator.java @@ -0,0 +1,322 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.streamer.validator; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.checkpoint.StreamerCheckpointV1; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodiePreCommitValidatorConfig; +import org.apache.hudi.exception.HoodieValidationException; + +import org.junit.jupiter.api.Test; + +import java.util.Collections; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertThrows; + +/** + * Tests for {@link SparkKafkaOffsetValidator}. + */ +public class TestSparkKafkaOffsetValidator { + + // ========== Helper methods ========== + + private static TypedProperties defaultConfig() { + TypedProperties props = new TypedProperties(); + props.setProperty(HoodiePreCommitValidatorConfig.STREAMING_OFFSET_TOLERANCE_PERCENTAGE.key(), "0.0"); + props.setProperty(HoodiePreCommitValidatorConfig.VALIDATION_FAILURE_POLICY.key(), "FAIL"); + return props; + } + + private static TypedProperties configWithTolerance(double tolerance) { + TypedProperties props = defaultConfig(); + props.setProperty(HoodiePreCommitValidatorConfig.STREAMING_OFFSET_TOLERANCE_PERCENTAGE.key(), + String.valueOf(tolerance)); + return props; + } + + private static TypedProperties configWithWarnPolicy() { + TypedProperties props = defaultConfig(); + props.setProperty(HoodiePreCommitValidatorConfig.VALIDATION_FAILURE_POLICY.key(), "WARN_LOG"); + return props; + } + + /** + * Build a Spark Kafka checkpoint string. + * Format: topic,partition:offset,partition:offset,... + */ + private static String buildSparkKafkaCheckpoint(String topic, int[] partitions, long[] offsets) { + StringBuilder sb = new StringBuilder(); + sb.append(topic); + for (int i = 0; i < partitions.length; i++) { + sb.append(",").append(partitions[i]).append(":").append(offsets[i]); + } + return sb.toString(); + } + + private static HoodieCommitMetadata buildMetadata(String checkpointValue) { + HoodieCommitMetadata metadata = new HoodieCommitMetadata(); + if (checkpointValue != null) { + metadata.addMetadata(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, checkpointValue); + } + return metadata; + } + + private static List buildWriteStats(long numInserts, long numUpdates) { + HoodieWriteStat stat = new HoodieWriteStat(); + stat.setNumInserts(numInserts); + stat.setNumUpdateWrites(numUpdates); + stat.setPartitionPath("partition1"); + return Collections.singletonList(stat); + } + + private static SparkValidationContext buildContext( + String instantTime, + HoodieCommitMetadata currentMetadata, + List writeStats, + HoodieCommitMetadata previousMetadata) { + return new SparkValidationContext( + instantTime, + Option.of(currentMetadata), + Option.of(writeStats), + previousMetadata != null ? Option.of(previousMetadata) : Option.empty()); + } + + // ========== Tests ========== + + @Test + public void testExactMatchPasses() { + // Previous: partition 0 at offset 100, partition 1 at offset 200 + // Current: partition 0 at offset 200, partition 1 at offset 300 + // Diff = (200-100) + (300-200) = 200. Records written = 200. + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0, 1}, new long[]{100, 200}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0, 1}, new long[]{200, 300}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(200, 0), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testDataLossDetected() { + // Diff = 1000 but only 500 records written -> 50% deviation + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{0}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{1000}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(500, 0), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertThrows(HoodieValidationException.class, () -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testWithinTolerancePasses() { + // Diff = 1000, records = 950 -> 5% deviation, tolerance = 10% + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{0}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{1000}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(950, 0), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(configWithTolerance(10.0)); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testWarnPolicyDoesNotThrow() { + // Data loss but WARN_LOG policy + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{0}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{1000}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(0, 0), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(configWithWarnPolicy()); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testSkipsFirstCommit() { + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{1000}); + + // No previous commit + SparkValidationContext ctx = new SparkValidationContext( + "20260320120000000", + Option.of(buildMetadata(currCheckpoint)), + Option.of(buildWriteStats(500, 0)), + Option.empty()); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testSkipsWhenNoCheckpointKey() { + // Current metadata has no checkpoint key + HoodieCommitMetadata currentMeta = new HoodieCommitMetadata(); + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{100}); + + SparkValidationContext ctx = buildContext("20260320120000000", + currentMeta, + buildWriteStats(500, 0), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testMultiPartitionValidation() { + // 4 partitions, each advancing by 250 = total diff 1000 + String prevCheckpoint = buildSparkKafkaCheckpoint("events", + new int[]{0, 1, 2, 3}, new long[]{0, 0, 0, 0}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", + new int[]{0, 1, 2, 3}, new long[]{250, 250, 250, 250}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(800, 200), // 800 inserts + 200 updates = 1000 + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testEmptyCommitSkipsValidation() { + // Both offsets same and no records written + String checkpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{100}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(checkpoint), + buildWriteStats(0, 0), + buildMetadata(checkpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testPreviousCheckpointMissingSkipsValidation() { + // Previous metadata exists but has no checkpoint key + HoodieCommitMetadata prevMeta = new HoodieCommitMetadata(); + + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{1000}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(500, 0), + prevMeta); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testOvercountingDetected() { + // More records written than offset diff + // Diff = 100, records = 200 -> |100-200|/100 = 100% deviation + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{0}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{100}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(200, 0), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertThrows(HoodieValidationException.class, () -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testExactToleranceBoundaryPasses() { + // Diff = 1000, records = 900 -> 10% deviation, tolerance = 10% + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{0}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{1000}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(900, 0), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(configWithTolerance(10.0)); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testJustOverToleranceFails() { + // Diff = 1000, records = 899 -> 10.1% deviation, tolerance = 10% + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{0}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{1000}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(899, 0), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(configWithTolerance(10.0)); + assertThrows(HoodieValidationException.class, () -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testOnlyInsertsNoUpdates() { + // Pure insert workload + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0, 1}, new long[]{0, 0}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0, 1}, new long[]{500, 500}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(1000, 0), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } + + @Test + public void testUpdatesCountedInRecordTotal() { + // Diff = 1000. 600 inserts + 400 updates = 1000 total + String prevCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{0}); + String currCheckpoint = buildSparkKafkaCheckpoint("events", new int[]{0}, new long[]{1000}); + + SparkValidationContext ctx = buildContext("20260320120000000", + buildMetadata(currCheckpoint), + buildWriteStats(600, 400), + buildMetadata(prevCheckpoint)); + + SparkKafkaOffsetValidator validator = new SparkKafkaOffsetValidator(defaultConfig()); + assertDoesNotThrow(() -> validator.validateWithMetadata(ctx)); + } +} diff --git a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java new file mode 100644 index 0000000000000..1b20a1dcef268 --- /dev/null +++ b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java @@ -0,0 +1,237 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.streamer.validator; + +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.checkpoint.StreamerCheckpointV1; +import org.apache.hudi.common.testutils.HoodieTestTable; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodiePreCommitValidatorConfig; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.junit.jupiter.api.AfterAll; +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.IOException; +import java.nio.file.Path; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertThrows; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Tests for {@link SparkStreamerValidatorUtils}. + * + *

Uses a lightweight Spark context for JavaRDD creation. Tests validate the orchestration + * logic (class loading, config passing, error handling) using first-commit scenarios + * (no previous commit on timeline) to avoid needing a full HoodieTable setup.

+ */ +public class TestSparkStreamerValidatorUtils { + + private static JavaSparkContext jsc; + + @TempDir + Path tempDir; + + @BeforeAll + public static void setUp() { + SparkConf conf = new SparkConf() + .setAppName("TestSparkStreamerValidatorUtils") + .setMaster("local[2]") + .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") + .set("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"); + jsc = new JavaSparkContext(conf); + } + + @AfterAll + public static void tearDown() { + if (jsc != null) { + jsc.stop(); + jsc = null; + } + } + + private static TypedProperties propsWithValidator(String validatorClassName) { + TypedProperties props = new TypedProperties(); + props.setProperty(HoodiePreCommitValidatorConfig.VALIDATOR_CLASS_NAMES.key(), validatorClassName); + props.setProperty(HoodiePreCommitValidatorConfig.STREAMING_OFFSET_TOLERANCE_PERCENTAGE.key(), "0.0"); + props.setProperty(HoodiePreCommitValidatorConfig.VALIDATION_FAILURE_POLICY.key(), "FAIL"); + return props; + } + + private static WriteStatus buildWriteStatus(String partitionPath, long numInserts, long numUpdates) { + HoodieWriteStat stat = new HoodieWriteStat(); + stat.setPartitionPath(partitionPath); + stat.setNumInserts(numInserts); + stat.setNumUpdateWrites(numUpdates); + + WriteStatus ws = new WriteStatus(false, 0.0); + ws.setStat(stat); + return ws; + } + + private JavaRDD toRDD(List writeStatuses) { + return jsc.parallelize(writeStatuses); + } + + private org.apache.hudi.common.table.HoodieTableMetaClient createMetaClient() throws IOException { + return org.apache.hudi.common.testutils.HoodieTestUtils.init( + tempDir.toAbsolutePath().toString()); + } + + // ========== Tests ========== + + @Test + public void testNoValidatorsConfigured() throws IOException { + TypedProperties props = new TypedProperties(); + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); + + assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), + new HashMap<>(), createMetaClient())); + } + + @Test + public void testEmptyValidatorString() throws IOException { + TypedProperties props = new TypedProperties(); + props.setProperty(HoodiePreCommitValidatorConfig.VALIDATOR_CLASS_NAMES.key(), ""); + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); + + assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), + new HashMap<>(), createMetaClient())); + } + + @Test + public void testValidValidatorFirstCommitPasses() throws IOException { + TypedProperties props = propsWithValidator( + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); + + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); + Map extraMeta = new HashMap<>(); + extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:100"); + + // First commit (no previous metadata on timeline) — validator should skip and pass + assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), extraMeta, createMetaClient())); + } + + @Test + public void testInvalidValidatorClassThrows() throws IOException { + TypedProperties props = propsWithValidator("com.nonexistent.FakeValidator"); + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); + + assertThrows(HoodieValidationException.class, + () -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), new HashMap<>(), createMetaClient())); + } + + @Test + public void testMultipleValidators() throws IOException { + TypedProperties props = propsWithValidator( + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator," + + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); + + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); + Map extraMeta = new HashMap<>(); + extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:100"); + + assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), extraMeta, createMetaClient())); + } + + @Test + public void testValidatorWithWhitespaceInClassNames() throws IOException { + TypedProperties props = propsWithValidator( + " org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator , "); + + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); + + assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), new HashMap<>(), createMetaClient())); + } + + @Test + public void testNullExtraMetadataHandled() throws IOException { + TypedProperties props = propsWithValidator( + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); + + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); + + assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), null, createMetaClient())); + } + + @Test + public void testMultipleWriteStatusesAggregated() throws IOException { + TypedProperties props = propsWithValidator( + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); + + List writeStatuses = new ArrayList<>(); + writeStatuses.add(buildWriteStatus("p1", 60, 0)); + writeStatuses.add(buildWriteStatus("p2", 40, 0)); + + Map extraMeta = new HashMap<>(); + extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:100"); + + assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), extraMeta, createMetaClient())); + } + + @Test + public void testEmptyWriteStatuses() throws IOException { + TypedProperties props = propsWithValidator( + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); + + List writeStatuses = Collections.emptyList(); + Map extraMeta = new HashMap<>(); + extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:100"); + + assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), extraMeta, createMetaClient())); + } + + @Test + public void testValidationExceptionPreservedAcrossValidators() throws IOException { + TypedProperties props = propsWithValidator( + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator," + + "com.nonexistent.FakeValidator"); + + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); + + HoodieValidationException ex = assertThrows(HoodieValidationException.class, + () -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), new HashMap<>(), createMetaClient())); + assertTrue(ex.getMessage().contains("FakeValidator")); + } +} diff --git a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkValidationContext.java b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkValidationContext.java new file mode 100644 index 0000000000000..7f94262e98c43 --- /dev/null +++ b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkValidationContext.java @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.streamer.validator; + +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.checkpoint.StreamerCheckpointV1; +import org.apache.hudi.common.util.Option; + +import org.junit.jupiter.api.Test; + +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Tests for {@link SparkValidationContext}. + */ +public class TestSparkValidationContext { + + private static HoodieWriteStat buildStat(long inserts, long updates) { + HoodieWriteStat stat = new HoodieWriteStat(); + stat.setNumInserts(inserts); + stat.setNumUpdateWrites(updates); + stat.setPartitionPath("partition1"); + return stat; + } + + @Test + public void testBasicProperties() { + HoodieCommitMetadata metadata = new HoodieCommitMetadata(); + metadata.addMetadata("key1", "value1"); + List writeStats = Collections.singletonList(buildStat(100, 50)); + + SparkValidationContext ctx = new SparkValidationContext( + "20260320120000000", + Option.of(metadata), + Option.of(writeStats), + Option.empty()); + + assertEquals("20260320120000000", ctx.getInstantTime()); + assertTrue(ctx.getCommitMetadata().isPresent()); + assertTrue(ctx.getWriteStats().isPresent()); + assertEquals(1, ctx.getWriteStats().get().size()); + } + + @Test + public void testRecordCounting() { + List writeStats = Arrays.asList( + buildStat(100, 50), // partition1: 100 inserts, 50 updates + buildStat(200, 30)); // partition2: 200 inserts, 30 updates + + SparkValidationContext ctx = new SparkValidationContext( + "20260320120000000", + Option.of(new HoodieCommitMetadata()), + Option.of(writeStats), + Option.empty()); + + assertEquals(300, ctx.getTotalInsertRecordsWritten()); + assertEquals(80, ctx.getTotalUpdateRecordsWritten()); + assertEquals(380, ctx.getTotalRecordsWritten()); + } + + @Test + public void testFirstCommitDetection() { + // No previous commit metadata -> first commit + SparkValidationContext ctx = new SparkValidationContext( + "20260320120000000", + Option.of(new HoodieCommitMetadata()), + Option.of(Collections.emptyList()), + Option.empty()); + + assertTrue(ctx.isFirstCommit()); + } + + @Test + public void testNotFirstCommitWhenPreviousExists() { + HoodieCommitMetadata prevMeta = new HoodieCommitMetadata(); + + SparkValidationContext ctx = new SparkValidationContext( + "20260320120000000", + Option.of(new HoodieCommitMetadata()), + Option.of(Collections.emptyList()), + Option.of(prevMeta)); + + assertFalse(ctx.isFirstCommit()); + } + + @Test + public void testExtraMetadataAccess() { + HoodieCommitMetadata metadata = new HoodieCommitMetadata(); + metadata.addMetadata(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:1000"); + metadata.addMetadata("custom.key", "custom_value"); + + SparkValidationContext ctx = new SparkValidationContext( + "20260320120000000", + Option.of(metadata), + Option.of(Collections.emptyList()), + Option.empty()); + + assertEquals("events,0:1000", + ctx.getExtraMetadata(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1).get()); + assertEquals("custom_value", ctx.getExtraMetadata("custom.key").get()); + assertFalse(ctx.getExtraMetadata("nonexistent.key").isPresent()); + } + + @Test + public void testPreviousCommitMetadataAccess() { + HoodieCommitMetadata prevMeta = new HoodieCommitMetadata(); + prevMeta.addMetadata(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:500"); + + SparkValidationContext ctx = new SparkValidationContext( + "20260320120000000", + Option.of(new HoodieCommitMetadata()), + Option.of(Collections.emptyList()), + Option.of(prevMeta)); + + assertTrue(ctx.getPreviousCommitMetadata().isPresent()); + assertEquals("events,0:500", + ctx.getPreviousCommitMetadata().get().getMetadata(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1)); + } + + @Test + public void testEmptyWriteStats() { + SparkValidationContext ctx = new SparkValidationContext( + "20260320120000000", + Option.of(new HoodieCommitMetadata()), + Option.empty(), + Option.empty()); + + assertEquals(0, ctx.getTotalRecordsWritten()); + assertEquals(0, ctx.getTotalInsertRecordsWritten()); + assertEquals(0, ctx.getTotalUpdateRecordsWritten()); + } +} From dd19d66b849313c211b4daffaff000564ade35b4 Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Tue, 31 Mar 2026 07:50:36 -0700 Subject: [PATCH 02/12] fix: address code review and fix checkstyle violations - Remove unused imports (java.io.IOException, HoodieCommitMetadata, HoodieTestTable, Option) that caused checkstyle build failures - Remove accidentally committed bootstrap_register_only_issue.md - Cache writeStatusRDD before collect() to prevent second DAG evaluation and potential driver OOM - Add comment explaining why validator runs before writeClient.commit(): offset validation is a stronger guard than commitOnErrors and must prevent the commit when data loss is detected - Clarify buildCommitMetadata() produces a pre-commit preview object, not a fully-constructed commit record - Add Javadoc to SparkKafkaOffsetValidator and SparkStreamerValidatorUtils explaining incompatibility with SparkValidatorUtils (different interface and constructor signature) to prevent misconfiguration - Add two-commit integration tests (testSecondCommitMatchingOffsetsPasses, testSecondCommitDataLossDetected) using HoodieTestTable to exercise the real offset comparison path, not just the first-commit skip path --- bootstrap_register_only_issue.md | 170 ------------------ .../hudi/utilities/streamer/StreamSync.java | 6 +- .../validator/SparkKafkaOffsetValidator.java | 7 + .../SparkStreamerValidatorUtils.java | 22 ++- .../TestSparkStreamerValidatorUtils.java | 50 +++++- 5 files changed, 75 insertions(+), 180 deletions(-) delete mode 100644 bootstrap_register_only_issue.md diff --git a/bootstrap_register_only_issue.md b/bootstrap_register_only_issue.md deleted file mode 100644 index 52914b1bd7ae9..0000000000000 --- a/bootstrap_register_only_issue.md +++ /dev/null @@ -1,170 +0,0 @@ -### Feature Description - -**What the feature achieves:** -Adds a `REGISTER_ONLY` bootstrap mode that allows Hudi to register existing partitions and their file listings without reading file contents or creating skeleton files. At query time, Hudi natively reads these partitions as plain Parquet, ensuring `SELECT * FROM table` returns complete results across all tiers — no view wrappers or query changes needed. This enables a three-tier bootstrap strategy for onboarding large Hive tables where historical data resides in cold storage (e.g., S3 Glacier, Azure Archive). - -**Why this feature is needed:** -Problem: Organizations migrating large Hive tables to Hudi often have a tiered storage layout: -- Recent data (e.g., last 30 days) in hot/standard storage — should be fully rewritten into Hudi -- Warm data (e.g., 30 days to 1 year) in standard storage — suitable for METADATA_ONLY bootstrap -- Cold data (e.g., older than 1 year) in archival/cold storage (S3 Glacier, etc.) — cannot be read without expensive retrieval - -Current gaps: -- Bootstrap requires every discovered partition to be either `FULL_RECORD` or `METADATA_ONLY` (enforced by `checkArgument` in `SparkBootstrapCommitActionExecutor.java:292-293`) -- Both modes require reading file contents: `FULL_RECORD` rewrites all data, `METADATA_ONLY` reads every record to extract record keys -- For cold storage, reading file contents triggers data retrieval (e.g., restore from Glacier), which is expensive, slow, and often impractical for terabytes of archival data -- If users bootstrap only recent partitions and skip cold ones entirely, Hudi queries that span into the cold date range silently return incomplete results — **silent data loss** -- The bootstrap epic ([#14665](https://github.com/apache/hudi/issues/14665)) describes "Onboard for new partitions alone" but there is no implementation that safely handles query completeness for skipped partitions - -Real scenario: -- A Hive table has 3 years of daily partitions (~1,095 partitions) -- Only the last year of data is in standard storage; older data is in S3 Glacier -- User wants to onboard to Hudi but cannot afford to restore 2+ years of Glacier data just to extract record keys -- With today's bootstrap, the user must either: (a) pay the Glacier retrieval cost for all cold data, or (b) skip old partitions and risk silent data loss on queries - -### User Experience - -**How users will use this feature:** - -Configuration: -```properties -# Use the date-based 3-tier selector -hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.DateBasedBootstrapModeSelector - -# Partitions newer than 30 days → FULL_RECORD (full rewrite into Hudi) -hoodie.bootstrap.mode.selector.days.full_record=30 - -# Partitions between 30 and 365 days → METADATA_ONLY (skeleton files, read warm storage for record keys) -hoodie.bootstrap.mode.selector.days.metadata_only=365 - -# Partitions older than 365 days → REGISTER_ONLY (no file content reading at all) -# (implicit: anything older than metadata_only threshold) - -# Partition date format (to parse partition paths like datestr=2024-01-15) -hoodie.bootstrap.mode.selector.partition.date.format=yyyy-MM-dd -hoodie.bootstrap.mode.selector.partition.date.field=datestr -``` - -Usage Example — Spark bootstrap: -```scala -spark.emptyDataFrame.write - .format("hudi") - .option("hoodie.bootstrap.base.path", "/data/hive_table") - .option("hoodie.table.name", "my_hudi_table") - .option("hoodie.datasource.write.operation", "bootstrap") - .option("hoodie.bootstrap.mode.selector", - "org.apache.hudi.client.bootstrap.selector.DateBasedBootstrapModeSelector") - .option("hoodie.bootstrap.mode.selector.days.full_record", "30") - .option("hoodie.bootstrap.mode.selector.days.metadata_only", "365") - .option("hoodie.bootstrap.mode.selector.partition.date.format", "yyyy-MM-dd") - .option("hoodie.bootstrap.mode.selector.partition.date.field", "datestr") - .mode(SaveMode.Overwrite) - .save("/data/my_hudi_table") -``` - -Query behavior — **no query changes needed**: -```sql --- This returns ALL data: hot (FULL_RECORD) + warm (METADATA_ONLY) + cold (REGISTER_ONLY) --- No views, no UNION ALL, no special syntax -SELECT * FROM my_hudi_table; - --- Partition filtering works as expected -SELECT * FROM my_hudi_table WHERE datestr >= '2024-01-01'; - --- Cold partition queries work, just read as plain Parquet (may have different performance) -SELECT * FROM my_hudi_table WHERE datestr = '2022-06-15'; -``` - -Performance characteristics by tier: - -| Tier | Bootstrap cost | Query performance | Hudi meta columns | -|------|---------------|-------------------|-------------------| -| FULL_RECORD (hot) | Full rewrite | Best — native Hudi file | All populated | -| METADATA_ONLY (warm) | Read for record keys | Moderate — skeleton stitching at read time | All populated | -| REGISTER_ONLY (cold) | File listing only (no content read) | Same as plain Parquet | Returned as null | - -Write guardrails: -- Upserts/deletes targeting REGISTER_ONLY partitions fail fast with a clear error message -- This is expected: without record keys, Hudi cannot index or merge records in these partitions -- If cold data is later restored to warm/hot storage, partitions can be "promoted" via re-bootstrap - -API Changes: - -New public APIs: -```java -// New enum value in BootstrapMode -public enum BootstrapMode { - FULL_RECORD, - METADATA_ONLY, - REGISTER_ONLY // NEW: register partition file listing without reading contents -} - -// New selector -public class DateBasedBootstrapModeSelector extends BootstrapModeSelector { - @Override - public Map> select( - List>> partitions); -} -``` - -### Hudi RFC Requirements - -**RFC PR link:** (if applicable) - -Why RFC is needed: -- Does this change public interfaces/APIs? **Yes** - - New `REGISTER_ONLY` value in `BootstrapMode` enum - - New `DateBasedBootstrapModeSelector` class - - Read path changes for bootstrapped tables to natively serve plain Parquet files without Hudi meta columns - - New table property to indicate presence of REGISTER_ONLY partitions - -- Does this change storage format? **Minor** - - Bootstrap commit metadata will include REGISTER_ONLY partition entries (file listings without skeleton files) - - No new file formats; REGISTER_ONLY partitions reference original source Parquet files as-is - - Backward compatible: tables without REGISTER_ONLY partitions are unaffected - -Justification: -- Extends the bootstrap mode enum (affects selectors, executor, read path) -- Read path changes require handling files without Hudi meta columns within a Hudi table -- Needs design review for query completeness and schema merging guarantees -- Affects how Hudi defines table boundaries (a Hudi table now includes "unmanaged" partitions) - -### Task Breakdown - -**Phase 1: Core Bootstrap Changes (write path)** -- Add `REGISTER_ONLY` to `BootstrapMode` enum (`BootstrapMode.java`) -- Create `DateBasedBootstrapModeSelector` with configurable date thresholds and partition date parsing -- Add config properties to `HoodieBootstrapConfig.java` for date-based tier boundaries -- Modify `SparkBootstrapCommitActionExecutor` to handle 3 modes: - - Relax `checkArgument` validation (line 292-293) to accept `REGISTER_ONLY` partitions - - For `REGISTER_ONLY`: record partition file listings in commit metadata without reading file contents or creating skeleton files - - Skip bootstrap index entries for `REGISTER_ONLY` partitions (`HFileBootstrapIndex`) -- Add table property `hoodie.bootstrap.has.register.only.partitions` -- Unit tests for selector and executor changes - -**Phase 2: Read Path — Native Query Completeness (critical)** -- Modify `HoodieBootstrapRelation` (Spark) to handle REGISTER_ONLY partitions: - - When a base file has no bootstrap index entry AND its schema has no Hudi meta columns → read as plain Parquet - - Return null for Hudi meta columns (`_hoodie_record_key`, `_hoodie_commit_time`, `_hoodie_partition_path`, `_hoodie_file_name`, `_hoodie_commit_seqno`) -- Handle schema merging: queries spanning multiple tiers must produce a unified schema where Hudi meta columns are nullable -- Ensure partition pruning works correctly for REGISTER_ONLY partitions -- Integration tests verifying: - - `SELECT *` across all 3 tiers returns complete results - - `SELECT * WHERE partition_col = ` returns correct data - - `SELECT * WHERE partition_col = ` performance is unaffected - - Hudi meta columns are null for REGISTER_ONLY rows, populated for others - -**Phase 3: Write Path Guardrails** -- Fail fast when upsert/delete targets a REGISTER_ONLY partition with a clear error message: - `"Cannot upsert/delete in REGISTER_ONLY bootstrap partition [datestr=2022-06-15]. Re-bootstrap with FULL_RECORD or METADATA_ONLY mode to enable writes."` -- Allow insert-overwrite to "promote" a REGISTER_ONLY partition to a regular Hudi partition (optional, future enhancement) - -**Phase 4: Tooling & Documentation** (optional, future) -- CLI command to list partitions by bootstrap mode -- CLI command to "promote" REGISTER_ONLY partitions to METADATA_ONLY or FULL_RECORD (when data is restored from cold storage) -- Documentation and migration guide updates - -### Related Issues -- [#14665](https://github.com/apache/hudi/issues/14665) — Efficient bootstrap and migration of existing non-Hudi dataset (parent epic) -- [#15974](https://github.com/apache/hudi/issues/15974) — Treat full bootstrap table as regular table -- [#15856](https://github.com/apache/hudi/issues/15856) — Precombine field is not required for metadata only bootstrap diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java index ef5e89fe806b3..b20e19d57545b 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java @@ -875,7 +875,11 @@ private Pair, JavaRDD> writeToSinkAndDoMetaSync(Hood totalSuccessfulRecords); String commitActionType = CommitUtils.getCommitActionType(cfg.operation, HoodieTableType.valueOf(cfg.tableType)); - // Run pre-commit streaming offset validators (if configured) before commit + // Run pre-commit streaming offset validators (if configured) before commit. + // This is intentional: offset validation is a stronger guard than commitOnErrors — + // if offset deviation indicates potential data loss, the commit must be prevented + // regardless of the commitOnErrors policy. A HoodieValidationException here will + // propagate up and the commit will not be finalized in the timeline. SparkStreamerValidatorUtils.runValidators(props, instantTime, writeStatusRDD, checkpointCommitMetadata, metaClient); diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java index b45c7759db81f..a4aef893ab93c 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java @@ -43,6 +43,13 @@ * *

This validator is primarily intended for append-only ingestion from Kafka via HoodieStreamer. * For upsert workloads with deduplication, configure a higher tolerance or use WARN_LOG.

+ * + *

Important: This class extends {@link org.apache.hudi.client.validator.BasePreCommitValidator} + * and is invoked by {@link SparkStreamerValidatorUtils}, NOT by {@code SparkValidatorUtils} + * (which expects {@code SparkPreCommitValidator} with a different constructor signature). + * Listing this class in {@code hoodie.precommit.validators} while also using the standard + * Spark table write-path validators will cause an instantiation failure in {@code SparkValidatorUtils}. + * Use this validator exclusively with HoodieStreamer pipelines.

*/ public class SparkKafkaOffsetValidator extends StreamingOffsetValidator { diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java index 42275f1187a35..4459c0c85cf7c 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java @@ -38,7 +38,6 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import java.io.IOException; import java.util.Arrays; import java.util.List; import java.util.Map; @@ -54,6 +53,15 @@ * *

Called from {@code StreamSync.writeToSinkAndDoMetaSync()} before * the commit is finalized.

+ * + *

Note on validator compatibility: This utility uses a different instantiation + * mechanism than {@code SparkValidatorUtils} (used by the Spark table write path). + * {@code SparkValidatorUtils} expects validators implementing {@code SparkPreCommitValidator} + * with a {@code (HoodieSparkTable, HoodieEngineContext, HoodieWriteConfig)} constructor. + * Validators registered here (e.g. {@link SparkKafkaOffsetValidator}) extend + * {@link BasePreCommitValidator} with a {@code (TypedProperties)} constructor and + * are NOT compatible with {@code SparkValidatorUtils}. Do not mix them under the same + * {@code hoodie.precommit.validators} config if both paths are active.

*/ public class SparkStreamerValidatorUtils { @@ -82,7 +90,8 @@ public static void runValidators(TypedProperties props, return; } - // Collect write statuses and build context + // Cache the RDD to avoid recomputation when collecting write stats (prevents a second DAG evaluation) + writeStatusRDD.cache(); List allWriteStatus = writeStatusRDD.collect(); HoodieCommitMetadata currentMetadata = buildCommitMetadata(allWriteStatus, checkpointCommitMetadata); List writeStats = allWriteStatus.stream() @@ -124,9 +133,12 @@ public static void runValidators(TypedProperties props, } /** - * Build HoodieCommitMetadata from write statuses and extra metadata. - * This constructs the metadata object that would be committed, giving - * validators access to the same data. + * Build a pre-commit snapshot of {@link HoodieCommitMetadata} from write statuses and extra metadata. + * + *

This is intentionally a partial/preview object used only for validation — it contains + * write stats and checkpoint extra-metadata, but omits fields that are not available before the + * commit (e.g. schema, operation type). Validators should treat this as a read-only snapshot + * of what will be committed, not a fully-constructed commit record.

*/ private static HoodieCommitMetadata buildCommitMetadata( List writeStatuses, Map extraMetadata) { diff --git a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java index 1b20a1dcef268..43e021cf45074 100644 --- a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java +++ b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java @@ -23,6 +23,7 @@ import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.model.HoodieCommitMetadata; import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.common.table.checkpoint.StreamerCheckpointV1; import org.apache.hudi.common.testutils.HoodieTestTable; import org.apache.hudi.common.util.Option; @@ -52,9 +53,9 @@ /** * Tests for {@link SparkStreamerValidatorUtils}. * - *

Uses a lightweight Spark context for JavaRDD creation. Tests validate the orchestration - * logic (class loading, config passing, error handling) using first-commit scenarios - * (no previous commit on timeline) to avoid needing a full HoodieTable setup.

+ *

Uses a lightweight Spark context for JavaRDD creation. Tests cover orchestration logic + * (class loading, config passing, error handling) as well as end-to-end offset validation + * using a two-commit timeline to verify the real comparison path is exercised.

*/ public class TestSparkStreamerValidatorUtils { @@ -104,7 +105,7 @@ private JavaRDD toRDD(List writeStatuses) { return jsc.parallelize(writeStatuses); } - private org.apache.hudi.common.table.HoodieTableMetaClient createMetaClient() throws IOException { + private HoodieTableMetaClient createMetaClient() throws IOException { return org.apache.hudi.common.testutils.HoodieTestUtils.init( tempDir.toAbsolutePath().toString()); } @@ -234,4 +235,45 @@ public void testValidationExceptionPreservedAcrossValidators() throws IOExceptio props, "20260320120000000", toRDD(writeStatuses), new HashMap<>(), createMetaClient())); assertTrue(ex.getMessage().contains("FakeValidator")); } + + @Test + public void testSecondCommitMatchingOffsetsPasses() throws Exception { + TypedProperties props = propsWithValidator( + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); + + // Create table with a previous committed instant: offset 0 -> 500 + HoodieTableMetaClient metaClient = createMetaClient(); + HoodieCommitMetadata prevMeta = new HoodieCommitMetadata(); + prevMeta.addMetadata(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:500"); + HoodieTestTable.of(metaClient).addCommit("20260320110000000", Option.of(prevMeta)); + + // Second commit: offset 500 -> 600, 100 records written — matches diff exactly + Map extraMeta = new HashMap<>(); + extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:600"); + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); + + assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), extraMeta, metaClient)); + } + + @Test + public void testSecondCommitDataLossDetected() throws Exception { + TypedProperties props = propsWithValidator( + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); + + // Create table with a previous committed instant: offset 0 -> 1000 + HoodieTableMetaClient metaClient = createMetaClient(); + HoodieCommitMetadata prevMeta = new HoodieCommitMetadata(); + prevMeta.addMetadata(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:1000"); + HoodieTestTable.of(metaClient).addCommit("20260320110000000", Option.of(prevMeta)); + + // Second commit: offset 1000 -> 2000 (diff=1000) but only 500 records written — data loss + Map extraMeta = new HashMap<>(); + extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:2000"); + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 500, 0)); + + assertThrows(HoodieValidationException.class, + () -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", toRDD(writeStatuses), extraMeta, metaClient)); + } } From 129e76e8f0f24cd6fa034e8e34c93c446a36595e Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Tue, 31 Mar 2026 07:58:46 -0700 Subject: [PATCH 03/12] fix: skip non-SparkPreCommitValidator classes in SparkValidatorUtils SparkKafkaOffsetValidator (and similar streaming validators) extend BasePreCommitValidator with a (TypedProperties) constructor, not the (HoodieSparkTable, HoodieEngineContext, HoodieWriteConfig) constructor that SparkValidatorUtils expects. Listing such a validator in hoodie.precommit.validators previously caused a reflection error in the Spark table write path. Add a Class.isAssignableFrom check to filter out classes that don't implement SparkPreCommitValidator before attempting instantiation, with a clear warning pointing users to SparkStreamerValidatorUtils for streaming validators. --- .../hudi/client/utils/SparkValidatorUtils.java | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java index c7804631d3f65..804cb74415ff2 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java @@ -84,6 +84,21 @@ public static void runValidators(HoodieWriteConfig config, Dataset beforeState = getRecordsFromCommittedFiles(sqlContext, partitionsModified, table, afterState.schema()); Stream validators = Arrays.stream(config.getPreCommitValidators().split(",")) + .map(String::trim) + .filter(validatorClass -> { + try { + Class clazz = Class.forName(validatorClass); + if (!SparkPreCommitValidator.class.isAssignableFrom(clazz)) { + LOG.warn("Skipping validator {} — it does not implement SparkPreCommitValidator. " + + "If this is a streaming offset validator (e.g. SparkKafkaOffsetValidator), " + + "it will be invoked by SparkStreamerValidatorUtils instead.", validatorClass); + return false; + } + return true; + } catch (ClassNotFoundException e) { + throw new HoodieValidationException("Cannot find validator class: " + validatorClass, e); + } + }) .map(validatorClass -> ((SparkPreCommitValidator) ReflectionUtils.loadClass(validatorClass, new Class[] {HoodieSparkTable.class, HoodieEngineContext.class, HoodieWriteConfig.class}, table, context, config))); From 5f810df66efba7853a5abf0fc2bae55660fecc6a Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Wed, 1 Apr 2026 09:17:59 -0700 Subject: [PATCH 04/12] ci: trigger CI re-run for flaky trino test From 07f8ab88c432082c77922477fb1d52253a60e597 Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Sat, 4 Apr 2026 08:24:13 -0700 Subject: [PATCH 05/12] fix: address reviewer comments on pre-commit streaming offset validator - Unpersist cached RDD in finally block to prevent executor memory leak - Let IOException propagate from loadPreviousCommitMetadata instead of silently swallowing it - Filter empty validator class names before Class.forName to handle trailing comma in config - Add write error count to validation message to distinguish write failures from silent data loss --- .../validator/StreamingOffsetValidator.java | 24 ++++-- .../TestStreamingOffsetValidator.java | 2 +- .../client/utils/SparkValidatorUtils.java | 1 + .../client/validator/ValidationContext.java | 14 ++++ .../SparkStreamerValidatorUtils.java | 83 ++++++++++--------- 5 files changed, 78 insertions(+), 46 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java index ce577d84ca018..b9e5979ba9ac7 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java @@ -139,6 +139,10 @@ public void validateWithMetadata(ValidationContext context) throws HoodieValidat long recordsWritten = context.getTotalInsertRecordsWritten() + context.getTotalUpdateRecordsWritten(); + // Track write errors so callers can distinguish write-failure deviation (write errors > 0) + // from silent data loss (write errors == 0) when the validator fires. + long writeErrors = context.getTotalWriteErrors(); + // For empty commits (e.g., no new data from source), both offsetDiff and recordsWritten // can be zero. This is a valid scenario — skip validation to avoid false positives. if (offsetDifference == 0 && recordsWritten == 0) { @@ -147,7 +151,7 @@ public void validateWithMetadata(ValidationContext context) throws HoodieValidat } // Validate offset vs record consistency - validateOffsetConsistency(offsetDifference, recordsWritten, + validateOffsetConsistency(offsetDifference, recordsWritten, writeErrors, currentCheckpoint, previousCheckpoint); } @@ -155,12 +159,13 @@ public void validateWithMetadata(ValidationContext context) throws HoodieValidat * Validate that offset difference matches record count within tolerance. * * @param offsetDiff Expected records based on offset difference - * @param recordsWritten Actual records written + * @param recordsWritten Actual records written (inserts + updates) + * @param writeErrors Records that failed to write (tracked in write status errors) * @param currentCheckpoint Current checkpoint string (for error messages) * @param previousCheckpoint Previous checkpoint string (for error messages) * @throws HoodieValidationException if validation fails and policy is FAIL */ - protected void validateOffsetConsistency(long offsetDiff, long recordsWritten, + protected void validateOffsetConsistency(long offsetDiff, long recordsWritten, long writeErrors, String currentCheckpoint, String previousCheckpoint) throws HoodieValidationException { @@ -169,10 +174,13 @@ protected void validateOffsetConsistency(long offsetDiff, long recordsWritten, if (deviation > tolerancePercentage) { String errorMsg = String.format( "Streaming offset validation failed. " - + "Offset difference: %d, Records written: %d, Deviation: %.2f%%, Tolerance: %.2f%%. " - + "This may indicate data loss or filtering. " + + "Offset difference: %d, Records written: %d, Write errors: %d, Deviation: %.2f%%, Tolerance: %.2f%%. " + + "%s" + "Previous checkpoint: %s, Current checkpoint: %s", - offsetDiff, recordsWritten, deviation, tolerancePercentage, + offsetDiff, recordsWritten, writeErrors, deviation, tolerancePercentage, + writeErrors > 0 + ? "Non-zero write errors suggest records failed to write rather than silent data loss. " + : "This may indicate data loss or filtering. ", previousCheckpoint, currentCheckpoint); if (failurePolicy == ValidationFailurePolicy.WARN_LOG) { @@ -181,8 +189,8 @@ protected void validateOffsetConsistency(long offsetDiff, long recordsWritten, throw new HoodieValidationException(errorMsg); } } else { - log.info("Offset validation passed. Offset diff: {}, Records: {}, Deviation: {}% (within {}%)", - offsetDiff, recordsWritten, String.format("%.2f", deviation), tolerancePercentage); + log.info("Offset validation passed. Offset diff: {}, Records: {}, Write errors: {}, Deviation: {}% (within {}%)", + offsetDiff, recordsWritten, writeErrors, String.format("%.2f", deviation), tolerancePercentage); } } diff --git a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/client/validator/TestStreamingOffsetValidator.java b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/client/validator/TestStreamingOffsetValidator.java index 818bc2087a3c1..bc402e5ee5a0e 100644 --- a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/client/validator/TestStreamingOffsetValidator.java +++ b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/client/validator/TestStreamingOffsetValidator.java @@ -62,7 +62,7 @@ public MockOffsetValidator(TypedProperties config) { // Expose protected method for testing public void testValidateOffsetConsistency(long offsetDiff, long recordsWritten, String current, String previous) { - validateOffsetConsistency(offsetDiff, recordsWritten, current, previous); + validateOffsetConsistency(offsetDiff, recordsWritten, 0L, current, previous); } // Expose validateWithMetadata for testing diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java index 804cb74415ff2..f40fb875b9276 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java @@ -85,6 +85,7 @@ public static void runValidators(HoodieWriteConfig config, Stream validators = Arrays.stream(config.getPreCommitValidators().split(",")) .map(String::trim) + .filter(s -> !s.isEmpty()) .filter(validatorClass -> { try { Class clazz = Class.forName(validatorClass); diff --git a/hudi-common/src/main/java/org/apache/hudi/client/validator/ValidationContext.java b/hudi-common/src/main/java/org/apache/hudi/client/validator/ValidationContext.java index 30fdbb3ba3b4a..b85218e587e87 100644 --- a/hudi-common/src/main/java/org/apache/hudi/client/validator/ValidationContext.java +++ b/hudi-common/src/main/java/org/apache/hudi/client/validator/ValidationContext.java @@ -169,6 +169,20 @@ default long getTotalUpdateRecordsWritten() { .orElse(0L); } + /** + * Calculate total write errors in the current commit. + * Records that failed to write are tracked in {@link org.apache.hudi.common.model.HoodieWriteStat#getTotalWriteErrors()}. + * A non-zero error count alongside a deviation in offset validation indicates write failures + * rather than silent data loss — useful context for distinguishing the two failure modes. + * + * @return Total count of records that failed to write + */ + default long getTotalWriteErrors() { + return getWriteStats() + .map(stats -> stats.stream().mapToLong(HoodieWriteStat::getTotalWriteErrors).sum()) + .orElse(0L); + } + /** * Check if this is the first commit (no previous commits exist). * Derived from {@link #getPreviousCommitInstant()}. diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java index 4459c0c85cf7c..1d01b0c7de8c1 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java @@ -32,12 +32,14 @@ import org.apache.hudi.common.util.ReflectionUtils; import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.config.HoodiePreCommitValidatorConfig; +import org.apache.hudi.exception.HoodieIOException; import org.apache.hudi.exception.HoodieValidationException; import org.apache.spark.api.java.JavaRDD; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.io.IOException; import java.util.Arrays; import java.util.List; import java.util.Map; @@ -90,45 +92,50 @@ public static void runValidators(TypedProperties props, return; } - // Cache the RDD to avoid recomputation when collecting write stats (prevents a second DAG evaluation) + // Cache the RDD to avoid recomputation when collecting write stats (prevents a second DAG evaluation). + // Always unpersist in finally to prevent executor memory leaks. writeStatusRDD.cache(); - List allWriteStatus = writeStatusRDD.collect(); - HoodieCommitMetadata currentMetadata = buildCommitMetadata(allWriteStatus, checkpointCommitMetadata); - List writeStats = allWriteStatus.stream() - .map(WriteStatus::getStat) - .collect(Collectors.toList()); - - // Load previous commit metadata from timeline - Option previousCommitMetadata = loadPreviousCommitMetadata(metaClient); - - ValidationContext context = new SparkValidationContext( - instant, - Option.of(currentMetadata), - Option.of(writeStats), - previousCommitMetadata, - metaClient); - - // Instantiate and run each validator - List classNames = Arrays.stream(validatorClassNames.split(",")) - .map(String::trim) - .filter(s -> !s.isEmpty()) - .collect(Collectors.toList()); - - for (String className : classNames) { - try { - BasePreCommitValidator validator = (BasePreCommitValidator) - ReflectionUtils.loadClass(className, new Class[] {TypedProperties.class}, props); - LOG.info("Running pre-commit validator: {} for instant: {}", className, instant); - validator.validateWithMetadata(context); - LOG.info("Pre-commit validator {} passed for instant: {}", className, instant); - } catch (HoodieValidationException e) { - LOG.error("Pre-commit validator {} failed for instant: {}", className, instant, e); - throw e; - } catch (Exception e) { - LOG.error("Failed to instantiate or run validator: {}", className, e); - throw new HoodieValidationException( - "Failed to run pre-commit validator: " + className, e); + try { + List allWriteStatus = writeStatusRDD.collect(); + HoodieCommitMetadata currentMetadata = buildCommitMetadata(allWriteStatus, checkpointCommitMetadata); + List writeStats = allWriteStatus.stream() + .map(WriteStatus::getStat) + .collect(Collectors.toList()); + + // Load previous commit metadata from timeline + Option previousCommitMetadata = loadPreviousCommitMetadata(metaClient); + + ValidationContext context = new SparkValidationContext( + instant, + Option.of(currentMetadata), + Option.of(writeStats), + previousCommitMetadata, + metaClient); + + // Instantiate and run each validator + List classNames = Arrays.stream(validatorClassNames.split(",")) + .map(String::trim) + .filter(s -> !s.isEmpty()) + .collect(Collectors.toList()); + + for (String className : classNames) { + try { + BasePreCommitValidator validator = (BasePreCommitValidator) + ReflectionUtils.loadClass(className, new Class[] {TypedProperties.class}, props); + LOG.info("Running pre-commit validator: {} for instant: {}", className, instant); + validator.validateWithMetadata(context); + LOG.info("Pre-commit validator {} passed for instant: {}", className, instant); + } catch (HoodieValidationException e) { + LOG.error("Pre-commit validator {} failed for instant: {}", className, instant, e); + throw e; + } catch (Exception e) { + LOG.error("Failed to instantiate or run validator: {}", className, e); + throw new HoodieValidationException( + "Failed to run pre-commit validator: " + className, e); + } } + } finally { + writeStatusRDD.unpersist(); } } @@ -172,6 +179,8 @@ private static Option loadPreviousCommitMetadata(HoodieTab if (lastInstant.isPresent()) { return Option.of(completedTimeline.readCommitMetadata(lastInstant.get())); } + } catch (IOException e) { + throw new HoodieIOException("Failed to load previous commit metadata", e); } catch (Exception e) { LOG.warn("Failed to load previous commit metadata, skipping previous commit comparison", e); } From 5798d9f69a972376afc91dd049afcc35ac086df6 Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Wed, 8 Apr 2026 09:51:18 -0700 Subject: [PATCH 06/12] fix: address reviewer follow-up comments on pre-commit streaming validator - Change runValidators to accept List instead of JavaRDD to fix RDD unpersist-before-commit bug; StreamSync now caches the RDD, collects to list for validators, passes RDD to commit, then unpersists - Remove generic catch(Exception) in loadPreviousCommitMetadata so non-IOException failures propagate instead of silently skipping validation - Implement getPreviousCommitInstant() in SparkValidationContext via timeline lookup instead of throwing UnsupportedOperationException - Add Objects::nonNull filter when building writeStats list - Add BasePreCommitValidator assignability guard in SparkStreamerValidatorUtils to warn and skip SparkPreCommitValidator classes (reverse-direction guard) - Eliminate double class loading in SparkValidatorUtils by combining filter+map into a single flatMap; remove unused ReflectionUtils import - Remove trivial constructor Javadoc from SparkKafkaOffsetValidator - Add HoodieTestUtils import in test; remove Spark context boilerplate now that runValidators accepts List directly --- .../client/utils/SparkValidatorUtils.java | 17 ++-- .../hudi/utilities/streamer/StreamSync.java | 17 ++-- .../validator/SparkKafkaOffsetValidator.java | 5 - .../SparkStreamerValidatorUtils.java | 99 ++++++++++--------- .../validator/SparkValidationContext.java | 16 +-- .../TestSparkStreamerValidatorUtils.java | 63 ++++-------- 6 files changed, 96 insertions(+), 121 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java index f40fb875b9276..1ee8ba0433d52 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java @@ -31,7 +31,6 @@ import org.apache.hudi.common.table.TableSchemaResolver; import org.apache.hudi.common.table.view.HoodieTablePreCommitFileSystemView; import org.apache.hudi.common.util.Option; -import org.apache.hudi.common.util.ReflectionUtils; import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.exception.HoodieValidationException; @@ -86,23 +85,25 @@ public static void runValidators(HoodieWriteConfig config, Stream validators = Arrays.stream(config.getPreCommitValidators().split(",")) .map(String::trim) .filter(s -> !s.isEmpty()) - .filter(validatorClass -> { + .flatMap(validatorClass -> { try { Class clazz = Class.forName(validatorClass); if (!SparkPreCommitValidator.class.isAssignableFrom(clazz)) { LOG.warn("Skipping validator {} — it does not implement SparkPreCommitValidator. " + "If this is a streaming offset validator (e.g. SparkKafkaOffsetValidator), " + "it will be invoked by SparkStreamerValidatorUtils instead.", validatorClass); - return false; + return Stream.empty(); } - return true; + SparkPreCommitValidator validator = (SparkPreCommitValidator) + clazz.getDeclaredConstructor(HoodieSparkTable.class, HoodieEngineContext.class, HoodieWriteConfig.class) + .newInstance(table, context, config); + return Stream.of(validator); } catch (ClassNotFoundException e) { throw new HoodieValidationException("Cannot find validator class: " + validatorClass, e); + } catch (ReflectiveOperationException e) { + throw new HoodieValidationException("Failed to instantiate validator: " + validatorClass, e); } - }) - .map(validatorClass -> ((SparkPreCommitValidator) ReflectionUtils.loadClass(validatorClass, - new Class[] {HoodieSparkTable.class, HoodieEngineContext.class, HoodieWriteConfig.class}, - table, context, config))); + }); boolean allSuccess = validators.map(v -> runValidatorAsync(v, writeMetadata, beforeState, afterState, instantTime)).map(CompletableFuture::join) .reduce(true, Boolean::logicalAnd); diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java index b20e19d57545b..36f678e4ee000 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java @@ -875,16 +875,21 @@ private Pair, JavaRDD> writeToSinkAndDoMetaSync(Hood totalSuccessfulRecords); String commitActionType = CommitUtils.getCommitActionType(cfg.operation, HoodieTableType.valueOf(cfg.tableType)); - // Run pre-commit streaming offset validators (if configured) before commit. - // This is intentional: offset validation is a stronger guard than commitOnErrors — - // if offset deviation indicates potential data loss, the commit must be prevented - // regardless of the commitOnErrors policy. A HoodieValidationException here will - // propagate up and the commit will not be finalized in the timeline. - SparkStreamerValidatorUtils.runValidators(props, instantTime, writeStatusRDD, + // Cache the RDD so both validators (collect) and writeClient.commit() can use it + // without triggering a second DAG evaluation of the write operations. + writeStatusRDD.cache(); + List writeStatuses = writeStatusRDD.collect(); + + // Run pre-commit streaming offset validators (if configured). + // Placement before writeClient.commit() is intentional: offset validation is a stronger + // guard than commitOnErrors — if offset deviation indicates potential data loss, the commit + // must be prevented regardless of the commitOnErrors policy. + SparkStreamerValidatorUtils.runValidators(props, instantTime, writeStatuses, checkpointCommitMetadata, metaClient); boolean success = writeClient.commit(instantTime, writeStatusRDD, Option.of(checkpointCommitMetadata), commitActionType, partitionToReplacedFileIds, Option.empty(), Option.of(writeStatusValidator)); + writeStatusRDD.unpersist(); releaseResourcesInvoked = true; if (success) { LOG.info("Commit " + instantTime + " successful!"); diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java index a4aef893ab93c..8e5c8b33713af 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java @@ -53,11 +53,6 @@ */ public class SparkKafkaOffsetValidator extends StreamingOffsetValidator { - /** - * Create a Spark Kafka offset validator. - * - * @param config Validator configuration - */ public SparkKafkaOffsetValidator(TypedProperties config) { super(config, StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, CheckpointFormat.SPARK_KAFKA); } diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java index 1d01b0c7de8c1..474215ec0f3c1 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkStreamerValidatorUtils.java @@ -35,7 +35,6 @@ import org.apache.hudi.exception.HoodieIOException; import org.apache.hudi.exception.HoodieValidationException; -import org.apache.spark.api.java.JavaRDD; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -43,6 +42,7 @@ import java.util.Arrays; import java.util.List; import java.util.Map; +import java.util.Objects; import java.util.stream.Collectors; /** @@ -72,16 +72,21 @@ public class SparkStreamerValidatorUtils { /** * Run all configured pre-commit validators. * + *

The caller is responsible for caching and unpersisting the source RDD if needed. + * This method accepts pre-collected write statuses to avoid a second DAG evaluation — + * the caller should cache the RDD, collect to this list, call this method, then pass + * the same RDD to {@code writeClient.commit()}, and unpersist after commit completes.

+ * * @param props Configuration properties containing validator class names - * @param instant Commit instant time - * @param writeStatusRDD Write statuses from Spark write operations + * @param instantTime Commit instant time + * @param writeStatuses Pre-collected write statuses from Spark write operations * @param checkpointCommitMetadata Extra metadata being committed (contains checkpoint info) * @param metaClient Table meta client for timeline access and previous commit lookup * @throws HoodieValidationException if any validator fails with FAIL policy */ public static void runValidators(TypedProperties props, - String instant, - JavaRDD writeStatusRDD, + String instantTime, + List writeStatuses, Map checkpointCommitMetadata, HoodieTableMetaClient metaClient) { String validatorClassNames = props.getString( @@ -92,50 +97,48 @@ public static void runValidators(TypedProperties props, return; } - // Cache the RDD to avoid recomputation when collecting write stats (prevents a second DAG evaluation). - // Always unpersist in finally to prevent executor memory leaks. - writeStatusRDD.cache(); - try { - List allWriteStatus = writeStatusRDD.collect(); - HoodieCommitMetadata currentMetadata = buildCommitMetadata(allWriteStatus, checkpointCommitMetadata); - List writeStats = allWriteStatus.stream() - .map(WriteStatus::getStat) - .collect(Collectors.toList()); - - // Load previous commit metadata from timeline - Option previousCommitMetadata = loadPreviousCommitMetadata(metaClient); - - ValidationContext context = new SparkValidationContext( - instant, - Option.of(currentMetadata), - Option.of(writeStats), - previousCommitMetadata, - metaClient); - - // Instantiate and run each validator - List classNames = Arrays.stream(validatorClassNames.split(",")) - .map(String::trim) - .filter(s -> !s.isEmpty()) - .collect(Collectors.toList()); - - for (String className : classNames) { - try { - BasePreCommitValidator validator = (BasePreCommitValidator) - ReflectionUtils.loadClass(className, new Class[] {TypedProperties.class}, props); - LOG.info("Running pre-commit validator: {} for instant: {}", className, instant); - validator.validateWithMetadata(context); - LOG.info("Pre-commit validator {} passed for instant: {}", className, instant); - } catch (HoodieValidationException e) { - LOG.error("Pre-commit validator {} failed for instant: {}", className, instant, e); - throw e; - } catch (Exception e) { - LOG.error("Failed to instantiate or run validator: {}", className, e); - throw new HoodieValidationException( - "Failed to run pre-commit validator: " + className, e); + HoodieCommitMetadata currentMetadata = buildCommitMetadata(writeStatuses, checkpointCommitMetadata); + List writeStats = writeStatuses.stream() + .map(WriteStatus::getStat) + .filter(Objects::nonNull) + .collect(Collectors.toList()); + + Option previousCommitMetadata = loadPreviousCommitMetadata(metaClient); + + ValidationContext context = new SparkValidationContext( + instantTime, + Option.of(currentMetadata), + Option.of(writeStats), + previousCommitMetadata, + metaClient); + + List classNames = Arrays.stream(validatorClassNames.split(",")) + .map(String::trim) + .filter(s -> !s.isEmpty()) + .collect(Collectors.toList()); + + for (String className : classNames) { + try { + Class clazz = Class.forName(className); + if (!BasePreCommitValidator.class.isAssignableFrom(clazz)) { + LOG.warn("Skipping validator {} in HoodieStreamer path — it does not extend BasePreCommitValidator. " + + "If this is a SparkPreCommitValidator (e.g. SqlQueryEqualityPreCommitValidator), " + + "it must be invoked via SparkValidatorUtils in the standard Spark write path instead.", className); + continue; } + BasePreCommitValidator validator = (BasePreCommitValidator) + ReflectionUtils.loadClass(className, new Class[] {TypedProperties.class}, props); + LOG.info("Running pre-commit validator: {} for instant: {}", className, instantTime); + validator.validateWithMetadata(context); + LOG.info("Pre-commit validator {} passed for instant: {}", className, instantTime); + } catch (HoodieValidationException e) { + LOG.error("Pre-commit validator {} failed for instant: {}", className, instantTime, e); + throw e; + } catch (Exception e) { + LOG.error("Failed to instantiate or run validator: {}", className, e); + throw new HoodieValidationException( + "Failed to run pre-commit validator: " + className, e); } - } finally { - writeStatusRDD.unpersist(); } } @@ -181,8 +184,6 @@ private static Option loadPreviousCommitMetadata(HoodieTab } } catch (IOException e) { throw new HoodieIOException("Failed to load previous commit metadata", e); - } catch (Exception e) { - LOG.warn("Failed to load previous commit metadata, skipping previous commit comparison", e); } return Option.empty(); } diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkValidationContext.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkValidationContext.java index d8f845bd19789..1d46e13aeedbd 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkValidationContext.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkValidationContext.java @@ -113,16 +113,18 @@ public HoodieActiveTimeline getActiveTimeline() { } /** - * Not directly supported. Use {@link #isFirstCommit()} or - * {@link #getPreviousCommitMetadata()} instead. - * - * @throws UnsupportedOperationException always + * Get the previous completed commit instant by querying the timeline. + * Returns {@link Option#empty()} if this is the first commit or metaClient is unavailable. */ @Override public Option getPreviousCommitInstant() { - throw new UnsupportedOperationException( - "getPreviousCommitInstant() is not available in HoodieStreamer pre-commit validation context. " - + "Use isFirstCommit() or getPreviousCommitMetadata() instead."); + if (metaClient == null) { + return Option.empty(); + } + return metaClient.getActiveTimeline() + .getWriteTimeline() + .filterCompletedInstants() + .lastInstant(); } @Override diff --git a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java index 43e021cf45074..5b279a37af335 100644 --- a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java +++ b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java @@ -26,15 +26,11 @@ import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.common.table.checkpoint.StreamerCheckpointV1; import org.apache.hudi.common.testutils.HoodieTestTable; +import org.apache.hudi.common.testutils.HoodieTestUtils; import org.apache.hudi.common.util.Option; import org.apache.hudi.config.HoodiePreCommitValidatorConfig; import org.apache.hudi.exception.HoodieValidationException; -import org.apache.spark.SparkConf; -import org.apache.spark.api.java.JavaRDD; -import org.apache.spark.api.java.JavaSparkContext; -import org.junit.jupiter.api.AfterAll; -import org.junit.jupiter.api.BeforeAll; import org.junit.jupiter.api.Test; import org.junit.jupiter.api.io.TempDir; @@ -53,35 +49,15 @@ /** * Tests for {@link SparkStreamerValidatorUtils}. * - *

Uses a lightweight Spark context for JavaRDD creation. Tests cover orchestration logic - * (class loading, config passing, error handling) as well as end-to-end offset validation - * using a two-commit timeline to verify the real comparison path is exercised.

+ *

Tests cover orchestration logic (class loading, config passing, error handling) + * as well as end-to-end offset validation using a two-commit timeline to verify + * the real comparison path is exercised.

*/ public class TestSparkStreamerValidatorUtils { - private static JavaSparkContext jsc; - @TempDir Path tempDir; - @BeforeAll - public static void setUp() { - SparkConf conf = new SparkConf() - .setAppName("TestSparkStreamerValidatorUtils") - .setMaster("local[2]") - .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - .set("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"); - jsc = new JavaSparkContext(conf); - } - - @AfterAll - public static void tearDown() { - if (jsc != null) { - jsc.stop(); - jsc = null; - } - } - private static TypedProperties propsWithValidator(String validatorClassName) { TypedProperties props = new TypedProperties(); props.setProperty(HoodiePreCommitValidatorConfig.VALIDATOR_CLASS_NAMES.key(), validatorClassName); @@ -101,13 +77,8 @@ private static WriteStatus buildWriteStatus(String partitionPath, long numInsert return ws; } - private JavaRDD toRDD(List writeStatuses) { - return jsc.parallelize(writeStatuses); - } - private HoodieTableMetaClient createMetaClient() throws IOException { - return org.apache.hudi.common.testutils.HoodieTestUtils.init( - tempDir.toAbsolutePath().toString()); + return HoodieTestUtils.init(tempDir.toAbsolutePath().toString()); } // ========== Tests ========== @@ -118,7 +89,7 @@ public void testNoValidatorsConfigured() throws IOException { List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), + props, "20260320120000000", writeStatuses, new HashMap<>(), createMetaClient())); } @@ -129,7 +100,7 @@ public void testEmptyValidatorString() throws IOException { List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), + props, "20260320120000000", writeStatuses, new HashMap<>(), createMetaClient())); } @@ -144,7 +115,7 @@ public void testValidValidatorFirstCommitPasses() throws IOException { // First commit (no previous metadata on timeline) — validator should skip and pass assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), extraMeta, createMetaClient())); + props, "20260320120000000", writeStatuses, extraMeta, createMetaClient())); } @Test @@ -154,7 +125,7 @@ public void testInvalidValidatorClassThrows() throws IOException { assertThrows(HoodieValidationException.class, () -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), new HashMap<>(), createMetaClient())); + props, "20260320120000000", writeStatuses, new HashMap<>(), createMetaClient())); } @Test @@ -168,7 +139,7 @@ public void testMultipleValidators() throws IOException { extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:100"); assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), extraMeta, createMetaClient())); + props, "20260320120000000", writeStatuses, extraMeta, createMetaClient())); } @Test @@ -179,7 +150,7 @@ public void testValidatorWithWhitespaceInClassNames() throws IOException { List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), new HashMap<>(), createMetaClient())); + props, "20260320120000000", writeStatuses, new HashMap<>(), createMetaClient())); } @Test @@ -190,7 +161,7 @@ public void testNullExtraMetadataHandled() throws IOException { List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), null, createMetaClient())); + props, "20260320120000000", writeStatuses, null, createMetaClient())); } @Test @@ -206,7 +177,7 @@ public void testMultipleWriteStatusesAggregated() throws IOException { extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:100"); assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), extraMeta, createMetaClient())); + props, "20260320120000000", writeStatuses, extraMeta, createMetaClient())); } @Test @@ -219,7 +190,7 @@ public void testEmptyWriteStatuses() throws IOException { extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:100"); assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), extraMeta, createMetaClient())); + props, "20260320120000000", writeStatuses, extraMeta, createMetaClient())); } @Test @@ -232,7 +203,7 @@ public void testValidationExceptionPreservedAcrossValidators() throws IOExceptio HoodieValidationException ex = assertThrows(HoodieValidationException.class, () -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), new HashMap<>(), createMetaClient())); + props, "20260320120000000", writeStatuses, new HashMap<>(), createMetaClient())); assertTrue(ex.getMessage().contains("FakeValidator")); } @@ -253,7 +224,7 @@ public void testSecondCommitMatchingOffsetsPasses() throws Exception { List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), extraMeta, metaClient)); + props, "20260320120000000", writeStatuses, extraMeta, metaClient)); } @Test @@ -274,6 +245,6 @@ public void testSecondCommitDataLossDetected() throws Exception { assertThrows(HoodieValidationException.class, () -> SparkStreamerValidatorUtils.runValidators( - props, "20260320120000000", toRDD(writeStatuses), extraMeta, metaClient)); + props, "20260320120000000", writeStatuses, extraMeta, metaClient)); } } From 34e50d6b3999b8af80bb35a7184ba4af5d3fea65 Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Wed, 8 Apr 2026 10:15:40 -0700 Subject: [PATCH 07/12] fix: guard cache() call when writeStatusRDD is already persisted Calling cache() on an RDD that already has a storage level assigned throws SparkUnsupportedOperationException. The write path may cache the RDD internally before returning it. Track whether we own the cache and only call cache()/unpersist() when the RDD was not already persisted. --- .../apache/hudi/utilities/streamer/StreamSync.java | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java index 36f678e4ee000..b8930cdedd572 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java @@ -875,9 +875,12 @@ private Pair, JavaRDD> writeToSinkAndDoMetaSync(Hood totalSuccessfulRecords); String commitActionType = CommitUtils.getCommitActionType(cfg.operation, HoodieTableType.valueOf(cfg.tableType)); - // Cache the RDD so both validators (collect) and writeClient.commit() can use it - // without triggering a second DAG evaluation of the write operations. - writeStatusRDD.cache(); + // Cache the RDD if not already persisted, so both validators (collect) and + // writeClient.commit() share the same materialized result without re-evaluation. + boolean weOwnCache = writeStatusRDD.getStorageLevel().equals(org.apache.spark.storage.StorageLevel.NONE()); + if (weOwnCache) { + writeStatusRDD.cache(); + } List writeStatuses = writeStatusRDD.collect(); // Run pre-commit streaming offset validators (if configured). @@ -889,7 +892,9 @@ private Pair, JavaRDD> writeToSinkAndDoMetaSync(Hood boolean success = writeClient.commit(instantTime, writeStatusRDD, Option.of(checkpointCommitMetadata), commitActionType, partitionToReplacedFileIds, Option.empty(), Option.of(writeStatusValidator)); - writeStatusRDD.unpersist(); + if (weOwnCache) { + writeStatusRDD.unpersist(); + } releaseResourcesInvoked = true; if (success) { LOG.info("Commit " + instantTime + " successful!"); From 97b99fd84e40703c28dffe1ecb66ac97324da1aa Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Fri, 17 Apr 2026 16:49:49 -0700 Subject: [PATCH 08/12] Ensure writeStatusRDD is always unpersisted via try/finally --- .../apache/hudi/utilities/streamer/StreamSync.java | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java index b8930cdedd572..2965503c4de91 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java @@ -890,10 +890,14 @@ private Pair, JavaRDD> writeToSinkAndDoMetaSync(Hood SparkStreamerValidatorUtils.runValidators(props, instantTime, writeStatuses, checkpointCommitMetadata, metaClient); - boolean success = writeClient.commit(instantTime, writeStatusRDD, Option.of(checkpointCommitMetadata), commitActionType, partitionToReplacedFileIds, Option.empty(), - Option.of(writeStatusValidator)); - if (weOwnCache) { - writeStatusRDD.unpersist(); + boolean success; + try { + success = writeClient.commit(instantTime, writeStatusRDD, Option.of(checkpointCommitMetadata), commitActionType, partitionToReplacedFileIds, Option.empty(), + Option.of(writeStatusValidator)); + } finally { + if (weOwnCache) { + writeStatusRDD.unpersist(); + } } releaseResourcesInvoked = true; if (success) { From 2656b70f897ad3a89b91a03f6c3220fee09d0d07 Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Tue, 21 Apr 2026 11:31:44 -0700 Subject: [PATCH 09/12] fix: use proper import for StorageLevel instead of fully-qualified class reference MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Sonnet 4.6 --- .../java/org/apache/hudi/utilities/streamer/StreamSync.java | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java index 2965503c4de91..259209bfa824f 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java @@ -129,6 +129,7 @@ import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.StructType; +import org.apache.spark.storage.StorageLevel; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -877,7 +878,7 @@ private Pair, JavaRDD> writeToSinkAndDoMetaSync(Hood // Cache the RDD if not already persisted, so both validators (collect) and // writeClient.commit() share the same materialized result without re-evaluation. - boolean weOwnCache = writeStatusRDD.getStorageLevel().equals(org.apache.spark.storage.StorageLevel.NONE()); + boolean weOwnCache = writeStatusRDD.getStorageLevel().equals(StorageLevel.NONE()); if (weOwnCache) { writeStatusRDD.cache(); } From 156a79db6d6f39dff147827cb907e97eab5eca69 Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Sat, 9 May 2026 07:40:54 -0700 Subject: [PATCH 10/12] fix: address final reviewer feedback on pre-commit streaming validator - Move runValidators() inside the try/finally so writeStatusRDD.unpersist() always runs, including on validator exceptions (FAIL policy or HoodieIOException from loadPreviousCommitMetadata). - Use ReflectionUtils.loadClass in SparkValidatorUtils for instantiation, matching SparkStreamerValidatorUtils and the rest of the codebase. - Rename weOwnCache to shouldUnpersist to read in the direction of its actual use (gating unpersist in finally). --- .../client/utils/SparkValidatorUtils.java | 8 +++--- .../hudi/utilities/streamer/StreamSync.java | 26 ++++++++++--------- 2 files changed, 19 insertions(+), 15 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java index 1ee8ba0433d52..40ee7d8cefff9 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkValidatorUtils.java @@ -31,6 +31,7 @@ import org.apache.hudi.common.table.TableSchemaResolver; import org.apache.hudi.common.table.view.HoodieTablePreCommitFileSystemView; import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ReflectionUtils; import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.exception.HoodieValidationException; @@ -94,9 +95,10 @@ public static void runValidators(HoodieWriteConfig config, + "it will be invoked by SparkStreamerValidatorUtils instead.", validatorClass); return Stream.empty(); } - SparkPreCommitValidator validator = (SparkPreCommitValidator) - clazz.getDeclaredConstructor(HoodieSparkTable.class, HoodieEngineContext.class, HoodieWriteConfig.class) - .newInstance(table, context, config); + SparkPreCommitValidator validator = (SparkPreCommitValidator) ReflectionUtils.loadClass( + validatorClass, + new Class[] {HoodieSparkTable.class, HoodieEngineContext.class, HoodieWriteConfig.class}, + table, context, config); return Stream.of(validator); } catch (ClassNotFoundException e) { throw new HoodieValidationException("Cannot find validator class: " + validatorClass, e); diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java index 259209bfa824f..2001790f45437 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java @@ -878,25 +878,27 @@ private Pair, JavaRDD> writeToSinkAndDoMetaSync(Hood // Cache the RDD if not already persisted, so both validators (collect) and // writeClient.commit() share the same materialized result without re-evaluation. - boolean weOwnCache = writeStatusRDD.getStorageLevel().equals(StorageLevel.NONE()); - if (weOwnCache) { + // shouldUnpersist is true when we created the cache here (storage level was NONE), + // so the finally block knows to release it. + boolean shouldUnpersist = writeStatusRDD.getStorageLevel().equals(StorageLevel.NONE()); + if (shouldUnpersist) { writeStatusRDD.cache(); } - List writeStatuses = writeStatusRDD.collect(); - - // Run pre-commit streaming offset validators (if configured). - // Placement before writeClient.commit() is intentional: offset validation is a stronger - // guard than commitOnErrors — if offset deviation indicates potential data loss, the commit - // must be prevented regardless of the commitOnErrors policy. - SparkStreamerValidatorUtils.runValidators(props, instantTime, writeStatuses, - checkpointCommitMetadata, metaClient); - boolean success; try { + List writeStatuses = writeStatusRDD.collect(); + + // Run pre-commit streaming offset validators (if configured). + // Placement before writeClient.commit() is intentional: offset validation is a stronger + // guard than commitOnErrors — if offset deviation indicates potential data loss, the commit + // must be prevented regardless of the commitOnErrors policy. + SparkStreamerValidatorUtils.runValidators(props, instantTime, writeStatuses, + checkpointCommitMetadata, metaClient); + success = writeClient.commit(instantTime, writeStatusRDD, Option.of(checkpointCommitMetadata), commitActionType, partitionToReplacedFileIds, Option.empty(), Option.of(writeStatusValidator)); } finally { - if (weOwnCache) { + if (shouldUnpersist) { writeStatusRDD.unpersist(); } } From 5f4bca6c6040b4414cbada6d116177806ae9ce24 Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Sun, 10 May 2026 08:37:47 -0700 Subject: [PATCH 11/12] fix: only cache writeStatusRDD when pre-commit validators are configured Address danny0405's review comment on PR #18405: skip the .cache()/.unpersist() cycle when no pre-commit validators are configured, since without validators the RDD is consumed exactly once by writeClient.commit() and caching adds no value. Guards both the cache call and the validator collect+run on a single validatorsConfigured boolean derived from hoodie.precommit.validators. --- .../hudi/utilities/streamer/StreamSync.java | 33 +++++++++++-------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java index 2001790f45437..2b1406c7deab5 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java @@ -74,6 +74,7 @@ import org.apache.hudi.config.HoodieErrorTableConfig; import org.apache.hudi.config.HoodieIndexConfig; import org.apache.hudi.config.HoodiePayloadConfig; +import org.apache.hudi.config.HoodiePreCommitValidatorConfig; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.config.metrics.HoodieMetricsConfig; import org.apache.hudi.data.HoodieJavaRDD; @@ -876,24 +877,30 @@ private Pair, JavaRDD> writeToSinkAndDoMetaSync(Hood totalSuccessfulRecords); String commitActionType = CommitUtils.getCommitActionType(cfg.operation, HoodieTableType.valueOf(cfg.tableType)); - // Cache the RDD if not already persisted, so both validators (collect) and - // writeClient.commit() share the same materialized result without re-evaluation. - // shouldUnpersist is true when we created the cache here (storage level was NONE), - // so the finally block knows to release it. - boolean shouldUnpersist = writeStatusRDD.getStorageLevel().equals(StorageLevel.NONE()); + // Cache the RDD only when pre-commit validators are configured. Validators collect the RDD + // before commit, so without caching the same DAG would re-evaluate inside writeClient.commit(). + // When no validators are configured, commit consumes the RDD once and caching adds no value. + // shouldUnpersist is true only when we created the cache here (validators present and storage + // level was NONE), so the finally block knows to release it. + boolean validatorsConfigured = !StringUtils.isNullOrEmpty(props.getString( + HoodiePreCommitValidatorConfig.VALIDATOR_CLASS_NAMES.key(), + HoodiePreCommitValidatorConfig.VALIDATOR_CLASS_NAMES.defaultValue())); + boolean shouldUnpersist = validatorsConfigured && writeStatusRDD.getStorageLevel().equals(StorageLevel.NONE()); if (shouldUnpersist) { writeStatusRDD.cache(); } boolean success; try { - List writeStatuses = writeStatusRDD.collect(); - - // Run pre-commit streaming offset validators (if configured). - // Placement before writeClient.commit() is intentional: offset validation is a stronger - // guard than commitOnErrors — if offset deviation indicates potential data loss, the commit - // must be prevented regardless of the commitOnErrors policy. - SparkStreamerValidatorUtils.runValidators(props, instantTime, writeStatuses, - checkpointCommitMetadata, metaClient); + if (validatorsConfigured) { + List writeStatuses = writeStatusRDD.collect(); + + // Run pre-commit streaming offset validators (if configured). + // Placement before writeClient.commit() is intentional: offset validation is a stronger + // guard than commitOnErrors — if offset deviation indicates potential data loss, the commit + // must be prevented regardless of the commitOnErrors policy. + SparkStreamerValidatorUtils.runValidators(props, instantTime, writeStatuses, + checkpointCommitMetadata, metaClient); + } success = writeClient.commit(instantTime, writeStatusRDD, Option.of(checkpointCommitMetadata), commitActionType, partitionToReplacedFileIds, Option.empty(), Option.of(writeStatusValidator)); From 17ece14ff0e0975a6dd9a0fab824d292b6101753 Mon Sep 17 00:00:00 2001 From: Xinli Shang Date: Sat, 16 May 2026 07:17:44 -0700 Subject: [PATCH 12/12] address codope review: V2-then-V1 checkpoint key resolution + V2 test coverage Comment 1 (SparkKafkaOffsetValidator hardcoded V1 key): - StreamingOffsetValidator base class now exposes a no-key constructor that auto-resolves the checkpoint via CheckpointUtils.getCheckpoint(metadata), which prefers V2 and falls back to V1. The explicit-key constructor stays for subclasses that read a custom non-streamer key (e.g. Flink's HOODIE_METADATA_KEY). - SparkKafkaOffsetValidator switches to the no-key constructor. Comment 2 (tests only cover V1 path): - testSecondCommitMatchingOffsetsPasses and testSecondCommitDataLossDetected are now parameterized over both V1 and V2 checkpoint keys. - Added testV2CheckpointKeyOnTableVersionEightFires on a HoodieTableVersion.EIGHT table with V2 keys, asserting the validator fires on data loss. --- .../validator/StreamingOffsetValidator.java | 69 +++++++++++++++++-- .../validator/SparkKafkaOffsetValidator.java | 9 +-- .../TestSparkStreamerValidatorUtils.java | 56 ++++++++++++--- 3 files changed, 115 insertions(+), 19 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java index b9e5979ba9ac7..40e7a4f1635f9 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/validator/StreamingOffsetValidator.java @@ -20,11 +20,13 @@ package org.apache.hudi.client.validator; import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieCommitMetadata; import org.apache.hudi.common.util.CheckpointUtils; import org.apache.hudi.common.util.CheckpointUtils.CheckpointFormat; import org.apache.hudi.common.util.Option; import org.apache.hudi.config.HoodiePreCommitValidatorConfig; import org.apache.hudi.config.HoodiePreCommitValidatorConfig.ValidationFailurePolicy; +import org.apache.hudi.exception.HoodieException; import org.apache.hudi.exception.HoodieValidationException; import lombok.extern.slf4j.Slf4j; @@ -50,7 +52,11 @@ * * Subclasses specify: * - Checkpoint format (SPARK_KAFKA, FLINK_KAFKA, etc.) - * - Checkpoint metadata key + * - Checkpoint metadata key (optional — when omitted, the validator auto-resolves the + * active streamer key from commit metadata using + * {@link org.apache.hudi.common.table.checkpoint.CheckpointUtils#getCheckpoint(HoodieCommitMetadata)}, + * which prefers V2 and falls back to V1. Subclasses that read a custom non-streamer key + * (e.g. Flink's HOODIE_METADATA_KEY) must pass it explicitly.) * - Source-specific parsing logic (if needed) * * Configuration: @@ -66,7 +72,26 @@ public abstract class StreamingOffsetValidator extends BasePreCommitValidator { protected final CheckpointFormat checkpointFormat; /** - * Create a streaming offset validator. + * Create a streaming offset validator that auto-resolves the checkpoint key from commit + * metadata using {@link org.apache.hudi.common.table.checkpoint.CheckpointUtils#getCheckpoint(HoodieCommitMetadata)}. + * + *

Use this constructor for streamer pipelines (V1 or V2 checkpoint keys). The validator + * will prefer V2 (table version 8+) and fall back to V1 transparently, so subclasses don't + * need to know which key the writer used.

+ * + * @param config Validator configuration + * @param checkpointFormat Format of the checkpoint string + */ + protected StreamingOffsetValidator(TypedProperties config, + CheckpointFormat checkpointFormat) { + this(config, null, checkpointFormat); + } + + /** + * Create a streaming offset validator with an explicit checkpoint metadata key. + * + *

Use this constructor when the writer stores its checkpoint under a custom key that + * is not the standard streamer V1/V2 key (e.g. Flink's HOODIE_METADATA_KEY).

* * @param config Validator configuration * @param checkpointKey Key to extract checkpoint from extraMetadata @@ -95,10 +120,12 @@ public void validateWithMetadata(ValidationContext context) throws HoodieValidat return; } - // Extract current checkpoint - Option currentCheckpointOpt = context.getExtraMetadata(checkpointKey); + // Extract current checkpoint — either from the explicit key (custom writers like Flink) or + // by auto-resolving from commit metadata (streamer pipelines, V2-then-V1 fallback). + Option currentCheckpointOpt = resolveCheckpoint(context.getCommitMetadata()); if (!currentCheckpointOpt.isPresent()) { - log.warn("Current checkpoint not found with key: {}. Skipping validation.", checkpointKey); + log.warn("Current checkpoint not found (key: {}). Skipping validation.", + checkpointKey == null ? "" : checkpointKey); return; } String currentCheckpoint = currentCheckpointOpt.get(); @@ -110,8 +137,7 @@ public void validateWithMetadata(ValidationContext context) throws HoodieValidat } // Extract previous checkpoint - Option previousCheckpointOpt = context.getPreviousCommitMetadata() - .flatMap(metadata -> Option.ofNullable(metadata.getMetadata(checkpointKey))); + Option previousCheckpointOpt = resolveCheckpoint(context.getPreviousCommitMetadata()); if (!previousCheckpointOpt.isPresent()) { log.info("Previous checkpoint not found. May be first streaming commit. Skipping validation."); @@ -218,4 +244,33 @@ private double calculateDeviation(long offsetDiff, long recordsWritten) { long difference = Math.abs(offsetDiff - recordsWritten); return (100.0 * difference) / offsetDiff; } + + /** + * Resolve the checkpoint string from commit metadata. + * + *

When the validator was constructed with an explicit {@code checkpointKey}, that key + * is read directly. Otherwise, {@link org.apache.hudi.common.table.checkpoint.CheckpointUtils#getCheckpoint(HoodieCommitMetadata)} + * is used to locate the active streamer checkpoint (V2 first, V1 fallback), so callers + * don't need to know which key the writer used.

+ * + * @param commitMetadataOpt Optional commit metadata containing extraMetadata + * @return Optional checkpoint string (empty if metadata is absent or no checkpoint key matches) + */ + private Option resolveCheckpoint(Option commitMetadataOpt) { + if (!commitMetadataOpt.isPresent()) { + return Option.empty(); + } + HoodieCommitMetadata metadata = commitMetadataOpt.get(); + if (checkpointKey != null) { + return Option.ofNullable(metadata.getMetadata(checkpointKey)); + } + try { + return Option.ofNullable( + org.apache.hudi.common.table.checkpoint.CheckpointUtils.getCheckpoint(metadata) + .getCheckpointKey()); + } catch (HoodieException e) { + // No V1 or V2 streamer checkpoint key present in extraMetadata. + return Option.empty(); + } + } } diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java index 8e5c8b33713af..589e09e96de89 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/validator/SparkKafkaOffsetValidator.java @@ -21,15 +21,16 @@ import org.apache.hudi.client.validator.StreamingOffsetValidator; import org.apache.hudi.common.config.TypedProperties; -import org.apache.hudi.common.table.checkpoint.StreamerCheckpointV1; import org.apache.hudi.common.util.CheckpointUtils.CheckpointFormat; /** * Spark/HoodieStreamer-specific Kafka offset validator. * *

Validates that the number of records written matches the Kafka offset difference - * between the current and previous HoodieStreamer checkpoints. Uses the Spark Kafka - * checkpoint format stored with key {@code deltastreamer.checkpoint.key} in extraMetadata.

+ * between the current and previous HoodieStreamer checkpoints. The active checkpoint key + * (V1 {@code deltastreamer.checkpoint.key} or V2 {@code streamer.checkpoint.key.v2}) is + * resolved at validation time via {@link org.apache.hudi.common.table.checkpoint.CheckpointUtils#getCheckpoint}, + * so this validator works against tables written with either checkpoint key version.

* *

Configuration: *

    @@ -54,6 +55,6 @@ public class SparkKafkaOffsetValidator extends StreamingOffsetValidator { public SparkKafkaOffsetValidator(TypedProperties config) { - super(config, StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, CheckpointFormat.SPARK_KAFKA); + super(config, CheckpointFormat.SPARK_KAFKA); } } diff --git a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java index 5b279a37af335..69d5d228dab60 100644 --- a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java +++ b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/validator/TestSparkStreamerValidatorUtils.java @@ -22,9 +22,12 @@ import org.apache.hudi.client.WriteStatus; import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.HoodieTableType; import org.apache.hudi.common.model.HoodieWriteStat; import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.HoodieTableVersion; import org.apache.hudi.common.table.checkpoint.StreamerCheckpointV1; +import org.apache.hudi.common.table.checkpoint.StreamerCheckpointV2; import org.apache.hudi.common.testutils.HoodieTestTable; import org.apache.hudi.common.testutils.HoodieTestUtils; import org.apache.hudi.common.util.Option; @@ -33,6 +36,8 @@ import org.junit.jupiter.api.Test; import org.junit.jupiter.api.io.TempDir; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; import java.io.IOException; import java.nio.file.Path; @@ -81,6 +86,10 @@ private HoodieTableMetaClient createMetaClient() throws IOException { return HoodieTestUtils.init(tempDir.toAbsolutePath().toString()); } + private HoodieTableMetaClient createMetaClient(HoodieTableVersion version) throws IOException { + return HoodieTestUtils.init(tempDir.toAbsolutePath().toString(), HoodieTableType.COPY_ON_WRITE, version); + } + // ========== Tests ========== @Test @@ -207,44 +216,75 @@ public void testValidationExceptionPreservedAcrossValidators() throws IOExceptio assertTrue(ex.getMessage().contains("FakeValidator")); } - @Test - public void testSecondCommitMatchingOffsetsPasses() throws Exception { + @ParameterizedTest + @ValueSource(strings = { + StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, + StreamerCheckpointV2.STREAMER_CHECKPOINT_KEY_V2 + }) + public void testSecondCommitMatchingOffsetsPasses(String checkpointKey) throws Exception { TypedProperties props = propsWithValidator( "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); // Create table with a previous committed instant: offset 0 -> 500 HoodieTableMetaClient metaClient = createMetaClient(); HoodieCommitMetadata prevMeta = new HoodieCommitMetadata(); - prevMeta.addMetadata(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:500"); + prevMeta.addMetadata(checkpointKey, "events,0:500"); HoodieTestTable.of(metaClient).addCommit("20260320110000000", Option.of(prevMeta)); // Second commit: offset 500 -> 600, 100 records written — matches diff exactly Map extraMeta = new HashMap<>(); - extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:600"); + extraMeta.put(checkpointKey, "events,0:600"); List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 100, 0)); assertDoesNotThrow(() -> SparkStreamerValidatorUtils.runValidators( props, "20260320120000000", writeStatuses, extraMeta, metaClient)); } - @Test - public void testSecondCommitDataLossDetected() throws Exception { + @ParameterizedTest + @ValueSource(strings = { + StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, + StreamerCheckpointV2.STREAMER_CHECKPOINT_KEY_V2 + }) + public void testSecondCommitDataLossDetected(String checkpointKey) throws Exception { TypedProperties props = propsWithValidator( "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); // Create table with a previous committed instant: offset 0 -> 1000 HoodieTableMetaClient metaClient = createMetaClient(); HoodieCommitMetadata prevMeta = new HoodieCommitMetadata(); - prevMeta.addMetadata(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:1000"); + prevMeta.addMetadata(checkpointKey, "events,0:1000"); HoodieTestTable.of(metaClient).addCommit("20260320110000000", Option.of(prevMeta)); // Second commit: offset 1000 -> 2000 (diff=1000) but only 500 records written — data loss Map extraMeta = new HashMap<>(); - extraMeta.put(StreamerCheckpointV1.STREAMER_CHECKPOINT_KEY_V1, "events,0:2000"); + extraMeta.put(checkpointKey, "events,0:2000"); List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 500, 0)); assertThrows(HoodieValidationException.class, () -> SparkStreamerValidatorUtils.runValidators( props, "20260320120000000", writeStatuses, extraMeta, metaClient)); } + + @Test + public void testV2CheckpointKeyOnTableVersionEightFires() throws Exception { + // Verifies the validator actually fires on a writeTableVersion=8 table that uses the + // V2 checkpoint key — i.e. the auto-resolution in StreamingOffsetValidator picks up V2 + // and runs the comparison instead of silently skipping. + TypedProperties props = propsWithValidator( + "org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator"); + + HoodieTableMetaClient metaClient = createMetaClient(HoodieTableVersion.EIGHT); + HoodieCommitMetadata prevMeta = new HoodieCommitMetadata(); + prevMeta.addMetadata(StreamerCheckpointV2.STREAMER_CHECKPOINT_KEY_V2, "events,0:1000"); + HoodieTestTable.of(metaClient).addCommit("20260320110000000", Option.of(prevMeta)); + + // Offset diff = 1000 but only 200 records written — must fail + Map extraMeta = new HashMap<>(); + extraMeta.put(StreamerCheckpointV2.STREAMER_CHECKPOINT_KEY_V2, "events,0:2000"); + List writeStatuses = Collections.singletonList(buildWriteStatus("p1", 200, 0)); + + assertThrows(HoodieValidationException.class, + () -> SparkStreamerValidatorUtils.runValidators( + props, "20260320120000000", writeStatuses, extraMeta, metaClient)); + } }