MINOR: Fix clustering row writer to avoid using timestamp based reads by lokeshj1703 · Pull Request #18475 · apache/hudi

lokeshj1703 · 2026-04-07T06:48:08Z

Describe the issue this Pull Request addresses

This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.

The read code path for clustering row writer already filters based on the explicit file paths we are setting in params. So, removing the TIMESTAMP_AS_OF in the query.

Summary and Changelog

This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.

Impact

NA

Risk Level

low

Documentation Update

NA

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — Code looks clean overall with one minor readability suggestion.

yihua · 2026-04-07T06:58:40Z

+      Assertions.assertTrue(recordsProcessedByClustering > 0,
+          "Clustering should have processed some records, but got: " + recordsProcessedByClustering);
+
+      // Verify that there were more total records than what was clustered


🤖 nit: the writeData() method signature declares List<HoodieRecord> as the return type, but the return value is never used at the call site (line 120) and the method just returns the input unchanged—consider changing the return type to void for clarity.

@lokeshj1703 it would be good to address this nit.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — clean fix that removes a redundant (and potentially harmful) TIMESTAMP_AS_OF parameter from the clustering row writer read path. The HoodieFileGroupReader operates on explicitly captured file slices from the clustering plan, and HoodieMergedLogRecordReader already handles inflight record filtering via allowInflightInstants=false, making the timestamp filter both unnecessary and a potential source of data loss (committed records between scheduling and execution would have been silently excluded). The new test is well-structured and correctly exercises both COW and MOR paths with the proper hoodie.datasource.write.row.writer.enable config key that controls the row writer branch in MultipleSparkJobExecutionStrategy.performClustering().

hudi-bot · 2026-04-07T07:48:39Z

CI report:

ac37866 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM

yihua · 2026-04-08T07:04:22Z

+      Assertions.assertTrue(recordsProcessedByClustering > 0,
+          "Clustering should have processed some records, but got: " + recordsProcessedByClustering);
+
+      // Verify that there were more total records than what was clustered


@lokeshj1703 it would be good to address this nit.

…#18476) PR cherry-picks #18475 This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads. The read code path for clustering row writer already filters based on the explicit file paths we are setting in params. So, removing the TIMESTAMP_AS_OF in the query. --------- Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local>

MINOR: Fix clustering row writer to avoid using timestamp based reads

ac37866

yihua reviewed Apr 7, 2026

View reviewed changes

github-actions bot added the size:M PR with lines of changes in (100, 300] label Apr 7, 2026

yihua reviewed Apr 7, 2026

View reviewed changes

lokeshj1703 mentioned this pull request Apr 7, 2026

MINOR: Fix clustering row writer to avoid using timestamp based reads #18476

Merged

3 tasks

nsivabalan approved these changes Apr 8, 2026

View reviewed changes

yihua approved these changes Apr 8, 2026

View reviewed changes

nsivabalan merged commit 3932271 into apache:branch-0.x Apr 9, 2026
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MINOR: Fix clustering row writer to avoid using timestamp based reads#18475

MINOR: Fix clustering row writer to avoid using timestamp based reads#18475
nsivabalan merged 1 commit intoapache:branch-0.xfrom
lokeshj1703:minor-17

lokeshj1703 commented Apr 7, 2026

Uh oh!

yihua left a comment

Uh oh!

yihua Apr 7, 2026

Uh oh!

yihua Apr 8, 2026

Uh oh!

yihua left a comment

Uh oh!

hudi-bot commented Apr 7, 2026

Uh oh!

yihua left a comment

Uh oh!

yihua Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lokeshj1703 commented Apr 7, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Apr 7, 2026

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants