MINOR: Fix clustering row writer to avoid using timestamp based reads by lokeshj1703 · Pull Request #18476 · apache/hudi

lokeshj1703 · 2026-04-07T07:17:15Z

Describe the issue this Pull Request addresses

PR cherry-picks #18475

This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.

The read code path for clustering row writer already filters based on the explicit file paths we are setting in params. So, removing the TIMESTAMP_AS_OF in the query.

Summary and Changelog

This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.

Impact

NA

Risk Level

low

Documentation Update

NA

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — One minor typing inconsistency in the new test file: the JavaRDD declaration on line 277 is missing the generic type parameter that's used elsewhere.

yihua · 2026-04-07T07:28:24Z

+  }
+
+  private List<HoodieRecord> writeData(SparkRDDWriteClient client, String instant, List<HoodieRecord> recordList, boolean doCommitExplicitly) {
+    JavaRDD records = jsc.parallelize(recordList, 2);


🤖 nit: could you add the generic type parameter here? Should be JavaRDD<HoodieRecord> records to match line 175 and avoid raw type warnings.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — clean fix that removes a timestamp constraint that was redundant (explicit hoodie.datasource.read.paths/glob.paths already scope the read to the clustering plan's files) while also being the root cause of failures when an inflight ingestion exists at a later instant. The test coverage is thorough and correctly parameterises both COW/MOR and row-writer on/off. One small question inline about MoR log file discovery.

yihua · 2026-04-07T07:36:47Z

@@ -427,7 +426,6 @@ private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,

    HashMap<String, String> params = new HashMap<>();


🤖 Could you confirm that createRelation with explicit hoodie.datasource.read.paths + glob.paths truly suppresses Hudi's file-group-based log file auto-discovery for MoR? My concern is that if the relation still scans for all log files belonging to a file group (rather than only the ones in paths), a concurrent completed commit that wrote new log files to the same file group between clustering scheduling and execution could get silently included — producing a clustered output that covers a wider time range than the plan intended.

hudi-bot · 2026-04-07T08:47:01Z

CI report:

2375e39 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2026-04-10T04:01:34Z

Lokesh Jain added 2 commits April 7, 2026 12:18

MINOR: Fix clustering row writer to avoid using timestamp based reads

1e9b3ae

Fix compilation

2375e39

yihua reviewed Apr 7, 2026

View reviewed changes

github-actions bot added the size:M PR with lines of changes in (100, 300] label Apr 7, 2026

lokeshj1703 changed the title ~~Minor 18~~ MINOR: Fix clustering row writer to avoid using timestamp based reads Apr 8, 2026

nsivabalan approved these changes Apr 8, 2026

View reviewed changes

nsivabalan merged commit 44fa4c1 into apache:release-0.14.2-prep Apr 10, 2026
28 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MINOR: Fix clustering row writer to avoid using timestamp based reads#18476

MINOR: Fix clustering row writer to avoid using timestamp based reads#18476
nsivabalan merged 2 commits intoapache:release-0.14.2-prepfrom
lokeshj1703:minor-18

lokeshj1703 commented Apr 7, 2026

Uh oh!

yihua left a comment

Uh oh!

yihua Apr 7, 2026

Uh oh!

yihua left a comment

Uh oh!

yihua Apr 7, 2026

Uh oh!

hudi-bot commented Apr 7, 2026

Uh oh!

nsivabalan commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -427,7 +426,6 @@ private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,

		HashMap<String, String> params = new HashMap<>();

Conversation

lokeshj1703 commented Apr 7, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Apr 7, 2026

CI report:

Uh oh!

nsivabalan commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants