Skip to content

MINOR: Fix clustering row writer to avoid using timestamp based reads#18476

Merged
nsivabalan merged 2 commits intoapache:release-0.14.2-prepfrom
lokeshj1703:minor-18
Apr 10, 2026
Merged

MINOR: Fix clustering row writer to avoid using timestamp based reads#18476
nsivabalan merged 2 commits intoapache:release-0.14.2-prepfrom
lokeshj1703:minor-18

Conversation

@lokeshj1703
Copy link
Copy Markdown
Collaborator

Describe the issue this Pull Request addresses

PR cherry-picks #18475

This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.

The read code path for clustering row writer already filters based on the explicit file paths we are setting in params. So, removing the TIMESTAMP_AS_OF in the query.

Summary and Changelog

This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.

Impact

NA

Risk Level

low

Documentation Update

NA

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — One minor typing inconsistency in the new test file: the JavaRDD declaration on line 277 is missing the generic type parameter that's used elsewhere.

}

private List<HoodieRecord> writeData(SparkRDDWriteClient client, String instant, List<HoodieRecord> recordList, boolean doCommitExplicitly) {
JavaRDD records = jsc.parallelize(recordList, 2);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you add the generic type parameter here? Should be JavaRDD<HoodieRecord> records to match line 175 and avoid raw type warnings.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — clean fix that removes a timestamp constraint that was redundant (explicit hoodie.datasource.read.paths/glob.paths already scope the read to the clustering plan's files) while also being the root cause of failures when an inflight ingestion exists at a later instant. The test coverage is thorough and correctly parameterises both COW/MOR and row-writer on/off. One small question inline about MoR log file discovery.

@@ -427,7 +426,6 @@ private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc,

HashMap<String, String> params = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Could you confirm that createRelation with explicit hoodie.datasource.read.paths + glob.paths truly suppresses Hudi's file-group-based log file auto-discovery for MoR? My concern is that if the relation still scans for all log files belonging to a file group (rather than only the ones in paths), a concurrent completed commit that wrote new log files to the same file group between clustering scheduling and execution could get silently included — producing a clustered output that covers a wider time range than the plan intended.

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Apr 7, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Apr 7, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@lokeshj1703 lokeshj1703 changed the title Minor 18 MINOR: Fix clustering row writer to avoid using timestamp based reads Apr 8, 2026
@nsivabalan
Copy link
Copy Markdown
Contributor

image

@nsivabalan nsivabalan merged commit 44fa4c1 into apache:release-0.14.2-prep Apr 10, 2026
28 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants