MINOR: Fix clustering row writer to avoid using timestamp based reads#18476
MINOR: Fix clustering row writer to avoid using timestamp based reads#18476nsivabalan merged 2 commits intoapache:release-0.14.2-prepfrom
Conversation
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — One minor typing inconsistency in the new test file: the JavaRDD declaration on line 277 is missing the generic type parameter that's used elsewhere.
| } | ||
|
|
||
| private List<HoodieRecord> writeData(SparkRDDWriteClient client, String instant, List<HoodieRecord> recordList, boolean doCommitExplicitly) { | ||
| JavaRDD records = jsc.parallelize(recordList, 2); |
There was a problem hiding this comment.
🤖 nit: could you add the generic type parameter here? Should be JavaRDD<HoodieRecord> records to match line 175 and avoid raw type warnings.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
LGTM — clean fix that removes a timestamp constraint that was redundant (explicit hoodie.datasource.read.paths/glob.paths already scope the read to the clustering plan's files) while also being the root cause of failures when an inflight ingestion exists at a later instant. The test coverage is thorough and correctly parameterises both COW/MOR and row-writer on/off. One small question inline about MoR log file discovery.
| @@ -427,7 +426,6 @@ private Dataset<Row> readRecordsForGroupAsRow(JavaSparkContext jsc, | |||
|
|
|||
| HashMap<String, String> params = new HashMap<>(); | |||
There was a problem hiding this comment.
🤖 Could you confirm that createRelation with explicit hoodie.datasource.read.paths + glob.paths truly suppresses Hudi's file-group-based log file auto-discovery for MoR? My concern is that if the relation still scans for all log files belonging to a file group (rather than only the ones in paths), a concurrent completed commit that wrote new log files to the same file group between clustering scheduling and execution could get silently included — producing a clustered output that covers a wider time range than the plan intended.
44fa4c1
into
apache:release-0.14.2-prep

Describe the issue this Pull Request addresses
PR cherry-picks #18475
This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.
The read code path for clustering row writer already filters based on the explicit file paths we are setting in params. So, removing the TIMESTAMP_AS_OF in the query.
Summary and Changelog
This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.
Impact
NA
Risk Level
low
Documentation Update
NA
Contributor's checklist