MINOR: Fix clustering row writer to avoid using timestamp based reads#18475
MINOR: Fix clustering row writer to avoid using timestamp based reads#18475nsivabalan merged 1 commit intoapache:branch-0.xfrom
Conversation
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — Code looks clean overall with one minor readability suggestion.
| Assertions.assertTrue(recordsProcessedByClustering > 0, | ||
| "Clustering should have processed some records, but got: " + recordsProcessedByClustering); | ||
|
|
||
| // Verify that there were more total records than what was clustered |
There was a problem hiding this comment.
🤖 nit: the writeData() method signature declares List<HoodieRecord> as the return type, but the return value is never used at the call site (line 120) and the method just returns the input unchanged—consider changing the return type to void for clarity.
There was a problem hiding this comment.
@lokeshj1703 it would be good to address this nit.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
LGTM — clean fix that removes a redundant (and potentially harmful) TIMESTAMP_AS_OF parameter from the clustering row writer read path. The HoodieFileGroupReader operates on explicitly captured file slices from the clustering plan, and HoodieMergedLogRecordReader already handles inflight record filtering via allowInflightInstants=false, making the timestamp filter both unnecessary and a potential source of data loss (committed records between scheduling and execution would have been silently excluded). The new test is well-structured and correctly exercises both COW and MOR paths with the proper hoodie.datasource.write.row.writer.enable config key that controls the row writer branch in MultipleSparkJobExecutionStrategy.performClustering().
| Assertions.assertTrue(recordsProcessedByClustering > 0, | ||
| "Clustering should have processed some records, but got: " + recordsProcessedByClustering); | ||
|
|
||
| // Verify that there were more total records than what was clustered |
There was a problem hiding this comment.
@lokeshj1703 it would be good to address this nit.
…#18476) PR cherry-picks #18475 This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads. The read code path for clustering row writer already filters based on the explicit file paths we are setting in params. So, removing the TIMESTAMP_AS_OF in the query. --------- Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local>
Describe the issue this Pull Request addresses
This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.
The read code path for clustering row writer already filters based on the explicit file paths we are setting in params. So, removing the TIMESTAMP_AS_OF in the query.
Summary and Changelog
This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.
Impact
NA
Risk Level
low
Documentation Update
NA
Contributor's checklist