Skip to content

MINOR: Fix clustering row writer to avoid using timestamp based reads#18475

Merged
nsivabalan merged 1 commit intoapache:branch-0.xfrom
lokeshj1703:minor-17
Apr 9, 2026
Merged

MINOR: Fix clustering row writer to avoid using timestamp based reads#18475
nsivabalan merged 1 commit intoapache:branch-0.xfrom
lokeshj1703:minor-17

Conversation

@lokeshj1703
Copy link
Copy Markdown
Collaborator

Describe the issue this Pull Request addresses

This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.

The read code path for clustering row writer already filters based on the explicit file paths we are setting in params. So, removing the TIMESTAMP_AS_OF in the query.

Summary and Changelog

This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.

Impact

NA

Risk Level

low

Documentation Update

NA

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — Code looks clean overall with one minor readability suggestion.

Assertions.assertTrue(recordsProcessedByClustering > 0,
"Clustering should have processed some records, but got: " + recordsProcessedByClustering);

// Verify that there were more total records than what was clustered
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the writeData() method signature declares List<HoodieRecord> as the return type, but the return value is never used at the call site (line 120) and the method just returns the input unchanged—consider changing the return type to void for clarity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lokeshj1703 it would be good to address this nit.

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Apr 7, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — clean fix that removes a redundant (and potentially harmful) TIMESTAMP_AS_OF parameter from the clustering row writer read path. The HoodieFileGroupReader operates on explicitly captured file slices from the clustering plan, and HoodieMergedLogRecordReader already handles inflight record filtering via allowInflightInstants=false, making the timestamp filter both unnecessary and a potential source of data loss (committed records between scheduling and execution would have been silently excluded). The new test is well-structured and correctly exercises both COW and MOR paths with the proper hoodie.datasource.write.row.writer.enable config key that controls the row writer branch in MultipleSparkJobExecutionStrategy.performClustering().

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Apr 7, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Assertions.assertTrue(recordsProcessedByClustering > 0,
"Clustering should have processed some records, but got: " + recordsProcessedByClustering);

// Verify that there were more total records than what was clustered
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lokeshj1703 it would be good to address this nit.

@nsivabalan nsivabalan merged commit 3932271 into apache:branch-0.x Apr 9, 2026
48 checks passed
nsivabalan pushed a commit that referenced this pull request Apr 10, 2026
…#18476)

PR cherry-picks #18475

This PR adds comprehensive test coverage for clustering operations when there's a pending ingestion in a different partition, and fixes an issue with row writer clustering that was incorrectly using timestamp-based reads.

The read code path for clustering row writer already filters based on the explicit file paths we are setting in params. So, removing the TIMESTAMP_AS_OF in the query.


---------

Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants