Skip to content

KAFKA-20452: Avoid creating unnecessary empty batches in LogCleaner below the High Watermark#22089

Merged
chia7712 merged 1 commit intoapache:trunkfrom
JiayaoS:fix/KAFKA-20452
May 8, 2026
Merged

KAFKA-20452: Avoid creating unnecessary empty batches in LogCleaner below the High Watermark#22089
chia7712 merged 1 commit intoapache:trunkfrom
JiayaoS:fix/KAFKA-20452

Conversation

@JiayaoS
Copy link
Copy Markdown
Contributor

@JiayaoS JiayaoS commented Apr 18, 2026

This patch replaces upperBoundOffset with highWatermark in
Cleaner.cleanSegments() / cleanInto(), update related Javadoc and
comments, and adds a test to verify that empty batches are not retained
when the highWatermark is beyond the cleaned range.

Reviewers: Jun Rao junrao@gmail.com, PoAn Yang payang@apache.org,
Chia-Ping Tsai chia7712@gmail.com

@github-actions github-actions Bot added triage PRs from the community core Kafka Broker storage Pull requests that target the storage module small Small PRs labels Apr 18, 2026
@JiayaoS
Copy link
Copy Markdown
Contributor Author

JiayaoS commented Apr 18, 2026

PR #17193 (KAFKA-17076) introduced logic to retain the last batch in each cleaning round even if empty, to preserve logEndOffset after compaction. As noted by @junrao, the current check uses upperBoundOffset, which may produce unnecessary empty batch.

We only need to retain an empty batch when its nextOffset equals the highWatermark. This PR replaces upperBoundOffset with highWatermark in Cleaner.java, update related comments and Javadoc, and adds testCleanSegmentsNotRetainingLastEmptyBatch to verify empty batch are not retained when the highWatermark is beyond the cleaned range.

testCleanSegmentsRetainingLastEmptyBatch — empty batch retained when nextOffset == highWatermark.
testCleanSegmentsNotRetainingLastEmptyBatch — no empty batch retained when high watermark is beyond the cleaned segments.

@JiayaoS JiayaoS force-pushed the fix/KAFKA-20452 branch 4 times, most recently from 9521654 to 8288640 Compare April 18, 2026 03:59
CleanedTransactionMetadata transactionMetadata,
long legacyDeleteHorizonMs,
long upperBoundOffsetOfCleaningRound) throws IOException {
long highWatermark) throws IOException {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get the hw directly within this method instead of passing it in from the caller?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use log.highWatermark() directly, I think the main impact would be on unit tests. If the HW is modified during cleaning, different batches may see different HW values, but I don't think this would cause any issues since the HW typically only moves forward.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, since hw is monotonic, we can simplify the method by fetching it internally rather than capturing the value too early.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted how the highWatermark parameter is passed, updated the corresponding unit tests, and all tests pass locally.

@github-actions github-actions Bot removed the small Small PRs label Apr 18, 2026
@JiayaoS JiayaoS marked this pull request as ready for review April 18, 2026 09:47
@github-actions github-actions Bot removed the triage PRs from the community label Apr 19, 2026
Comment thread core/src/test/scala/unit/kafka/log/LogCleanerTest.scala Outdated
@chia7712 chia7712 requested a review from junrao April 20, 2026 06:40
Comment thread core/src/test/scala/unit/kafka/log/LogCleanerTest.scala Outdated
@JiayaoS
Copy link
Copy Markdown
Contributor Author

JiayaoS commented May 6, 2026

Hi @junrao , Could you please take a look when you have a moment? Thanks!

Copy link
Copy Markdown
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JiayaoS : Thanks for the PR. LGTM

@chia7712 chia7712 merged commit 7480345 into apache:trunk May 8, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved core Kafka Broker storage Pull requests that target the storage module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants