Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-7385: Fix log cleaner behavior when empty batches are retained #5623

Merged
merged 6 commits into from Sep 9, 2018

Conversation

dhruvilshah3
Copy link
Contributor

@dhruvilshah3 dhruvilshah3 commented Sep 7, 2018

With idempotent producers, we may leave empty batches in the log during log compaction. When filtering the data, we keep track of state like maxOffset and maxTimestamp of filtered data. This patch ensures we maintain this state correctly for the case when only empty batches are left in MemoryRecords#filterTo. Without this patch, we did not initialize maxOffset in this edge case which led us to append data to the log with maxOffset = -1L, causing the append to fail and log cleaner to crash.

@@ -371,31 +388,24 @@ public int hashCode() {
public static class FilterResult {
public final ByteBuffer output;
public final int messagesRead;
public final int bytesRead;
public final int totalBytesRead;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this only this one renamed?


for (MutableRecordBatch batch : batches) {
bytesRead += batch.sizeInBytes();
long maxOffset = -1L;
int messagesRead = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no messages retained, then don't we lose this value?

@ijuma
Copy link
Contributor

ijuma commented Sep 8, 2018

@dhruvilshah3 can you please add a description to the PR explaining the fix?

private static FilterResult filterTo(TopicPartition partition, Iterable<MutableRecordBatch> batches,
RecordFilter filter, ByteBuffer destinationBuffer, int maxRecordBatchSize,
BufferSupplier decompressionBufferSupplier) {
private static class FilteredBatchesMetadata {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why we added this new class and only have a subset of fields in it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is tracking state for all filtered batches. Earlier, all branches in filterTo were cluttered with logic to track state per batch and aggregating that over all the batches we have seen so far. With this class, we now only track per batch state and FilteredBatchesMetadata holds the aggregate. Cleaner separation and avoiding code repetition, mostly which also makes it easier to reason about which fields have / have not been initialized in a particular branch.

Two fields have been left out from here: messagesRead and bytesRead. Both these fields are aggregated for all the data we have seen so far, regardless of whether we end up filtering it or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Maybe worth documenting the intent (briefly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nicer to make the fields in FilterResult mutable and move the functionality there. It's a little annoying to have an additional internal class just for accumulating a subset of the filter stats.

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch, left a few small comments.

@@ -245,18 +263,17 @@ private static FilterResult filterTo(TopicPartition partition, Iterable<MutableR
batch.producerEpoch(), batch.baseSequence(), batch.baseOffset(), batch.lastOffset(),
batch.partitionLeaderEpoch(), batch.timestampType(), batch.maxTimestamp(),
batch.isTransactional(), batch.isControlBatch());
filteredBatchesMetadata.addRetainedBatchMetadata(batch, retainedRecords.size(), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't retainedRecords.size() just 0 down this path?

private static FilterResult filterTo(TopicPartition partition, Iterable<MutableRecordBatch> batches,
RecordFilter filter, ByteBuffer destinationBuffer, int maxRecordBatchSize,
BufferSupplier decompressionBufferSupplier) {
private static class FilteredBatchesMetadata {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nicer to make the fields in FilterResult mutable and move the functionality there. It's a little annoying to have an additional internal class just for accumulating a subset of the filter stats.

ByteBufferOutputStream bufferOutputStream = new ByteBufferOutputStream(destinationBuffer);
int totalBytesRead = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FilterResult is mutable, could we just increment the bytesRead and messagesRead fields directly?

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the patch.

@hachikuji hachikuji merged commit 958cdca into apache:trunk Sep 9, 2018
@dhruvilshah3 dhruvilshah3 deleted the cleaner-fix branch September 9, 2018 01:04
hachikuji pushed a commit that referenced this pull request Sep 9, 2018
…ined (#5623)

With idempotent/transactional producers, we may leave empty batches in the log during log compaction. When filtering the data, we keep track of state like `maxOffset` and `maxTimestamp` of filtered data. This patch ensures we maintain this state correctly for the case when only empty batches are left in `MemoryRecords#filterTo`. Without this patch, we did not initialize `maxOffset` in this edge case which led us to append data to the log with `maxOffset` = -1L, causing the append to fail and log cleaner to crash.

Reviewers: Jason Gustafson <jason@confluent.io>
Pengxiaolong pushed a commit to Pengxiaolong/kafka that referenced this pull request Jun 14, 2019
…ined (apache#5623)

With idempotent/transactional producers, we may leave empty batches in the log during log compaction. When filtering the data, we keep track of state like `maxOffset` and `maxTimestamp` of filtered data. This patch ensures we maintain this state correctly for the case when only empty batches are left in `MemoryRecords#filterTo`. Without this patch, we did not initialize `maxOffset` in this edge case which led us to append data to the log with `maxOffset` = -1L, causing the append to fail and log cleaner to crash.

Reviewers: Jason Gustafson <jason@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants