New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-16007 Merge batch records during ZK migration #15007
KAFKA-16007 Merge batch records during ZK migration #15007
Conversation
Some supporting evidence. I modified ZkMigrationIntegrationTest to create 1000 partitions in three ways (10 topics, 100 topics, 1000 topics). The number of batches generated and the overall migration time was pretty linear:
times in milliseconds. As we can see, the batch size doesn't really affect the time waited on each batch. The biggest gain here comes from making fewer batches overall. In KRaftMigrationDriver, we wait for each batch to be committed by the controller which really limits throughput. Probably this ~30ms we're seeing is the round-trip time in the Raft layer. |
batch.forEach(apiMessageAndVersion -> | ||
log.trace(recordRedactor.toLoggableString(apiMessageAndVersion.message()))); | ||
} | ||
CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the consumer here have any expectation on atomicity of the records? I am trying to figure out how the batching applies at the raft layer. Would you expect the batches to be preserved in the log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this consumer only combines batches, any semantics relying on batch boundaries should be ok. Anyways, batches are irrelevant during the migration since we're using transactions at the controller layer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer your question more directly
Does the consumer here have any expectation on atomicity of the records?
No. The eventual consumer of these batches is QuorumController#MigrationRecordConsumer which simply sends them along to Raft as a non-atomic batch. It doesn't care about batch boundaries or alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related question, what happens if KRaft loses leadership in the middle of this consumer loop?
return String.format("%d records were generated in %d ms across %d batches. The record types were %s", | ||
totalRecords, durationMs(), totalBatches, recordTypeCounts); | ||
return String.format( | ||
"%d records were generated in %d ms across %d batches. The average batch size was %.2f " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "average batch size" might be a little ambiguous. Maybe we could say "record/batch" or something like that? Wondering if size in bytes is interesting also, but perhaps we can get that from the raft metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after this patch we probably expect this value to be around 1000, so maybe it's not that useful to print out here.
Size is interesting, but yea we can infer that from the Raft metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement.
metadata/src/main/java/org/apache/kafka/metadata/migration/BufferingBatchConsumer.java
Outdated
Show resolved
Hide resolved
} | ||
CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch); | ||
long batchStart = time.nanoseconds(); | ||
FutureUtils.waitWithLogging(KRaftMigrationDriver.this.log, "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, Kafka should avoid blocking on a CompletableFuture. This can be avoided by using CompletableFuture::thenCompose
or better yet concurrent.Flow
since the CompletableFuture
doesn't return an interesting value.
I looked at ZkMigrationClient
. If you wanted to use Flow
. You would replace the use of Consumer
with Flow.Subscriber
. ZkMigrationClient
would be come a Flow.Publisher
.
Flow has support for pipelining and back-pressure. For example, you would make initial Subscription.request
1000
and request more data as the zkRecordConsumer
processes more batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the Flow API looks really cool. I'll check that out. Does look like it's Java 9+ only, but I'll keep it in mind for future stuff (I think we'll be bumping up to Java 11 for 4.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I keep forgetting that we still need to support Java 8. Looking forward to 4.x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an existing issue but Time.waitForFuture
doesn't look correct. It is comparing nano times. In the JVM you can't compare nano times because they can overflow. It is recommended to instead compare elapse time: https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/System.html#nanoTime()
This will cause this code to block forever when it overflows.
migrationBatchConsumer, | ||
brokersInMetadata::add | ||
); | ||
migrationBatchConsumer.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If zkMigrationClient.readAllMetadata
throws migrationBatchConsumer.close
is not called. Is this okay because zkRecordConsumer.abortMigration
is called in the catch
?
batch.forEach(apiMessageAndVersion -> | ||
log.trace(recordRedactor.toLoggableString(apiMessageAndVersion.message()))); | ||
} | ||
CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related question, what happens if KRaft loses leadership in the middle of this consumer loop?
Thanks for taking a look @jsancio! I'll answer some related questions here. If an error occurs inside the The logic for detecting and aborting partial transactions is in ActivationRecordsGenerator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Should we fix the blocking issue in this PR?
delegateConsumer.accept(new ArrayList<>(bufferedBatch)); | ||
bufferedBatch.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How bout this implication?
delegateConsumer.accept(bufferedBatch);
bufferedBatch = new ArrayList<>(minBatchSize);
Similar in flush
. There seems to be some code duplication between these two methods.
} | ||
CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch); | ||
long batchStart = time.nanoseconds(); | ||
FutureUtils.waitWithLogging(KRaftMigrationDriver.this.log, "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I keep forgetting that we still need to support Java 8. Looking forward to 4.x.
} | ||
CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch); | ||
long batchStart = time.nanoseconds(); | ||
FutureUtils.waitWithLogging(KRaftMigrationDriver.this.log, "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an existing issue but Time.waitForFuture
doesn't look correct. It is comparing nano times. In the JVM you can't compare nano times because they can overflow. It is recommended to instead compare elapse time: https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/System.html#nanoTime()
This will cause this code to block forever when it overflows.
batches++; | ||
batchDurationsNs += durationNs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This is measuring how much time the ZK migration spent in the controller, writing and committing the batches to KRaft, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's correct.
I filed https://issues.apache.org/jira/browse/KAFKA-16020 for the nanos issue. Good catch! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if the tests are green.
Test failures look unrelated |
To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org>
To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org>
To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org>
To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org>
) To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org> Co-authored-by: David Arthur <mumrah@gmail.com>
To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org>
To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org>
To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking.
Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size.