KAFKA-16007 Merge batch records during ZK migration #15007

mumrah · 2023-12-13T22:19:43Z

To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking.

Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size.

mumrah · 2023-12-14T23:00:47Z

Some supporting evidence.

I modified ZkMigrationIntegrationTest to create 1000 partitions in three ways (10 topics, 100 topics, 1000 topics). The number of batches generated and the overall migration time was pretty linear:

Case	Records	Batches	Time Per Batch	Total Time
10 topics x 100 partitions (trunk)	1020	12	32.33	536
100 topics x 10 partitions (trunk)	1110	102	30.30	3317
1000 topics x 1 partition (trunk)	2010	1002	30.69	32243
1000 topics x 1 partition (this PR)	2010	3	30.69	415

times in milliseconds.

As we can see, the batch size doesn't really affect the time waited on each batch. The biggest gain here comes from making fewer batches overall. In KRaftMigrationDriver, we wait for each batch to be committed by the controller which really limits throughput. Probably this ~30ms we're seeing is the round-trip time in the Raft layer.

hachikuji · 2023-12-15T01:38:09Z

metadata/src/main/java/org/apache/kafka/metadata/migration/KRaftMigrationDriver.java

+                    batch.forEach(apiMessageAndVersion ->
+                        log.trace(recordRedactor.toLoggableString(apiMessageAndVersion.message())));
+                }
+                CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch);


Does the consumer here have any expectation on atomicity of the records? I am trying to figure out how the batching applies at the raft layer. Would you expect the batches to be preserved in the log?

Since this consumer only combines batches, any semantics relying on batch boundaries should be ok. Anyways, batches are irrelevant during the migration since we're using transactions at the controller layer

To answer your question more directly

Does the consumer here have any expectation on atomicity of the records?

No. The eventual consumer of these batches is QuorumController#MigrationRecordConsumer which simply sends them along to Raft as a non-atomic batch. It doesn't care about batch boundaries or alignment.

Related question, what happens if KRaft loses leadership in the middle of this consumer loop?

hachikuji · 2023-12-15T01:38:36Z

metadata/src/main/java/org/apache/kafka/metadata/migration/MigrationManifest.java

-        return String.format("%d records were generated in %d ms across %d batches. The record types were %s",
-            totalRecords, durationMs(), totalBatches, recordTypeCounts);
+        return String.format(
+            "%d records were generated in %d ms across %d batches. The average batch size was %.2f " +


The "average batch size" might be a little ambiguous. Maybe we could say "record/batch" or something like that? Wondering if size in bytes is interesting also, but perhaps we can get that from the raft metrics.

Actually, after this patch we probably expect this value to be around 1000, so maybe it's not that useful to print out here.

Size is interesting, but yea we can infer that from the Raft metrics.

jsancio

Thanks for the improvement.

metadata/src/main/java/org/apache/kafka/metadata/migration/BufferingBatchConsumer.java

jsancio · 2023-12-15T02:47:38Z

metadata/src/main/java/org/apache/kafka/metadata/migration/KRaftMigrationDriver.java

+                }
+                CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch);
+                long batchStart = time.nanoseconds();
+                FutureUtils.waitWithLogging(KRaftMigrationDriver.this.log, "",


In general, Kafka should avoid blocking on a CompletableFuture. This can be avoided by using CompletableFuture::thenCompose or better yet concurrent.Flow since the CompletableFuture doesn't return an interesting value.

I looked at ZkMigrationClient. If you wanted to use Flow. You would replace the use of Consumer with Flow.Subscriber. ZkMigrationClient would be come a Flow.Publisher.

Flow has support for pipelining and back-pressure. For example, you would make initial Subscription.request 1000 and request more data as the zkRecordConsumer processes more batches.

Thanks, the Flow API looks really cool. I'll check that out. Does look like it's Java 9+ only, but I'll keep it in mind for future stuff (I think we'll be bumping up to Java 11 for 4.0)

That's fair. I keep forgetting that we still need to support Java 8. Looking forward to 4.x.

This is an existing issue but Time.waitForFuture doesn't look correct. It is comparing nano times. In the JVM you can't compare nano times because they can overflow. It is recommended to instead compare elapse time: https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/System.html#nanoTime()

This will cause this code to block forever when it overflows.

jsancio · 2023-12-15T03:00:41Z

metadata/src/main/java/org/apache/kafka/metadata/migration/KRaftMigrationDriver.java

+                    migrationBatchConsumer,
+                    brokersInMetadata::add
+                );
+                migrationBatchConsumer.close();


If zkMigrationClient.readAllMetadata throws migrationBatchConsumer.close is not called. Is this okay because zkRecordConsumer.abortMigration is called in the catch?

jsancio · 2023-12-15T03:04:59Z

metadata/src/main/java/org/apache/kafka/metadata/migration/KRaftMigrationDriver.java

+                    batch.forEach(apiMessageAndVersion ->
+                        log.trace(recordRedactor.toLoggableString(apiMessageAndVersion.message())));
+                }
+                CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch);


Related question, what happens if KRaft loses leadership in the middle of this consumer loop?

mumrah · 2023-12-15T13:34:12Z

Thanks for taking a look @jsancio! I'll answer some related questions here. If an error occurs inside the readAllMetadata call in KRaftMetadataDriver, the catch block will explicitly abort the transaction. If the controller crashes during this call or any time before committing the EndTransactionRecord, then the next active controller will abort the partial transaction. This is the same case as a controller not crashing but losing leadership during the migration (like from a timeout).

The logic for detecting and aborting partial transactions is in ActivationRecordsGenerator.

…tion-rebatching

jsancio

Thanks. Should we fix the blocking issue in this PR?

jsancio · 2023-12-15T18:35:46Z

metadata/src/main/java/org/apache/kafka/metadata/migration/BufferingBatchConsumer.java

+            delegateConsumer.accept(new ArrayList<>(bufferedBatch));
+            bufferedBatch.clear();


How bout this implication?

delegateConsumer.accept(bufferedBatch); bufferedBatch = new ArrayList<>(minBatchSize);

Similar in flush. There seems to be some code duplication between these two methods.

jsancio · 2023-12-15T18:38:44Z

metadata/src/main/java/org/apache/kafka/metadata/migration/KRaftMigrationDriver.java

+                }
+                CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch);
+                long batchStart = time.nanoseconds();
+                FutureUtils.waitWithLogging(KRaftMigrationDriver.this.log, "",


That's fair. I keep forgetting that we still need to support Java 8. Looking forward to 4.x.

jsancio · 2023-12-15T18:47:54Z

metadata/src/main/java/org/apache/kafka/metadata/migration/KRaftMigrationDriver.java

+                }
+                CompletableFuture<?> future = zkRecordConsumer.acceptBatch(batch);
+                long batchStart = time.nanoseconds();
+                FutureUtils.waitWithLogging(KRaftMigrationDriver.this.log, "",


This is an existing issue but Time.waitForFuture doesn't look correct. It is comparing nano times. In the JVM you can't compare nano times because they can overflow. It is recommended to instead compare elapse time: https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/System.html#nanoTime()

This will cause this code to block forever when it overflows.

jsancio · 2023-12-15T18:52:12Z

metadata/src/main/java/org/apache/kafka/metadata/migration/MigrationManifest.java

            batches++;
+            batchDurationsNs += durationNs;


Okay. This is measuring how much time the ZK migration spent in the controller, writing and committing the batches to KRaft, right?

Yup, that's correct.

mumrah · 2023-12-15T19:58:23Z

I filed https://issues.apache.org/jira/browse/KAFKA-16020 for the nanos issue. Good catch!

jsancio

LGTM, if the tests are green.

mumrah · 2023-12-16T00:32:12Z

Test failures look unrelated

To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org>

) To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org> Co-authored-by: David Arthur <mumrah@gmail.com>

To avoid creating lots of small KRaft batches during the ZK migration, this patch adds a mechanism to merge batches into sizes of at least 1000. This has the effect of reducing the number of batches sent to Raft which reduces the amount of time spent blocking. Since migrations use metadata transactions, the batch boundaries for migrated records are not significant. Even in light of that, this implementation does not break up existing batches. It will only combine them into a larger batch to meet the minimum size. Reviewers: José Armando García Sancio <jsancio@apache.org>

mumrah added 4 commits December 13, 2023 17:16

Rebatch records during the migration, collect more migration stats

1dc3eb8

add some tests

153b0e0

add license

0f13561

fix typing

7ef3f82

mumrah changed the title ~~KAFKA-16007 Re-batch records during the migration~~ KAFKA-16007 Merge batch records during the migration Dec 14, 2023

mumrah changed the title ~~KAFKA-16007 Merge batch records during the migration~~ KAFKA-16007 Merge batch records during ZK migration Dec 14, 2023

mumrah added 2 commits December 14, 2023 18:39

clarify javadoc

067b46e

rename variable

fa25bc9

hachikuji reviewed Dec 15, 2023

View reviewed changes

remove avg batch size

2aee759

jsancio reviewed Dec 15, 2023

View reviewed changes

mumrah added 2 commits December 15, 2023 08:34

rename close to flush

141d979

Merge remote-tracking branch 'origin/trunk' into KAFKA-16007-zk-migra…

3531b56

…tion-rebatching

jsancio reviewed Dec 15, 2023

View reviewed changes

PR feedback

c56c3f0

mumrah requested a review from jsancio December 15, 2023 22:10

jsancio approved these changes Dec 15, 2023

View reviewed changes

mumrah merged commit 7f763d3 into apache:trunk Dec 16, 2023
1 check was pending

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-16007 Merge batch records during ZK migration #15007

KAFKA-16007 Merge batch records during ZK migration #15007

mumrah commented Dec 13, 2023 •

edited

mumrah commented Dec 14, 2023

hachikuji Dec 15, 2023

mumrah Dec 15, 2023

mumrah Dec 15, 2023

jsancio Dec 15, 2023

hachikuji Dec 15, 2023

mumrah Dec 15, 2023

jsancio left a comment

jsancio Dec 15, 2023

mumrah Dec 15, 2023

jsancio Dec 15, 2023

jsancio Dec 15, 2023

jsancio Dec 15, 2023

jsancio Dec 15, 2023

mumrah commented Dec 15, 2023

jsancio left a comment

jsancio Dec 15, 2023

jsancio Dec 15, 2023

jsancio Dec 15, 2023

jsancio Dec 15, 2023 •

edited

mumrah Dec 15, 2023

mumrah commented Dec 15, 2023

jsancio left a comment •

edited

mumrah commented Dec 16, 2023

		delegateConsumer.accept(new ArrayList<>(bufferedBatch));
		bufferedBatch.clear();

KAFKA-16007 Merge batch records during ZK migration #15007

KAFKA-16007 Merge batch records during ZK migration #15007

Conversation

mumrah commented Dec 13, 2023 • edited

mumrah commented Dec 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsancio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumrah commented Dec 15, 2023

jsancio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsancio Dec 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumrah commented Dec 15, 2023

jsancio left a comment • edited

Choose a reason for hiding this comment

mumrah commented Dec 16, 2023

mumrah commented Dec 13, 2023 •

edited

jsancio Dec 15, 2023 •

edited

jsancio left a comment •

edited