KAFKA-13630: reduce amount of time that producer network thread holds batch queue lock #11722

jasonk000 · 2022-01-30T19:51:25Z

Hold the deque lock for only as long as is required to collect and make a decision in ready() and drain() loops. Once this is done, remaining work can be done without lock, so release it. This allows producers to continue appending.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

… batch queue lock

ijuma · 2022-01-31T12:53:55Z

Thanks for the PR. Do you have some results to share regarding the impact this optimization has?

jasonk000 · 2022-01-31T17:02:51Z

This application has a single producer thread, a high send() rate. The change reduces spinlock CPU cycles from 14.6% to 2.5% of the send() path, or more clearly a 12.1% improvement in efficiency for the send() path by reducing the duration of contention events with the network thread.

Before, with *complete_monitor* highlighted:

After, with *complete_monitor* highlighted:

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

artemlivshits · 2022-02-03T21:41:15Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

+                batch.setProducerState(producerIdAndEpoch, transactionManager.sequenceNumber(batch.topicPartition), isTransactional);
+                transactionManager.incrementSequenceNumber(batch.topicPartition, batch.recordCount);


I think the expectation is that these 2 would be atomic (i.e. would be bad if one thread executed 615, then another thread executed 615 again and got the same sequence number, before the first thread got a chance to execute 616).
Also I think the expectation is that batches that are ordered one after another in the queue would get the sequence numbers in the same order (i.e. that batch that is later in the queue would get higher sequence number).
Previously these expectations were protected by the queue lock so "poll", "get sequence", "update sequence" would execute as atomic block, with this change the operations could interleave.

OK, this makes sense. I agree it makes sense to push this back into the synchronized block; the more important part for reducing lock duration is to move the close() outside of the block. Would you agree with this approach?

I've addressed this in 34008bf.

Below is flamegraph that getDeque() and close() consume the CPU in the drain() tree. I believe with both of these performed outside of the lock we are still in a good spot here.

Moving close outside of locked scope LGTM

Can you please test with Java 11 or newer? Looks like you tested with Java 8 which uses the slower crc32c method.

Correct, this application is jdk8. I'll have to find a jdk11 app to compare, it could take a while as this app is not ready for it. Is a jdk11 test a blocker?

No, not a blocker.

ijuma

Thanks for the PR, a couple of comments.

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

showuon

@jasonk000 , thanks for the PR. Left a comment.

showuon · 2022-02-07T05:21:01Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

+            // the rest of the work by processing outside the lock
+            // close() is particularly expensive
+
+            batch.close();


If we move the batch.close() out of synchronized scope, is it possible that other thread might take the batch as "alive" and do other operation before we close it completely?

Looks like it won't happen since we only lock on deque object, but just want to confirm, to make sure it won't break anything.

Since we lock on deque, the batch is either "in" the deque and append/etc will work on it in synchronized block, or it is removed from deque, and will no longer be available for other threads to do any work on it. From my reading, Sender only collects batches via the drain and expiry path, both of which acquire the lock before removing the element from the deque. This implies that the removal from batch is atomic, whether that is drained or expired out, it can only be one. The Sender re-enqueue works on batches that were previously drained as well. Nothing stands out at me as a possible area for leakage but I'm happy to be corrected and more eyes is always better for this sort of thing.

Make sense, and that's also what I've seen. Thanks for confirmation!

showuon

LGTM!

jasonk000 · 2022-03-04T02:22:52Z

hi @showuon , was there anything remaining here before we merge it? thx

showuon · 2022-03-04T02:25:43Z

@ijuma @artemlivshits , do you want to have another look at this PR? Thanks.

ijuma

LGTM, thanks!

…work thread holds batch queue lock (apache#11722)

…work thread holds batch queue lock (apache#11722) (#391) TICKET = LIKAFKA-46436 LI_DESCRIPTION = This cherry picks from apache#11722 Hold the deque lock for only as long as is required to collect and make a decision in ready() and drain() loops. Once this is done, remaining work can be done without lock, so release it. This allows producers to continue appending. EXIT_CRITERIA = N/A

KAFKA-13630: reduce amount of time that producer network thread holds…

fc833bd

… batch queue lock

artemlivshits reviewed Feb 3, 2022

View reviewed changes

move transaction procedures back inside the lock

34008bf

ijuma reviewed Feb 4, 2022

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java Show resolved Hide resolved

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java Show resolved Hide resolved

jasonk000 added 2 commits February 4, 2022 16:27

put back comment about ready() loop being hot path

f15fdef

fix style issue (tab)

4657509

showuon reviewed Feb 7, 2022

View reviewed changes

showuon approved these changes Feb 8, 2022

View reviewed changes

andrewchoi5 approved these changes Feb 21, 2022

View reviewed changes

ijuma approved these changes Mar 9, 2022

View reviewed changes

ijuma merged commit 2367c89 into apache:trunk Mar 9, 2022

jasonk000 deleted the kafka-13630-recordaccumulator-locks branch March 9, 2022 17:15

ZIDAZ added a commit to linkedin/kafka that referenced this pull request Sep 12, 2022

[LI-CHERRY-PICK] KAFKA-13630: Reduce amount of time that producer net…

2b73c44

…work thread holds batch queue lock (apache#11722)

ZIDAZ mentioned this pull request Sep 12, 2022

[LI-CHERRY-PICK] KAFKA-13630: reduce amount of time that producer network thread holds batch queue lock linkedin/kafka#391

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-13630: reduce amount of time that producer network thread holds batch queue lock #11722

KAFKA-13630: reduce amount of time that producer network thread holds batch queue lock #11722

jasonk000 commented Jan 30, 2022

ijuma commented Jan 31, 2022

jasonk000 commented Jan 31, 2022

artemlivshits Feb 3, 2022

jasonk000 Feb 3, 2022

jasonk000 Feb 4, 2022

artemlivshits Feb 4, 2022

ijuma Feb 4, 2022 •

edited

Loading

jasonk000 Feb 4, 2022

ijuma Feb 4, 2022

ijuma left a comment

showuon left a comment

showuon Feb 7, 2022

showuon Feb 7, 2022

jasonk000 Feb 8, 2022

showuon Feb 8, 2022

showuon left a comment

jasonk000 commented Mar 4, 2022

showuon commented Mar 4, 2022

ijuma left a comment

		batch.setProducerState(producerIdAndEpoch, transactionManager.sequenceNumber(batch.topicPartition), isTransactional);
		transactionManager.incrementSequenceNumber(batch.topicPartition, batch.recordCount);

KAFKA-13630: reduce amount of time that producer network thread holds batch queue lock #11722

KAFKA-13630: reduce amount of time that producer network thread holds batch queue lock #11722

Conversation

jasonk000 commented Jan 30, 2022

Committer Checklist (excluded from commit message)

ijuma commented Jan 31, 2022

jasonk000 commented Jan 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijuma Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijuma left a comment

Choose a reason for hiding this comment

showuon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

showuon left a comment

Choose a reason for hiding this comment

jasonk000 commented Mar 4, 2022

showuon commented Mar 4, 2022

ijuma left a comment

Choose a reason for hiding this comment

ijuma Feb 4, 2022 •

edited

Loading