KAFKA-16226 Reduce synchronization between producer threads #15323

msn-tldr · 2024-02-06T14:32:56Z

As this JIRA explains, there is increased synchronization between application-thread, and the background thread as the background thread started to synchronized methods Metadata.currentLeader() in original PR. So this PR does the following changes

Changes background thread, i.e. RecordAccumulator's partitionReady(), and drainBatchesForOneNode(), to not use Metadata.currentLeader(). Instead rely on MetadataCache that is immutable. So access to it is unsynchronized.
This PR repurposes MetadataCache as an immutable snapshot of Metadata. This is a wrapper around public Cluster. MetadataCache's API/functionality should be extended for internal client usage Vs public Cluster. For example, this PR adds MetadataCache.leaderEpochFor()
Rename MetadataCache to MetadataSnapshot to make it explicit its immutable.

Note both Cluster and MetadataCache are not syncronized, hence reduce synchronization from the hot path for high partition counts.

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

hachikuji · 2024-02-06T22:35:32Z

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

     * @return The delay for next check
     */
-    private long partitionReady(Metadata metadata, long nowMs, String topic,
+    private long partitionReady(Cluster cluster, long nowMs, String topic,


In some ways, this is a step backwards. We have been trying to reduce the reliance on Cluster internally because it is public. With a lot of internal usage, we end up making changes to the API which are only needed for the internal implementation (as we are doing in this PR). Have you considered alternatives? Perhaps we could expose something like Cluster, but with a reduced scope?

@hachikuji

Thanks for pointing it out. As it turns out I don't need to extend the public api of Cluster in order to get epoch. So internal usage doesn't change Cluster's api anymore.

We have been trying to reduce the reliance on Cluster internally because it is public.

This could be achieved by created a forwarding "internal" class ClusterView that uses Cluster by composition offering the same api. Then client code can be refactored to use ClusterView. That way future extensions of Cluster's public api for internal use-cases could be prevented by making them in ClusterView.

But this is going to be a size-able refactor, how about keeping it separate from this PR? As the intention of this PR is to fix the perf bug, cherry-pick it to other branches.

Makes sense. We'd probably have to do it the other way around though I guess? The client's dependence on Cluster cannot be easily changed, but we can move the internal implementation anywhere we want.

I was looking into your idea a little bit. There might be a simple enough variation that wouldn't require significant changes. What do you think about this? https://github.com/apache/kafka/compare/trunk...hachikuji:kafka:internal-cluster-view?expand=1

@hachikuji Thanks for the draft PR. I have introduced InternalCluster as a wrapper around public Cluster. I have extended InternalCluster to leaderEpochFor that is only for client's internal-usage.

msn-tldr · 2024-02-07T16:02:45Z

@hachikuji
There are unrelated test failures on Jenkins run. Further looking at history of failed tests, they have been failing from before.

https://ge.apache.org/s/fr7yermmdioac/tests/overview?outcome=FAILED

msn-tldr · 2024-02-08T15:25:10Z

clients/src/test/java/org/apache/kafka/clients/producer/internals/RecordAccumulatorTest.java

            new HashSet<>(Arrays.asList(node2)), 999999 /* maxSize */, time.milliseconds());
        assertTrue(batches.get(node2.id()).isEmpty());
    }

-    @Test
-    public void testDrainOnANodeWhenItCeasesToBeALeader() throws InterruptedException {


This is no longer needed as drainBatchesForOneNode uses InternalCluster now Vs Metadata earlier. With Metadata is mutable, it can happen a node is a partition leader but then leadership moves another node. This is not possible as InternalCluster is immutable.

msn-tldr · 2024-02-08T15:54:33Z

The latest Jenkins failure is due to compilation errors with Scala 2.12 introduced here
#15327 (comment)

UPD:
This has a fix here
#15343

hachikuji · 2024-02-08T18:01:21Z

clients/src/main/java/org/apache/kafka/clients/MetadataCache.java

@@ -52,7 +52,7 @@ public class MetadataCache {
    private final Map<TopicPartition, PartitionMetadata> metadataByPartition;
    private final Map<String, Uuid> topicIds;
    private final Map<Uuid, String> topicNames;
-    private Cluster clusterInstance;
+    private InternalCluster clusterInstance;


The javadoc for MetadataCache describes it as mutable, but as far as I can tell, we do not actually modify any of the collections. We always build new instances instead of updating an existing one. That makes me wonder if we can change the javadoc and use MetadataCache as the immutable snapshot of metadata state. Then we could drop InternalCluster in favor of MetadataCache. Would that work?

@hachikuji I had thought about MetadataCache. It has 1 accessor i.e. partitionMetadata() that is returning mutable PartitionMetadata, and is not making defensive copies. Rest all accessors are returning immutable objects, or making defensive copies.
Is it ok for partitionMetadata() to make defensive copies? That could lead to memory going up.

Discussed offline. It does not look like PartitionMetadata should be treated as mutable. It comes directly from the Metadata response and I can't think of a reason the client would update any of the replica sets directly. We should confirm though.

We should confirm though.

I don't see it being used mutably in code. I see historically, it was made mutable to support deletion/updates within cache, but the deletion/update code has since been removed. As far i can see, read-only semantic. So i have treated MetadataCache as immutable cache, made its internal data structures unmodifiable and updated the javadoc.

All clients test pass locally, hopefully Jenkins signal is green too 🤞

ProducerBatch.maybeUpdateLeaderEpoch should only update leader-epoch is new one is greater

Remove no longer needed changes to public classes

hachikuji · 2024-02-09T17:59:34Z

clients/src/main/java/org/apache/kafka/clients/Metadata.java

+    /**
+     * Get the current metadata cache.
+     */
+    public synchronized MetadataCache fetchCache() {


Perhaps we could make cache volatile and avoid the synchronization?

@hachikuji Interesting.
Metadata.update() requires mutual exclusion while updating cache, other internal data structures of Metadata. So it makes sense to keep the synchronizatiion, what do you think?
Moreover, fetchCache is called once in Sender::sendProducerData, so it's not a bottle neck in the hotpath.

It makes sense to require exclusive access when building the cache, but here we're just accessing the built value. So I don't think the synchronization is necessary.

clients/src/main/java/org/apache/kafka/clients/MetadataCache.java

clients/src/main/java/org/apache/kafka/clients/producer/internals/ProducerBatch.java

clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java

clients/src/main/java/org/apache/kafka/clients/MetadataCache.java

clients/src/test/java/org/apache/kafka/test/TestUtils.java

hachikuji

LGTM. One minor suggestion.

clients/src/main/java/org/apache/kafka/clients/Metadata.java

msn-tldr · 2024-02-14T12:29:04Z

Went through the test failures across all jdk/scala combos, they are unrelated, and have been failing before this PR as well
https://ge.apache.org/s/wftzjb3q6slyc/tests/overview?outcome=FAILED
https://ge.apache.org/s/lyqs6eqs4mtny/tests/overview?outcome=FAILED
https://ge.apache.org/s/x6x27oapk6qsa/tests/overview?outcome=FAILED
https://ge.apache.org/s/sleegbh5pfyfo/tests/overview?outcome=FAILED

The test failure belong to test kafka.server.LogDirFailureTest, they are being fixed here https://issues.apache.org/jira/browse/KAFKA-16225

@hachikuji I believe this good to be merged, what do you think?

clients/src/main/java/org/apache/kafka/clients/Metadata.java

ijuma · 2024-02-14T23:53:21Z

@hachikuji @msn-tldr #15376 looks related.

…15323) As this [JIRA](https://issues.apache.org/jira/browse/KAFKA-16226) explains, there is increased synchronization between application-thread, and the background thread as the background thread started to synchronized methods Metadata.currentLeader() in [original PR](apache#14384). So this PR does the following changes 1. Changes background thread, i.e. RecordAccumulator's partitionReady(), and drainBatchesForOneNode(), to not use `Metadata.currentLeader()`. Instead rely on `MetadataCache` that is immutable. So access to it is unsynchronized. 2. This PR repurposes `MetadataCache` as an immutable snapshot of Metadata. This is a wrapper around public `Cluster`. `MetadataCache`'s API/functionality should be extended for internal client usage Vs public `Cluster`. For example, this PR adds `MetadataCache.leaderEpochFor()` 3. Rename `MetadataCache` to `MetadataSnapshot` to make it explicit its immutable. **Note both `Cluster` and `MetadataCache` are not syncronized, hence reduce synchronization from the hot path for high partition counts.** Reviewers: Jason Gustafson <jason@confluent.io>

msn-tldr · 2024-02-16T10:42:14Z

@hachikuji thanks for merging.

msn-tldr · 2024-02-16T18:32:41Z

@ijuma thanks for flagging #15376.

@hachikuji Looks like this was going to add a test that tested the concurrent update of Metadata, and fetching MetadataSnapshot/Cluster. This is useful, so i have created a follow-up PR #15385

Variable metadataMock was removed by #15323 after the CI build of #15320 was run and before #15320 was merged. Reviewers: Luke Chen <showuon@gmail.com>, Lucas Brutschy <lbrutschy@confluent.io>

…15323) As this [JIRA](https://issues.apache.org/jira/browse/KAFKA-16226) explains, there is increased synchronization between application-thread, and the background thread as the background thread started to synchronized methods Metadata.currentLeader() in [original PR](apache#14384). So this PR does the following changes 1. Changes background thread, i.e. RecordAccumulator's partitionReady(), and drainBatchesForOneNode(), to not use `Metadata.currentLeader()`. Instead rely on `MetadataCache` that is immutable. So access to it is unsynchronized. 2. This PR repurposes `MetadataCache` as an immutable snapshot of Metadata. This is a wrapper around public `Cluster`. `MetadataCache`'s API/functionality should be extended for internal client usage Vs public `Cluster`. For example, this PR adds `MetadataCache.leaderEpochFor()` 3. Rename `MetadataCache` to `MetadataSnapshot` to make it explicit its immutable. **Note both `Cluster` and `MetadataCache` are not syncronized, hence reduce synchronization from the hot path for high partition counts.** Reviewers: Jason Gustafson <jason@confluent.io>

…#15493) As this [JIRA](https://issues.apache.org/jira/browse/KAFKA-16226) explains, there is increased synchronization between application-thread, and the background thread as the background thread started to synchronized methods Metadata.currentLeader() in [original PR](#14384). So this PR does the following changes 1. Changes background thread, i.e. RecordAccumulator's partitionReady(), and drainBatchesForOneNode(), to not use `Metadata.currentLeader()`. Instead rely on `MetadataCache` that is immutable. So access to it is unsynchronized. 2. This PR repurposes `MetadataCache` as an immutable snapshot of Metadata. This is a wrapper around public `Cluster`. `MetadataCache`'s API/functionality should be extended for internal client usage Vs public `Cluster`. For example, this PR adds `MetadataCache.leaderEpochFor()` 3. Rename `MetadataCache` to `MetadataSnapshot` to make it explicit its immutable. **Note both `Cluster` and `MetadataCache` are not synchronized, hence reduce synchronization from the hot path for high partition counts.** Reviewers: Jason Gustafson <jason@confluent.io>

…#15498) As this [JIRA](https://issues.apache.org/jira/browse/KAFKA-16226) explains, there is increased synchronization between application-thread, and the background thread as the background thread started to synchronized methods Metadata.currentLeader() in [original PR](#14384). So this PR does the following changes 1. Changes background thread, i.e. RecordAccumulator's partitionReady(), and drainBatchesForOneNode(), to not use `Metadata.currentLeader()`. Instead rely on `MetadataCache` that is immutable. So access to it is unsynchronized. 2. This PR repurposes `MetadataCache` as an immutable snapshot of Metadata. This is a wrapper around public `Cluster`. `MetadataCache`'s API/functionality should be extended for internal client usage Vs public `Cluster`. For example, this PR adds `MetadataCache.leaderEpochFor()` 3. Rename `MetadataCache` to `MetadataSnapshot` to make it explicit its immutable. **Note both `Cluster` and `MetadataCache` are not syncronized, hence reduce synchronization from the hot path for high partition counts.** Reviewers: Jason Gustafson <jason@confluent.io>

…15323) As this [JIRA](https://issues.apache.org/jira/browse/KAFKA-16226) explains, there is increased synchronization between application-thread, and the background thread as the background thread started to synchronized methods Metadata.currentLeader() in [original PR](apache#14384). So this PR does the following changes 1. Changes background thread, i.e. RecordAccumulator's partitionReady(), and drainBatchesForOneNode(), to not use `Metadata.currentLeader()`. Instead rely on `MetadataCache` that is immutable. So access to it is unsynchronized. 2. This PR repurposes `MetadataCache` as an immutable snapshot of Metadata. This is a wrapper around public `Cluster`. `MetadataCache`'s API/functionality should be extended for internal client usage Vs public `Cluster`. For example, this PR adds `MetadataCache.leaderEpochFor()` 3. Rename `MetadataCache` to `MetadataSnapshot` to make it explicit its immutable. **Note both `Cluster` and `MetadataCache` are not syncronized, hence reduce synchronization from the hot path for high partition counts.** Reviewers: Jason Gustafson <jason@confluent.io>

Variable metadataMock was removed by apache#15323 after the CI build of apache#15320 was run and before apache#15320 was merged. Reviewers: Luke Chen <showuon@gmail.com>, Lucas Brutschy <lbrutschy@confluent.io>

…15323) As this [JIRA](https://issues.apache.org/jira/browse/KAFKA-16226) explains, there is increased synchronization between application-thread, and the background thread as the background thread started to synchronized methods Metadata.currentLeader() in [original PR](apache#14384). So this PR does the following changes 1. Changes background thread, i.e. RecordAccumulator's partitionReady(), and drainBatchesForOneNode(), to not use `Metadata.currentLeader()`. Instead rely on `MetadataCache` that is immutable. So access to it is unsynchronized. 2. This PR repurposes `MetadataCache` as an immutable snapshot of Metadata. This is a wrapper around public `Cluster`. `MetadataCache`'s API/functionality should be extended for internal client usage Vs public `Cluster`. For example, this PR adds `MetadataCache.leaderEpochFor()` 3. Rename `MetadataCache` to `MetadataSnapshot` to make it explicit its immutable. **Note both `Cluster` and `MetadataCache` are not syncronized, hence reduce synchronization from the hot path for high partition counts.** Reviewers: Jason Gustafson <jason@confluent.io>

Variable metadataMock was removed by apache#15323 after the CI build of apache#15320 was run and before apache#15320 was merged. Reviewers: Luke Chen <showuon@gmail.com>, Lucas Brutschy <lbrutschy@confluent.io>

msn-tldr changed the title ~~working change~~ Reduce synchronization between producer threads Feb 6, 2024

msn-tldr marked this pull request as ready for review February 6, 2024 15:43

msn-tldr force-pushed the kafka_15415_reduce_contention branch from 1344252 to ca32a8f Compare February 6, 2024 15:44

msn-tldr changed the title ~~Reduce synchronization between producer threads~~ KAFKA-16226 Reduce synchronization between producer threads Feb 6, 2024

hachikuji reviewed Feb 6, 2024

View reviewed changes

msn-tldr commented Feb 8, 2024

View reviewed changes

hachikuji reviewed Feb 8, 2024

View reviewed changes

msn-tldr added 10 commits February 9, 2024 17:48

working change

5900b9f

improve edge case checking in ProducerBatch.maybeUpdateLeaderEpoch

4eb16b7

ProducerBatch.maybeUpdateLeaderEpoch should only update leader-epoch is new one is greater

remove Cluster.leaderEpochFor

350ef2c

Introduce InternalCluster

99601cf

Remove no longer needed changes to public classes

actually add file InternalCluster.java to git

61fa23d

Make MetadataCache immutable

0279768

Using MetadataCache as immutable cache instead of InternalCluster

16beff0

clean up comments

4ca64b3

fix cluster ctor

8eef08c

fix method name

a75ab88

msn-tldr force-pushed the kafka_15415_reduce_contention branch from f101313 to a75ab88 Compare February 9, 2024 17:48

hachikuji reviewed Feb 9, 2024

View reviewed changes

msn-tldr added 3 commits February 12, 2024 19:03

Optional<Integer> to OptionalInt

0e64839

Rename MetadataCache, make cache volatile, add javadocs

5496797

further rename MetadataCache -> MetadataSnapshot

7e50b1f

hachikuji approved these changes Feb 13, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/Metadata.java Show resolved Hide resolved

fetch() should be unsynchronized too

092bee8

ijuma reviewed Feb 14, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/Metadata.java Outdated Show resolved Hide resolved

remove synchronization when simply reading from snapshot

804c629

hachikuji approved these changes Feb 14, 2024

View reviewed changes

ijuma mentioned this pull request Feb 14, 2024

KAFKA-16259 Immutable MetadataCache to improve client performance #15376

Closed

3 tasks

hachikuji merged commit ff90f78 into apache:trunk Feb 15, 2024
1 check failed

msn-tldr mentioned this pull request Feb 16, 2024

KAFKA-16226 Add test for concurrently updatingMetadata and fetching snapshot/cluster #15385

Merged

3 tasks

cadonna mentioned this pull request Feb 21, 2024

HOTFIX: Fix compilation error in TransactionManagerTest #15405

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-16226 Reduce synchronization between producer threads #15323

KAFKA-16226 Reduce synchronization between producer threads #15323

msn-tldr commented Feb 6, 2024 •

edited

Loading

hachikuji Feb 6, 2024

msn-tldr Feb 7, 2024 •

edited

Loading

hachikuji Feb 7, 2024

hachikuji Feb 7, 2024

msn-tldr Feb 8, 2024

msn-tldr commented Feb 7, 2024

msn-tldr Feb 8, 2024

msn-tldr commented Feb 8, 2024 •

edited

Loading

hachikuji Feb 8, 2024

msn-tldr Feb 8, 2024 •

edited

Loading

hachikuji Feb 8, 2024

msn-tldr Feb 9, 2024

hachikuji Feb 9, 2024

msn-tldr Feb 12, 2024

hachikuji Feb 12, 2024

hachikuji left a comment

msn-tldr commented Feb 14, 2024

ijuma commented Feb 14, 2024

msn-tldr commented Feb 16, 2024

msn-tldr commented Feb 16, 2024

KAFKA-16226 Reduce synchronization between producer threads #15323

KAFKA-16226 Reduce synchronization between producer threads #15323

Conversation

msn-tldr commented Feb 6, 2024 • edited Loading

Committer Checklist (excluded from commit message)

Choose a reason for hiding this comment

msn-tldr Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msn-tldr commented Feb 7, 2024

Choose a reason for hiding this comment

msn-tldr commented Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

msn-tldr Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji left a comment

Choose a reason for hiding this comment

msn-tldr commented Feb 14, 2024

ijuma commented Feb 14, 2024

msn-tldr commented Feb 16, 2024

msn-tldr commented Feb 16, 2024

msn-tldr commented Feb 6, 2024 •

edited

Loading

msn-tldr Feb 7, 2024 •

edited

Loading

msn-tldr commented Feb 8, 2024 •

edited

Loading

msn-tldr Feb 8, 2024 •

edited

Loading