-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue 11496][C++] Allow partitioned producers to start lazily #11570
Conversation
8938ec2
to
7277147
Compare
@@ -379,8 +418,8 @@ bool PartitionedProducerImpl::isConnected() const { | |||
Lock producersLock(producersMutex_); | |||
const auto producers = producers_; | |||
producersLock.unlock(); | |||
for (const auto& producer : producers_) { | |||
if (!producer->isConnected()) { | |||
for (const auto& producer : producers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the producer has been created, but no messages have been sent, this will return true, even though there has been no connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The question is what people do with this method. Is it an indicator of a problem? Without lazy producers, false means we were once connected but now aren't and so can be considered a transient problem. But with lazy producers, false may mean we've not even tried, so false does not mean anything. So I decided to return true to signal absence of a problem.
Definitely interested to hear opinions on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I have no strong opinion here. "connected" is a weird concept in a distributed system like this.
7277147
to
d761f7b
Compare
@@ -379,8 +418,8 @@ bool PartitionedProducerImpl::isConnected() const { | |||
Lock producersLock(producersMutex_); | |||
const auto producers = producers_; | |||
producersLock.unlock(); | |||
for (const auto& producer : producers_) { | |||
if (!producer->isConnected()) { | |||
for (const auto& producer : producers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I have no strong opinion here. "connected" is a weird concept in a distributed system like this.
d761f7b
to
0b9da9e
Compare
@Vanlightly Thanks for providing doc-related info! Looking forward to your docs. |
Could you rebase to master since C++/Python tests have been fixed? |
0b9da9e
to
367cc96
Compare
I realised that I had not included the extra config in the @Anonymitaet Beyond the doxygen and pdoc generated docs, is it worth documenting this change further. If so, where do you think? |
367cc96
to
b90ff74
Compare
@Vanlightly how about adding your changes here? |
Avoids performing lookups and producer registration on all partitions of a partitioned topic when using SinglePartition routing without keyed messages
b90ff74
to
0fa7215
Compare
Lazy producers connect on demand on their first message. The send timeout timer must be started on the first message as there is no guarantee the connection will complete
@Anonymitaet I have added to the documentation a section about lazy producers and also added examples of both synchronous and asynchronous producers and consumers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for your great contribution! I’ve left some comments, PTAL.
Ensure that the state is checked always in PartitionedProducer::sendAsync to avoid buffering of messages after closeAsync called. Removed sequential locking in ProducerImpl:: handleCreateProducer that allowed for state to go back to Ready after closeAsync called.
Thank you for your comments about PIP-79. For authn/authz backward compatibility reason, I think the internal producer should be connected to broker before completing the creation of partitioned producer( https://mail-archives.apache.org/mod_mbox/pulsar-dev/202102.mbox/%3CCAO2yDybn3sqPJV32YqvYndk%3D8mxNKodcGB4GE3QmUs8F9m8YUw%40mail.gmail.com%3E ). However, it isn't critical because the change affects when ProducerConfiguration.setLazyStartPartitionedProducers(true) is set. As mentioned in here, partitioned producer stats will be incorrect if each partition has different number of producers. I tried to fix this issue in the PIP-79 (It will be fixed in #10534 at Java client). |
This can be worked around by making the producer connection to any one random partition. If we could make this the first partition selected in roundrobin, even better. |
In order to ensure than authz errors are returned during producer creation, start one producer eagerly when using lazy partitioned producers. When UseSinglePartition mode is used, the eagerly created producer will be the only producer created (when non-keyed messages are sent).
In the case that an internal producer is created lazily and it fails, then fail any pending requests immediately.
@equanz @ivankelly I have changed it so that one producer is started eagerly, chosen by the routing policy and any other internal producers are started lazily. This will ensure that authz errors occur during createProducer and we still get the desired behaviour of fewer lookups and connections when using single partition routing. |
@codelipenghui I added the Without this PR, the test introduced from #11955 would block forever. It's because for a partitioned producer, calling The root cause is when a partitioned producer calls |
Fixes #11496 also matches part of PIP 79. C++ implementation that closely matches the proposed Java client changes from reducing partitioned producer connections and lookups: PR 10279 ### Motivation Producers that send messages to partitioned topics start a producer per partition, even when using single partition routing. For topics that have the combination of a large number of producers and a large number of partitions, this can put strain on the brokers. With say 1000 partitions and single partition routing with non-keyed messages, 999 topic owner lookups and producer registrations are performed that could be avoided. PIP 79 also describes this. I wrote this before realising that PIP 79 also covers this. This implementation can be reviewed and contrasted to the Java client implementation in #10279. ### Modifications Allows partitioned producers to start producers for individual partitions lazily. Starting a producer involves a topic owner lookup to find out which broker is the owner of the partition, then registering the producer for that partition with the owner broker. For topics with many partitions and when using SinglePartition routing without keyed messages, all of these lookups and producer registrations are a waste except for the single chosen partition. This change allows the user to control whether a producer on a partitioned topic uses this lazy start or not, via a new config in ProducerConfiguration. When ProducerConfiguration.setLazyStartPartitionedProducers(true) is set, the PartitionedProducerImpl.start() becomes a synchronous operation that only does housekeeping (no network operations). The producer of any given partition is started (which includes a topic owner lookup and registration) upon sending the first message to that partition. While the producer starts, messages are buffered. The sendTimeout timer is only activated once a producer has been fully started, which should give enough time for any buffered messages to be sent. For very short send timeouts, this setting could cause send timeouts during the start phase. The default of 30s should however not cause this issue. (cherry picked from commit 9577b84)
…e#11570) Fixes apache#11496 also matches part of PIP 79. C++ implementation that closely matches the proposed Java client changes from reducing partitioned producer connections and lookups: PR 10279 ### Motivation Producers that send messages to partitioned topics start a producer per partition, even when using single partition routing. For topics that have the combination of a large number of producers and a large number of partitions, this can put strain on the brokers. With say 1000 partitions and single partition routing with non-keyed messages, 999 topic owner lookups and producer registrations are performed that could be avoided. PIP 79 also describes this. I wrote this before realising that PIP 79 also covers this. This implementation can be reviewed and contrasted to the Java client implementation in apache#10279. ### Modifications Allows partitioned producers to start producers for individual partitions lazily. Starting a producer involves a topic owner lookup to find out which broker is the owner of the partition, then registering the producer for that partition with the owner broker. For topics with many partitions and when using SinglePartition routing without keyed messages, all of these lookups and producer registrations are a waste except for the single chosen partition. This change allows the user to control whether a producer on a partitioned topic uses this lazy start or not, via a new config in ProducerConfiguration. When ProducerConfiguration.setLazyStartPartitionedProducers(true) is set, the PartitionedProducerImpl.start() becomes a synchronous operation that only does housekeeping (no network operations). The producer of any given partition is started (which includes a topic owner lookup and registration) upon sending the first message to that partition. While the producer starts, messages are buffered. The sendTimeout timer is only activated once a producer has been fully started, which should give enough time for any buffered messages to be sent. For very short send timeouts, this setting could cause send timeouts during the start phase. The default of 30s should however not cause this issue.
Fixes apache#13849 Fixes apache#14848 ### Motivation apache#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see apache#13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1
Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1
Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca)
Fixes apache#13849 Fixes apache#14848 ### Motivation apache#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see apache#13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca) (cherry picked from commit 83b6833)
Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca)
Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca)
Fixes #11496 also matches part of PIP 79.
C++ implementation that closely matches the proposed Java client changes from reducing partitioned producer connections and lookups: PR 10279
Motivation
Producers that send messages to partitioned topics start a producer per partition, even when using single partition routing. For topics that have the combination of a large number of producers and a large number of partitions, this can put strain on the brokers. With say 1000 partitions and single partition routing with non-keyed messages, 999 topic owner lookups and producer registrations are performed that could be avoided.
PIP 79 also describes this. I wrote this before realising that PIP 79 also covers this. This implementation can be reviewed and contrasted to the Java client implementation in #10279.
Modifications
Allows partitioned producers to start producers for individual partitions lazily. Starting a producer involves a topic owner
lookup to find out which broker is the owner of the partition, then registering the producer for that partition with the owner
broker. For topics with many partitions and when using SinglePartition routing without keyed messages, all of these
lookups and producer registrations are a waste except for the single chosen partition.
This change allows the user to control whether a producer on a partitioned topic uses this lazy start or not, via a new config
in ProducerConfiguration. When ProducerConfiguration.setLazyStartPartitionedProducers(true) is set, the PartitionedProducerImpl.start() becomes a synchronous operation that only does housekeeping (no network operations).
The producer of any given partition is started (which includes a topic owner lookup and registration) upon sending the first message to that partition. While the producer starts, messages are buffered.
The sendTimeout timer is only activated once a producer has been fully started, which should give enough time for any buffered messages to be sent. For very short send timeouts, this setting could cause send timeouts during the start phase. The default of 30s should however not cause this issue.
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
If
yes
was chosen, please highlight the changesDocumentation
For contributor
For this PR, do we need to update docs?
Yes, the new client config would need documenting. Can contribute that if this PR is accepted.
For committer
For this PR, do we need to update docs?
If yes,
if you update docs in this PR, label this PR with the
doc
label.if you plan to update docs later, label this PR with the
doc-required
label.if you need help on updating docs, create a follow-up issue with the
doc-required
label.If no, label this PR with the
no-need-doc
label and explain why.