[Issue 11496][C++] Allow partitioned producers to start lazily #11570

Vanlightly · 2021-08-05T11:48:52Z

Fixes #11496 also matches part of PIP 79.

C++ implementation that closely matches the proposed Java client changes from reducing partitioned producer connections and lookups: PR 10279

Motivation

Producers that send messages to partitioned topics start a producer per partition, even when using single partition routing. For topics that have the combination of a large number of producers and a large number of partitions, this can put strain on the brokers. With say 1000 partitions and single partition routing with non-keyed messages, 999 topic owner lookups and producer registrations are performed that could be avoided.

PIP 79 also describes this. I wrote this before realising that PIP 79 also covers this. This implementation can be reviewed and contrasted to the Java client implementation in #10279.

Modifications

Allows partitioned producers to start producers for individual partitions lazily. Starting a producer involves a topic owner
lookup to find out which broker is the owner of the partition, then registering the producer for that partition with the owner
broker. For topics with many partitions and when using SinglePartition routing without keyed messages, all of these
lookups and producer registrations are a waste except for the single chosen partition.

This change allows the user to control whether a producer on a partitioned topic uses this lazy start or not, via a new config
in ProducerConfiguration. When ProducerConfiguration.setLazyStartPartitionedProducers(true) is set, the PartitionedProducerImpl.start() becomes a synchronous operation that only does housekeeping (no network operations).
The producer of any given partition is started (which includes a topic owner lookup and registration) upon sending the first message to that partition. While the producer starts, messages are buffered.

The sendTimeout timer is only activated once a producer has been fully started, which should give enough time for any buffered messages to be sent. For very short send timeouts, this setting could cause send timeouts during the start phase. The default of 30s should however not cause this issue.

Verifying this change

This change added tests and can be verified as follows:

BasicEndToEndTest, testPartitionedProducerConsumer
BasicEndToEndTest, testSyncFlushBatchMessagesPartitionedTopicLazyProducers
BasicEndToEndTest, testFlushInLazyPartitionedProducer

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (yes) - client configuration
The schema: (no)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

Documentation

For contributor

For this PR, do we need to update docs?

Yes, the new client config would need documenting. Can contribute that if this PR is accepted.

For committer

For this PR, do we need to update docs?

If yes,
- if you update docs in this PR, label this PR with the doc label.
- if you plan to update docs later, label this PR with the doc-required label.
- if you need help on updating docs, create a follow-up issue with the doc-required label.
If no, label this PR with the no-need-doc label and explain why.

pulsar-client-cpp/lib/PartitionedProducerImpl.cc

ivankelly · 2021-08-05T13:50:42Z

pulsar-client-cpp/lib/PartitionedProducerImpl.cc

@@ -379,8 +418,8 @@ bool PartitionedProducerImpl::isConnected() const {
    Lock producersLock(producersMutex_);
    const auto producers = producers_;
    producersLock.unlock();
-    for (const auto& producer : producers_) {
-        if (!producer->isConnected()) {
+    for (const auto& producer : producers) {


If the producer has been created, but no messages have been sent, this will return true, even though there has been no connections.

Yes. The question is what people do with this method. Is it an indicator of a problem? Without lazy producers, false means we were once connected but now aren't and so can be considered a transient problem. But with lazy producers, false may mean we've not even tried, so false does not mean anything. So I decided to return true to signal absence of a problem.

Definitely interested to hear opinions on that.

Ya, I have no strong opinion here. "connected" is a weird concept in a distributed system like this.

pulsar-client-cpp/lib/PartitionedProducerImpl.cc

ivankelly · 2021-08-06T14:36:01Z

pulsar-client-cpp/lib/PartitionedProducerImpl.cc

@@ -379,8 +418,8 @@ bool PartitionedProducerImpl::isConnected() const {
    Lock producersLock(producersMutex_);
    const auto producers = producers_;
    producersLock.unlock();
-    for (const auto& producer : producers_) {
-        if (!producer->isConnected()) {
+    for (const auto& producer : producers) {


Ya, I have no strong opinion here. "connected" is a weird concept in a distributed system like this.

pulsar-client-cpp/lib/ProducerImpl.cc

Anonymitaet · 2021-08-09T01:55:13Z

@Vanlightly Thanks for providing doc-related info! Looking forward to your docs.

BewareMyPower · 2021-08-09T02:25:29Z

Could you rebase to master since C++/Python tests have been fixed?

Vanlightly · 2021-08-09T12:24:11Z

I realised that I had not included the extra config in the __init__.py so will add that to this PR.

@Anonymitaet Beyond the doxygen and pdoc generated docs, is it worth documenting this change further. If so, where do you think?

Anonymitaet · 2021-08-10T02:18:22Z

@Vanlightly how about adding your changes here?

Avoids performing lookups and producer registration on all partitions of a partitioned topic when using SinglePartition routing without keyed messages

Lazy producers connect on demand on their first message. The send timeout timer must be started on the first message as there is no guarantee the connection will complete

Vanlightly · 2021-08-10T18:11:23Z

@Anonymitaet I have added to the documentation a section about lazy producers and also added examples of both synchronous and asynchronous producers and consumers.

site2/docs/client-libraries-cpp.md

Anonymitaet

Many thanks for your great contribution! I’ve left some comments, PTAL.

Ensure that the state is checked always in PartitionedProducer::sendAsync to avoid buffering of messages after closeAsync called. Removed sequential locking in ProducerImpl:: handleCreateProducer that allowed for state to go back to Ready after closeAsync called.

equanz · 2021-08-12T08:52:58Z

Thank you for your comments about PIP-79.
I have a few comments. I think these comments don't block merging.

For authn/authz backward compatibility reason, I think the internal producer should be connected to broker before completing the creation of partitioned producer( https://mail-archives.apache.org/mod_mbox/pulsar-dev/202102.mbox/%3CCAO2yDybn3sqPJV32YqvYndk%3D8mxNKodcGB4GE3QmUs8F9m8YUw%40mail.gmail.com%3E ). However, it isn't critical because the change affects when ProducerConfiguration.setLazyStartPartitionedProducers(true) is set.

As mentioned in here, partitioned producer stats will be incorrect if each partition has different number of producers. I tried to fix this issue in the PIP-79 (It will be fixed in #10534 at Java client).

ivankelly · 2021-08-12T19:01:38Z

For authn/authz backward compatibility reason, I think the internal producer should be connected to broker before completing the creation of partitioned producer( https://mail-archives.apache.org/mod_mbox/pulsar-dev/202102.mbox/%3CCAO2yDybn3sqPJV32YqvYndk%3D8mxNKodcGB4GE3QmUs8F9m8YUw%40mail.gmail.com%3E ). However, it isn't critical because the change affects when ProducerConfiguration.setLazyStartPartitionedProducers(true) is set.

This can be worked around by making the producer connection to any one random partition. If we could make this the first partition selected in roundrobin, even better.

In order to ensure than authz errors are returned during producer creation, start one producer eagerly when using lazy partitioned producers. When UseSinglePartition mode is used, the eagerly created producer will be the only producer created (when non-keyed messages are sent).

In the case that an internal producer is created lazily and it fails, then fail any pending requests immediately.

Vanlightly · 2021-08-13T11:58:40Z

@equanz @ivankelly I have changed it so that one producer is started eagerly, chosen by the routing policy and any other internal producers are started lazily. This will ensure that authz errors occur during createProducer and we still get the desired behaviour of fewer lookups and connections when using single partition routing.

BewareMyPower · 2021-09-10T09:39:24Z

@codelipenghui I added the release/2.8.2 label first.

Without this PR, the test introduced from #11955 would block forever. It's because for a partitioned producer, calling send() after shutdown() or close() would block forever. I didn't find why this PR works, but this PR should fix it and has been tested in my local env.

The root cause is when a partitioned producer calls shutdown() in client shutdown, it only sets the state and the internal producers are not affected. We can also solve this problem by shutdown all internal producers when a partitioned producer calls shutdown.

Fixes #11496 also matches part of PIP 79. C++ implementation that closely matches the proposed Java client changes from reducing partitioned producer connections and lookups: PR 10279 ### Motivation Producers that send messages to partitioned topics start a producer per partition, even when using single partition routing. For topics that have the combination of a large number of producers and a large number of partitions, this can put strain on the brokers. With say 1000 partitions and single partition routing with non-keyed messages, 999 topic owner lookups and producer registrations are performed that could be avoided. PIP 79 also describes this. I wrote this before realising that PIP 79 also covers this. This implementation can be reviewed and contrasted to the Java client implementation in #10279. ### Modifications Allows partitioned producers to start producers for individual partitions lazily. Starting a producer involves a topic owner lookup to find out which broker is the owner of the partition, then registering the producer for that partition with the owner broker. For topics with many partitions and when using SinglePartition routing without keyed messages, all of these lookups and producer registrations are a waste except for the single chosen partition. This change allows the user to control whether a producer on a partitioned topic uses this lazy start or not, via a new config in ProducerConfiguration. When ProducerConfiguration.setLazyStartPartitionedProducers(true) is set, the PartitionedProducerImpl.start() becomes a synchronous operation that only does housekeeping (no network operations). The producer of any given partition is started (which includes a topic owner lookup and registration) upon sending the first message to that partition. While the producer starts, messages are buffered. The sendTimeout timer is only activated once a producer has been fully started, which should give enough time for any buffered messages to be sent. For very short send timeouts, this setting could cause send timeouts during the start phase. The default of 30s should however not cause this issue. (cherry picked from commit 9577b84)

…e#11570) Fixes apache#11496 also matches part of PIP 79. C++ implementation that closely matches the proposed Java client changes from reducing partitioned producer connections and lookups: PR 10279 ### Motivation Producers that send messages to partitioned topics start a producer per partition, even when using single partition routing. For topics that have the combination of a large number of producers and a large number of partitions, this can put strain on the brokers. With say 1000 partitions and single partition routing with non-keyed messages, 999 topic owner lookups and producer registrations are performed that could be avoided. PIP 79 also describes this. I wrote this before realising that PIP 79 also covers this. This implementation can be reviewed and contrasted to the Java client implementation in apache#10279. ### Modifications Allows partitioned producers to start producers for individual partitions lazily. Starting a producer involves a topic owner lookup to find out which broker is the owner of the partition, then registering the producer for that partition with the owner broker. For topics with many partitions and when using SinglePartition routing without keyed messages, all of these lookups and producer registrations are a waste except for the single chosen partition. This change allows the user to control whether a producer on a partitioned topic uses this lazy start or not, via a new config in ProducerConfiguration. When ProducerConfiguration.setLazyStartPartitionedProducers(true) is set, the PartitionedProducerImpl.start() becomes a synchronous operation that only does housekeeping (no network operations). The producer of any given partition is started (which includes a topic owner lookup and registration) upon sending the first message to that partition. While the producer starts, messages are buffered. The sendTimeout timer is only activated once a producer has been fully started, which should give enough time for any buffered messages to be sent. For very short send timeouts, this setting could cause send timeouts during the start phase. The default of 30s should however not cause this issue.

Fixes apache#13849 Fixes apache#14848 ### Motivation apache#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see apache#13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1

Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1

Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca)

Fixes apache#13849 Fixes apache#14848 ### Motivation apache#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see apache#13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca) (cherry picked from commit 83b6833)

Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca)

Vanlightly force-pushed the cpp-lazy-producers branch from 8938ec2 to 7277147 Compare August 5, 2021 13:01

Vanlightly mentioned this pull request Aug 5, 2021

[PIP 79][client] Add lazy-loading feature to PartitionedProducer #10279

Merged

1 task

ivankelly reviewed Aug 5, 2021

View reviewed changes

Anonymitaet added the doc-required Your PR changes impact docs and you will update later. label Aug 5, 2021

Vanlightly force-pushed the cpp-lazy-producers branch from 7277147 to d761f7b Compare August 6, 2021 14:28

ivankelly reviewed Aug 6, 2021

View reviewed changes

Vanlightly force-pushed the cpp-lazy-producers branch from d761f7b to 0b9da9e Compare August 6, 2021 16:21

ivankelly approved these changes Aug 6, 2021

View reviewed changes

Vanlightly force-pushed the cpp-lazy-producers branch from 0b9da9e to 367cc96 Compare August 9, 2021 08:01

Vanlightly force-pushed the cpp-lazy-producers branch from 367cc96 to b90ff74 Compare August 9, 2021 12:45

Allow partitioned producers to start lazily

0fa7215

Avoids performing lookups and producer registration on all partitions of a partitioned topic when using SinglePartition routing without keyed messages

Vanlightly force-pushed the cpp-lazy-producers branch from b90ff74 to 0fa7215 Compare August 10, 2021 11:28

Lazy producers start send timeout timer on first message

1f54174

Lazy producers connect on demand on their first message. The send timeout timer must be started on the first message as there is no guarantee the connection will complete

ivankelly approved these changes Aug 10, 2021

View reviewed changes