Improved in max-pending-bytes mechanism for broker #7406

merlimat · 2020-07-01T07:01:59Z

Motivation

There are few issues with the current implementation of the broker-wide throttling based on max-outstanding bytes in broker that was added in #6178.

The current implementation is over-counting the outstanding bytes when there are >1 producers on a given connection. It's cycling through the producers and adding the per-connection counter one time per each producer.
There is 1 atomic increment and 2 volatile reads per each request
There is a delay for detecting the memory over-commit, due to the background task running periodically
If there is a substantial amount of producers, the tasks that runs every 100ms will consume significant CPU by looping over all the producers (many of which could be idle) all the time

The improvement proposed here is using thread-local counters to avoid contention and CPU overhead.

Use 1 counter per IO thread. Once the counter for that thread exceed 1/N of the quota, throttle all the connections that are pinned to the thread.
This will also ensure that 1 single connection trying to publish too fast can throttle all other connections in the broker. Rather, it will only affect 1/N of connections.
No need for atomic/volatile, just thread-local and local variables, since we're always operating on connections that belongs to the same IO thread.
Precise enforcement with no CPU overhead for idle producers

Regarding the lower limit per each IO thread, in which 1 single connection cannot use the entire memory space, I don't think it's an issue at all.

Once there is a "window" of several MB, there is no throughput gain in having a bigger window. Also, In most scenarios, the only time we'd be filling this window is when the downstream BK is either malfunctioning or under-provisioned for the load. Either case, it's not a performance concern.

ivankelly · 2020-07-01T08:16:56Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java

+    private static final FastThreadLocal<Set<ServerCnx>> cnxsPerThread = new FastThreadLocal<Set<ServerCnx>>() {
+        @Override
+        protected Set<ServerCnx> initialValue() throws Exception {
+            return Collections.newSetFromMap(new IdentityHashMap<>());


What's the advantage of this over "new HashSet<>()"?

Since ServerCnx doesn't have a hashCode() method I thought to be on the safe side to just make sure to use == operator instead of hashing. Honestly, I'm not 100% sure that would make difference in practice.

ivankelly · 2020-07-01T08:58:41Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

@@ -2307,4 +2253,16 @@ private boolean isSystemTopic(String topic) {
    public void setInterceptor(BrokerInterceptor interceptor) {
        this.interceptor = interceptor;
    }
+
+    public void pausedConnections(int numberOfConnections) {
+        pausedConnections.addAndGet(numberOfConnections);


Is this exported via prometheus? It would be better to have separate event counters for pause and resume so that if pausing happens between prometheus pulls we can see it.

Good point. I just exposed it here to for validation in the tests, though it makes sense to expose the 2 counters.

codelipenghui · 2020-07-01T10:00:26Z

Regarding the lower limit per each IO thread, in which 1 single connection cannot use the entire memory space, I don't think it's an issue at all.

There are some users who use few producers to publish messages, in order to have a better batch effect to improve throughput. We cannot assume 1 single connection cannot use the entire memory space. Our purpose is to control memory and make effective use of memory. I think split memory according to io thread may decrease the memory utilization.

How about using a LongAdder to record the current pending bytes and remove the check tasks.

ivankelly · 2020-07-01T12:45:25Z

@codelipenghui A LongAdder would still cause contention on every entry, as to sum it has to view all cells. To allow a single thread to use the full buffer when there's little traffic on the other threads, we should use a token bucket, as networking folks have been doing for decades.

merlimat · 2020-07-01T13:00:25Z

There are some users who use few producers to publish messages, in order to have a better batch effect to improve throughput.

I have, of course, tested it and as expected once you have few MBs of runway, there's no throughput advantage in allowing it to grow higher.

We cannot assume 1 single connection cannot use the entire memory space.

We should assume that, because allowing 1 single connection to take over the entire quota of the broker and stalling all other connections is bad.

Our purpose is to control memory and make effective use of memory.

Effectively and efficiently, in a scalable manner.

I think split memory according to io thread may decrease the memory utilization.

If there's no negative impact on throughput, that is a good thing. That memory can be used on dispatch side.

How about using a LongAdder to record the current pending bytes and remove the check tasks.

The LongAdder would not allow you to remove the check tasks. The LongAdder only difference, compared to an AtomicLong, is that it's optimized for writes (thread-local counters), but the read path is extra slow and meant to be used infrequently.

That doesn't solve the multiple issue I've outlined above, in particular the need for the check task.

A single broker, can be very easily be serving 100s of thousands of producers at a given point in time. Some of them active, some of them idle.

Frequently looping over all the producers (just to get to the connections) is just burning a lot of CPU.

codelipenghui · 2020-07-01T13:17:19Z

Frequently looping over all the producers (just to get to the connections) is just burning a lot of CPU.

I agree with this point. I just want to say the pending buffer is sensitive to write throughput, especially on machines with less memory. If we split it into multiple parts, when there are fewer connections, the channel will enable auto-read, disable auto-read frequently, And it is possible that some parts are available pending more messages.

The LongAdder would not allow you to remove the check tasks. The LongAdder only difference, compared to an AtomicLong, is that it's optimized for writes (thread-local counters), but the read path is extra slow and meant to be used infrequently.

The io threads count not much, will we add up to a few numbers each time will become a bottleneck?

merlimat · 2020-07-01T14:24:14Z

I just want to say the pending buffer is sensitive to write throughput, especially on machines with less memory. If we split it into multiple parts, when there are fewer connections, the channel will enable auto-read, disable auto-read frequently,

How much is less memory? Once you have 10s of MB per thread (or per connection, if you have a single connection), there will be no more improvement in letting it use more memory.

It's the same reason for which the OS network stack doesn't let you grow the TCP window indefinitely. There are OS limits that are in place, because: (1) too much doesn't help and (2) it starve other users.

The current default settings we have is -XX:MaxDirectMemorySize=4g.
That means that with 16 cores, you'd get, by default, 32 IO threads.
With this, each thread gets 128MB of buffer size, which is far greater than any TCP window size you'd get from Linux.

Also, 4G for a VM with 16 cores is very little memory ratio. Such VM would typically be having 64GB or higher.

Reversely, in container environment, mem is usually capped lower, but so is the CPU limit. If you limit the CPUs on the broker container, the default IO threads count will adjust accordingly, balancing the situation.

If we split it into multiple parts, when there are fewer connections, the channel will enable auto-read, disable auto-read frequently,

We're already doing that with the per-connection pendingSendRequest (default 1K) and there's no perf impact.

And it is possible that some parts are available pending more messages.

The io threads count not much, will we add up to a few numbers each time will become a bottleneck?

I'm not sure what you mean here.

codelipenghui · 2020-07-01T14:32:22Z

The io threads count not much, will we add up to a few numbers each time will become a bottleneck?
I'm not sure what you mean here.

Sorry I didn't make it clear. I mean if we have 32 io thread, for each message we sum 32 numbers, is this will be a bottleneck of the broker? I just want to know the disadvantages of using Longadder.

merlimat · 2020-07-06T18:55:23Z

Sorry I didn't make it clear. I mean if we have 32 io thread, for each message we sum 32 numbers, is this will be a bottleneck of the broker? I just want to know the disadvantages of using Longadder.

@codelipenghui With longadder you'd still need to run a background thread to do the enforcement, because the increment part is low-contention, but reading the "actual value" is on the slow path.

jiazhai · 2020-07-14T03:31:08Z

@codelipenghui to help review this pr

rdhabalia

this is a frequently occurring issue for us and I don't see any review/comment on #7499 .
@merlimat if you have already tested this change then let's merge this approach.
👍

rdhabalia · 2020-08-07T17:37:01Z

/pulsarbot run-failure-checks

codelipenghui · 2020-11-04T01:04:42Z

@merlimat Could you please rebase to the master branch? So that we can onboard this PR in 2.7.0

315157973

This can solve the current OOM problem

315157973 · 2021-04-07T03:30:24Z

Assuming 32 IO threads, 8G direct memory. The buffer of each IO thread is 256MB. Bookie’s write latency is about 10ms. Under stable conditions, 100* 256MB = 25G can be written per second, which is enough to fill bookie.

eolivelli

+1

codelipenghui · 2021-04-07T05:13:21Z

@merlimat Could you please fix the conflicts so that we can advance this PR.

hangc0276 · 2021-04-07T14:43:58Z

Great job, it can solve the current broker OOM, would you please rebase master branch and resolve the conflicts? @merlimat

ronfarkash

Could you finish this PR, this fixes a major issue in the project that should not be neglected. @merlimat

eolivelli · 2021-04-29T09:42:41Z

@merlimat do you want to rebase this onto current master ?
it looks like it is a very good improvement

merlimat · 2021-04-29T13:24:38Z

Yes, I’ll get this ready in the next few days. There are also some improvements I want to do to this PR.

galrose · 2021-04-29T14:18:59Z

@merlimat thats great thank you

merlimat · 2021-04-30T22:08:59Z

In a subsequent PR, I'll be adding a bucket-token based mechanism to handle different usages across multiple threads.

315157973 · 2021-05-01T08:17:51Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

-    private final long maxMessagePublishBufferBytes;
-    private final long resumeProducerReadMessagePublishBufferBytes;
-    private volatile boolean reachMessagePublishBufferThreshold;
+    private final AtomicInteger pausedConnections = new AtomicInteger();


The reading frequency here should be very low, only used in unit tests or metrics, why not use longAdder

Good point, changed that.

…native#488) This PR upgrades pulsar dependency to 2.8.0-rc-202105092228, which has two major API changes. apache/pulsar#10391 changed `LoadManager` API so that `MetadataCache` is used instead of `ZookeeperCache` in this PR. apache/pulsar#7406 changed the throttling strategy. However, currently KoP is different from Pulsar that the produce and its callback may be in different threads. KoP calls `PersistentTopic#publishMessages` in a callback of `KafkaTopicManager#getTopic` if the returned future is not completed immediately. Otherwise, it's called just in the I/O thread. Therefore, here we still use a **channel based** publish bytes stats for throttling, while apache/pulsar#7406 uses a **thread based** publish bytes stats. The other refactors are: 1. Change the throttling related fields from `InternalServerCnx` to `KafkaRequestHandler`. 2. Use `BrokerService#getPausedConnections` to check if the channel's auto read is disabled and modify the tests as well.

This PR upgrades pulsar dependency to 2.8.0-rc-202105092228, which has two major API changes. apache/pulsar#10391 changed `LoadManager` API so that `MetadataCache` is used instead of `ZookeeperCache` in this PR. apache/pulsar#7406 changed the throttling strategy. However, currently KoP is different from Pulsar that the produce and its callback may be in different threads. KoP calls `PersistentTopic#publishMessages` in a callback of `KafkaTopicManager#getTopic` if the returned future is not completed immediately. Otherwise, it's called just in the I/O thread. Therefore, here we still use a **channel based** publish bytes stats for throttling, while apache/pulsar#7406 uses a **thread based** publish bytes stats. The other refactors are: 1. Change the throttling related fields from `InternalServerCnx` to `KafkaRequestHandler`. 2. Use `BrokerService#getPausedConnections` to check if the channel's auto read is disabled and modify the tests as well. * Fix LoadManager interface * Refactor publish throttling * Remove ZookeeperCache usage

merlimat added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Jul 1, 2020

merlimat added this to the 2.7.0 milestone Jul 1, 2020

merlimat requested review from ivankelly, sijie, srkukarni, rdhabalia, jerrypeng and codelipenghui July 1, 2020 07:02

merlimat self-assigned this Jul 1, 2020

ivankelly reviewed Jul 1, 2020

View reviewed changes

ivankelly closed this Jul 1, 2020

ivankelly reopened this Jul 1, 2020

merlimat mentioned this pull request Jul 10, 2020

[pulsar-broker] Broker handle back-pressure with max-pending message across topics to avoid OOM #7499

Open

rdhabalia approved these changes Aug 7, 2020

View reviewed changes

codelipenghui modified the milestones: 2.7.0, 2.8.0 Nov 17, 2020

315157973 approved these changes Apr 7, 2021

View reviewed changes

eolivelli approved these changes Apr 7, 2021

View reviewed changes

codelipenghui approved these changes Apr 7, 2021

View reviewed changes

hangc0276 approved these changes Apr 7, 2021

View reviewed changes

ronfarkash reviewed Apr 26, 2021

View reviewed changes

Improved in max-pending-bytes mechanism for broker

77acc92

merlimat force-pushed the max-pending-bytes branch from f706016 to 77acc92 Compare April 30, 2021 22:06

Fixed imports

fb620e2

315157973 reviewed May 1, 2021

View reviewed changes

Switched to LongAdder

97dba5a

merlimat merged commit 2a522c8 into apache:master May 2, 2021

merlimat deleted the max-pending-bytes branch May 2, 2021 17:44

lhotari mentioned this pull request May 10, 2021

lh use conscrypt for jetty lhotari/pulsar#39

Closed

BewareMyPower mentioned this pull request May 10, 2021

Fix publish throttling and LoadManager API for pulsar upgrade streamnative/kop#488

Merged

wenbingshen mentioned this pull request Sep 24, 2021

[BUG] Questions about pulsar broker direct OOM #12169

Open

sijie mentioned this pull request Sep 24, 2021

ISSUE-12169: [BUG] Questions about pulsar broker direct OOM streamnative/pulsar-archived#3090

Open

wangjialing218 mentioned this pull request Dec 20, 2021

support limit message size produced by protocol handlers by maxMessag… #13204

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved in max-pending-bytes mechanism for broker #7406

Improved in max-pending-bytes mechanism for broker #7406

merlimat commented Jul 1, 2020

ivankelly Jul 1, 2020

merlimat Jul 1, 2020

ivankelly Jul 1, 2020

merlimat Jul 1, 2020

codelipenghui commented Jul 1, 2020

ivankelly commented Jul 1, 2020 •

edited

Loading

merlimat commented Jul 1, 2020

codelipenghui commented Jul 1, 2020

merlimat commented Jul 1, 2020

codelipenghui commented Jul 1, 2020

merlimat commented Jul 6, 2020

jiazhai commented Jul 14, 2020

rdhabalia left a comment

rdhabalia commented Aug 7, 2020

codelipenghui commented Nov 4, 2020

315157973 left a comment

315157973 commented Apr 7, 2021

eolivelli left a comment

codelipenghui commented Apr 7, 2021

hangc0276 commented Apr 7, 2021

ronfarkash left a comment

eolivelli commented Apr 29, 2021

merlimat commented Apr 29, 2021

galrose commented Apr 29, 2021

merlimat commented Apr 30, 2021

315157973 May 1, 2021

merlimat May 1, 2021

Improved in max-pending-bytes mechanism for broker #7406

Improved in max-pending-bytes mechanism for broker #7406

Conversation

merlimat commented Jul 1, 2020

Motivation

ivankelly Jul 1, 2020

Choose a reason for hiding this comment

merlimat Jul 1, 2020

Choose a reason for hiding this comment

ivankelly Jul 1, 2020

Choose a reason for hiding this comment

merlimat Jul 1, 2020

Choose a reason for hiding this comment

codelipenghui commented Jul 1, 2020

ivankelly commented Jul 1, 2020 • edited Loading

merlimat commented Jul 1, 2020

codelipenghui commented Jul 1, 2020

merlimat commented Jul 1, 2020

codelipenghui commented Jul 1, 2020

merlimat commented Jul 6, 2020

jiazhai commented Jul 14, 2020

rdhabalia left a comment

Choose a reason for hiding this comment

rdhabalia commented Aug 7, 2020

codelipenghui commented Nov 4, 2020

315157973 left a comment

Choose a reason for hiding this comment

315157973 commented Apr 7, 2021

eolivelli left a comment

Choose a reason for hiding this comment

codelipenghui commented Apr 7, 2021

hangc0276 commented Apr 7, 2021

ronfarkash left a comment

Choose a reason for hiding this comment

eolivelli commented Apr 29, 2021

merlimat commented Apr 29, 2021

galrose commented Apr 29, 2021

merlimat commented Apr 30, 2021

315157973 May 1, 2021

Choose a reason for hiding this comment

merlimat May 1, 2021

Choose a reason for hiding this comment

ivankelly commented Jul 1, 2020 •

edited

Loading