Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish rate limit on broker to avoid OOM #5513

Closed
jiazhai opened this issue Oct 31, 2019 · 16 comments · Fixed by #5710
Closed

Publish rate limit on broker to avoid OOM #5513

jiazhai opened this issue Oct 31, 2019 · 16 comments · Fixed by #5710
Assignees
Labels
type/feature The PR added a new feature or issue requested a new feature
Milestone

Comments

@jiazhai
Copy link
Member

jiazhai commented Oct 31, 2019

Out Of Memory is seen frequently while client produced message too quickly and used all the direct memory.
This happens like this:

  1. Client send message to broker;
  2. Broker call bookkeeper client to write message to Bookie, and hold the direct memory until write success;
    If in the 2nd step the direct memory not get released, then the Broker will get OOM.

It would be great to have a way to detect the memory pressure, and do the publish rate limit if lack of memory; once the memory get back, cancel the rate limit.

@jiazhai jiazhai added the type/feature The PR added a new feature or issue requested a new feature label Oct 31, 2019
@jiazhai jiazhai added this to the 2.5.0 milestone Oct 31, 2019
@merlimat
Copy link
Contributor

There's already a limit per-connection of max outstanding entries between broker and bookies. Do you have a precise scenario for this to happen?

@rdhabalia
Copy link
Contributor

@jiazhai have you started seeing OOM recently in 2.5 or it was there before 2.5?
are you proposing similar publish rate implemented at #3985

@sijie
Copy link
Member

sijie commented Nov 4, 2019

@rdhabalia

We have users doing load tests over Pulsar broker. The behavior we have observed is as follows:

disks are overwhelmed. the entries are queuing up in the broker side. the load test is continuing and causing broker OOM.

The tests were done in 2.4.2. They are upgrading the cluster to 2.5.0 and run the test again.

Regarding the feature you introduced in #3985, we have asked them to try it out. However they can't really predicate the traffic and set a good publish rate in advance. Hence we are looking for a better solution at the broker level. We are thinking of reusing the mechanism you introduced in #3985. The idea is to disable autoRead when the max pending requests exceeds a threshold or the direct memory usage exceeds a threshold.

@jiazhai
Copy link
Member Author

jiazhai commented Nov 4, 2019

Hi @rdhabalia @merlimat. Thanks for your concern. And thanks @rdhabalia for your great feature of #3985.
Recently there is a user trying to use Pulsar along with Flink, and there is a significant amount of messages produced. They meet the OOM of direct memory very often.
#3985 and the setting that Matteo mentioned may solve part of the issue, but the OOM in Broker may still exists.
I thought we could leverage part of the change in #3985, and add another broker scope config to do the publish rate limiting.

@rdhabalia
Copy link
Contributor

the setting that Matteo mentioned may solve part of the issue, but the OOM in Broker may still exists.

Is it possible for you to reproduce OOM with master code-base. Just make sure that broker has change #3985 and you don't have to enable publish-rate limiting. I know disabling auto-read was not working all the time with ByteToMessageDecoder because of this reason and I think it's taken care in #3985.

So, if you are still able to reproduce OOM then can you please provide the steps to reproduce and error-log?

@sijie
Copy link
Member

sijie commented Nov 20, 2019

@rdhabalia

We are able to reproduce OOM problem with master code-base (including your code change and doesn't enable publish-rate limiting).

The steps to reproduce:

  1. setup a bookkeeper cluster with 10 nodes. Each node has one SSD disk for both journal and ledger directory.
  2. setup a broker with 10 nodes.
  3. create a topic with 20 partitions.
  4. launch a flink job with a data-generator source (without throttling) and a flink pulsar sink (https://github.com/streamnative/pulsar-flink), with parallelism == 20. The producer setting in the pulsar sink is the default producer setting.
  5. run the flink job for a while. you can observe the OOM very quickly.

The heapdump of broker shows a lot of pending entries because the flink job is producing more than the bookkeeper cluster can accept. so all the entries are accumulating at the broker and cause broker crash due to OOM.

The expectation is broker should degrade when reaching its capacity limitation and give back pressure to the clients. Broker shouldn't crash due to OOM.

If we don't provide the capability as Jia proposed in #5710, pulsar can't be used in high-volume ingestion workload.

@rdhabalia
Copy link
Contributor

@merlimat
There's already a limit per-connection of max outstanding entries between broker and bookies. Do you have a precise scenario for this to happen?

@sijie
create a topic with 20 partitions. with parallelism == 20. The producer setting in the pulsar sink is the default producer setting.

I gave a try with similar setup and somehow I couldn't reproduce it. and reason broker doesn't go OOM because I think what @merlimat mentioned. Broker restricts max pending publish request per connection.
https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L130

also, introducing counter across all topics for throttling can cause bottleneck while publishing for all topics so, this feature might not be recommended for most of the users.

so, I would recommend to depend on maxPendingRequestPerConnection rather adding more complexity and if that's not working then it's worth to investigate why it disabling channel still cause OOM. I have also created #5742 which can allow users to configure max-pending requests per connection if needed.

@sijie
Copy link
Member

sijie commented Nov 25, 2019

@rdhabalia

I gave a try with similar setup and somehow I couldn't reproduce it.

I am not sure how do you setup. Are you running the flink connector? or you simulate it? Did you run enough parallelism to stress test the cluster?

OOM is a common scene during our test.

Broker restricts max pending publish request per connection.

But the broker is still facing OOM when the number of connections increase, no?

also, introducing counter across all topics for throttling can cause bottleneck while publishing for all topics so, this feature might not be recommended for most of the users.

The feature we are adding is controlled by a flag. People need to pre-configure what is the rate that a broker can accept. That rate is typically aligned with your NIC configuration. E.g. 80% of your NIC bandwidth. We might make it smart by automatically adjust the rate based on NIC bandwidth and memory usage. but we will always make it easy to turn on/off with a flag.

What is the side effect of adding this rate limiter at broker level?

depend on maxPendingRequestPerConnection rather adding more complexity

We can't really depend on maxPendingRequestPerConnection. Because when you have a lot of concurrent clients (connections), that setting doesn't worker. Also using number of requests doesn't give an accurate estimation. because the size of per request varies between applications. A broker level mechanism that based on bytes not requests is much robust than the other approaches.

You mentioned that "I know disabling auto-read was not working all the time with ByteToMessageDecoder because of this reason and I think it's taken care in #3985." But the implementation that Jia proposed is using your implementation at #3985. If the problem is taken care by #3985, then there shouldn't be a problem in #5710; if the problem is not taken care by #3985, then it is a problem for both namespace level rate limiter and a broker level rate limiter, which I am not sure it is a problem for adding a broker level rate limiter not a problem for adding the namespace level rate limiter.

@rdhabalia
Copy link
Contributor

I am not sure how do you setup. Are you running the flink connector?

no, I have pref broker setup and trying to publish messages with multiple processes of perf-producers.

my only point is to figure out root cause of OOM and address it. if broker is going OOM with 20 topics and 20-400 producers then do we think auto-read is not working as expected and broker is still accepting messages after auto-read disable and keeping them in memory is causing OOM?

@sijie
Copy link
Member

sijie commented Nov 25, 2019

my only point is to figure out root cause of OOM and address it.
do we think auto-read is not working as expected

Yes. we think auto-read is not working as expected, because maxPendingRequestPerConnection is not a useful mechanism. 1) it is not a broker level mechanism. it only limits per connection. when a broker has increase number of connections, the limit is changed; 2) it is not a bytes based mechanism. it limits on number of requests, the request size can change when the batching behavior changes in the client side. the number of bytes pending on the channel is non-deterministic. 3) it is almost impossible to determine a value based on maxPendingRequestPerConnection in a real production traffic. (think about the peak traffic in China's shopping festival day).

The requirement from us is deadly simple - I have a broker, no matter how many clients send the request and how clients batch the request, broker should work as normal and give backpressure to the client when exceed the capacity that a broker offers. The ideal perfect solution is to disable auto-read when the resource usage (aka memory usage, cpu usage, or network usage) exceed a threshold and the process should be done automatically. However to achieve a perfect solution like that takes time and is usually complicated. The closest solution that we can provide is to have a broker-level rate limiter : 1) we can limit the traffic based on an aggregate throughput (bytes/second); 2) the mechanism is already available by #3985, we just piggyback.

@sijie
Copy link
Member

sijie commented Nov 25, 2019

@rdhabalia putting OOM question aside, what are the concerns of adding a broker-level rate limiter as a feature? Just looking into features that Pulsar has, it usually has broker-level settings and namespace-level settings. As a feature, doesn't it make sense to also have a broker-level rate limiter in addition to namespace-level rate limiter?

@rdhabalia
Copy link
Contributor

putting OOM question aside, what are the concerns of adding a broker-level rate limiter

I wanted to figure out root cause of OOM as we are trying to target that issue mainly. I had verified autoread behavior while implementing #3985 so, was curious about the actual issue. also, all threads are going to do rate limit using one counter and that could be bottleneck as well so, one wants to avoid it as well.

@sijie
Copy link
Member

sijie commented Nov 26, 2019

I wanted to figure out root cause of OOM as we are trying to target that issue mainly

what else we can provide you to figure out the root cause of OOM?

I had verified autoread behavior while implementing #3985 so,

We also verified #3985 works as expected. As I explained, auto-read mechanism is not a problem. The problem is we need to a broker-level metric/mechanism to disable auto-read, which is not a namespace based or a connection based mechanism. That's why we need a broker-level rate limiter.

also, all threads are going to do rate limit using one counter and that could be bottleneck as well so, one wants to avoid it as well.

  1. this is a feature that users can turn it on/off. If people doesn't need this feature, they don't have to turn on broker-level rate limiting.
  2. If current counter is a bottleneck, we can seek for an improved method to improve counting. e.g. a better counter implementation.

The question here is more about: do we need a broker-level mechanism to meet the requirements of running Pulsar at high load? If we need, then why not adopt the rate limiter approach? If we don't, then what are the other approaches? (I have tried to explain that existing approaches don't apply to our requirements as above)

@sijie
Copy link
Member

sijie commented Nov 26, 2019

all threads are going to do rate limit using one counter and that could be bottleneck as well so, one wants to avoid it as well.

The current implementation of RateLimiter is using LongAdder for counting the bytes and msgs. A LongAdder maintains multiple cells for counters. They are only aggregated when sum is called. The sum is called when the monitor, which is done periodically in the background thread. I am not sure how this would be a bottleneck.

LongAdder is also used for other metrics. Shouldn't they be all concerned as well?

@rdhabalia
Copy link
Contributor

(I have tried to explain that existing approaches don't apply to our requirements as above)

sure. If you feel this will be useful for your usecase and useful to your users then we can add it. 👍

@jiazhai
Copy link
Member Author

jiazhai commented Nov 26, 2019

@rdhabalia @sijie Thanks for your comments. And many thanks for @rdhabalia 's PR #3985, It worked well in our test.
And as the steps that @sijie mentioned, It could be reproduced easy in user side. And user complained a lot for this OOM. This should be a good way to solve this issue.

jiazhai added a commit that referenced this issue Nov 26, 2019
Fixes #5513 

### Motivation

Through #3985, user could set the publish rate for each topic, but the topic number for each broker is not limited, so there is case that a lot of topics served in same broker, and if each topic send too many message, it will cause the messages not able to send to BookKeeper in time,  and messages be hold in the direct memory of broker, and cause Broker out of direct memory.

### Modifications

- add broker publish rate limit base on #3985,
- add unit test.

### Verifying this change
unit test passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature The PR added a new feature or issue requested a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants