Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RateLimiter lock contention when use precise publish rate limiter #21442

Closed
1 of 2 tasks
Shawyeok opened this issue Oct 25, 2023 · 10 comments · Fixed by #21681
Closed
1 of 2 tasks

[Bug] RateLimiter lock contention when use precise publish rate limiter #21442

Shawyeok opened this issue Oct 25, 2023 · 10 comments · Fixed by #21681
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@Shawyeok
Copy link
Contributor

Shawyeok commented Oct 25, 2023

Search before asking

  • I searched in the issues and found nothing similar.

After enabled the preciseTopicPublishRateLimiterEnable configuration in broker.conf, if one topic on the broker exceeds the publish rate limit (for example, 1MB/s), it will cause lock contention, leading to an increase in the 99th percentile write latency for other topics that have not reached the rate limit.
image

The profile shows that there is lock content inside the method org.apache.pulsar.broker.service.PrecisPublishLimiter#tryAcquire.
image
pulsar-3.0.1.lock.profile.008.html.zip

Minimal reproduce step

Noticed in a fork of 2.8.1, reproduced in 3.0.1 on Linux CentOS 7.9 3.10.0-1160.99.1.el7.x86_64

pulsar-admin namespaces create public/ratelimited
pulsar-admin namespaces set-publish-rate -b 1048576 public/ratelimited
pulsar-admin namespaces policies public/ratelimited
pulsar-perf produce persistent://public/default/no-rate-limit
pulsar-perf produce -r 1600 -ioThreads 4 -n 100 persistent://public/ratelimited/topic-1

What did you expect to see?

One topic exceeds publish rate limit should not effect other topics write latency.

What did you see instead?

Another topic write latency also increased.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@Shawyeok Shawyeok added the type/bug The PR fixed a bug or issue reported a bug label Oct 25, 2023
@1625567290
Copy link

Hello, I am also learning pulsar recently, have you found a solution to this problem?

@Shawyeok
Copy link
Contributor Author

Shawyeok commented Nov 3, 2023

Sorry for late reply.

The current publish rate limiting is based on setting the autoRead flag of the Netty channel to false. However, autoRead is just a hint in Netty, not a guarantee (if anyone is familiar with this, a detailed explanation would be great). As a result, topics that have already been rate-limited may call the isTopicPublishRateExceeded method multiple times before the next rate-limiting period arrives, which is unnecessary. So, an optimization I'm thinking of is to set a flag after a topic triggers rate limiting, to avoid having to check whether the topic has already been rate-limited through the isTopicPublishRateExceeded method each time.

Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

@Shawyeok
Copy link
Contributor Author

Shawyeok commented Nov 7, 2023

There is some related discuss in mail list:

@1625567290
Copy link

Sorry for late reply.

The current publish rate limiting is based on setting the autoRead flag of the Netty channel to false. However, autoRead is just a hint in Netty, not a guarantee (if anyone is familiar with this, a detailed explanation would be great). As a result, topics that have already been rate-limited may call the isTopicPublishRateExceeded method multiple times before the next rate-limiting period arrives, which is unnecessary. So, an optimization I'm thinking of is to set a flag after a topic triggers rate limiting, to avoid having to check whether the topic has already been rate-limited through the isTopicPublishRateExceeded method each time.

Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

Sorry, I am a beginner of pulsar, there is a problem I don't understand, can you help me solve it? The questioner has said affected other theme writing efficiency, the current theme open preciseTopicPublishRateLimiterEnable, why can affect other theme. I can't find the answer according to your answer, can you help me?

@Shawyeok
Copy link
Contributor Author

Sorry for late reply.
The current publish rate limiting is based on setting the autoRead flag of the Netty channel to false. However, autoRead is just a hint in Netty, not a guarantee (if anyone is familiar with this, a detailed explanation would be great). As a result, topics that have already been rate-limited may call the isTopicPublishRateExceeded method multiple times before the next rate-limiting period arrives, which is unnecessary. So, an optimization I'm thinking of is to set a flag after a topic triggers rate limiting, to avoid having to check whether the topic has already been rate-limited through the isTopicPublishRateExceeded method each time.
Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

Sorry, I am a beginner of pulsar, there is a problem I don't understand, can you help me solve it? The questioner has said affected other theme writing efficiency, the current theme open preciseTopicPublishRateLimiterEnable, why can affect other theme. I can't find the answer according to your answer, can you help me?

In Pulsar internal, there are multiple topics that depend on the same thread for message writing (such as Netty's EventLoop and OrderedExecutor). If, for some reason, these threads encounter a performance bottleneck, such as the lock contention observed in this case, then even those topics not directly causing the lock contention will be affected, as long as they are also bound to these same threads.

@lhotari
Copy link
Member

lhotari commented Nov 21, 2023

Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

@Shawyeok I'm working on a non-blocking implementation that uses CAS. Please check https://github.com/lhotari/async-tokenbucket for the PoC and performance test.

In addition, I'll be fixing the issue in handling the way how autoread is toggled. (WIP at https://github.com/lhotari/pulsar/blob/lh-rate-limiter-improvement/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ThrottleTracker.java )

@lhotari lhotari self-assigned this Nov 21, 2023
@lhotari
Copy link
Member

lhotari commented Nov 22, 2023

Hello, I am also learning pulsar recently, have you found a solution to this problem?

@1625567290 Just curious to learn about your motivation to start learning Pulsar here. This might not be the easiest way to learn Pulsar. :)

@Shawyeok
Copy link
Contributor Author

Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

@Shawyeok I'm working on a non-blocking implementation that uses CAS. Please check https://github.com/lhotari/async-tokenbucket for the PoC and performance test.

In addition, I'll be fixing the issue in handling the way how autoread is toggled. (WIP at https://github.com/lhotari/pulsar/blob/lh-rate-limiter-improvement/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ThrottleTracker.java )

@lhotari Cool! I will study and verify your solution as soon as possible after my vacation, which may be a few days.

@lhotari
Copy link
Member

lhotari commented Nov 22, 2023

@lhotari Cool! I will study and verify your solution as soon as possible after my vacation, which may be a few days.

Sounds good. Please check the blog post https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html for the full context.

@lhotari
Copy link
Member

lhotari commented Dec 11, 2023

@Shawyeok There's now "PIP-322: Pulsar Rate Limiting Refactoring" (#21680) to address this issue. The changes are in the draft PR #21681 . Please review! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants