Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ThresholdShedder Strategy for loadbalancer and expose loadbalance metric to prometheus #6772

Merged
merged 3 commits into from
Apr 22, 2020

Conversation

hangc0276
Copy link
Contributor

Motivation

The Only one overload shedder strategy is OverloadShedder, which collects each broker's max resource usage and compare with threshold (default value is 85%). When max resource usage reaches the threshold, it will trigger bundle unloading, which will migrate parts of bundles to other brokers. The overload shedder strategy has some drawbacks as follows:

  • Not support configure other overload shedder strategies
  • It is hard to determine the threshold value, the default threshold is 85%. But for a broker, the max resource usage is few to reach 85%, which will lead to unbalanced traffic between brokers. The heavy traffic broker's read cache hit rate will decrease.
  • When you restart the most brokers of the pulsar cluster at the same time, the whole traffic in the cluster will goes to the rest brokers. The restarted brokers will have no traffic for a long time, due to the rest brokers max resource usage not reach the threshold.

Changes

  1. Support multiple overload shedder strategy, which only need to configure in broker.conf
  2. I develop ThresholdShedder strategy, the main idea as follow:
    • Calculate the average resource usage of the brokers, and individual broker resource usage will compare with the average value. If it greatter than average value plus threshold, the overload shedder will be triggered.
      broker resource usage > average resource usage + threshold
    • Each kind of resources (ie bandwithIn, bandwithOut, CPU, Memory, Direct Memory), has weight(default is 1.0) when calculate broker's resource usage.
    • Record the pulsar broker cluster history average resource usage, new average resource usage will be calculate as follow:
      new_avg = old_avg * factor + (1-factor) * avg
      new_avg: newest average resoruce usage
      old_avg: old average resource usge which is calculate in last round.
      factor: the decrease factor, default value is 0.9
      avg: the average resource usage of the brokers
  3. expose load balance metric to prometheus
  4. fix a bug in OverloadShedder, which specify the unloaded bundle in the overload's own broker.

Please help check this implementation, if it is ok, i will add test case.

@skyrocknroll
Copy link
Contributor

@hangc0276 Thanks for this PR. This looks exciting

@sijie sijie added area/broker type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages labels Apr 21, 2020
@sijie sijie added this to the 2.6.0 milestone Apr 21, 2020
Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hangc0276 this is a great feature! Looking pretty great!

@sijie
Copy link
Member

sijie commented Apr 21, 2020

/pulsarbot run-failure-checks

@codelipenghui codelipenghui merged commit b9bdfa1 into apache:master Apr 22, 2020
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
… metric to prometheus (apache#6772)

### Motivation
The Only one overload shedder strategy is `OverloadShedder`, which collects each broker's max resource usage and compare with threshold (default value is 85%). When max resource usage reaches the threshold, it will trigger bundle unloading, which will migrate parts of bundles to other brokers. The overload shedder strategy has some drawbacks as follows:
- Not support configure other overload shedder strategies
- It is hard to determine the threshold value,  the default threshold is 85%. But for a broker, the max resource usage is few to reach 85%, which will lead to unbalanced traffic between brokers. The heavy traffic broker's read cache hit rate will decrease.
- When you restart the most brokers of the pulsar cluster at the same time, the whole traffic in the cluster will goes to the rest brokers. The restarted brokers will have no traffic for a long time, due to the rest brokers max resource usage not reach the threshold.

### Changes
1. Support multiple overload shedder strategy, which only need to configure in `broker.conf`
2. I develop `ThresholdShedder` strategy, the main idea as follow:
    - Calculate the average resource usage of the brokers, and individual broker resource usage will compare with the average value. If it greatter than average value plus threshold, the overload shedder will be triggered.
    `broker resource usage > average resource usage + threshold`
    - Each kind of resources (ie bandwithIn, bandwithOut, CPU, Memory, Direct Memory), has weight(default is 1.0) when calculate broker's resource usage.
    - Record the pulsar broker cluster history average resource usage, new average resource usage will be calculate as follow:
    `new_avg = old_avg * factor + (1-factor) * avg`
    `new_avg`: newest average resoruce usage
    `old_avg`: old average resource usge which is calculate in last round.
    `factor`: the decrease factor, default value is `0.9`
    `avg`: the average resource usage of the brokers
3. expose load balance metric to prometheus
4. fix a bug in `OverloadShedder`, which specify the unloaded bundle in the overload's own broker.

Please help check this implementation, if it is ok, i will add test case.
@alexanderursu99
Copy link

@hangc0276 Hi, I'm having issues running the ThresholdShedder, is there any special configuration that should be pointed out? I believe I have it enabled correctly, but I don't see any load shedded when I would expect it to, just from looking at the resource metrics.

I can share more details if needed.

If I should open a new issue for this, let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants