Add ThresholdShedder Strategy for loadbalancer and expose loadbalance metric to prometheus #6772

hangc0276 · 2020-04-20T03:07:36Z

Motivation

The Only one overload shedder strategy is OverloadShedder, which collects each broker's max resource usage and compare with threshold (default value is 85%). When max resource usage reaches the threshold, it will trigger bundle unloading, which will migrate parts of bundles to other brokers. The overload shedder strategy has some drawbacks as follows:

Not support configure other overload shedder strategies
It is hard to determine the threshold value, the default threshold is 85%. But for a broker, the max resource usage is few to reach 85%, which will lead to unbalanced traffic between brokers. The heavy traffic broker's read cache hit rate will decrease.
When you restart the most brokers of the pulsar cluster at the same time, the whole traffic in the cluster will goes to the rest brokers. The restarted brokers will have no traffic for a long time, due to the rest brokers max resource usage not reach the threshold.

Changes

Support multiple overload shedder strategy, which only need to configure in broker.conf
I develop ThresholdShedder strategy, the main idea as follow:
- Calculate the average resource usage of the brokers, and individual broker resource usage will compare with the average value. If it greatter than average value plus threshold, the overload shedder will be triggered.
  broker resource usage > average resource usage + threshold
- Each kind of resources (ie bandwithIn, bandwithOut, CPU, Memory, Direct Memory), has weight(default is 1.0) when calculate broker's resource usage.
- Record the pulsar broker cluster history average resource usage, new average resource usage will be calculate as follow:
  new_avg = old_avg * factor + (1-factor) * avg
  new_avg: newest average resoruce usage
  old_avg: old average resource usge which is calculate in last round.
  factor: the decrease factor, default value is 0.9
  avg: the average resource usage of the brokers
expose load balance metric to prometheus
fix a bug in OverloadShedder, which specify the unloaded bundle in the overload's own broker.

Please help check this implementation, if it is ok, i will add test case.

skyrocknroll · 2020-04-20T05:34:19Z

@hangc0276 Thanks for this PR. This looks exciting

sijie

@hangc0276 this is a great feature! Looking pretty great!

sijie · 2020-04-21T19:49:41Z

/pulsarbot run-failure-checks

… metric to prometheus (apache#6772) ### Motivation The Only one overload shedder strategy is `OverloadShedder`, which collects each broker's max resource usage and compare with threshold (default value is 85%). When max resource usage reaches the threshold, it will trigger bundle unloading, which will migrate parts of bundles to other brokers. The overload shedder strategy has some drawbacks as follows: - Not support configure other overload shedder strategies - It is hard to determine the threshold value, the default threshold is 85%. But for a broker, the max resource usage is few to reach 85%, which will lead to unbalanced traffic between brokers. The heavy traffic broker's read cache hit rate will decrease. - When you restart the most brokers of the pulsar cluster at the same time, the whole traffic in the cluster will goes to the rest brokers. The restarted brokers will have no traffic for a long time, due to the rest brokers max resource usage not reach the threshold. ### Changes 1. Support multiple overload shedder strategy, which only need to configure in `broker.conf` 2. I develop `ThresholdShedder` strategy, the main idea as follow: - Calculate the average resource usage of the brokers, and individual broker resource usage will compare with the average value. If it greatter than average value plus threshold, the overload shedder will be triggered. `broker resource usage > average resource usage + threshold` - Each kind of resources (ie bandwithIn, bandwithOut, CPU, Memory, Direct Memory), has weight(default is 1.0) when calculate broker's resource usage. - Record the pulsar broker cluster history average resource usage, new average resource usage will be calculate as follow: `new_avg = old_avg * factor + (1-factor) * avg` `new_avg`: newest average resoruce usage `old_avg`: old average resource usge which is calculate in last round. `factor`: the decrease factor, default value is `0.9` `avg`: the average resource usage of the brokers 3. expose load balance metric to prometheus 4. fix a bug in `OverloadShedder`, which specify the unloaded bundle in the overload's own broker. Please help check this implementation, if it is ok, i will add test case.

alexanderursu99 · 2021-04-21T19:00:50Z

@hangc0276 Hi, I'm having issues running the ThresholdShedder, is there any special configuration that should be pointed out? I believe I have it enabled correctly, but I don't see any load shedded when I would expect it to, just from looking at the resource metrics.

I can share more details if needed.

If I should open a new issue for this, let me know!

hangc0276 added 3 commits April 19, 2020 19:54

add threshold shedder loadbalancer

8e181b7

add license

4ff66ef

update document

3f8cf06

sijie assigned hangc0276 Apr 21, 2020

sijie added area/broker type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages labels Apr 21, 2020

sijie added this to the 2.6.0 milestone Apr 21, 2020

sijie requested review from codelipenghui, jiazhai, merlimat and rdhabalia April 21, 2020 19:47

sijie approved these changes Apr 21, 2020

View reviewed changes

codelipenghui approved these changes Apr 22, 2020

View reviewed changes

codelipenghui merged commit b9bdfa1 into apache:master Apr 22, 2020

sijie mentioned this pull request May 20, 2020

[discussion] Pulsar release 2.6.0 #5819

Closed

tongsucn mentioned this pull request Nov 9, 2020

Unloaded namespace bundles may not be assigned to suitable brokers. #8492

Closed

sijie mentioned this pull request Nov 9, 2020

ISSUE-8492: Unloaded namespace bundles may not be assigned to suitable brokers. streamnative/pulsar-archived#1660

Closed

michaeljmarshall mentioned this pull request Sep 24, 2021

[Docs][LoadBalancing] Add/Improve Javadocs for LoadSheddingStrategy impls #12180

Merged

hangc0276 mentioned this pull request Dec 15, 2021

PIP-122: Change loadBalancer default loadSheddingStrategy to ThresholdShedder #13340

Closed

sijie mentioned this pull request Dec 15, 2021

ISSUE-13340: PIP-122: Change loadBalancer default loadSheddingStrategy to ThresholdShedder streamnative/pulsar-archived#3437

Closed

Technoboy- mentioned this pull request Nov 24, 2022

Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" #18598

Closed

2 tasks

sijie mentioned this pull request Nov 25, 2022

ISSUE-18598: Discuss about reverting #16937 "skip mis-configured resource usage(>100%) in load balancer" streamnative/pulsar-archived#5175

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ThresholdShedder Strategy for loadbalancer and expose loadbalance metric to prometheus #6772

Add ThresholdShedder Strategy for loadbalancer and expose loadbalance metric to prometheus #6772

hangc0276 commented Apr 20, 2020

skyrocknroll commented Apr 20, 2020

sijie left a comment

sijie commented Apr 21, 2020

alexanderursu99 commented Apr 21, 2021

Add ThresholdShedder Strategy for loadbalancer and expose loadbalance metric to prometheus #6772

Add ThresholdShedder Strategy for loadbalancer and expose loadbalance metric to prometheus #6772

Conversation

hangc0276 commented Apr 20, 2020

Motivation

Changes

skyrocknroll commented Apr 20, 2020

sijie left a comment

Choose a reason for hiding this comment

sijie commented Apr 21, 2020

alexanderursu99 commented Apr 21, 2021