Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature][broker]Provide new load balance placement strategy implementation based on the least resource usage with weight #16281

Merged
merged 7 commits into from Jul 12, 2022

Conversation

HQebupt
Copy link
Contributor

@HQebupt HQebupt commented Jun 29, 2022

email discussion thread: https://lists.apache.org/thread/36vyyhndr4og175k2bz3mfdf5ctd2xky

Motivation

See PIP-182 #16274

Modifications

See #16274

The main idea of the new strategy is to unify the requirement of load shedding strategy and bundle placement strategy, which consider the resource usage with weight, including historical observations.

How to calculate a score for a broker ?

  • use its historical load and short-term load data with weight, which can solve the case of load jitter in a broker.

How to select a broker for assignning bundle ?

  • select a broker based on which one has the least average resource usage with weight.
  • the random selection algorithm is better than the minScore among low load brokers, and use loadBalancerAverageResourceUsageDifferenceThresholdShedderPercentage to adjust the size of the randomization pool

Below are screenshots of the effect of the new strategy with less time and fewer load balancing times.

image

Verifying this change

  • Make sure that the change passes the CI checks.

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

Documentation

Check the box below and label this PR (if you have committer privilege).

Need to update docs?

  • doc
    Doc added in admin client api.

…tation based on the least resource usage with weight
@github-actions github-actions bot added the doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. label Jun 29, 2022
@HQebupt
Copy link
Contributor Author

HQebupt commented Jun 30, 2022

/pulsarbot run-failure-checks

@HQebupt HQebupt force-pushed the leastResourceUsageWithWeight branch from ce402a8 to 6a63c86 Compare July 1, 2022 10:56
final double diffThreshold =
conf.getLoadBalancerAverageResourceUsageDifferenceThresholdShedderPercentage() / 100.0;
brokerAvgResourceUsageWithWeight.forEach((broker, avgResUsage) -> {
if (avgResUsage + diffThreshold <= avgUsage) {
Copy link
Contributor

@heesung-sn heesung-sn Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this algo is better than the minScore, but we still have the small randomization pool issue here.
example) b1:4, b2:20, b3:20

Here, still, b1 will be repeatedly selected until b1's load is updated to the leader, though all of these brokers should be candidates as their loads are low. (As we discussed, when many bundles are unloading in a short period, this might lead to b1 ending up overloaded, like b1:90%, b2:20, b3:20)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. The original intention of this algorithm is to select a larger randomization pool. It doesn't work well for small clusters. But there is a compromise improvement point, that is, adjusting the diffThreshold to a negative number, which can expand the randomization pool.
(eg: b1:4, b2:20, b3:20, set diffThreshold=-15, all of brokers can be candidates.)
This diffThreshold parameter takes effect dynamically, so it can be adjusted dynamically according to the status of cluster. Do you have a better suggestion? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the negative value in the doc, In the config definition.

Another edge case is that with the negative threshold, it might be selecting overloaded brokers.
example) threshold =-15, {b1=25, b2=100, b3=100, b4=100, b5=100} => b2, b3, b4, b5 will be selected.

And another edge case is that with a low variance of the usage distribution, it might be selecting none of the brokers.
example) threshold =10, {b1=24, b2=25, b3=26} => none will be selected.

If we want to care more about the bigger randomization pool, one could just randomly select any brokers below LoadBalancerBrokerOverloadedThresholdPercentage, which could be just another strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truely. An algorithm that can be applied to the many cases is good, but cannot cover all cases.

Another edge case is that with the negative threshold, it might be selecting overloaded brokers.
example) threshold =-15, {b1=25, b2=100, b3=100, b4=100, b5=100} => b2, b3, b4, b5 will be selected.

The operator need to know the status of the cluster and configure the threshold appropriately.

  • negative threshold applies when a large number of brokers have less load.
  • positive threshold applies when a large number of brokers have heavy load.

And another edge case is that with a low variance of the usage distribution, it might be selecting none of the brokers.
example) threshold =10, {b1=24, b2=25, b3=26} => none will be selected.

Assign randomly if none is selected.
image

Copy link
Contributor

@heesung-sn heesung-sn Jul 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good we have a fallback case that just randomly selects any broker, when empty.

Please update the doc to further explain the edge cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer just keep the algorithm simple, otherwise it is hard to debug and understand.

The diffThreshold might filter all candidates out and finally just randomly select the broker from all candidates, which might be high load.

The principle of selection is to let the broker with lower load be selected, such as selecting a broker which is under avg load.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could explain more about the negative and positive threshold considerations in the config broker.conf.

Otherwise, LGTM.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a expert of this part.
So I won't say much about the algorithm.

A part from that I believe that the change is good.
we are adding a new pluggable part of Pulsar, so a little discussion on dev@pulsar.apache.org is deserved

thanks for contributing this enhancement

@HQebupt
Copy link
Contributor Author

HQebupt commented Jul 5, 2022

I am not a expert of this part. So I won't say much about the algorithm.

A part from that I believe that the change is good. we are adding a new pluggable part of Pulsar, so a little discussion on dev@pulsar.apache.org is deserved

thanks for contributing this enhancement

Thx, here is the email discussion thread: https://lists.apache.org/thread/36vyyhndr4og175k2bz3mfdf5ctd2xky. I added it into the PR description.

@HQebupt HQebupt force-pushed the leastResourceUsageWithWeight branch from 449acb4 to 9fe58a4 Compare July 5, 2022 11:40
@HQebupt
Copy link
Contributor Author

HQebupt commented Jul 6, 2022

/pulsarbot run-failure-checks

@HQebupt HQebupt requested a review from eolivelli July 6, 2022 02:22
@HQebupt
Copy link
Contributor Author

HQebupt commented Jul 6, 2022

/pulsarbot run-failure-checks

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HQebupt
Copy link
Contributor Author

HQebupt commented Jul 6, 2022

/pulsarbot run-failure-checks

@HQebupt
Copy link
Contributor Author

HQebupt commented Jul 7, 2022

@Demogorgon314 Could you please review the PR again ?

conf/broker.conf Outdated Show resolved Hide resolved
@HQebupt
Copy link
Contributor Author

HQebupt commented Jul 8, 2022

/pulsarbot run-failure-checks

Copy link
Contributor

@Jason918 Jason918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HQebupt
Copy link
Contributor Author

HQebupt commented Jul 8, 2022

/pulsarbot run-failure-checks

1 similar comment
@HQebupt
Copy link
Contributor Author

HQebupt commented Jul 8, 2022

/pulsarbot run-failure-checks

@codelipenghui codelipenghui added this to the 2.11.0 milestone Jul 11, 2022
@codelipenghui codelipenghui added type/feature The PR added a new feature or issue requested a new feature area/broker labels Jul 11, 2022
@heesung-sn
Copy link
Contributor

LGTM

@codelipenghui codelipenghui merged commit d5088e3 into apache:master Jul 12, 2022
wuxuanqicn pushed a commit to wuxuanqicn/pulsar that referenced this pull request Jul 14, 2022
…tation based on the least resource usage with weight (apache#16281)

email discussion thread: https://lists.apache.org/thread/36vyyhndr4og175k2bz3mfdf5ctd2xky

### Motivation
See PIP-182 apache#16274 

### Modifications
See apache#16274 
 
The main idea of the new strategy is to unify the requirement of load shedding strategy and bundle placement strategy, which consider the resource usage with weight, including historical observations.

How to calculate a score for a broker ?
- use its historical load and short-term load data with weight, which can solve the case of load jitter in a broker.

How to select a broker for assignning bundle ?
- select a broker based on which one has the least average resource usage with weight.
- the random selection algorithm is better than the `minScore` among low load brokers, and use `loadBalancerAverageResourceUsageDifferenceThresholdShedderPercentage ` to adjust the size of the `randomization pool`
dragonls pushed a commit to dragonls/pulsar that referenced this pull request Oct 21, 2022
…tation based on the least resource usage with weight (apache#16281)
aloyszhang pushed a commit to aloyszhang/pulsar that referenced this pull request Jan 9, 2023
…ge request !151)

Squash merge branch 'release-2.8.1.4-bundle-place' into 'release-2.8.1.4'
Fixes #<xyz>

### Motivation
--story=878626611 新增基于资源权重的放置策略apache#16281 (merge request !150)


TAPD: --story=878626611
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. type/feature The PR added a new feature or issue requested a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants