Skip to content

LB: introduce randomization in locality LB scheduler initialization#32075

Merged
adisuissa merged 6 commits intoenvoyproxy:mainfrom
adisuissa:locality_sched_plumb_seed
Feb 6, 2024
Merged

LB: introduce randomization in locality LB scheduler initialization#32075
adisuissa merged 6 commits intoenvoyproxy:mainfrom
adisuissa:locality_sched_plumb_seed

Conversation

@adisuissa
Copy link
Copy Markdown
Contributor

Commit Message: LB: introduce randomization in locality LB scheduler initialization
Additional Description:
Following up on the randomized picking introduced in #31592, this PR plumbs a seed into the locality-LB.
Prior to this PR whenever a new EDS assignment arrived, the order of the picked locality was always reset.
For example, if there are 2 localities, locality_A with weight 99 and locality_B with weight 1, an update of the EDS message (that doesn't update the weights) may result in starting from the beginning of the "pick list".
This may cause a skew in the traffic distribution, especially in a fleet of Envoys, and when the Envoys receive periodic assignment updates and a very low QPS.

The change ensures that the starting point of which locality to choose from is randomized and picked from some random starting point.

Risk Level: Medium - a bug fix that may impact current traffic flows.
Testing: Added tests, and updated old ones.
Docs Changes: None.
Release Notes: Added.
Platform Specific Features: N/A.
Runtime guard: Added envoy.reloadable_features.edf_lb_locality_scheduler_init_fix

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #32075 was opened by adisuissa.

see: more, trace.

@adisuissa
Copy link
Copy Markdown
Contributor Author

Assigning load-balancing relevant maintainers.
This PR is mostly plumbing (unfortunately needed in many places of the code) and tests.
/assign @wbpcode @nezdolik @htuch

locality_entries.emplace_back(std::make_shared<LocalityEntry>(i, effective_weight));
}
}
// If not all effective weights were zero, create the scheduler.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to create a scheduler even if all weights are equal?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this was the behavior prior to this PR, and I haven't modified that.

This comment (and code) is looking at the case where all weights are zero. The end-result of this method (both prior to this change and after this change) is that if all weights are zero, the locality_scheduler will be null.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
…b_seed

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
hosts_added, hosts_removed, absl::nullopt, absl::nullopt);
priority_set_.updateHosts(
0, HostSetImpl::partitionHosts(hosts_, hosts_per_locality_), {}, hosts_added, hosts_removed,
server_context_.api().randomGenerator().random(), absl::nullopt, absl::nullopt);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked randomGenerator().random() briefly, it does not seem to rely on any properties that increase the chance of duplications across large Envoy fleet (e.g. by using current time in the seed)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is quite challenging to sync the time across a fleet, especially if we are thinking in the context of EDS updates that are being sent to many Envoys (and each triggers the update at a slightly different time).
BTW I'm also basing this code on the comment that already exists in:

// Seed to allow us to desynchronize load balancers across a fleet. If we don't
// do this, multiple Envoys that receive an update at the same time (or even
// multiple load balancers on the same host) will send requests to
// backends in roughly lock step, causing significant imbalance and potential
// overload.
const uint64_t seed_;

locality_scheduler = std::make_unique<EdfScheduler<LocalityEntry>>(
EdfScheduler<LocalityEntry>::createWithPicks(
locality_entries,
[](const LocalityEntry& entry) { return entry.effective_weight_; }, seed));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to ensure that seed > 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seed is defined as uint64_t in the input of the current function, and as a uint32_t in the input of createWithPick(), so both are unsigned and should be >0. The implicit casting seems fine to me, as it just specifies the number of initial-picks which if it is more than 10^6 is probably bad anyway.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was curios if 0 value (which is valid uint64_t) can lead to unpredicted results.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was validated in this test.

Copy link
Copy Markdown
Member

@nezdolik nezdolik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if edf_lb_locality_scheduler_init_fix flag is now on by default, we need to test both code branches in the tests

…b_seed

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa
Copy link
Copy Markdown
Contributor Author

if edf_lb_locality_scheduler_init_fix flag is now on by default, we need to test both code branches in the tests

In some cases we can do a parameterized test that works with both. I think the question is what is the added value of this, and in this case I'm not sure the benefit outweighs the cost.
I'm not opposing doing this, just looking for the reasons to increase test time.

@adisuissa
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review and comments @nezdolik!

@nezdolik
Copy link
Copy Markdown
Member

if edf_lb_locality_scheduler_init_fix flag is now on by default, we need to test both code branches in the tests

In some cases we can do a parameterized test that works with both. I think the question is what is the added value of this, and in this case I'm not sure the benefit outweighs the cost. I'm not opposing doing this, just looking for the reasons to increase test time.

Would not executing code from both if branches affect coverage?

nezdolik
nezdolik previously approved these changes Jan 30, 2024
@adisuissa
Copy link
Copy Markdown
Contributor Author

if edf_lb_locality_scheduler_init_fix flag is now on by default, we need to test both code branches in the tests

In some cases we can do a parameterized test that works with both. I think the question is what is the added value of this, and in this case I'm not sure the benefit outweighs the cost. I'm not opposing doing this, just looking for the reasons to increase test time.

Would not executing code from both if branches affect coverage?

Yes. I added a specific test that sets the runtime-flag to false, to validate that it works as expected, and that will cover the old-path code.
If you think there are other tests that may benefit from that, let me know which.

@nezdolik
Copy link
Copy Markdown
Member

if edf_lb_locality_scheduler_init_fix flag is now on by default, we need to test both code branches in the tests

In some cases we can do a parameterized test that works with both. I think the question is what is the added value of this, and in this case I'm not sure the benefit outweighs the cost. I'm not opposing doing this, just looking for the reasons to increase test time.

Would not executing code from both if branches affect coverage?

Yes. I added a specific test that sets the runtime-flag to false, to validate that it works as expected, and that will cover the old-path code. If you think there are other tests that may benefit from that, let me know which.

np, sorry I missed that test, it was lots of affected test code :)

// If not all effective weights were zero, create the scheduler.
if (!locality_entries.empty()) {
locality_scheduler = std::make_unique<EdfScheduler<LocalityEntry>>(
EdfScheduler<LocalityEntry>::createWithPicks(
Copy link
Copy Markdown
Member

@wbpcode wbpcode Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is LGTM overall.

I have only one question. The createWithPicks will use the seed % 429496729 as the pre picks number.
And because the seed is a random value from random() method, so it's possible we may do 429496728 times pre-picks when we update the hosts.

Do we acutally evalute the possible performance impact at the worst case? If the users update the host set frequently, will this bring huge performance burden?

(PS: the host scheduler will use seed % hosts.size() as the pre picks number, I think this should be some similar things? like we can use seed % locality_entries.size() or std::max(8, seed % locality_entries.size()) here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is LGTM overall.

I have only one question. The createWithPicks will use the seed % 429496729 as the pre picks number. And because the seed is a random value from random() method, so it's possible we may do 429496728 times pre-picks when we update the hosts.

Do we acutally evalute the possible performance impact at the worst case? If the users update the host set frequently, will this bring huge performance burden?

While conceptually the number of pre-picks could be large, the code is bounded by the number of entries/weights (N- see here), and the operation has the complexity of O(N*log(N)), which is the same as it is right now (because the localities are added one after the other to the priority-queue).
A proof of the bounded number of iterations can be found at the doc linked for this comment.

(PS: the host scheduler will use seed % hosts.size() as the pre picks number, I think this should be some similar things? like we can use seed % locality_entries.size() or std::max(8, seed % locality_entries.size()) here?

Yes, the next PR will be to fix the wrong initialization there as well.

Copy link
Copy Markdown
Member

@wbpcode wbpcode Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually not completely get it. I will read the code more detailedly tomorrow. Will unblock the PR first.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to reach out directly and I can try to give a guided explanation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. Sorry for the previous negligenc.

I have to say, it's awesome 👍.

Copy link
Copy Markdown
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. I mark it as request changes because there is an important quesion to be checked first.

wbpcode
wbpcode previously approved these changes Feb 1, 2024
htuch
htuch previously approved these changes Feb 2, 2024
…b_seed

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa adisuissa dismissed stale reviews from htuch, wbpcode, and nezdolik via 1e82534 February 5, 2024 13:56
@adisuissa
Copy link
Copy Markdown
Contributor Author

/retest

Copy link
Copy Markdown
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants