load_balancing: Add Load Aware Locality LB Policy#43784
load_balancing: Add Load Aware Locality LB Policy#43784jukie wants to merge 35 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
796eb56 to
8509dc8
Compare
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
e82f25f to
83f0b03
Compare
|
This approach removes the OrcaWeightManager extraction (#43695) as a prerequisite - the policy reads utilization directly via a new lightweight per-host How it works:
Design decisions:
|
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
markdroth
left a comment
There was a problem hiding this comment.
This approach looks really good from an xDS API perspective!
I'll let one of the Envoy maintainers review the implementation.
api/envoy/extensions/load_balancing_policies/load_aware_locality/v3/load_aware_locality.proto
Outdated
Show resolved
Hide resolved
api/envoy/extensions/load_balancing_policies/load_aware_locality/v3/load_aware_locality.proto
Show resolved
Hide resolved
api/envoy/extensions/load_balancing_policies/load_aware_locality/v3/load_aware_locality.proto
Outdated
Show resolved
Hide resolved
|
/lgtm api |
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: Isaac Wilson <isaac.wilson514@gmail.com>
|
/retest |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new load balancing policy, load_aware_locality, which distributes traffic between localities based on real-time utilization data from ORCA reports. The implementation is robust, featuring EWMA smoothing for utilization metrics, a variance threshold to prefer the local locality when load is balanced, and a configurable probe percentage to ensure telemetry freshness. The code is well-structured, with clear separation between main-thread weight computation and per-worker load balancing logic. It also includes comprehensive unit and integration tests that cover a wide range of scenarios, including dynamic load shifts and various configuration parameters. The changes are additive and well-contained within the new extension. Overall, this is a high-quality contribution that adds significant new functionality to Envoy's load balancing capabilities.
Note: Security Review did not run due to the size of the PR.
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
…icies Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
|
/retest |
|
needs main merge - unrelated go problem is fixed there |
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
| return absl::InvalidArgumentError( | ||
| absl::StrCat("Unsupported endpoint picking policy for load_aware_locality: ", | ||
| endpoint_picking_policy_factory->name(), | ||
| ". Child policies must support locality-scoped worker instantiation.")); |
There was a problem hiding this comment.
I'm not super familiar with Envoy's implementation, so I don't understand why we have this restriction, but I will note that it seems sub-optimal. In principle, the child policy and the parent policy should have exactly the same API, so delegation should be possible to any child policy. If that's not the case, then maybe we need some changes in Envoy's LB policy API to make that possible.
There was a problem hiding this comment.
This new locality policy only works with child policies whose logic can be preserved on a locality-scoped set of hosts. This is similar to the limitations with combining weighted clusters with consistent hashing (#21675). When the host set is split before the child policy runs, policies like ring_hash and maglev lose their expected behavior because they're building their hash structures from an incomplete view of the host set.
If you'd prefer that choice be up to the user I can remove this restriction and document it as a limitation.
There was a problem hiding this comment.
load_aware_locality only works with child policies whose logic can be preserved on a locality-scoped set of hosts. Policies that depend on the full cluster's host set (such as the hashing-based policies rejected here) don't fit that model
I don't understand what that means, since I'm not that familiar with the Envoy LB policy API. But in principle, I would expect that the parent policy and child policy should both accept the list of endpoints in the same structure, since they are both implementing the same LB policy API. Given that, it's not clear to me what restrictions we would have here: I would expect that the parent policy would essentially just filter the list of endpoints it passes down to the child policy, and the child policy would just pick from among that filtered set of endpoints. What am I missing?
There was a problem hiding this comment.
I would expect that the parent policy would essentially just filter the list of endpoints it passes down to the child policy
That's exactly what happens here but the problem is policies like ring_hash or maglev who depend on building a hash structure over the full cluster host set. Once the parent narrows that to a single locality, the child is no longer implementing the same policy.
To really support that, the parent would need to implement something like locality-aware hashing so a given hash key maps consistently to the same locality. That isn't necessarily impossible in the future but is a bit at-odds with the core goal of this policy which is to route based on load. I can add that on top if you feel strongly but I personally feel that would be better suited as a locality-aware hashing LB policy.
There was a problem hiding this comment.
Again, this is similar to other limitations that exist in Envoy (a good list is in envoyproxy/gateway#5307 (comment)) so documenting it as a known limitation but still accepting these child policies could be reasonable as well. I'm open to changing this.
There was a problem hiding this comment.
If I'm understanding right, you're saying that the child's behavior won't be exactly the same underneath the locality-picking policy, because of things like the fact that ring_hash won't have the same set of hosts in its ring and will therefore not wind up picking the same endpoint for the same request hash?
If so, I think that's fine and is not something we should be disallowing. It's certainly something that people need to understand if they choose to configure those policies underneath a locality-picking policy, but I don't think we should go out of our way to disallow that.
I also think this hard-coded allow list is going to wind up being brittle, because no one will ever remember to update it when we add new LB policies over time.
There was a problem hiding this comment.
Thanks for the feedback, updated.
After a host removal tears down a locality's child LB, the routing snapshot may still assign weight to that locality, causing chooseHost to return nullptr despite other localities having healthy hosts. Add pickLocalityLb helper that falls back to any locality with a usable child LB. Also propagate empty-delta priority updates to child LBs. Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
… too much Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
| class OrcaUtilizationStore { | ||
| public: | ||
| double get() const { return value_.load(std::memory_order_relaxed) / kScale; } | ||
|
|
||
| // Set utilization with a monotonic timestamp (milliseconds since epoch). | ||
| void set(double utilization, int64_t monotonic_time_ms) { | ||
| // Reject non-finite values: std::clamp and the uint32 cast have undefined behavior for them. | ||
| if (!std::isfinite(utilization)) { | ||
| return; | ||
| } | ||
| // Clamp to [0, 1]. The fixed-point uint32 encoding cannot represent values outside this range: | ||
| // negative values would wrap around on cast, and values above 1.0 would overflow kScale. | ||
| // ORCA also defines utilization in [0, 1]. | ||
| utilization = std::clamp(utilization, 0.0, 1.0); | ||
| value_.store(static_cast<uint32_t>(utilization * kScale), std::memory_order_relaxed); | ||
| last_update_time_ms_.store(monotonic_time_ms, std::memory_order_release); | ||
| } | ||
|
|
||
| // Returns the monotonic timestamp (ms since epoch) of the last set() call, | ||
| // or 0 if set() has never been called. | ||
| int64_t lastUpdateTimeMs() const { return last_update_time_ms_.load(std::memory_order_acquire); } | ||
|
|
||
| private: | ||
| static constexpr double kScale = 10000.0; | ||
| std::atomic<uint32_t> value_{0}; | ||
| std::atomic<int64_t> last_update_time_ms_{0}; | ||
| }; |
There was a problem hiding this comment.
Is this necessary here? I think we may prefer the solution in the client_side_weighted_round_robin? That's say, the LB will implement the specific HostLbPolicyData implementation and will compute the the necessary utilization?
The HostLbPolicyData is designed to do this task.
| /** | ||
| * @return true if this LB policy reads per-host ORCA load report data. | ||
| * When true, the router will parse ORCA load reports from upstream | ||
| * responses. | ||
| */ | ||
| virtual bool requiresOrcaLoadReports() const { return false; } |
There was a problem hiding this comment.
Maybe, Maybe we can move the requiresOrcaLoadReports to the LoadBalancerConfig, and we can add another absl::optional<double> onOrcaLoadReport(const Upstream::OrcaLoadReport& report); there to calculate the utilization based on the configuration.
Then combine the new OrcaUtilizationStore, we may could get better solution than previous HostLbPolicyData?
There was a problem hiding this comment.
That makes sense, it would allow the LoadBalancerConfig to control which ORCA fields get extracted instead of the current hardcoded logic. I'll take a look at this.
I initially took a stab at extracting HostLbPolicyData into a central class with #43695 but I'll re-think this approach.
|
Sounds good, thanks for the feedback! |
|
@paul-r-gall @wbpcode starting on the multi-entry approach in #43995 if you could take a look please. Let me know if you'd rather see that in this PR. |
| /*/extensions/load_balancing_policies/client_side_weighted_round_robin @wbpcode @adisuissa @efimki | ||
| /*/extensions/load_balancing_policies/override_host @yanavlasov @tonya11en | ||
| /*/extensions/load_balancing_policies/wrr_locality @wbpcode @adisuissa @efimki | ||
| /*/extensions/load_balancing_policies/load_aware_locality @wbpcode @adisuissa @efimki @jukie |
There was a problem hiding this comment.
You may could remove me from the owner list because we may have no enough bandwidth to sponsor this new extension. 😞
Commit Message: load balancing: add load-aware locality-picking LB policy
Additional Description: New LB policy (
envoy.load_balancing_policies.load_aware_locality) that uses ORCA utilization data to choose localities based on real-time headroom, as proposed in #43665.Risk Level: low (new extension behind a new config proto, core changes are additive)
Testing: Unit tests for config validation, weight computation (EWMA, variance threshold, probe percentage, all overloaded fallback, topology changes), per-locality host partitioning, and weighted random selection. Integration test with a multi-locality cluster verifying traffic shifts in response to ORCA utilization reports.
Docs Changes: proto comments serve as initial documentation.
Release Notes: added new extension
envoy.load_balancing_policies.load_aware_locality.Platform Specific Features: n/a
Fixes #43665
[Optional API Considerations:]
API Considerations:
LoadAwareLocalityinenvoy/extensions/load_balancing_policies/load_aware_locality/v3/.endpoint_picking_policyfield accepts any endpoint-picking child policy.metric_names_for_computing_utilizationis declared but not yet honored (documented in proto comments as TODO).orcaUtilization()onHostDescriptioninterface.AI was used during implementation and for writing tests but I fully understand the changes here