Skip to content

load_balancing: Add Load Aware Locality LB Policy#43784

Open
jukie wants to merge 35 commits intoenvoyproxy:mainfrom
jukie:load-aware-locality-lb
Open

load_balancing: Add Load Aware Locality LB Policy#43784
jukie wants to merge 35 commits intoenvoyproxy:mainfrom
jukie:load-aware-locality-lb

Conversation

@jukie
Copy link
Contributor

@jukie jukie commented Mar 5, 2026

Commit Message: load balancing: add load-aware locality-picking LB policy

Additional Description: New LB policy (envoy.load_balancing_policies.load_aware_locality) that uses ORCA utilization data to choose localities based on real-time headroom, as proposed in #43665.

Risk Level: low (new extension behind a new config proto, core changes are additive)

Testing: Unit tests for config validation, weight computation (EWMA, variance threshold, probe percentage, all overloaded fallback, topology changes), per-locality host partitioning, and weighted random selection. Integration test with a multi-locality cluster verifying traffic shifts in response to ORCA utilization reports.

Docs Changes: proto comments serve as initial documentation.

Release Notes: added new extension envoy.load_balancing_policies.load_aware_locality.

Platform Specific Features: n/a
Fixes #43665
[Optional API Considerations:]
API Considerations:

  • New proto LoadAwareLocality in envoy/extensions/load_balancing_policies/load_aware_locality/v3/.
    • endpoint_picking_policy field accepts any endpoint-picking child policy.
    • metric_names_for_computing_utilization is declared but not yet honored (documented in proto comments as TODO).
  • New virtual method orcaUtilization() on HostDescription interface.

AI was used during implementation and for writing tests but I fully understand the changes here

@repokitteh-read-only
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #43784 was opened by jukie.

see: more, trace.

Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
@jukie jukie force-pushed the load-aware-locality-lb branch from 796eb56 to 8509dc8 Compare March 5, 2026 05:51
jukie added 2 commits March 4, 2026 23:27
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
@jukie jukie force-pushed the load-aware-locality-lb branch from e82f25f to 83f0b03 Compare March 5, 2026 07:29
@jukie
Copy link
Contributor Author

jukie commented Mar 5, 2026

This approach removes the OrcaWeightManager extraction (#43695) as a prerequisite - the policy reads utilization directly via a new lightweight per-host OrcaUtilizationStore rather than reusing CSWRR's weight management machinery.

How it works:

  • A main-thread timer periodically reads host->orcaUtilization() from healthy hosts, averages per locality, applies EWMA smoothing, and computes capacity-weighted headroom (healthy_hosts × (1 - utilization)).
  • The resulting routing weights are pushed to workers via TLS.
  • Workers select a locality by weighted random and delegate endpoint selection to per-locality child LB instances (e.g., round_robin, client_side_weighted_round_robin).
LoadAwareLocalityLoadBalancer (main thread, ThreadAwareLoadBalancer)
  |
  |-- weight_update_timer_ (plain Event::Timer)
  |     Fires periodically to recompute locality routing weights.
  |
  |-- on timer callback:
  |     computeLocalityRoutingWeights()
  |       Reads host->orcaUtilization() from each healthy host, averages
  |       per locality, applies EWMA smoothing, computes capacity weighted
  |       by healthy host count (healthy_hosts * headroom), checks variance
  |       threshold for local-zone preference, applies probe percentage,
  |       publishes immutable snapshot.
  |
  +-- WorkerLocalLbFactory (shared across all workers)
        |
        |-- child_thread_aware_lb_ (single shared child ThreadAwareLoadBalancer)
        |     Created and initialized on the main thread. Workers call
        |     factory()->create() from it to build per-locality worker LBs.
        |
        |-- tls_ (ThreadLocal::TypedSlot<ThreadLocalShim>)
        |     Main thread pushes RoutingWeightsSnapshot to workers via
        |     runOnAllThreads(). Workers read from their TLS slot with
        |     zero synchronization.
        |
        +-- create() --> WorkerLocalLb (one per worker thread)
              |
              |-- selectLocality() [weighted random by capacity]
              |     Weighted random across localities proportional to
              |     host-count-weighted headroom.
              |
              +-- per_locality_[] (one PerLocalityState per locality)
                    |-- PrioritySetImpl (hosts for this locality only)
                    +-- LoadBalancer (worker-local child instance, e.g., RoundRobin)

Design decisions:

  • Locality selection and endpoint selection are fully decoupled — any endpoint-picking child policy works, including CSWRR (no lbPolicyData() conflicts since utilization flows through a separate OrcaUtilizationStore channel).
  • Local-zone preference: when local utilization is within a configurable variance threshold of the remote average, routes 100% local.
  • Probe percentage: ensures a minimum fraction of traffic reaches remote localities to keep ORCA data fresh even in all-local mode. If Out-Of-Band reporting is added this can be used as an alternative to completely avoid cross-zone traffic

jukie added 2 commits March 5, 2026 01:36
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
jukie added 2 commits March 5, 2026 08:33
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #43784 was ready_for_review by jukie.

see: more, trace.

Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Copy link
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks really good from an xDS API perspective!

I'll let one of the Envoy maintainers review the implementation.

@markdroth markdroth self-assigned this Mar 5, 2026
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
@markdroth
Copy link
Contributor

/lgtm api

@repokitteh-read-only repokitteh-read-only bot removed the api label Mar 5, 2026
jukie and others added 3 commits March 6, 2026 08:45
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: Isaac Wilson <isaac.wilson514@gmail.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
@jukie
Copy link
Contributor Author

jukie commented Mar 6, 2026

/retest

@agrawroh
Copy link
Member

agrawroh commented Mar 6, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new load balancing policy, load_aware_locality, which distributes traffic between localities based on real-time utilization data from ORCA reports. The implementation is robust, featuring EWMA smoothing for utilization metrics, a variance threshold to prefer the local locality when load is balanced, and a configurable probe percentage to ensure telemetry freshness. The code is well-structured, with clear separation between main-thread weight computation and per-worker load balancing logic. It also includes comprehensive unit and integration tests that cover a wide range of scenarios, including dynamic load shifts and various configuration parameters. The changes are additive and well-contained within the new extension. Overall, this is a high-quality contribution that adds significant new functionality to Envoy's load balancing capabilities.

Note: Security Review did not run due to the size of the PR.

@RyanTheOptimist RyanTheOptimist removed the deps Approval required for changes to Envoy's external dependencies label Mar 9, 2026
jukie added 3 commits March 10, 2026 22:48
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
…icies

Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
@jukie
Copy link
Contributor Author

jukie commented Mar 12, 2026

/retest

@phlax
Copy link
Member

phlax commented Mar 12, 2026

needs main merge - unrelated go problem is fixed there

jukie added 4 commits March 12, 2026 07:44
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
return absl::InvalidArgumentError(
absl::StrCat("Unsupported endpoint picking policy for load_aware_locality: ",
endpoint_picking_policy_factory->name(),
". Child policies must support locality-scoped worker instantiation."));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with Envoy's implementation, so I don't understand why we have this restriction, but I will note that it seems sub-optimal. In principle, the child policy and the parent policy should have exactly the same API, so delegation should be possible to any child policy. If that's not the case, then maybe we need some changes in Envoy's LB policy API to make that possible.

Copy link
Contributor Author

@jukie jukie Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new locality policy only works with child policies whose logic can be preserved on a locality-scoped set of hosts. This is similar to the limitations with combining weighted clusters with consistent hashing (#21675). When the host set is split before the child policy runs, policies like ring_hash and maglev lose their expected behavior because they're building their hash structures from an incomplete view of the host set.

If you'd prefer that choice be up to the user I can remove this restriction and document it as a limitation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_aware_locality only works with child policies whose logic can be preserved on a locality-scoped set of hosts. Policies that depend on the full cluster's host set (such as the hashing-based policies rejected here) don't fit that model

I don't understand what that means, since I'm not that familiar with the Envoy LB policy API. But in principle, I would expect that the parent policy and child policy should both accept the list of endpoints in the same structure, since they are both implementing the same LB policy API. Given that, it's not clear to me what restrictions we would have here: I would expect that the parent policy would essentially just filter the list of endpoints it passes down to the child policy, and the child policy would just pick from among that filtered set of endpoints. What am I missing?

Copy link
Contributor Author

@jukie jukie Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that the parent policy would essentially just filter the list of endpoints it passes down to the child policy

That's exactly what happens here but the problem is policies like ring_hash or maglev who depend on building a hash structure over the full cluster host set. Once the parent narrows that to a single locality, the child is no longer implementing the same policy.
To really support that, the parent would need to implement something like locality-aware hashing so a given hash key maps consistently to the same locality. That isn't necessarily impossible in the future but is a bit at-odds with the core goal of this policy which is to route based on load. I can add that on top if you feel strongly but I personally feel that would be better suited as a locality-aware hashing LB policy.

Copy link
Contributor Author

@jukie jukie Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is similar to other limitations that exist in Envoy (a good list is in envoyproxy/gateway#5307 (comment)) so documenting it as a known limitation but still accepting these child policies could be reasonable as well. I'm open to changing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding right, you're saying that the child's behavior won't be exactly the same underneath the locality-picking policy, because of things like the fact that ring_hash won't have the same set of hosts in its ring and will therefore not wind up picking the same endpoint for the same request hash?

If so, I think that's fine and is not something we should be disallowing. It's certainly something that people need to understand if they choose to configure those policies underneath a locality-picking policy, but I don't think we should go out of our way to disallow that.

I also think this hard-coded allow list is going to wind up being brittle, because no one will ever remember to update it when we add new LB policies over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, updated.

jukie added 8 commits March 12, 2026 09:58
After a host removal tears down a locality's child LB, the routing
snapshot may still assign weight to that locality, causing chooseHost
to return nullptr despite other localities having healthy hosts.

Add pickLocalityLb helper that falls back to any locality with a
usable child LB. Also propagate empty-delta priority updates to
child LBs.

Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
… too much

Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
@jukie jukie requested a review from markdroth March 12, 2026 22:58
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Comment on lines +79 to +105
class OrcaUtilizationStore {
public:
double get() const { return value_.load(std::memory_order_relaxed) / kScale; }

// Set utilization with a monotonic timestamp (milliseconds since epoch).
void set(double utilization, int64_t monotonic_time_ms) {
// Reject non-finite values: std::clamp and the uint32 cast have undefined behavior for them.
if (!std::isfinite(utilization)) {
return;
}
// Clamp to [0, 1]. The fixed-point uint32 encoding cannot represent values outside this range:
// negative values would wrap around on cast, and values above 1.0 would overflow kScale.
// ORCA also defines utilization in [0, 1].
utilization = std::clamp(utilization, 0.0, 1.0);
value_.store(static_cast<uint32_t>(utilization * kScale), std::memory_order_relaxed);
last_update_time_ms_.store(monotonic_time_ms, std::memory_order_release);
}

// Returns the monotonic timestamp (ms since epoch) of the last set() call,
// or 0 if set() has never been called.
int64_t lastUpdateTimeMs() const { return last_update_time_ms_.load(std::memory_order_acquire); }

private:
static constexpr double kScale = 10000.0;
std::atomic<uint32_t> value_{0};
std::atomic<int64_t> last_update_time_ms_{0};
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary here? I think we may prefer the solution in the client_side_weighted_round_robin? That's say, the LB will implement the specific HostLbPolicyData implementation and will compute the the necessary utilization?

The HostLbPolicyData is designed to do this task.

Comment on lines +402 to +407
/**
* @return true if this LB policy reads per-host ORCA load report data.
* When true, the router will parse ORCA load reports from upstream
* responses.
*/
virtual bool requiresOrcaLoadReports() const { return false; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, Maybe we can move the requiresOrcaLoadReports to the LoadBalancerConfig, and we can add another absl::optional<double> onOrcaLoadReport(const Upstream::OrcaLoadReport& report); there to calculate the utilization based on the configuration.

Then combine the new OrcaUtilizationStore, we may could get better solution than previous HostLbPolicyData?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, it would allow the LoadBalancerConfig to control which ORCA fields get extracted instead of the current hardcoded logic. I'll take a look at this.

I initially took a stab at extracting HostLbPolicyData into a central class with #43695 but I'll re-think this approach.

@wbpcode
Copy link
Member

wbpcode commented Mar 17, 2026

cc @jukie recently we do have lots of new LB policies. Orz. See #43588 where the LbPolicyData also be used.

I think we may also could enhance the current LbPolicyData mechanism, to support multiple LbPolicyData in the single host. 🤔

@paul-r-gall
Copy link
Contributor

@jukie I agree with @wbpcode's recommendation for enhancing LbPolicyData to have multiple on a single host.

@jukie
Copy link
Contributor Author

jukie commented Mar 17, 2026

Sounds good, thanks for the feedback!

@jukie
Copy link
Contributor Author

jukie commented Mar 17, 2026

@paul-r-gall @wbpcode starting on the multi-entry approach in #43995 if you could take a look please. Let me know if you'd rather see that in this PR.

/*/extensions/load_balancing_policies/client_side_weighted_round_robin @wbpcode @adisuissa @efimki
/*/extensions/load_balancing_policies/override_host @yanavlasov @tonya11en
/*/extensions/load_balancing_policies/wrr_locality @wbpcode @adisuissa @efimki
/*/extensions/load_balancing_policies/load_aware_locality @wbpcode @adisuissa @efimki @jukie
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may could remove me from the owner list because we may have no enough bandwidth to sponsor this new extension. 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: ORCA-driven locality routing for zone-aware load balancing

8 participants