Skip to content

xds: "Cluster not available" can cause backoff retries of [0,1s) for xDS requests using an Envoy cluster #27702

@abeyad

Description

@abeyad

When the Bootstrap config has xDS configured to use Envoy gRPC, we can get into a situation where we get a Cluster not available error when trying to establish a new gRPC stream.

This happens when the xDS cluster is not yet initialized, and the cluster that depends on it (e.g. for ADS or SDS using Envoy gRPC) gets initialized first, and attempts to create a connection to a not-yet-initialized cluster where the cluster is not found in the thread-local storage for the main thread.

When the cluster is not found, the gRPC stream schedules a retry attempt for some jittered back-off delay of up to 1 second. Typically, the xDS cluster is initialized by then, so the subsequent retry attempt to establish the gRPC stream, which gets triggered when the timer expires, ends up being successful.

This is not ideal, especially for Envoy Mobile, where we wouldn't want to wait for up to 1 second just for xDS initialization in the app.

This scenario can be triggered under two circumstances:

  1. The static xDS cluster appears after the cluster that depends on it in the Bootstrap config's static_resources repeated field ordering.
  2. We use ADS, where the gRPC service is using Envoy gRPC (i.e. an Envoy cluster in the static_resources field) and the cluster upon which ADS depends is not guaranteed to be fully warmed during initialization. The ClusterManagerImpl calls ads_mux_->start() after adding primary clusters, but depending on the cluster type, it may not yet be warmed when the ClusterManagerInitHelper::addCluster() call finishes executing. Examples of cluster types that will call the ADS mux's start(), even if wait_for_warm_on_init is set, include the STRICT_DNS cluster, which has to make asynchronous calls to resolve the DNS entries for the destination hosts.

For scenario 1, there are a few options to solve the issue:

  1. Have the ClusterManagerImpl go through a first pass where it figures out the dependencies between clusters, then initializes them based on that ordering.
    • This is the most sensible in terms of API experience
    • But adds quite a bit of complexity to already complex code and will likely slow down the time it takes to get all the clusters initialized since we'll have to go through a first pass to figure out dependencies (and at that point, maybe it's not so much more efficient than just the backoff retry mechanism?)
  2. Have a way for clusters to declare their dependencies on each other in the Bootstrap config.
    • But this requires a new configuration knob in the Bootstrap and still requires the API user to do something to declare dependencies, which isn't much different than the third option which is...
  3. Require that clusters that depend on a static xDS cluster are ordered after the cluster they depend upon in the Bootstrap's static_resources field.

For scenario 2, there are a couple options to "solve" the issue:

  1. Move the ads_mux_->start() call to after the initialization is complete for the cluster that ADS is configured with. This will require a first pass to determine cluster dependencies, and adding the ADS mux initialization to the post-cluster-init callback if ADS depends on it.
  2. Just have a caveat for the time being that ADS configured with an Envoy cluster that requires asynchronous initialization will result in a backoff retry that could delay app initialization. For the ADS use cases with Google's Traffic Director that we are thinking of, it will depend on Google gRPC having a target URI which doesn't have the same retry backoff problem that Envoy gRPC does.

Another option that applies to both scenarios is to tighten the backoff timer interval (e.g. to 500ms max instead of 1s).

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions