-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
When the Bootstrap config has xDS configured to use Envoy gRPC, we can get into a situation where we get a Cluster not available error when trying to establish a new gRPC stream.
This happens when the xDS cluster is not yet initialized, and the cluster that depends on it (e.g. for ADS or SDS using Envoy gRPC) gets initialized first, and attempts to create a connection to a not-yet-initialized cluster where the cluster is not found in the thread-local storage for the main thread.
When the cluster is not found, the gRPC stream schedules a retry attempt for some jittered back-off delay of up to 1 second. Typically, the xDS cluster is initialized by then, so the subsequent retry attempt to establish the gRPC stream, which gets triggered when the timer expires, ends up being successful.
This is not ideal, especially for Envoy Mobile, where we wouldn't want to wait for up to 1 second just for xDS initialization in the app.
This scenario can be triggered under two circumstances:
- The static xDS cluster appears after the cluster that depends on it in the Bootstrap config's
static_resourcesrepeated field ordering. - We use ADS, where the gRPC service is using Envoy gRPC (i.e. an Envoy cluster in the
static_resourcesfield) and the cluster upon which ADS depends is not guaranteed to be fully warmed during initialization. The ClusterManagerImpl callsads_mux_->start()after adding primary clusters, but depending on the cluster type, it may not yet be warmed when the ClusterManagerInitHelper::addCluster() call finishes executing. Examples of cluster types that will call the ADS mux'sstart(), even if wait_for_warm_on_init is set, include the STRICT_DNS cluster, which has to make asynchronous calls to resolve the DNS entries for the destination hosts.
For scenario 1, there are a few options to solve the issue:
- Have the ClusterManagerImpl go through a first pass where it figures out the dependencies between clusters, then initializes them based on that ordering.
- This is the most sensible in terms of API experience
- But adds quite a bit of complexity to already complex code and will likely slow down the time it takes to get all the clusters initialized since we'll have to go through a first pass to figure out dependencies (and at that point, maybe it's not so much more efficient than just the backoff retry mechanism?)
- Have a way for clusters to declare their dependencies on each other in the Bootstrap config.
- But this requires a new configuration knob in the Bootstrap and still requires the API user to do something to declare dependencies, which isn't much different than the third option which is...
- Require that clusters that depend on a static xDS cluster are ordered after the cluster they depend upon in the Bootstrap's
static_resourcesfield.- This would require an update to documentation and making that info well known to xDS users.
- Update the tests that set the correct cluster ordering in the Bootstrap config. tests: Eliminate retry backoffs in the SDS integration tests #27679 is an example of this.
For scenario 2, there are a couple options to "solve" the issue:
- Move the
ads_mux_->start()call to after the initialization is complete for the cluster that ADS is configured with. This will require a first pass to determine cluster dependencies, and adding the ADS mux initialization to the post-cluster-init callback if ADS depends on it. - Just have a caveat for the time being that ADS configured with an Envoy cluster that requires asynchronous initialization will result in a backoff retry that could delay app initialization. For the ADS use cases with Google's Traffic Director that we are thinking of, it will depend on Google gRPC having a target URI which doesn't have the same retry backoff problem that Envoy gRPC does.
Another option that applies to both scenarios is to tighten the backoff timer interval (e.g. to 500ms max instead of 1s).