sds: cluster not warming while certificates are being fetched; immediately marked active #11120

howardjohn · 2020-05-08T21:11:27Z

When creating clusters that reference SDS certificates, the warming behavior does not seem correct. My expectation is that until the secret is sent, the cluster will be marked as "warming" until the initial_fetch_timeout, and block the rest of initialization from occuring.

What I am actually seeing is initialization is blocked, but there is nothing indicating the clusters are warming.

Using this config:
docker run -v $HOME/kube/local:/config -p 15000:15000 envoyproxy/envoy-dev -c /config/envoy-sds-lds.yaml --log-format-prefix-with-location 0 --reject-unknown-dynamic-fields

with envoy version: 49efb9841a58ebdc43a666f55c445911c8e4181c/1.15.0-dev/Clean/RELEASE/BoringSSL

and config files:

cds.yaml:

resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
  name: outbound_cluster_tls
  connect_timeout: 5s
  max_requests_per_connection: 1
  load_assignment:
    cluster_name: xds-grpc
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: 127.0.0.1
              port_value: 8080
  type: STATIC
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext
      common_tls_context:
        tls_certificate_sds_secret_configs:
          - name: "default"
            sds_config:
              initial_fetch_timeout: 20s
              api_config_source:
                api_type: GRPC
                grpc_services:
                  - envoy_grpc:
                      cluster_name: "sds-grpc"
                refresh_delay: 60s
        combined_validation_context:
          default_validation_context: {}
          validation_context_sds_secret_config:
            name: ROOTCA
            sds_config:
              initial_fetch_timeout: 20s
              api_config_source:
                api_type: GRPC
                grpc_services:
                - envoy_grpc:
                    cluster_name: sds-grpc

envoy-sds-lds.yaml:

admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 15000
node:
  id: id
  cluster: sdstest
dynamic_resources:
  lds_config:
    api_config_source:
      api_type: GRPC
      grpc_services:
        envoy_grpc:
          cluster_name: lds
  cds_config:
    path: /config/cds.yaml
static_resources:
  clusters:    
  - name: sds-grpc
    type: STATIC
    http2_protocol_options: {}
    connect_timeout: 5s
    lb_policy: ROUND_ROBIN
  - name: lds
    type: STATIC
    http2_protocol_options: {}
    connect_timeout: 5s
    lb_policy: ROUND_ROBIN

Basically what should happen here is we get a dynamic CDS cluster with SDS config. This SDS config fails, as the sds server is not setup. We have initial_fetch_timeout, so for 20s everything should be warming.

What we see instead:

Stats are not showing warming:

cluster_manager.cds.init_fetch_timeout: 0
cluster_manager.cds.update_attempt: 1
cluster_manager.cds.update_failure: 0
cluster_manager.cds.update_rejected: 0
cluster_manager.cds.update_success: 1
cluster_manager.cds.update_time: 1588972075968
cluster_manager.cds.version: 17241709254077376921
cluster_manager.cluster_added: 3
cluster_manager.cluster_modified: 0
cluster_manager.cluster_removed: 0
cluster_manager.cluster_updated: 0
cluster_manager.cluster_updated_via_merge: 0
cluster_manager.update_merge_cancelled: 0
cluster_manager.update_out_of_merge_window: 0
cluster_manager.warming_clusters: 0

We also see init_fetch_timeout is 0; this does not change after 20s

LDS is not requested until 20s later, indicating the initial_fetch_timeout is respected. This can be seen in the logs:
(note - for simple testing I don't have a real LDS server, but we can see its not even attempted until 20s in)

[2020-05-08 21:07:55.967][1][info][upstream] cds: add 1 cluster(s), remove 2 cluster(s)
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][info][upstream] cds: add/update cluster 'outbound_cluster_tls'
[2020-05-08 21:07:55.968][1][info][main] starting main dispatch loop
[2020-05-08 21:07:56.703][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.703][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:56.938][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.938][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.135][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.135][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.682][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.682][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:58.671][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:58.671][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:08.992][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:08:08.992][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:15.967][1][info][upstream] cm init: all clusters initialized
[2020-05-08 21:08:15.967][1][info][main] all clusters initialized. initializing init manager
[2020-05-08 21:08:15.967][1][warning][config] StreamListeners gRPC config stream closed: 14, no healthy upstream

dynamic_active_clusters shows the cluster in cds.yaml. I would expect it to be "warming".

This example above is meant to simplify it, I have originally seen this with a normal deployment using ADS gRPC server (Istio) not just files.

The text was updated successfully, but these errors were encountered:

JimmyCYJ · 2020-05-10T23:28:12Z

/cc @JimmyCYJ

dio · 2020-05-12T02:13:41Z

@Shikugawa would you be able to help on investigating this?

Shikugawa · 2020-05-13T11:40:00Z

@dio Yes, I'll investigate about this problem.

howardjohn · 2020-05-13T18:44:37Z

I suspect this causes other issues as well. We are seeing that if we do not include sds config in the XDS connection, we eventually have SDS permanently broken - clusters that reference sds secrets will be stuck warming forever.

We have 2 secrets throughout all config, the client cert and the root cert. If we add the cert to the xds cluster (enforcing SDS starts before XDS) then we never see the issue with the client cert. However, the root cert still sometimes gets stuck warming forever.

more info istio/istio#22443.

I am not sure if its the same root cause but seems related

Shikugawa · 2020-05-14T13:49:17Z

@howardjohn Got it. I think that cluster_warming and also initial_fetch_timeout is broken. It never counts cluster_warming and initial_fetch_timeout. Your problem would be caused by requesting an inactive SDS cluster. So we have two problems related to this issue. Is this right?

stale · 2020-06-13T14:12:19Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

howardjohn · 2020-06-15T14:31:44Z

Your problem would be caused by requesting an inactive SDS cluster.

Its not just the SDS cluster being inactive, the SDS cluster could be active but not return any secrets

mandarjog · 2020-07-08T23:13:17Z

@lambdai @JimmyCYJ can you look at this? We do not want envoy to declare ready before it is.

lambdai · 2020-07-08T23:34:51Z

Let me rephrase to see we are on the same page.

SDS: somehow the SDS's response is not returning.
while 1 is blocked for the first 20s due to SDS fetch timeout, the cluster is in warming but shows active
After 20s, the SDS resource announce ready
Immediately after 3. the cluster announce ready.

My understanding is that "warming cluster shows ready" is the only error.
IMHO cluster is early active b/c SDS set 20s fetch timeout. That is exactly what 20s fetch timeout is supposed to work.

howardjohn · 2020-07-08T23:41:26Z

We may be saying the same thing, not sure. But I think its more like:

SDS: somehow the SDS's response is not returning.
while 1 is blocked for the first 20s due to SDS fetch timeout, the cluster is in warming but shows active
After 20s, the sds fetch times out due to 20s fetch timeout
Immediately after 3. the cluster announce ready - in the sense that LDS init continues, but it was always shown as "ready" in the stats

lambdai · 2020-07-08T23:56:50Z

@howardjohn I see stat is the only liar. I think envoy is early declaring active at stat. Not only cluster but also listener

However, the sequence is working as expected. The initial_fetch_timeout is supposed to unblock the initialization by announce itself "ready" or "active"

@mandarjog I think the solution is to disable init_fetch_timeout in SDS if we cannot tolerate fake readiness.

howardjohn · 2020-07-09T00:00:33Z

@lambdai I don't think stats is the only issue. See istio/istio#22443. I have less of a clear reproducer, but basically we get into a state where Envoy never sends an SDS request for one of the SDS resources in the XDS response. It seems there are larger problems then just the stat - but maybe its a completely different issue.

Why is the solution to disable it rather than fix the stat?

lambdai · 2020-07-09T00:20:06Z

basically we get into a state where Envoy never sends an SDS request for one of the SDS resources in the XDS response

Get it. Reading the issue and see if I can help.
If envoy does want to block initialization assuming before sds is fixed, the only solution is to disable the fetch timeout in SDS.
Then the readiness probe should chime in and kill istio container, correct?

howardjohn · 2020-07-09T00:25:04Z

Oh I see. Yeah actually we do not want initial fetch timeout for these in particular. But that is an Istio detail not envoy side.

One thing I do wonder - does the initial_fetch_timeout stuff work differently for bootstrap vs dynamic? Because what we found is if we requested a secret default as part of the XDS cluster, everything is ok. If we do not, but the dynamic response requests default it occasionally breaks and never sends the SDS request. I never figured out why that is - I assumed it may be able some sequencing issue and using the secret for the XDS cluster linearizes it, but maybe its bootstrap vs dynamic?

Shikugawa · 2020-07-13T11:33:18Z

Oh I see. Yeah actually we do not want initial fetch timeout for these in particular. But that is an Istio detail not envoy side.

@lambdai @mandarjog Sorry. I missed these discussions! For now, I think that the common ground is to add disable_init_fetch_timeout on SDS or runtime flag. This won't break the default behavior of SDS and API compatibility.

howardjohn · 2020-07-13T13:50:34Z

Why do we need disable_init_fetch_timeout? I think setting it to 0s disabled it already?

Besides, the fact we want it disabled is an implementation detail of our usage of Envoy, it's still a general issue that will impact others

Shikugawa · 2020-07-13T15:54:14Z

@howardjohn Got it. Maybe this is caused by the state management of cluster manager. In general, if we started to create xDS subscription when all of clusters are not initialized, this problem will be occurred. So I think that to resolve this problem, we should fix the implementation of init_manager. So this is relatively deep-routed problem. We should have the prevention logic not to use active_clusters_ on the implementation of ClusterManager, shouldn't we? cc. @lambdai @mandarjog

tmshort · 2020-08-04T12:27:14Z

Any update on this? This is an intermittent problem on our istio deployment that stalls pods.

Shikugawa · 2020-08-14T16:50:54Z

@howardjohn Hey. I'd like to take confirmation about this. I don't completely understand what is the problem. Our problem is,

Attached cluster will be immidiately active after SDS subscription on attached cluster sent DiscoveryRequest to SDS Cluster until init_fetch_timeout on attached cluster when DiscoveryResponse from SDS cluster didn't have CA. We expect to work that attached cluster is warming in this situation.

Is this what you said?

howardjohn · 2020-08-14T16:52:27Z

Yep. The cluster should be warming until the secret is fetched, but its active immediately

ramaraochavali · 2020-08-23T06:20:44Z

We also run in to this problem intermittently. In our case we stream file based certs via SDS using Istio. When we have many clusters if SDS push is delayed for a cluster, since the cluster is incorrectly marked as Active, requests fail to that cluster with error "OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_UNKNOWN" immediately after initialization. The problem is more prominent when there are many clusters.

tmshort · 2020-09-14T13:37:04Z

Any update on this issue?

Shikugawa · 2020-09-15T15:10:04Z

@tmshort I'm considering that this issue has two problems. As the part of this, I crafted a PR and it is in review. #12783

incfly · 2020-10-23T04:57:59Z

/cc @incfly

…cret entity (#13344) This PR highly depends on #12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster. Risk Level: Mid Testing: Unit Docs Changes: Release Notes: Added Fixes #11120, future work: #13777 Signed-off-by: Shikugawa <rei@tetrate.io>

…cret entity (envoyproxy#13344) This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster. Risk Level: Mid Testing: Unit Docs Changes: Release Notes: Added Fixes envoyproxy#11120, future work: envoyproxy#13777 Signed-off-by: Shikugawa <rei@tetrate.io>

* cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783) Signed-off-by: Shikugawa <rei@tetrate.io> * sds: keep warming when dynamic inserted cluster can't be extracted secret entity (envoyproxy#13344) This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster. Risk Level: Mid Testing: Unit Docs Changes: Release Notes: Added Fixes envoyproxy#11120, future work: envoyproxy#13777 Signed-off-by: Shikugawa <rei@tetrate.io> Co-authored-by: Rei Shimizu <rei@tetrate.io>

github-actions · 2020-12-09T17:27:45Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Shikugawa · 2020-12-10T08:49:41Z

@lizan Can we close this?

lizan · 2020-12-10T09:46:17Z

@Shikugawa no this is not fixed, I'll have a fix soon.

Fixes #11120, allows more than one init manager to watch the same SDS init target so clusters/listeners won't be marked active immediately. Additional Description: Risk Level: Medium Testing: integration test Docs Changes: N/A Release Notes: Added. Signed-off-by: Lizan Zhou <lizan@tetrate.io>

shashankram · 2021-03-08T17:11:19Z

Hey @howardjohn , I recollect you mentioning envoy not fetching SDS secrets dynamically for clusters that reference their transport socket to an SDS config source sometimes. I have seen this issue as well. Was this fixed or root caused, because it seems like this was addressed in Istio by providing the transport socket certs within the cluster config itself.

howardjohn mentioned this issue May 8, 2020

Proxy cannot get SDS push if SDS server is not ready and proxy sends out SDS requests already. istio/istio#22443

Closed

dio assigned Shikugawa May 12, 2020

dio added area/sds SDS related area/service discovery labels May 12, 2020

mattklein123 added the investigate Potential bug that needs verification label May 12, 2020

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 13, 2020

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jun 15, 2020

Shikugawa mentioned this issue Aug 16, 2020

cluster manager: dynamically inserted clusters should be warming at first #12670

Closed

howardjohn mentioned this issue Sep 10, 2020

xdsproxy results in latent bug being exposed again istio/istio#27211

Closed

Shikugawa mentioned this issue Oct 1, 2020

sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13344

Merged

mpuncel mentioned this issue Oct 12, 2020

delay health checks until transport socket secrets are ready. #13516

Merged

lizan closed this as completed in #13344 Oct 27, 2020

lizan reopened this Nov 3, 2020

Shikugawa mentioned this issue Nov 4, 2020

sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13894

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 9, 2020

lizan removed investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently labels Dec 10, 2020

lizan assigned lizan and unassigned Shikugawa Dec 10, 2020

lizan mentioned this issue Dec 10, 2020

sds: allow multiple init managers share sds target #14357

Merged

lizan closed this as completed in #14357 Dec 16, 2020

jamiechapmanbrn mentioned this issue Feb 10, 2022

Ingressgateways not coming up on istio >= 1.11 with a lot of certificates istio/istio#37272

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sds: cluster not warming while certificates are being fetched; immediately marked active #11120

sds: cluster not warming while certificates are being fetched; immediately marked active #11120

howardjohn commented May 8, 2020

JimmyCYJ commented May 10, 2020

dio commented May 12, 2020

Shikugawa commented May 13, 2020

howardjohn commented May 13, 2020

Shikugawa commented May 14, 2020

stale bot commented Jun 13, 2020

howardjohn commented Jun 15, 2020

mandarjog commented Jul 8, 2020

lambdai commented Jul 8, 2020

howardjohn commented Jul 8, 2020

lambdai commented Jul 8, 2020

howardjohn commented Jul 9, 2020

lambdai commented Jul 9, 2020

howardjohn commented Jul 9, 2020

Shikugawa commented Jul 13, 2020 •

edited

howardjohn commented Jul 13, 2020

Shikugawa commented Jul 13, 2020 •

edited

tmshort commented Aug 4, 2020

Shikugawa commented Aug 14, 2020

howardjohn commented Aug 14, 2020

ramaraochavali commented Aug 23, 2020 •

edited

tmshort commented Sep 14, 2020

Shikugawa commented Sep 15, 2020

incfly commented Oct 23, 2020

github-actions bot commented Dec 9, 2020

Shikugawa commented Dec 10, 2020

lizan commented Dec 10, 2020

shashankram commented Mar 8, 2021

sds: cluster not warming while certificates are being fetched; immediately marked active #11120

sds: cluster not warming while certificates are being fetched; immediately marked active #11120

Comments

howardjohn commented May 8, 2020

JimmyCYJ commented May 10, 2020

dio commented May 12, 2020

Shikugawa commented May 13, 2020

howardjohn commented May 13, 2020

Shikugawa commented May 14, 2020

stale bot commented Jun 13, 2020

howardjohn commented Jun 15, 2020

mandarjog commented Jul 8, 2020

lambdai commented Jul 8, 2020

howardjohn commented Jul 8, 2020

lambdai commented Jul 8, 2020

howardjohn commented Jul 9, 2020

lambdai commented Jul 9, 2020

howardjohn commented Jul 9, 2020

Shikugawa commented Jul 13, 2020 • edited

howardjohn commented Jul 13, 2020

Shikugawa commented Jul 13, 2020 • edited

tmshort commented Aug 4, 2020

Shikugawa commented Aug 14, 2020

howardjohn commented Aug 14, 2020

ramaraochavali commented Aug 23, 2020 • edited

tmshort commented Sep 14, 2020

Shikugawa commented Sep 15, 2020

incfly commented Oct 23, 2020

github-actions bot commented Dec 9, 2020

Shikugawa commented Dec 10, 2020

lizan commented Dec 10, 2020

shashankram commented Mar 8, 2021

Shikugawa commented Jul 13, 2020 •

edited

Shikugawa commented Jul 13, 2020 •

edited

ramaraochavali commented Aug 23, 2020 •

edited