Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sds: cluster not warming while certificates are being fetched; immediately marked active #11120

Closed
howardjohn opened this issue May 8, 2020 · 28 comments · Fixed by #13344 or #14357
Closed
Assignees

Comments

@howardjohn
Copy link
Contributor

When creating clusters that reference SDS certificates, the warming behavior does not seem correct. My expectation is that until the secret is sent, the cluster will be marked as "warming" until the initial_fetch_timeout, and block the rest of initialization from occuring.

What I am actually seeing is initialization is blocked, but there is nothing indicating the clusters are warming.

Using this config:
docker run -v $HOME/kube/local:/config -p 15000:15000 envoyproxy/envoy-dev -c /config/envoy-sds-lds.yaml --log-format-prefix-with-location 0 --reject-unknown-dynamic-fields

with envoy version: 49efb9841a58ebdc43a666f55c445911c8e4181c/1.15.0-dev/Clean/RELEASE/BoringSSL

and config files:

cds.yaml:

resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
  name: outbound_cluster_tls
  connect_timeout: 5s
  max_requests_per_connection: 1
  load_assignment:
    cluster_name: xds-grpc
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: 127.0.0.1
              port_value: 8080
  type: STATIC
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext
      common_tls_context:
        tls_certificate_sds_secret_configs:
          - name: "default"
            sds_config:
              initial_fetch_timeout: 20s
              api_config_source:
                api_type: GRPC
                grpc_services:
                  - envoy_grpc:
                      cluster_name: "sds-grpc"
                refresh_delay: 60s
        combined_validation_context:
          default_validation_context: {}
          validation_context_sds_secret_config:
            name: ROOTCA
            sds_config:
              initial_fetch_timeout: 20s
              api_config_source:
                api_type: GRPC
                grpc_services:
                - envoy_grpc:
                    cluster_name: sds-grpc    

envoy-sds-lds.yaml:

admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 15000
node:
  id: id
  cluster: sdstest
dynamic_resources:
  lds_config:
    api_config_source:
      api_type: GRPC
      grpc_services:
        envoy_grpc:
          cluster_name: lds
  cds_config:
    path: /config/cds.yaml
static_resources:
  clusters:    
  - name: sds-grpc
    type: STATIC
    http2_protocol_options: {}
    connect_timeout: 5s
    lb_policy: ROUND_ROBIN
  - name: lds
    type: STATIC
    http2_protocol_options: {}
    connect_timeout: 5s
    lb_policy: ROUND_ROBIN

Basically what should happen here is we get a dynamic CDS cluster with SDS config. This SDS config fails, as the sds server is not setup. We have initial_fetch_timeout, so for 20s everything should be warming.

What we see instead:

  • Stats are not showing warming:
cluster_manager.cds.init_fetch_timeout: 0
cluster_manager.cds.update_attempt: 1
cluster_manager.cds.update_failure: 0
cluster_manager.cds.update_rejected: 0
cluster_manager.cds.update_success: 1
cluster_manager.cds.update_time: 1588972075968
cluster_manager.cds.version: 17241709254077376921
cluster_manager.cluster_added: 3
cluster_manager.cluster_modified: 0
cluster_manager.cluster_removed: 0
cluster_manager.cluster_updated: 0
cluster_manager.cluster_updated_via_merge: 0
cluster_manager.update_merge_cancelled: 0
cluster_manager.update_out_of_merge_window: 0
cluster_manager.warming_clusters: 0

We also see init_fetch_timeout is 0; this does not change after 20s

  • LDS is not requested until 20s later, indicating the initial_fetch_timeout is respected. This can be seen in the logs:
    (note - for simple testing I don't have a real LDS server, but we can see its not even attempted until 20s in)
[2020-05-08 21:07:55.967][1][info][upstream] cds: add 1 cluster(s), remove 2 cluster(s)
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][info][upstream] cds: add/update cluster 'outbound_cluster_tls'
[2020-05-08 21:07:55.968][1][info][main] starting main dispatch loop
[2020-05-08 21:07:56.703][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.703][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:56.938][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.938][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.135][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.135][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.682][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.682][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:58.671][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:58.671][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:08.992][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:08:08.992][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:15.967][1][info][upstream] cm init: all clusters initialized
[2020-05-08 21:08:15.967][1][info][main] all clusters initialized. initializing init manager
[2020-05-08 21:08:15.967][1][warning][config] StreamListeners gRPC config stream closed: 14, no healthy upstream
  • dynamic_active_clusters shows the cluster in cds.yaml. I would expect it to be "warming".

This example above is meant to simplify it, I have originally seen this with a normal deployment using ADS gRPC server (Istio) not just files.

@JimmyCYJ
Copy link
Member

/cc @JimmyCYJ

@dio
Copy link
Member

dio commented May 12, 2020

@Shikugawa would you be able to help on investigating this?

@mattklein123 mattklein123 added the investigate Potential bug that needs verification label May 12, 2020
@Shikugawa
Copy link
Member

@dio Yes, I'll investigate about this problem.

@howardjohn
Copy link
Contributor Author

I suspect this causes other issues as well. We are seeing that if we do not include sds config in the XDS connection, we eventually have SDS permanently broken - clusters that reference sds secrets will be stuck warming forever.

We have 2 secrets throughout all config, the client cert and the root cert. If we add the cert to the xds cluster (enforcing SDS starts before XDS) then we never see the issue with the client cert. However, the root cert still sometimes gets stuck warming forever.

more info istio/istio#22443.

I am not sure if its the same root cause but seems related

@Shikugawa
Copy link
Member

@howardjohn Got it. I think that cluster_warming and also initial_fetch_timeout is broken. It never counts cluster_warming and initial_fetch_timeout. Your problem would be caused by requesting an inactive SDS cluster. So we have two problems related to this issue. Is this right?

@stale
Copy link

stale bot commented Jun 13, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 13, 2020
@howardjohn
Copy link
Contributor Author

Your problem would be caused by requesting an inactive SDS cluster.

Its not just the SDS cluster being inactive, the SDS cluster could be active but not return any secrets

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jun 15, 2020
@mandarjog
Copy link
Contributor

@lambdai @JimmyCYJ can you look at this? We do not want envoy to declare ready before it is.

@lambdai
Copy link
Contributor

lambdai commented Jul 8, 2020

Let me rephrase to see we are on the same page.

  1. SDS: somehow the SDS's response is not returning.
  2. while 1 is blocked for the first 20s due to SDS fetch timeout, the cluster is in warming but shows active
  3. After 20s, the SDS resource announce ready
  4. Immediately after 3. the cluster announce ready.

My understanding is that "warming cluster shows ready" is the only error.
IMHO cluster is early active b/c SDS set 20s fetch timeout. That is exactly what 20s fetch timeout is supposed to work.

@howardjohn
Copy link
Contributor Author

We may be saying the same thing, not sure. But I think its more like:

  1. SDS: somehow the SDS's response is not returning.
  2. while 1 is blocked for the first 20s due to SDS fetch timeout, the cluster is in warming but shows active
  3. After 20s, the sds fetch times out due to 20s fetch timeout
  4. Immediately after 3. the cluster announce ready - in the sense that LDS init continues, but it was always shown as "ready" in the stats

@lambdai
Copy link
Contributor

lambdai commented Jul 8, 2020

@howardjohn I see stat is the only liar. I think envoy is early declaring active at stat. Not only cluster but also listener

However, the sequence is working as expected. The initial_fetch_timeout is supposed to unblock the initialization by announce itself "ready" or "active"

@mandarjog I think the solution is to disable init_fetch_timeout in SDS if we cannot tolerate fake readiness.

@howardjohn
Copy link
Contributor Author

@lambdai I don't think stats is the only issue. See istio/istio#22443. I have less of a clear reproducer, but basically we get into a state where Envoy never sends an SDS request for one of the SDS resources in the XDS response. It seems there are larger problems then just the stat - but maybe its a completely different issue.

Why is the solution to disable it rather than fix the stat?

@lambdai
Copy link
Contributor

lambdai commented Jul 9, 2020

basically we get into a state where Envoy never sends an SDS request for one of the SDS resources in the XDS response

Get it. Reading the issue and see if I can help.
If envoy does want to block initialization assuming before sds is fixed, the only solution is to disable the fetch timeout in SDS.
Then the readiness probe should chime in and kill istio container, correct?

@howardjohn
Copy link
Contributor Author

Oh I see. Yeah actually we do not want initial fetch timeout for these in particular. But that is an Istio detail not envoy side.

One thing I do wonder - does the initial_fetch_timeout stuff work differently for bootstrap vs dynamic? Because what we found is if we requested a secret default as part of the XDS cluster, everything is ok. If we do not, but the dynamic response requests default it occasionally breaks and never sends the SDS request. I never figured out why that is - I assumed it may be able some sequencing issue and using the secret for the XDS cluster linearizes it, but maybe its bootstrap vs dynamic?

@Shikugawa
Copy link
Member

Shikugawa commented Jul 13, 2020

Oh I see. Yeah actually we do not want initial fetch timeout for these in particular. But that is an Istio detail not envoy side.

@lambdai @mandarjog Sorry. I missed these discussions! For now, I think that the common ground is to add disable_init_fetch_timeout on SDS or runtime flag. This won't break the default behavior of SDS and API compatibility.

@howardjohn
Copy link
Contributor Author

Why do we need disable_init_fetch_timeout? I think setting it to 0s disabled it already?

Besides, the fact we want it disabled is an implementation detail of our usage of Envoy, it's still a general issue that will impact others

@Shikugawa
Copy link
Member

Shikugawa commented Jul 13, 2020

@howardjohn Got it. Maybe this is caused by the state management of cluster manager. In general, if we started to create xDS subscription when all of clusters are not initialized, this problem will be occurred. So I think that to resolve this problem, we should fix the implementation of init_manager. So this is relatively deep-routed problem. We should have the prevention logic not to use active_clusters_ on the implementation of ClusterManager, shouldn't we? cc. @lambdai @mandarjog

@tmshort
Copy link

tmshort commented Aug 4, 2020

Any update on this? This is an intermittent problem on our istio deployment that stalls pods.

@Shikugawa
Copy link
Member

@howardjohn Hey. I'd like to take confirmation about this. I don't completely understand what is the problem. Our problem is,

Attached cluster will be immidiately active after SDS subscription on attached cluster sent DiscoveryRequest to SDS Cluster until init_fetch_timeout on attached cluster when DiscoveryResponse from SDS cluster didn't have CA. We expect to work that attached cluster is warming in this situation.

Is this what you said?

@howardjohn
Copy link
Contributor Author

Yep. The cluster should be warming until the secret is fetched, but its active immediately

@ramaraochavali
Copy link
Contributor

ramaraochavali commented Aug 23, 2020

We also run in to this problem intermittently. In our case we stream file based certs via SDS using Istio. When we have many clusters if SDS push is delayed for a cluster, since the cluster is incorrectly marked as Active, requests fail to that cluster with error "OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_UNKNOWN" immediately after initialization. The problem is more prominent when there are many clusters.

@tmshort
Copy link

tmshort commented Sep 14, 2020

Any update on this issue?

@Shikugawa
Copy link
Member

@tmshort I'm considering that this issue has two problems. As the part of this, I crafted a PR and it is in review. #12783

@incfly
Copy link
Contributor

incfly commented Oct 23, 2020

/cc @incfly

lizan pushed a commit that referenced this issue Oct 27, 2020
…cret entity (#13344)

This PR highly depends on #12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes: Added
Fixes #11120, future work: #13777

Signed-off-by: Shikugawa <rei@tetrate.io>
lizan pushed a commit to lizan/envoy that referenced this issue Oct 30, 2020
…cret entity (envoyproxy#13344)

This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes: Added
Fixes envoyproxy#11120, future work: envoyproxy#13777

Signed-off-by: Shikugawa <rei@tetrate.io>
istio-testing pushed a commit to istio/envoy that referenced this issue Nov 2, 2020
* cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783)

Signed-off-by: Shikugawa <rei@tetrate.io>

* sds: keep warming when dynamic inserted cluster can't be extracted secret entity (envoyproxy#13344)

This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes: Added
Fixes envoyproxy#11120, future work: envoyproxy#13777

Signed-off-by: Shikugawa <rei@tetrate.io>

Co-authored-by: Rei Shimizu <rei@tetrate.io>
@lizan lizan reopened this Nov 3, 2020
@github-actions
Copy link

github-actions bot commented Dec 9, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 9, 2020
@Shikugawa
Copy link
Member

@lizan Can we close this?

@lizan lizan removed investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently labels Dec 10, 2020
@lizan lizan assigned lizan and unassigned Shikugawa Dec 10, 2020
@lizan
Copy link
Member

lizan commented Dec 10, 2020

@Shikugawa no this is not fixed, I'll have a fix soon.

lizan added a commit that referenced this issue Dec 16, 2020
Fixes #11120, allows more than one init manager to watch the same SDS init target so clusters/listeners won't be marked active immediately.

Additional Description:
Risk Level: Medium
Testing: integration test
Docs Changes: N/A
Release Notes: Added.

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@shashankram
Copy link

Hey @howardjohn , I recollect you mentioning envoy not fetching SDS secrets dynamically for clusters that reference their transport socket to an SDS config source sometimes. I have seen this issue as well. Was this fixed or root caused, because it seems like this was addressed in Istio by providing the transport socket certs within the cluster config itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment