connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: CERTIFICATE_VERIFY_FAILED #17484

sirkubax · 2021-07-26T08:21:16Z

Service inactive (conn_reset) due to certificate issue

via

https://www.envoyproxy.io/docs/envoy/latest/configuration/security/secret#config-secret-discovery-service

If a listener server certificate needs to be fetched by SDS remotely, it will NOT be marked as active, its port will not be opened before the certificates are fetched. If Envoy fails to fetch the certificates due to connection failures, or bad response data, the listener will be marked as active, and the port will be open, but the connection to the port will be reset.

Logs:

[2021-07-25 21:21:40.104][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:187] cm init: initializing cds
[2021-07-25 21:21:40.104][1][warning][main] [source/server/server.cc:609] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_
connections
[2021-07-25 21:21:40.107][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 14, upstream 
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2021-07-25 21:21:40.549][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 14, upstream 
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2021-07-25 21:21:40.661][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 14, upstream 
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2021-07-25 21:21:40.852][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 14, upstream 
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2021-07-25 21:21:42.847][1][info][upstream] [source/common/upstream/cds_api_impl.cc:71] cds: add 3 cluster(s), remove 2 cluster(s)
[2021-07-25 21:21:42.858][1][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'build_server_scm'
[2021-07-25 21:21:42.859][1][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'appservice-ns-node1-v1_appzip-node1'
[2021-07-25 21:21:42.860][1][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'appservice-ns-node1-v1_app-two-node1'

Description:
I'm running the Kubernetes Cluster as a 'Compute platform" for the Azure Arc enabled Kubernetes, so I can run microservices and functions in that endpoint.
I simulate 'on-premise' Kubernetes, meaning it is a VM (in Azure) where the k3s is running.
All works fine (slow but fine), until I reset the cluster (tried with microk8s too).
Looks like the envoy is failing to refresh the certificate, and put the endpoint into reset state.

Please make this issue more visible
Based on my observation, even with the debug logs, this message about failure to refresh certificate, is printed only once, after the k3s cluster is restarted, and it happen 5-15 second after envoy start (so I restart k3s, envoy starts, I think is serving my website correctly, then few seconds later it refresh cert and gives conn_reset). It is a problem, because I did not see this OPENSSL_internal error on the consecutive restarts!!!!!!!!! meaning if you are not lucky to spot it the first time, you are done.
The next time you try to access the website, you would get no matching filter chain found that is misleading, as to my understanding actually there is no active endpoint

[2021-07-25 18:58:27.504][15][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:79] tls inspector: new connection accepted
[2021-07-25 18:58:27.508][15][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:150] tls:onServerName(), requestedServerName: app-two-node1.test-arc-cluster-ldy6mzw.westeurope.k4a
pps.io
[2021-07-25 18:58:27.508][15][debug][conn_handler] [source/server/connection_handler_impl.cc:469] closing connection: no matching filter chain found

I spent almost a weak finding this very line (well there were other issues) so the fact, that the issue with the certificate is hidden and the logs indicate filter problem, I would consider as a bug :/

Please suggest some solution.
I still did not find which certificate is failing, the one I've tested with openssl verify are matching.

What endpoint is envoy communicating with in order to refresh xDS certs?
is it possible to disable certificate verification via the configuration (some insecure_skip_verify or sth?)
any chance that oolder/newer version of the envoy works?
can the endpoint that is in the 'reset' state be activated eg via runtime_modify admin endpoint?

Repro steps:
Env created following the docs (would share install script):
https://docs.microsoft.com/en-us/azure/azure-arc/kubernetes/quickstart-connect-cluster?tabs=azure-cli
https://docs.microsoft.com/en-us/azure/azure-arc/kubernetes/custom-locations#create-custom-location
https://docs.microsoft.com/en-us/azure/app-service/manage-create-arc-environment#install-the-app-service-extension
https://docs.microsoft.com/en-us/azure/azure-functions/create-first-function-arc-cli?tabs=csharp%2Cbrowser#create-storage-account
Installation script: https://pastebin.com/igCc5KwR
You should see:

Admin and Stats Output:
config dump before restart (still working)
https://pastebin.com/3rviGyAD
config dump after restart (NOT working)
https://pastebin.com/vRTwQ9P4

Clusters working
https://pastebin.com/Fcfscex4

Clusters not working
https://pastebin.com/1cudjUim

Config:

apiVersion: v1
data:
  external-envoy.yaml: |
    node:
      cluster: frontend
      id: external-<<<pod-name>>>
    stats_flush_interval: 5.0s
    static_resources:
      clusters:
      - name: envoy_controller_cluster
        connect_timeout: 1s
        load_assignment:
          cluster_name: envoy_controller_cluster
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: appservice-ext-node1-v1-k8se-app-controller
                    port_value: 9090
        lb_policy: ROUND_ROBIN
        typed_extension_protocol_options:
          envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
            "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
            explicit_http_config:
              http2_protocol_options:
                connection_keepalive:
                  interval: 30s
                  timeout: 5s
        upstream_connection_options:
          tcp_keepalive:
            keepalive_probes: 3
            keepalive_time: 30
            keepalive_interval: 5
        type: STRICT_DNS
        transport_socket:
          name: envoy.transport_sockets.tls
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
            common_tls_context:
              tls_certificates:
                - certificate_chain:
                    filename: /etc/envoy/certs/cert.pem
                  private_key:
                    filename: /etc/envoy/certs/key.pem
      - name: envoy_stats_cluster
        connect_timeout: 1s
        load_assignment:
          cluster_name: envoy_stats_cluster
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: appservice-ext-node1-v1-k8se-http-scaler
                    port_value: 4055
        lb_policy: ROUND_ROBIN
        typed_extension_protocol_options:
          envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
            "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
            explicit_http_config:
              http2_protocol_options:
                connection_keepalive:
                  interval: 30s
                  timeout: 5s
        upstream_connection_options:
          tcp_keepalive:
            keepalive_probes: 3
            keepalive_time: 30
            keepalive_interval: 5
        type: STRICT_DNS
        transport_socket:
          name: envoy.transport_sockets.tls
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
            common_tls_context:
              tls_certificates:
                - certificate_chain:
                    filename: /etc/envoy/certs/cert.pem
                  private_key:
                    filename: /etc/envoy/certs/key.pem

    dynamic_resources:
      cds_config:
        resource_api_version: V3
        api_config_source:
          api_type: GRPC
          transport_api_version: V3
          grpc_services:
            - envoy_grpc:
                cluster_name: envoy_controller_cluster
          set_node_on_first_message_only: false
      lds_config:
        resource_api_version: V3
        api_config_source:
          api_type: GRPC
          transport_api_version: V3
          grpc_services:
            - envoy_grpc:
                cluster_name: envoy_controller_cluster
          set_node_on_first_message_only: false

    admin:
      access_log_path: /dev/stderr
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 19000
    stats_sinks:
      - name: envoy.stat_sinks.metrics_service
        typed_config:
          "@type": type.googleapis.com/envoy.config.metrics.v3.MetricsServiceConfig
          transport_api_version: V3
          grpc_service:
            envoy_grpc:
              cluster_name: envoy_stats_cluster
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: appservice-ext-node1-v1
    meta.helm.sh/release-namespace: appservice-ns-node1-v1
  creationTimestamp: "2021-07-25T20:59:04Z"
  labels:
    app.kubernetes.io/instance: appservice-ext-node1-v1
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: k8se
    app.kubernetes.io/version: 1.0.20
    com.microsoft.k8se/build: 1.0.20
    helm.sh/chart: k8se-1.0.20
    k8se.microsoft.com/component: envoy
  name: appservice-ext-node1-v1-k8se-envoy
  namespace: appservice-ns-node1-v1
  resourceVersion: "1382128"
  uid: e1baefc4-787f-4744-af80-a81323bab71b

When endpoint is failing, user gets conn_reset

curl  -v https://appzip-node1.test-arc-cluster-z8btz74.westeurope.k4apps.io/
*   Trying 20.67.115.34:443...
* TCP_NODELAY set
* Connected to appzip-node1.test-arc-cluster-z8btz74.westeurope.k4apps.io (20.67.115.34) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: Connection reset by peer in connection to appzip-node1.test-arc-cluster-z8btz74.westeurope.k4apps.io:443 
* Closing connection 0
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to appzip-node1.test-arc-cluster-z8btz74.westeurope.k4apps.io:443

The text was updated successfully, but these errors were encountered:

sirkubax · 2021-07-26T12:26:18Z

found in appservice-ext-node1-v1-k8se-app-controller-85cb587976-nwr6h

I think that would be the core
still question - how I can see that my endpoint is disabled in any list like http://127.0.0.1:19000/clusters ?

{"level":"info","ts":1627301876.563232,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"error","ts":1627301876.6502664,"logger":"envoy","msg":"Error reading default cert","error":"140453565756544:error:0D06B08E:asn1 encoding routines:asn1_d2i_read_bio:not enough data:../crypto/asn1/a_d2i_fp.c:198:\n\nexit status 1","stacktrace":"main.main\n\t/__w/k4a
pps/k4apps/cmd/appcontroller/main.go:124\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
{"level":"error","ts":1627301876.6528075,"logger":"envoy","msg":"error reading dapr cert","error":"secrets "dapr-trust-bundle" not found","stacktrace":"main.main\n\t/__w/k4apps/k4apps/cmd/appcontroller/main.go:124\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
{"level":"info","ts":1627301876.7418113,"logger":"setup","msg":"starting manager"}
I0726 12:17:56.741905 1 leaderelection.go:243] attempting to acquire leader lease appservice-ns-node1-v1/appservice-ns-node1-v1-appservice-ns-node1-v1...
{"level":"info","ts":1627301876.7421775,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"error","ts":1627301877.841797,"logger":"envoy","msg":"error reading default cert","error":"140261821150336:error:0D06B08E:asn1 encoding routines:asn1_d2i_read_bio:not enough data:../crypto/asn1/a_d2i_fp.c:198:\n\nexit status 1","stacktrace":"github.com/microsoft/k
4apps/pkg/envoy.(*XDSManagementServer).updateTLSCert\n\t/__w/k4apps/k4apps/pkg/envoy/envoy.go:379\ngithub.com/microsoft/k4apps/pkg/envoy.(*XDSManagementServer).watchDefaultTLSCert.func3\n\t/__w/k4apps/k4apps/pkg/envoy/envoy.go:336\nk8s.io/client-go/tools/cache.ResourceEvent
HandlerFuncs.OnAdd\n\t/go/pkg/mod/k8s.io/client-go@v0.20.4/tools/cache/controller.go:231\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.20.4/tools/cache/shared_informer.go:777\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.f
unc1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v
0.20.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:90\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/go/pkg/mod/k8s.io/client-go@v0.20.4/tools/cache/shared_informer.go:771\nk
8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:73"}
{"level":"info","ts":1627301877.8429492,"logger":"envoy","msg":"Processed on startup","count":2}
{"level":"info","ts":1627301877.8429906,"logger":"envoy.stopwatch","msg":"measured: ","Initializing snapshot":1101}
{"level":"info","ts":1627301877.8430026,"logger":"envoy","msg":"starting xds and auth server on port 9090"}
I0726 12:18:13.706604 1 leaderelection.go:253] successfully acquired lease appservice-ns-node1-v1/appservice-ns-node1-v1-appservice-ns-node1-v1

sirkubax · 2021-07-29T11:31:07Z

So I'd declare the logging issue as a Bug - the logs should be clear, that the endpoint is active but not functional...
'no matching filters' is a false statement - as the filters are missing because there are not injected by the dynamic configuration because the endpoint is active but not functional... not because the definition or anything else is missing...
it should be logged that there is a problem enabling endpoint... not the problem with filters

github-actions · 2021-09-06T00:01:44Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2021-09-13T04:01:20Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

divramod · 2022-08-04T09:05:24Z

@sirkubax have you come to a solution? i am facing the same issue.

sirkubax added bug triage Issue requires triage labels Jul 26, 2021

mattklein123 added question Questions that are neither investigations, bugs, nor enhancements and removed bug triage Issue requires triage labels Aug 6, 2021

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 6, 2021

github-actions bot closed this as completed Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: CERTIFICATE_VERIFY_FAILED #17484

connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: CERTIFICATE_VERIFY_FAILED #17484

sirkubax commented Jul 26, 2021 •

edited

sirkubax commented Jul 26, 2021

sirkubax commented Jul 29, 2021

github-actions bot commented Sep 6, 2021

github-actions bot commented Sep 13, 2021

divramod commented Aug 4, 2022

connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: CERTIFICATE_VERIFY_FAILED #17484

connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: CERTIFICATE_VERIFY_FAILED #17484

Comments

sirkubax commented Jul 26, 2021 • edited

sirkubax commented Jul 26, 2021

sirkubax commented Jul 29, 2021

github-actions bot commented Sep 6, 2021

github-actions bot commented Sep 13, 2021

divramod commented Aug 4, 2022

sirkubax commented Jul 26, 2021 •

edited