Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: CERTIFICATE_VERIFY_FAILED #17484

Closed
sirkubax opened this issue Jul 26, 2021 · 5 comments
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently

Comments

@sirkubax
Copy link

sirkubax commented Jul 26, 2021

Service inactive (conn_reset) due to certificate issue

via

https://www.envoyproxy.io/docs/envoy/latest/configuration/security/secret#config-secret-discovery-service

If a listener server certificate needs to be fetched by SDS remotely, it will NOT be marked as active, its port will not be opened before the certificates are fetched. If Envoy fails to fetch the certificates due to connection failures, or bad response data, the listener will be marked as active, and the port will be open, but the connection to the port will be reset.

Logs:
image

[2021-07-25 21:21:40.104][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:187] cm init: initializing cds
[2021-07-25 21:21:40.104][1][warning][main] [source/server/server.cc:609] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_
connections
[2021-07-25 21:21:40.107][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 14, upstream 
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2021-07-25 21:21:40.549][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 14, upstream 
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2021-07-25 21:21:40.661][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 14, upstream 
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2021-07-25 21:21:40.852][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 14, upstream 
connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2021-07-25 21:21:42.847][1][info][upstream] [source/common/upstream/cds_api_impl.cc:71] cds: add 3 cluster(s), remove 2 cluster(s)
[2021-07-25 21:21:42.858][1][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'build_server_scm'
[2021-07-25 21:21:42.859][1][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'appservice-ns-node1-v1_appzip-node1'
[2021-07-25 21:21:42.860][1][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'appservice-ns-node1-v1_app-two-node1'

Description:
I'm running the Kubernetes Cluster as a 'Compute platform" for the Azure Arc enabled Kubernetes, so I can run microservices and functions in that endpoint.
I simulate 'on-premise' Kubernetes, meaning it is a VM (in Azure) where the k3s is running.
All works fine (slow but fine), until I reset the cluster (tried with microk8s too).
Looks like the envoy is failing to refresh the certificate, and put the endpoint into reset state.

Please make this issue more visible
Based on my observation, even with the debug logs, this message about failure to refresh certificate, is printed only once, after the k3s cluster is restarted, and it happen 5-15 second after envoy start (so I restart k3s, envoy starts, I think is serving my website correctly, then few seconds later it refresh cert and gives conn_reset). It is a problem, because I did not see this OPENSSL_internal error on the consecutive restarts!!!!!!!!! meaning if you are not lucky to spot it the first time, you are done.
The next time you try to access the website, you would get no matching filter chain found that is misleading, as to my understanding actually there is no active endpoint

[2021-07-25 18:58:27.504][15][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:79] tls inspector: new connection accepted
[2021-07-25 18:58:27.508][15][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:150] tls:onServerName(), requestedServerName: app-two-node1.test-arc-cluster-ldy6mzw.westeurope.k4a
pps.io
[2021-07-25 18:58:27.508][15][debug][conn_handler] [source/server/connection_handler_impl.cc:469] closing connection: no matching filter chain found

I spent almost a weak finding this very line (well there were other issues) so the fact, that the issue with the certificate is hidden and the logs indicate filter problem, I would consider as a bug :/

Please suggest some solution.
I still did not find which certificate is failing, the one I've tested with openssl verify are matching.

  • What endpoint is envoy communicating with in order to refresh xDS certs?
  • is it possible to disable certificate verification via the configuration (some insecure_skip_verify or sth?)
  • any chance that oolder/newer version of the envoy works?
  • can the endpoint that is in the 'reset' state be activated eg via runtime_modify admin endpoint?

Repro steps:
Env created following the docs (would share install script):
https://docs.microsoft.com/en-us/azure/azure-arc/kubernetes/quickstart-connect-cluster?tabs=azure-cli
https://docs.microsoft.com/en-us/azure/azure-arc/kubernetes/custom-locations#create-custom-location
https://docs.microsoft.com/en-us/azure/app-service/manage-create-arc-environment#install-the-app-service-extension
https://docs.microsoft.com/en-us/azure/azure-functions/create-first-function-arc-cli?tabs=csharp%2Cbrowser#create-storage-account
Installation script: https://pastebin.com/igCc5KwR
You should see:
image

Admin and Stats Output:
config dump before restart (still working)
https://pastebin.com/3rviGyAD
config dump after restart (NOT working)
https://pastebin.com/vRTwQ9P4

Clusters working
https://pastebin.com/Fcfscex4

Clusters not working
https://pastebin.com/1cudjUim

Config:

apiVersion: v1
data:
  external-envoy.yaml: |
    node:
      cluster: frontend
      id: external-<<<pod-name>>>
    stats_flush_interval: 5.0s
    static_resources:
      clusters:
      - name: envoy_controller_cluster
        connect_timeout: 1s
        load_assignment:
          cluster_name: envoy_controller_cluster
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: appservice-ext-node1-v1-k8se-app-controller
                    port_value: 9090
        lb_policy: ROUND_ROBIN
        typed_extension_protocol_options:
          envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
            "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
            explicit_http_config:
              http2_protocol_options:
                connection_keepalive:
                  interval: 30s
                  timeout: 5s
        upstream_connection_options:
          tcp_keepalive:
            keepalive_probes: 3
            keepalive_time: 30
            keepalive_interval: 5
        type: STRICT_DNS
        transport_socket:
          name: envoy.transport_sockets.tls
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
            common_tls_context:
              tls_certificates:
                - certificate_chain:
                    filename: /etc/envoy/certs/cert.pem
                  private_key:
                    filename: /etc/envoy/certs/key.pem
      - name: envoy_stats_cluster
        connect_timeout: 1s
        load_assignment:
          cluster_name: envoy_stats_cluster
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: appservice-ext-node1-v1-k8se-http-scaler
                    port_value: 4055
        lb_policy: ROUND_ROBIN
        typed_extension_protocol_options:
          envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
            "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
            explicit_http_config:
              http2_protocol_options:
                connection_keepalive:
                  interval: 30s
                  timeout: 5s
        upstream_connection_options:
          tcp_keepalive:
            keepalive_probes: 3
            keepalive_time: 30
            keepalive_interval: 5
        type: STRICT_DNS
        transport_socket:
          name: envoy.transport_sockets.tls
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
            common_tls_context:
              tls_certificates:
                - certificate_chain:
                    filename: /etc/envoy/certs/cert.pem
                  private_key:
                    filename: /etc/envoy/certs/key.pem

    dynamic_resources:
      cds_config:
        resource_api_version: V3
        api_config_source:
          api_type: GRPC
          transport_api_version: V3
          grpc_services:
            - envoy_grpc:
                cluster_name: envoy_controller_cluster
          set_node_on_first_message_only: false
      lds_config:
        resource_api_version: V3
        api_config_source:
          api_type: GRPC
          transport_api_version: V3
          grpc_services:
            - envoy_grpc:
                cluster_name: envoy_controller_cluster
          set_node_on_first_message_only: false

    admin:
      access_log_path: /dev/stderr
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 19000
    stats_sinks:
      - name: envoy.stat_sinks.metrics_service
        typed_config:
          "@type": type.googleapis.com/envoy.config.metrics.v3.MetricsServiceConfig
          transport_api_version: V3
          grpc_service:
            envoy_grpc:
              cluster_name: envoy_stats_cluster
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: appservice-ext-node1-v1
    meta.helm.sh/release-namespace: appservice-ns-node1-v1
  creationTimestamp: "2021-07-25T20:59:04Z"
  labels:
    app.kubernetes.io/instance: appservice-ext-node1-v1
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: k8se
    app.kubernetes.io/version: 1.0.20
    com.microsoft.k8se/build: 1.0.20
    helm.sh/chart: k8se-1.0.20
    k8se.microsoft.com/component: envoy
  name: appservice-ext-node1-v1-k8se-envoy
  namespace: appservice-ns-node1-v1
  resourceVersion: "1382128"
  uid: e1baefc4-787f-4744-af80-a81323bab71b

When endpoint is failing, user gets conn_reset

curl  -v https://appzip-node1.test-arc-cluster-z8btz74.westeurope.k4apps.io/
*   Trying 20.67.115.34:443...
* TCP_NODELAY set
* Connected to appzip-node1.test-arc-cluster-z8btz74.westeurope.k4apps.io (20.67.115.34) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: Connection reset by peer in connection to appzip-node1.test-arc-cluster-z8btz74.westeurope.k4apps.io:443 
* Closing connection 0
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to appzip-node1.test-arc-cluster-z8btz74.westeurope.k4apps.io:443 
@sirkubax sirkubax added bug triage Issue requires triage labels Jul 26, 2021
@sirkubax
Copy link
Author

found in appservice-ext-node1-v1-k8se-app-controller-85cb587976-nwr6h

I think that would be the core
still question - how I can see that my endpoint is disabled in any list like http://127.0.0.1:19000/clusters ?

{"level":"info","ts":1627301876.563232,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"error","ts":1627301876.6502664,"logger":"envoy","msg":"Error reading default cert","error":"140453565756544:error:0D06B08E:asn1 encoding routines:asn1_d2i_read_bio:not enough data:../crypto/asn1/a_d2i_fp.c:198:\n\nexit status 1","stacktrace":"main.main\n\t/__w/k4a
pps/k4apps/cmd/appcontroller/main.go:124\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
{"level":"error","ts":1627301876.6528075,"logger":"envoy","msg":"error reading dapr cert","error":"secrets "dapr-trust-bundle" not found","stacktrace":"main.main\n\t/__w/k4apps/k4apps/cmd/appcontroller/main.go:124\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
{"level":"info","ts":1627301876.7418113,"logger":"setup","msg":"starting manager"}
I0726 12:17:56.741905 1 leaderelection.go:243] attempting to acquire leader lease appservice-ns-node1-v1/appservice-ns-node1-v1-appservice-ns-node1-v1...
{"level":"info","ts":1627301876.7421775,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"error","ts":1627301877.841797,"logger":"envoy","msg":"error reading default cert","error":"140261821150336:error:0D06B08E:asn1 encoding routines:asn1_d2i_read_bio:not enough data:../crypto/asn1/a_d2i_fp.c:198:\n\nexit status 1","stacktrace":"github.com/microsoft/k
4apps/pkg/envoy.(*XDSManagementServer).updateTLSCert\n\t/__w/k4apps/k4apps/pkg/envoy/envoy.go:379\ngithub.com/microsoft/k4apps/pkg/envoy.(*XDSManagementServer).watchDefaultTLSCert.func3\n\t/__w/k4apps/k4apps/pkg/envoy/envoy.go:336\nk8s.io/client-go/tools/cache.ResourceEvent
HandlerFuncs.OnAdd\n\t/go/pkg/mod/k8s.io/client-go@v0.20.4/tools/cache/controller.go:231\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.20.4/tools/cache/shared_informer.go:777\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.f
unc1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v
0.20.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:90\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/go/pkg/mod/k8s.io/client-go@v0.20.4/tools/cache/shared_informer.go:771\nk
8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:73"}
{"level":"info","ts":1627301877.8429492,"logger":"envoy","msg":"Processed on startup","count":2}
{"level":"info","ts":1627301877.8429906,"logger":"envoy.stopwatch","msg":"measured: ","Initializing snapshot":1101}
{"level":"info","ts":1627301877.8430026,"logger":"envoy","msg":"starting xds and auth server on port 9090"}
I0726 12:18:13.706604 1 leaderelection.go:253] successfully acquired lease appservice-ns-node1-v1/appservice-ns-node1-v1-appservice-ns-node1-v1

@sirkubax
Copy link
Author

So I'd declare the logging issue as a Bug - the logs should be clear, that the endpoint is active but not functional...
'no matching filters' is a false statement - as the filters are missing because there are not injected by the dynamic configuration because the endpoint is active but not functional... not because the definition or anything else is missing...
it should be logged that there is a problem enabling endpoint... not the problem with filters

@mattklein123 mattklein123 added question Questions that are neither investigations, bugs, nor enhancements and removed bug triage Issue requires triage labels Aug 6, 2021
@github-actions
Copy link

github-actions bot commented Sep 6, 2021

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 6, 2021
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@divramod
Copy link

divramod commented Aug 4, 2022

@sirkubax have you come to a solution? i am facing the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

3 participants