Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy endpoints not loading since Chrome release 124 #33850

Closed
cd-fernando opened this issue Apr 29, 2024 · 18 comments
Closed

Envoy endpoints not loading since Chrome release 124 #33850

cd-fernando opened this issue Apr 29, 2024 · 18 comments
Labels
area/tls bug investigate Potential bug that needs verification

Comments

@cd-fernando
Copy link

cd-fernando commented Apr 29, 2024

Title: Envoy not responding since release of Chrome 124

Description:
As part of the release of Chrome version 124, Google seems to have widely enabled post-quantum secure TLS key encapsulation Kyber768 for TLS 1.3

Repro steps:
In Chrome flags chrome://flags/ check if "TLS 1.3 hybridized Kyber support" is enabled
From Chrome, connect to any TLS endpoint served by Envoy, and it gets stuck on a spinning wheel until if fails.

Note:
I was running a very old version of Envoy (1.15) but I upgraded all the way to 1.30.1

I would appreciate any suggestions on this issue, I am not well versed on Envoy.

@cd-fernando cd-fernando added bug triage Issue requires triage labels Apr 29, 2024
@adisuissa adisuissa added area/tls investigate Potential bug that needs verification and removed triage Issue requires triage labels Apr 29, 2024
@adisuissa
Copy link
Contributor

cc @ggreenway who may know more about the different TLS protocols that are supported.

@ggreenway
Copy link
Contributor

I would find it shocking if chrome wouldn't be willing to negotiate one of the older TLS 1.3 ciphers in this case; I'd guess that most TLS endpoints on the internet don't yet support the post-quantum cipher suites. Regardless, I think if there's an issue here, it's a bug in chrome.

@cd-fernando
Copy link
Author

cd-fernando commented Apr 30, 2024

I would find it shocking if chrome wouldn't be willing to negotiate one of the older TLS 1.3 ciphers in this case; I'd guess that most TLS endpoints on the internet don't yet support the post-quantum cipher suites. Regardless, I think if there's an issue here, it's a bug in chrome.

The problem has made some news website like MSN, I'm not sure what's the policy about posting links so I'll post an excerpt

Despite months of testing, the problem seems to have risen from web servers failing to adequately implement TLS, rather than an issue with Chrome. The error results in the rejection of connections that use the Kyber768 quantum-resistant key agreement algorithm, including connections with Chrome’s hybrid key.
Clearly, this is not a simple fix that can be implemented by Chrome, but it requires a larger and more orchestrated effort to transform the Internet into one that can handle sophisticated quantum-safe cryptography.
For now, affected users are being advised to disable the TLS 1.3 hybridized Kyber support in Chrome. However, long-term post-quantum secure ciphers will be essential in TLS, and the ability to disable the feature will likely be removed in the future, highlighting the importance of addressing the issue’s route cause earlier on so that websites can be prepared for quantum-based attacks in the future.

@cd-fernando
Copy link
Author

I might have found some helpful comment on DDG

These errors are not caused by a bug in Google Chrome but instead caused by web servers failing to properly implement Transport Layer Security (TLS) and not being able to handle larger ClientHello messages for post-quantum cryptography.

I think I saw a setting that affects this size in Envoy

@cd-fernando
Copy link
Author

I can't find such option, any suggestions?

@ggreenway
Copy link
Contributor

Can you capture a full tcpdump of the failed handshake and post it?

@cd-fernando
Copy link
Author

@ggreenway it doesn't seem to accept pcap files, any suggestions?

@cd-fernando
Copy link
Author

kyber768.pcap.gz

Ignore that, here's the gzipped pcap

@ggreenway
Copy link
Contributor

Huh, it looks like the tcp window is closed after only 1400 bytes. Can you post the full envoy configuration you used for this test?

@sschepens
Copy link
Contributor

Huh, it looks like the tcp window is closed after only 1400 bytes. Can you post the full envoy configuration you used for this test?

1400 bytes seems like a typical MTU, maybe we're limiting handshake to a single packet?

@cd-fernando
Copy link
Author

cd-fernando commented May 2, 2024

Here's a cut-down version of our config, I hope I didn't axe too much

---
admin:
  # access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
  listeners:
### BEGIN http frontends ###
  - name: apis
    address:
      socket_address: { address: 0.0.0.0, port_value: 443 }
    listener_filters:
    - name: "envoy.filters.listener.tls_inspector"
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector
    filter_chains:
    - filter_chain_match:
        server_names: ["*.testdomain.dev"]
        transport_protocol: "tls"
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
          common_tls_context:
            tls_params:
              tls_minimum_protocol_version: TLSv1_2
              cipher_suites: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305"
            tls_certificates:
            - certificate_chain:
                filename: /etc/envoy/STAR.testdomain.dev.crt
              private_key:
                filename: /etc/envoy/STAR.testdomain.dev.key
      filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          upgrade_configs:
            - upgrade_type: connect
          codec_type: AUTO
          use_remote_address: true
          xff_num_trusted_hops: 0
          access_log:
            - name: envoy.access_loggers.file
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
                path: "/dev/stdout"
          route_config:
            name: local_route
            virtual_hosts:
            - name: system_api
              domains: ["api.testdomain.dev"]
              routes:
              - match: { prefix: "/api/interact/" }
                route: { cluster: windfarm }
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
  - name: windfarm
    connect_timeout: 0.25s
    type: STATIC
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    health_checks:
      timeout: 1s
      interval: 1s
      unhealthy_threshold: 1
      healthy_threshold: 2
      http_health_check:
        path: /ping
    load_assignment:
      cluster_name: windfarm
      endpoints:
        - lb_endpoints:
           - endpoint:
              address:
                socket_address:
                  address: 127.0.0.1
                  port_value: 29091

@vparla
Copy link

vparla commented May 2, 2024

https://tldr.fail/ describes the issue. It is likely that Envoy is not reading the entire Client Hello as it spans packets.
Python test scripts can be found here:
github.com/dadrian/tldr.fail/blob/main/tldr_fail_test.py

@ggreenway
Copy link
Contributor

@vparla thanks for that. I used the linked python script to test against the latest Envoy and it worked correctly (Envoy sent back a ServerHello).

@cd-fernando can you try the script at https://github.com/dadrian/tldr.fail/blob/main/tldr_fail_test.py and see what it returns?

Can anyone else test this and report success or failure?

@ggreenway
Copy link
Contributor

I got one more report of it working correctly with Chrome.

I think I understand what's going on: the TlsInspector filter doesn't read from the socket; it peeks. This means the entire ClientHello needs to fit into the configured socket read buffer size.

I saw in the tcpdump that the server had a fully filled up tcp window (of only about 1500 bytes), and I didn't realize until now that it was the TlsInspector, not the TLS transport socket, that was getting stuck.

I've never seen a socket read buffer configured that small. What OS are you using?

If you don't need to select a filter chain based on SNI, you can remove the TlsInspector from your config and that should fix this.

@cd-fernando
Copy link
Author

Thanks everyone for looking into this.
@ggreenway you were right, we had a TCP max window size too small, it was set to 4096 and that wasn't enough.

For a bit more background in case you're curious it's a value we had set to optimise Haproxy and that only, unfortunately because our QA boxes are self contained, that setting affected Envoy too. It had never been a problem until the enabling of these Quantum resistant protocols.

Thank you very much again!

FYI that script doesn't seem to work on versions of Python older than 3.11.

@cd-fernando
Copy link
Author

Thank you very much for your assistance, I'm closing this now.

@dadrian
Copy link

dadrian commented May 13, 2024

Just so I can document this on tldr.fail---where is the socket read buffer configured in this context? Is that an Envoy, TLS Inspector, or kernel setting (or somewhere else)? I would have expected that to be internal to Envoy's implementation, but it didn't seem like this needed a code change?

@ggreenway
Copy link
Contributor

@dadrian it's the kernel socket receive buffer. On linux it's normally set with sysctl. This is a shortcoming in how Envoy implements this, but it's extremely uncommon to have such a small socket receive buffer, and changing Envoy to handle this condition is not simple, so until someone decides to put in the effort to fix it, I think this will remain as a known issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tls bug investigate Potential bug that needs verification
Projects
None yet
Development

No branches or pull requests

6 participants