gRPC 1.20.1 upgrade causes gRPC connections to fail #6897

TBoshoven · 2019-05-10T20:49:51Z

Title: gRPC 1.20.1 upgrade causes gRPC connections to fail

Description:
I use Envoy's gRPC-Web filter to integrate against a few gRPC applications built in Python. These applications use a secured connection.
Since they upgraded to gRPC v1.20.1 (from v1.19.x), Envoy has been unable to get proper responses. The version upgrade is the only change and other applications have no problem communicating with the updated services.
The error I am getting is gRPC status 14 with this message:

upstream connect error or disconnect/reset before headers. reset reason: connection termination

I tried running this against the current stable version as well as a (yesterday's) dev build and got the same results.

I have also been unable to reproduce this issue. I created a small gRPC application using Python and gRPC 1.20.1 and I could use gRPC to communicate with it. My little test service was unsecured but otherwise pretty similar, so I suspect perhaps a TLS issue.

The only other relevant thing I can think of is that all of these services are behind (TCP) load balancers.

Config:
My config is more or less the same as in the example. I fixed the indentation of the bottom part and (of course) updated addresses to match my services.

Logs:
These are the logs of a failing request (minus the request itself):

[2019-05-09 19:20:34.438][19][debug][client] [source/common/http/codec_client.cc:26] [C3] connecting
[2019-05-09 19:20:34.438][19][debug][connection] [source/common/network/connection_impl.cc:695] [C3] connecting to <THE IP ADDRESS>:443
[2019-05-09 19:20:34.438][19][debug][connection] [source/common/network/connection_impl.cc:704] [C3] connection in progress
[2019-05-09 19:20:34.438][19][debug][http2] [source/common/http/http2/codec_impl.cc:725] [C3] setting stream-level initial window size to 268435456
[2019-05-09 19:20:34.438][19][debug][http2] [source/common/http/http2/codec_impl.cc:747] [C3] updating connection-level initial window size to 268435456
[2019-05-09 19:20:34.438][19][debug][pool] [source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2019-05-09 19:20:34.438][19][debug][http] [source/common/http/conn_manager_impl.cc:1062] [C0][S2159400524055061755] request end stream
[2019-05-09 19:20:34.454][19][debug][connection] [source/common/network/connection_impl.cc:543] [C3] connected
[2019-05-09 19:20:34.454][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.470][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.470][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.470][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.471][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.471][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.471][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.472][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.472][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.506][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:131] [C3] handshake complete
[2019-05-09 19:20:34.506][19][debug][client] [source/common/http/codec_client.cc:64] [C3] connected
[2019-05-09 19:20:34.506][19][debug][pool] [source/common/http/http2/conn_pool.cc:96] [C3] creating stream
[2019-05-09 19:20:34.506][19][debug][router] [source/common/router/router.cc:1211] [C0][S2159400524055061755] pool ready
[2019-05-09 19:20:34.507][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C3] 
[2019-05-09 19:20:34.507][19][debug][connection] [source/common/network/connection_impl.cc:511] [C3] remote close
[2019-05-09 19:20:34.507][19][debug][connection] [source/common/network/connection_impl.cc:192] [C3] closing socket: 0
[2019-05-09 19:20:34.507][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:244] [C3] SSL shutdown: rc=0
[2019-05-09 19:20:34.507][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C3] 
[2019-05-09 19:20:34.507][19][debug][client] [source/common/http/codec_client.cc:82] [C3] disconnect. resetting 1 pending requests
[2019-05-09 19:20:34.507][19][debug][client] [source/common/http/codec_client.cc:105] [C3] request reset
[2019-05-09 19:20:34.507][19][debug][pool] [source/common/http/http2/conn_pool.cc:237] [C3] destroying stream: 0 remaining
[2019-05-09 19:20:34.507][19][debug][router] [source/common/router/router.cc:672] [C0][S2159400524055061755] upstream reset: reset reason connection termination
[2019-05-09 19:20:34.507][19][debug][http] [source/common/http/conn_manager_impl.cc:1137] [C0][S2159400524055061755] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2019-05-09 19:20:34.507][19][debug][http] [source/common/http/conn_manager_impl.cc:1329] [C0][S2159400524055061755] encoding headers via codec (end_stream=true):
':status', '200'
'content-type', 'application/grpc-web-text+proto'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection termination'
'access-control-allow-origin', 'http://localhost:8081'
'access-control-expose-headers', 'grpc-status,grpc-message'
'date', 'Thu, 09 May 2019 19:20:34 GMT'
'server', 'envoy'

[2019-05-09 19:20:34.507][19][debug][pool] [source/common/http/http2/conn_pool.cc:150] [C3] client disconnected
[2019-05-09 19:20:34.507][19][debug][pool] [source/common/http/http2/conn_pool.cc:175] [C3] destroying primary client
[2019-05-09 19:20:39.091][1][debug][main] [source/server/server.cc:147] flushing stats

I tried a bunch of things on the Envoy config side, but nothing made any difference.
Could anyone point me at what might be causing the issue or what might help me get to the bottom of this?

The text was updated successfully, but these errors were encountered:

mattklein123 · 2019-05-10T21:42:50Z

@lizan any thoughts?

gyang274 · 2019-05-18T03:56:26Z

@lizan @mattklein123 I am experiencing a similar issue. I can provide a reproducible example if needed.

dio · 2019-05-18T04:07:02Z

@gyang274 a reproducible example will be very helpful!

gyang274 · 2019-05-18T19:05:03Z

@dio I actually fixed the problem by using
hosts: [{ socket_address: { address: 0.0.0.0, port_value: 50051 }}]
instead of
hosts: [{ socket_address: { address: host.docker.internal, port_value: 50051 }}]
in enovy.yaml clusters settings, and start the docker with --network=host. This leads to success communication, although get warning that this option is depreciated.

Btw, this is on ubuntu 16.04 with grpc 1.20.3, google-protobuf 3.8.0-rc1.

moderation · 2019-05-18T23:30:07Z

@gyang274 The non-deprecated way of specifying hosts is shown in the following example - https://github.com/envoyproxy/envoy/blob/master/examples/grpc-bridge/config/s2s-grpc-envoy.yaml#L35-L43

    load_assignment:
      cluster_name: local_service_grpc
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: host.docker.internal
                port_value: 50051

TBoshoven · 2019-05-20T15:31:38Z

My issue is definitely not the same one as @gyang274 's.
While I run Envoy in Docker, the service I am contacting runs separately, in a K8s cluster.
The host and port are correct (verified by curling the host at the IP in my logs at the configured port) and they are accessible from the container.
I updated my config to be similar to what @moderation described and I am still seeing the same issue.

As stated in my original report, the only change (quite reproducibly; a rollback fixed it, then another update broke things again) is the update of that service to the new gRPC version.

I also verified that this error looks different than when connecting to a host that does not exist and an http/2 host that does not speak gRPC, in order to eliminate basic issues.

We may be able to do some more experimentation soon (I'd personally like to see if this is a TLS issue), but it would be great to get some pointers on where to start.

moderation · 2019-05-22T01:09:23Z

@TBoshoven gRPC 1.21.0 has just been released - would be interesting to see if you still have the issue using this - https://github.com/grpc/grpc/releases/tag/v1.21.0

TBoshoven · 2019-05-22T18:51:23Z

I'll try to get it tested within a week.

TBoshoven · 2019-05-29T14:34:41Z

Unfortunately, after updating the service to gRPC 1.21.1, things still fail in the same way.
Tested using the latest Envoy dev Docker container.

alertedsnake · 2019-05-30T21:06:36Z

I can add some more detail to what @TBoshoven said - I've spent much of the day working with combinations here, and with the same server code, if I swap grpc 1.19.0 for 1.20.0, it breaks in the way he describes.

I did some packet capturing with Wireshark, and in the 1.20 case, the 19th packet is a [FIN, ACK] packet rather than application data, and then everything further is different. I'm not sure the captures are really of much use though.

I should note that this only is a problem with TLS enabled - without, 1.20 also works just fine.

lizan · 2019-05-31T13:46:13Z

@alertedsnake @TBoshoven TLS enabled where? between client and envoy, or between envoy and upstream gRPC server, or both?
Can you post, or email me if you don't want to disclose, a full config with a trace (not debug) log?

alertedsnake · 2019-05-31T13:55:49Z

@lizan sorry, I should have mentioned that. TLS enabled between envoy and the upstream gRPC service. It uses authentication, so it only listens with TLS. If we have the gRPC service listen without TLS, everything works fine.

I'm using @TBoshoven 's envoy config, and I don't know anything about envoy here at all, so I'm not sure how to provide you the logs you're looking for.

TBoshoven · 2019-05-31T14:38:39Z

The logs I provided in my initial report are still valid.

Here's my cleaned-up config with a bunch of stuff that I poked at in a comment:

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address: { address: 0.0.0.0, port_value: 8080 }
      filter_chains:
        - filters:
            - name: envoy.http_connection_manager
              config:
                codec_type: auto
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: local_service
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/prefix"
                          route:
                            cluster: the_service
                            max_grpc_timeout: 0s
                      cors:
                        allow_origin:
                          - "http://localhost:8081"
                        allow_methods: GET, PUT, DELETE, POST, OPTIONS
                        allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout,authorization
                        max_age: "1728000"
                        expose_headers: grpc-status,grpc-message
                http_filters:
                  - name: envoy.grpc_web
                  - name: envoy.cors
                  - name: envoy.router
  clusters:
# Stuff I have poked at
#    - name: the_service
#      connect_timeout: 5s
#      type: logical_dns
#      http2_protocol_options:
#        hpack_table_size: 0
#      lb_policy: round_robin
#      load_assignment:
#        cluster_name: local_service_grpc
#        endpoints:
#          - lb_endpoints:
#              - endpoint:
#                  address:
#                    socket_address:
#                      address: hostname
#                      port_value: 443
#      tls_context:
#        allow_renegotiation: true
#        sni: hostname
#        common_tls_context:
#          validation_context:
#            trusted_ca:
#              filename: /etc/ssl/cert.pem
    - name: the_service
      connect_timeout: 5s
      type: logical_dns
      http2_protocol_options: {}
      lb_policy: round_robin
      hosts:
        - socket_address:
            address: hostname
            port_value: 443
tls_context: {}

The uncommented part of this worked with gRPC 1.19 and not 1.20.

TBoshoven · 2019-06-17T08:51:57Z

@lizan This is still an issue. Is there anything else we can provide to help identify and resolve the problem?

EDIT: Just read that you asked for a trace log. Let me see what I can do.

TBoshoven · 2019-06-17T10:36:53Z

This is very useful. I was not aware of the trace verbosity level.
Here's what I believe is the relevant part of the log. If needed, I can give you the full one as well.

envoy_1    | [2019-06-17 08:55:51.125][12][debug][client] [source/common/http/codec_client.cc:64] [C1] connected
envoy_1    | [2019-06-17 08:55:51.125][12][debug][pool] [source/common/http/http2/conn_pool.cc:96] [C1] creating stream
envoy_1    | [2019-06-17 08:55:51.125][12][debug][router] [source/common/router/router.cc:1400] [C0][S11007887549020777307] pool ready
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=24
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 24 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=21
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 21 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=4
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=13
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 13 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=8
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=332
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 332 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=1
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 22 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=0
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:541] [C1] write ready
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:234] [C1] ssl write returns: 412
envoy_1    | [2019-06-17 08:55:51.126][12][trace][connection] [source/common/network/connection_impl.cc:494] [C1] read ready
envoy_1    | [2019-06-17 08:55:51.126][12][trace][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:73] [C1] ssl read returns: 0
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:201] [C1] 
envoy_1    | [2019-06-17 08:55:51.126][12][trace][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:148] [C1] ssl read 0 bytes into 0 slices
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/common/network/connection_impl.cc:518] [C1] remote close
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/common/network/connection_impl.cc:188] [C1] closing socket: 0
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:270] [C1] SSL shutdown: rc=0
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:201] [C1] 
envoy_1    | [2019-06-17 08:55:51.126][12][debug][client] [source/common/http/codec_client.cc:82] [C1] disconnect. resetting 1 pending requests
envoy_1    | [2019-06-17 08:55:51.126][12][debug][client] [source/common/http/codec_client.cc:105] [C1] request reset
envoy_1    | [2019-06-17 08:55:51.126][12][trace][main] [source/common/event/dispatcher_impl.cc:144] item added to deferred deletion list (size=1)
envoy_1    | [2019-06-17 08:55:51.126][12][debug][pool] [source/common/http/http2/conn_pool.cc:237] [C1] destroying stream: 0 remaining

The line that struck me as odd is [connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:234] [C1] ssl write returns: 412
According to this, 412 means "SSL protocol or certificate type is not supported.". Not sure if that number in the log entry refers to this error or the number of bytes, but it seemed like it might. (Actually, on multiple runs, I get slight variations on this number, so I suspect it's just the number of bytes and not an error.)

About the cert: curl accepts the certificate just fine (TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384 with a P-256 curve which is part of the default accepted values according to the docs).

TBoshoven · 2019-06-24T19:44:39Z

I ended up switching to grpcwebproxy. It successfully connects to the backend services.

stale · 2019-07-24T21:25:14Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

alertedsnake · 2019-07-25T12:01:32Z

I get the point of the stale bot, but this is a real issue that has not been solved, and so it shouldn't be closed.

hollemawoolpert · 2019-10-04T20:19:05Z

I had a similar connect issue and alpn_protocols: h2 was the fix for me. See #7925.

This fails due to a combination of different unrelated issues with docker and envoy. See: docker/for-mac#2965 envoyproxy/envoy#6897 moby/moby#1143

stale · 2020-01-20T06:57:14Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2020-01-27T07:20:11Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

ramsrib · 2020-05-01T22:16:19Z

I have similar issue, alpn_protocols: h2 didn't help either.

mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label May 10, 2019

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 24, 2019

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 25, 2019

dio added the help wanted Needs help! label Jul 25, 2019

kwomackcodes mentioned this issue Jul 25, 2019

How to disable Stream timeout grpc/grpc-web#557

Open

mattklein123 removed the help wanted Needs help! label Dec 20, 2019

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 20, 2020

stale bot closed this as completed Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC 1.20.1 upgrade causes gRPC connections to fail #6897

gRPC 1.20.1 upgrade causes gRPC connections to fail #6897

TBoshoven commented May 10, 2019 •

edited

Loading

mattklein123 commented May 10, 2019

gyang274 commented May 18, 2019

dio commented May 18, 2019 •

edited

Loading

gyang274 commented May 18, 2019 •

edited

Loading

moderation commented May 18, 2019 •

edited

Loading

TBoshoven commented May 20, 2019

moderation commented May 22, 2019

TBoshoven commented May 22, 2019

TBoshoven commented May 29, 2019

alertedsnake commented May 30, 2019

lizan commented May 31, 2019

alertedsnake commented May 31, 2019

TBoshoven commented May 31, 2019

TBoshoven commented Jun 17, 2019 •

edited

Loading

TBoshoven commented Jun 17, 2019

TBoshoven commented Jun 24, 2019

stale bot commented Jul 24, 2019

alertedsnake commented Jul 25, 2019

hollemawoolpert commented Oct 4, 2019

stale bot commented Jan 20, 2020

stale bot commented Jan 27, 2020

ramsrib commented May 1, 2020

gRPC 1.20.1 upgrade causes gRPC connections to fail #6897

gRPC 1.20.1 upgrade causes gRPC connections to fail #6897

Comments

TBoshoven commented May 10, 2019 • edited Loading

mattklein123 commented May 10, 2019

gyang274 commented May 18, 2019

dio commented May 18, 2019 • edited Loading

gyang274 commented May 18, 2019 • edited Loading

moderation commented May 18, 2019 • edited Loading

TBoshoven commented May 20, 2019

moderation commented May 22, 2019

TBoshoven commented May 22, 2019

TBoshoven commented May 29, 2019

alertedsnake commented May 30, 2019

lizan commented May 31, 2019

alertedsnake commented May 31, 2019

TBoshoven commented May 31, 2019

TBoshoven commented Jun 17, 2019 • edited Loading

TBoshoven commented Jun 17, 2019

TBoshoven commented Jun 24, 2019

stale bot commented Jul 24, 2019

alertedsnake commented Jul 25, 2019

hollemawoolpert commented Oct 4, 2019

stale bot commented Jan 20, 2020

stale bot commented Jan 27, 2020

ramsrib commented May 1, 2020

TBoshoven commented May 10, 2019 •

edited

Loading

dio commented May 18, 2019 •

edited

Loading

gyang274 commented May 18, 2019 •

edited

Loading

moderation commented May 18, 2019 •

edited

Loading

TBoshoven commented Jun 17, 2019 •

edited

Loading