Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC 1.20.1 upgrade causes gRPC connections to fail #6897

Closed
TBoshoven opened this issue May 10, 2019 · 22 comments
Closed

gRPC 1.20.1 upgrade causes gRPC connections to fail #6897

TBoshoven opened this issue May 10, 2019 · 22 comments
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently

Comments

@TBoshoven
Copy link

TBoshoven commented May 10, 2019

Title: gRPC 1.20.1 upgrade causes gRPC connections to fail

Description:
I use Envoy's gRPC-Web filter to integrate against a few gRPC applications built in Python. These applications use a secured connection.
Since they upgraded to gRPC v1.20.1 (from v1.19.x), Envoy has been unable to get proper responses. The version upgrade is the only change and other applications have no problem communicating with the updated services.
The error I am getting is gRPC status 14 with this message:

upstream connect error or disconnect/reset before headers. reset reason: connection termination

I tried running this against the current stable version as well as a (yesterday's) dev build and got the same results.

I have also been unable to reproduce this issue. I created a small gRPC application using Python and gRPC 1.20.1 and I could use gRPC to communicate with it. My little test service was unsecured but otherwise pretty similar, so I suspect perhaps a TLS issue.

The only other relevant thing I can think of is that all of these services are behind (TCP) load balancers.

Config:
My config is more or less the same as in the example. I fixed the indentation of the bottom part and (of course) updated addresses to match my services.

Logs:
These are the logs of a failing request (minus the request itself):

[2019-05-09 19:20:34.438][19][debug][client] [source/common/http/codec_client.cc:26] [C3] connecting
[2019-05-09 19:20:34.438][19][debug][connection] [source/common/network/connection_impl.cc:695] [C3] connecting to <THE IP ADDRESS>:443
[2019-05-09 19:20:34.438][19][debug][connection] [source/common/network/connection_impl.cc:704] [C3] connection in progress
[2019-05-09 19:20:34.438][19][debug][http2] [source/common/http/http2/codec_impl.cc:725] [C3] setting stream-level initial window size to 268435456
[2019-05-09 19:20:34.438][19][debug][http2] [source/common/http/http2/codec_impl.cc:747] [C3] updating connection-level initial window size to 268435456
[2019-05-09 19:20:34.438][19][debug][pool] [source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2019-05-09 19:20:34.438][19][debug][http] [source/common/http/conn_manager_impl.cc:1062] [C0][S2159400524055061755] request end stream
[2019-05-09 19:20:34.454][19][debug][connection] [source/common/network/connection_impl.cc:543] [C3] connected
[2019-05-09 19:20:34.454][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.470][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.470][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.470][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.471][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.471][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.471][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.472][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.472][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.506][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:131] [C3] handshake complete
[2019-05-09 19:20:34.506][19][debug][client] [source/common/http/codec_client.cc:64] [C3] connected
[2019-05-09 19:20:34.506][19][debug][pool] [source/common/http/http2/conn_pool.cc:96] [C3] creating stream
[2019-05-09 19:20:34.506][19][debug][router] [source/common/router/router.cc:1211] [C0][S2159400524055061755] pool ready
[2019-05-09 19:20:34.507][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C3] 
[2019-05-09 19:20:34.507][19][debug][connection] [source/common/network/connection_impl.cc:511] [C3] remote close
[2019-05-09 19:20:34.507][19][debug][connection] [source/common/network/connection_impl.cc:192] [C3] closing socket: 0
[2019-05-09 19:20:34.507][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:244] [C3] SSL shutdown: rc=0
[2019-05-09 19:20:34.507][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C3] 
[2019-05-09 19:20:34.507][19][debug][client] [source/common/http/codec_client.cc:82] [C3] disconnect. resetting 1 pending requests
[2019-05-09 19:20:34.507][19][debug][client] [source/common/http/codec_client.cc:105] [C3] request reset
[2019-05-09 19:20:34.507][19][debug][pool] [source/common/http/http2/conn_pool.cc:237] [C3] destroying stream: 0 remaining
[2019-05-09 19:20:34.507][19][debug][router] [source/common/router/router.cc:672] [C0][S2159400524055061755] upstream reset: reset reason connection termination
[2019-05-09 19:20:34.507][19][debug][http] [source/common/http/conn_manager_impl.cc:1137] [C0][S2159400524055061755] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2019-05-09 19:20:34.507][19][debug][http] [source/common/http/conn_manager_impl.cc:1329] [C0][S2159400524055061755] encoding headers via codec (end_stream=true):
':status', '200'
'content-type', 'application/grpc-web-text+proto'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection termination'
'access-control-allow-origin', 'http://localhost:8081'
'access-control-expose-headers', 'grpc-status,grpc-message'
'date', 'Thu, 09 May 2019 19:20:34 GMT'
'server', 'envoy'

[2019-05-09 19:20:34.507][19][debug][pool] [source/common/http/http2/conn_pool.cc:150] [C3] client disconnected
[2019-05-09 19:20:34.507][19][debug][pool] [source/common/http/http2/conn_pool.cc:175] [C3] destroying primary client
[2019-05-09 19:20:39.091][1][debug][main] [source/server/server.cc:147] flushing stats

I tried a bunch of things on the Envoy config side, but nothing made any difference.
Could anyone point me at what might be causing the issue or what might help me get to the bottom of this?

@mattklein123 mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label May 10, 2019
@mattklein123
Copy link
Member

@lizan any thoughts?

@gyang274
Copy link

@lizan @mattklein123 I am experiencing a similar issue. I can provide a reproducible example if needed.

@dio
Copy link
Member

dio commented May 18, 2019

@gyang274 a reproducible example will be very helpful!

@gyang274
Copy link

gyang274 commented May 18, 2019

@dio I actually fixed the problem by using
hosts: [{ socket_address: { address: 0.0.0.0, port_value: 50051 }}]
instead of
hosts: [{ socket_address: { address: host.docker.internal, port_value: 50051 }}]
in enovy.yaml clusters settings, and start the docker with --network=host. This leads to success communication, although get warning that this option is depreciated.

Btw, this is on ubuntu 16.04 with grpc 1.20.3, google-protobuf 3.8.0-rc1.

@moderation
Copy link
Contributor

moderation commented May 18, 2019

@gyang274 The non-deprecated way of specifying hosts is shown in the following example - https://github.com/envoyproxy/envoy/blob/master/examples/grpc-bridge/config/s2s-grpc-envoy.yaml#L35-L43

    load_assignment:
      cluster_name: local_service_grpc
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: host.docker.internal
                port_value: 50051

@TBoshoven
Copy link
Author

My issue is definitely not the same one as @gyang274 's.
While I run Envoy in Docker, the service I am contacting runs separately, in a K8s cluster.
The host and port are correct (verified by curling the host at the IP in my logs at the configured port) and they are accessible from the container.
I updated my config to be similar to what @moderation described and I am still seeing the same issue.

As stated in my original report, the only change (quite reproducibly; a rollback fixed it, then another update broke things again) is the update of that service to the new gRPC version.

I also verified that this error looks different than when connecting to a host that does not exist and an http/2 host that does not speak gRPC, in order to eliminate basic issues.

We may be able to do some more experimentation soon (I'd personally like to see if this is a TLS issue), but it would be great to get some pointers on where to start.

@moderation
Copy link
Contributor

@TBoshoven gRPC 1.21.0 has just been released - would be interesting to see if you still have the issue using this - https://github.com/grpc/grpc/releases/tag/v1.21.0

@TBoshoven
Copy link
Author

I'll try to get it tested within a week.

@TBoshoven
Copy link
Author

Unfortunately, after updating the service to gRPC 1.21.1, things still fail in the same way.
Tested using the latest Envoy dev Docker container.

@alertedsnake
Copy link

I can add some more detail to what @TBoshoven said - I've spent much of the day working with combinations here, and with the same server code, if I swap grpc 1.19.0 for 1.20.0, it breaks in the way he describes.

I did some packet capturing with Wireshark, and in the 1.20 case, the 19th packet is a [FIN, ACK] packet rather than application data, and then everything further is different. I'm not sure the captures are really of much use though.

I should note that this only is a problem with TLS enabled - without, 1.20 also works just fine.

@lizan
Copy link
Member

lizan commented May 31, 2019

@alertedsnake @TBoshoven TLS enabled where? between client and envoy, or between envoy and upstream gRPC server, or both?
Can you post, or email me if you don't want to disclose, a full config with a trace (not debug) log?

@alertedsnake
Copy link

@lizan sorry, I should have mentioned that. TLS enabled between envoy and the upstream gRPC service. It uses authentication, so it only listens with TLS. If we have the gRPC service listen without TLS, everything works fine.

I'm using @TBoshoven 's envoy config, and I don't know anything about envoy here at all, so I'm not sure how to provide you the logs you're looking for.

@TBoshoven
Copy link
Author

The logs I provided in my initial report are still valid.

Here's my cleaned-up config with a bunch of stuff that I poked at in a comment:

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address: { address: 0.0.0.0, port_value: 8080 }
      filter_chains:
        - filters:
            - name: envoy.http_connection_manager
              config:
                codec_type: auto
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: local_service
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/prefix"
                          route:
                            cluster: the_service
                            max_grpc_timeout: 0s
                      cors:
                        allow_origin:
                          - "http://localhost:8081"
                        allow_methods: GET, PUT, DELETE, POST, OPTIONS
                        allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout,authorization
                        max_age: "1728000"
                        expose_headers: grpc-status,grpc-message
                http_filters:
                  - name: envoy.grpc_web
                  - name: envoy.cors
                  - name: envoy.router
  clusters:
# Stuff I have poked at
#    - name: the_service
#      connect_timeout: 5s
#      type: logical_dns
#      http2_protocol_options:
#        hpack_table_size: 0
#      lb_policy: round_robin
#      load_assignment:
#        cluster_name: local_service_grpc
#        endpoints:
#          - lb_endpoints:
#              - endpoint:
#                  address:
#                    socket_address:
#                      address: hostname
#                      port_value: 443
#      tls_context:
#        allow_renegotiation: true
#        sni: hostname
#        common_tls_context:
#          validation_context:
#            trusted_ca:
#              filename: /etc/ssl/cert.pem
    - name: the_service
      connect_timeout: 5s
      type: logical_dns
      http2_protocol_options: {}
      lb_policy: round_robin
      hosts:
        - socket_address:
            address: hostname
            port_value: 443
tls_context: {}

The uncommented part of this worked with gRPC 1.19 and not 1.20.

@TBoshoven
Copy link
Author

TBoshoven commented Jun 17, 2019

@lizan This is still an issue. Is there anything else we can provide to help identify and resolve the problem?

EDIT: Just read that you asked for a trace log. Let me see what I can do.

@TBoshoven
Copy link
Author

This is very useful. I was not aware of the trace verbosity level.
Here's what I believe is the relevant part of the log. If needed, I can give you the full one as well.

envoy_1    | [2019-06-17 08:55:51.125][12][debug][client] [source/common/http/codec_client.cc:64] [C1] connected
envoy_1    | [2019-06-17 08:55:51.125][12][debug][pool] [source/common/http/http2/conn_pool.cc:96] [C1] creating stream
envoy_1    | [2019-06-17 08:55:51.125][12][debug][router] [source/common/router/router.cc:1400] [C0][S11007887549020777307] pool ready
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=24
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 24 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=21
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 21 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=4
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=13
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 13 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=8
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=332
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 332 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=1
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 22 bytes, end_stream false
envoy_1    | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=0
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:541] [C1] write ready
envoy_1    | [2019-06-17 08:55:51.125][12][trace][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:234] [C1] ssl write returns: 412
envoy_1    | [2019-06-17 08:55:51.126][12][trace][connection] [source/common/network/connection_impl.cc:494] [C1] read ready
envoy_1    | [2019-06-17 08:55:51.126][12][trace][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:73] [C1] ssl read returns: 0
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:201] [C1] 
envoy_1    | [2019-06-17 08:55:51.126][12][trace][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:148] [C1] ssl read 0 bytes into 0 slices
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/common/network/connection_impl.cc:518] [C1] remote close
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/common/network/connection_impl.cc:188] [C1] closing socket: 0
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:270] [C1] SSL shutdown: rc=0
envoy_1    | [2019-06-17 08:55:51.126][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:201] [C1] 
envoy_1    | [2019-06-17 08:55:51.126][12][debug][client] [source/common/http/codec_client.cc:82] [C1] disconnect. resetting 1 pending requests
envoy_1    | [2019-06-17 08:55:51.126][12][debug][client] [source/common/http/codec_client.cc:105] [C1] request reset
envoy_1    | [2019-06-17 08:55:51.126][12][trace][main] [source/common/event/dispatcher_impl.cc:144] item added to deferred deletion list (size=1)
envoy_1    | [2019-06-17 08:55:51.126][12][debug][pool] [source/common/http/http2/conn_pool.cc:237] [C1] destroying stream: 0 remaining

The line that struck me as odd is [connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:234] [C1] ssl write returns: 412
According to this, 412 means "SSL protocol or certificate type is not supported.". Not sure if that number in the log entry refers to this error or the number of bytes, but it seemed like it might. (Actually, on multiple runs, I get slight variations on this number, so I suspect it's just the number of bytes and not an error.)

About the cert: curl accepts the certificate just fine (TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384 with a P-256 curve which is part of the default accepted values according to the docs).

@TBoshoven
Copy link
Author

I ended up switching to grpcwebproxy. It successfully connects to the backend services.

@stale
Copy link

stale bot commented Jul 24, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 24, 2019
@alertedsnake
Copy link

I get the point of the stale bot, but this is a real issue that has not been solved, and so it shouldn't be closed.

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 25, 2019
@dio dio added the help wanted Needs help! label Jul 25, 2019
@hollemawoolpert
Copy link

I had a similar connect issue and alpn_protocols: h2 was the fix for me. See #7925.

davidmontoyago added a commit to davidmontoyago/go-grpc-api-lab that referenced this issue Dec 13, 2019
This fails due to a combination of different unrelated issues with
docker and envoy.

See:
docker/for-mac#2965
envoyproxy/envoy#6897
moby/moby#1143
@mattklein123 mattklein123 removed the help wanted Needs help! label Dec 20, 2019
@stale
Copy link

stale bot commented Jan 20, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 20, 2020
@stale
Copy link

stale bot commented Jan 27, 2020

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed Jan 27, 2020
@ramsrib
Copy link

ramsrib commented May 1, 2020

I have similar issue, alpn_protocols: h2 didn't help either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

9 participants