Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket_options on listeners are not working on 1.17+ #18107

Closed
NersesAM opened this issue Sep 13, 2021 · 9 comments
Closed

socket_options on listeners are not working on 1.17+ #18107

NersesAM opened this issue Sep 13, 2021 · 9 comments
Labels
area/listener investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently

Comments

@NersesAM
Copy link
Contributor

NersesAM commented Sep 13, 2021

Title: socket_options on listeners are not working on 1.17+

Description:

We have the following setup grpc client -> AWS NLB -> envoy -> grpc server. We also have listener socket options set to sent TCP Keep-Alive packets every minute so that for those services that have very infrequent request rates the connects are not being closed by NLBs hard 350s timeout.

socket_options:
      - level: 1
        name: 9
        int_value: 1
        state: STATE_LISTENING
      - level: 6
        name: 4
        int_value: 60
        state: STATE_LISTENING
      - level: 6
        name: 5
        int_value: 60
        state: STATE_LISTENING
      - level: 6
        name: 6
        int_value: 5
        state: STATE_LISTENING

Everything was working as expected up until v1.16.x. Once we upgraded to 1.17+ clients started complaining that they started receiving "Connection reset by peer" errors. Digging into it I found out that socket_options setting above are not being respected and once the connection is idle above the NLBs 350s limit, it will be closed by NLB but client would not know about it until it tries to send another request later on which will blow up with "Connection reset by peer".

Repro steps:

I've modified one of the grpc examples to show the problem. Check out https://github.com/NersesAM/envoy/tree/tcp-keep-alive-bug-demo/examples/grpc-bridge

  1. docker-compose -f docker-compose-protos.yaml up --remove-orphans
  2. docker-compose up --build
  3. Once the containers are started grpc-client-proxy will act as gRPC client for grpc-server-proxy Envoy which has the socket_options defined for the listener. With version 1.16.5 defined for grpc-server-proxy we can see that attached wireshark container logging every 20 second.
netshoot_1           |    15 20.038207300   172.20.0.5 → 172.20.0.3   TCP 66 [TCP Keep-Alive] 8811 → 58036 [ACK] Seq=182 Ack=396 Win=64768 Len=0 TSval=1155384558 TSecr=3361626233
netshoot_1           |    16 20.038250400   172.20.0.3 → 172.20.0.5   TCP 66 [TCP Keep-Alive ACK] 58036 → 8811 [ACK] Seq=396 Ack=183 Win=64128 Len=0 TSval=3361646300 TSecr=1155364491
  1. But after changing envoy version of grpc-server-proxy to anything 1.17.0 and above and restarting step 2, the TCP Keep-alive packets will not be sent anymore.
@NersesAM NersesAM added bug triage Issue requires triage labels Sep 13, 2021
@lambdai
Copy link
Contributor

lambdai commented Sep 13, 2021

My dumb question: shouldn the keep alive sock options be applied to accepted socket instead of listen socket?

I don't know through what mechanism that the keepalive worked before 1.17, but keep alive setting at downstream connection is indded something that added to listener option.

I also find this issue #3634

@NersesAM
Copy link
Contributor Author

NersesAM commented Sep 14, 2021

@lambdai Sorry not sure what do you mean. Are you saying this is not a bug and you don't know how was it working before 1.17?

The issue you have linked was the original feature request ticket for implementing this exact feature which worked fine until it broke in 1.17

@alyssawilk
Copy link
Contributor

huh. @mattklein123 @snowp any ideas what might have caused a difference in TCP keep-alive or socket option proceeing between 1.16 and 1.17? Nothing comes to mind for me :-/

@alyssawilk alyssawilk removed the triage Issue requires triage label Sep 20, 2021
@mattklein123
Copy link
Member

Not sure. Can you potentially come up with a self contained repro on top of current main branch? I think debug/trace output should make it clear what socket options are being applied but I'm not sure. We can debug further from there?

@ramaraochavali
Copy link
Contributor

Not sure what changed but STATE_LISTENING is not working but STATE_PREBIND is working. Here is what we tried in Istio istio/istio#28879 (comment)

@mattklein123
Copy link
Member

Sorry I have no idea why it doesn't work after calling listen() now, but it used to before. PREBIND or BOUND seems more correct to me. If someone wants to debug why/how this changed please do.

@mattklein123 mattklein123 added area/listener investigate Potential bug that needs verification and removed bug labels Sep 20, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 20, 2021
@NersesAM
Copy link
Contributor Author

Thanks @mattklein123 and @ramaraochavali changing to STATE_PREBIND helped.
Do you still want to keep this bug open for further investigation why STATE_LISTENING stopped working?

@mattklein123
Copy link
Member

I don't think it's worth investigating this so I'm just going to close. Thanks for the follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/listener investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

5 participants