socket_options on listeners are not working on 1.17+ #18107

NersesAM · 2021-09-13T23:13:15Z

Title: socket_options on listeners are not working on 1.17+

Description:

We have the following setup grpc client -> AWS NLB -> envoy -> grpc server. We also have listener socket options set to sent TCP Keep-Alive packets every minute so that for those services that have very infrequent request rates the connects are not being closed by NLBs hard 350s timeout.

socket_options:
      - level: 1
        name: 9
        int_value: 1
        state: STATE_LISTENING
      - level: 6
        name: 4
        int_value: 60
        state: STATE_LISTENING
      - level: 6
        name: 5
        int_value: 60
        state: STATE_LISTENING
      - level: 6
        name: 6
        int_value: 5
        state: STATE_LISTENING

Everything was working as expected up until v1.16.x. Once we upgraded to 1.17+ clients started complaining that they started receiving "Connection reset by peer" errors. Digging into it I found out that socket_options setting above are not being respected and once the connection is idle above the NLBs 350s limit, it will be closed by NLB but client would not know about it until it tries to send another request later on which will blow up with "Connection reset by peer".

Repro steps:

I've modified one of the grpc examples to show the problem. Check out https://github.com/NersesAM/envoy/tree/tcp-keep-alive-bug-demo/examples/grpc-bridge

docker-compose -f docker-compose-protos.yaml up --remove-orphans
docker-compose up --build
Once the containers are started grpc-client-proxy will act as gRPC client for grpc-server-proxy Envoy which has the socket_options defined for the listener. With version 1.16.5 defined for grpc-server-proxy we can see that attached wireshark container logging every 20 second.

netshoot_1           |    15 20.038207300   172.20.0.5 → 172.20.0.3   TCP 66 [TCP Keep-Alive] 8811 → 58036 [ACK] Seq=182 Ack=396 Win=64768 Len=0 TSval=1155384558 TSecr=3361626233
netshoot_1           |    16 20.038250400   172.20.0.3 → 172.20.0.5   TCP 66 [TCP Keep-Alive ACK] 58036 → 8811 [ACK] Seq=396 Ack=183 Win=64128 Len=0 TSval=3361646300 TSecr=1155364491

But after changing envoy version of grpc-server-proxy to anything 1.17.0 and above and restarting step 2, the TCP Keep-alive packets will not be sent anymore.

The text was updated successfully, but these errors were encountered:

lambdai · 2021-09-13T23:53:48Z

My dumb question: shouldn the keep alive sock options be applied to accepted socket instead of listen socket?

I don't know through what mechanism that the keepalive worked before 1.17, but keep alive setting at downstream connection is indded something that added to listener option.

I also find this issue #3634

NersesAM · 2021-09-14T22:13:22Z

@lambdai Sorry not sure what do you mean. Are you saying this is not a bug and you don't know how was it working before 1.17?

The issue you have linked was the original feature request ticket for implementing this exact feature which worked fine until it broke in 1.17

alyssawilk · 2021-09-20T14:16:30Z

huh. @mattklein123 @snowp any ideas what might have caused a difference in TCP keep-alive or socket option proceeing between 1.16 and 1.17? Nothing comes to mind for me :-/

mattklein123 · 2021-09-20T14:58:36Z

Not sure. Can you potentially come up with a self contained repro on top of current main branch? I think debug/trace output should make it clear what socket options are being applied but I'm not sure. We can debug further from there?

ramaraochavali · 2021-09-20T15:05:52Z

Not sure what changed but STATE_LISTENING is not working but STATE_PREBIND is working. Here is what we tried in Istio istio/istio#28879 (comment)

mattklein123 · 2021-09-20T15:12:49Z

Sorry I have no idea why it doesn't work after calling listen() now, but it used to before. PREBIND or BOUND seems more correct to me. If someone wants to debug why/how this changed please do.

github-actions · 2021-10-20T16:01:21Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

NersesAM · 2021-10-25T14:56:39Z

Thanks @mattklein123 and @ramaraochavali changing to STATE_PREBIND helped.
Do you still want to keep this bug open for further investigation why STATE_LISTENING stopped working?

mattklein123 · 2021-10-25T15:17:50Z

I don't think it's worth investigating this so I'm just going to close. Thanks for the follow up.

NersesAM added bug triage Issue requires triage labels Sep 13, 2021

alyssawilk removed the triage Issue requires triage label Sep 20, 2021

mattklein123 added area/listener investigate Potential bug that needs verification and removed bug labels Sep 20, 2021

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 20, 2021

mattklein123 closed this as completed Oct 25, 2021

bcollard mentioned this issue Jan 28, 2022

socket options - change from 'STATE_LISTENING' to 'STATE_PREBIND' solo-io/gloo#5842

Merged

chaunceyjiang mentioned this issue Jan 31, 2024

envoy: Change socket option from 'STATE_LISTENING' to 'STATE_PREBIND' cilium/cilium#30543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

socket_options on listeners are not working on 1.17+ #18107

socket_options on listeners are not working on 1.17+ #18107

NersesAM commented Sep 13, 2021 •

edited

Loading

lambdai commented Sep 13, 2021

NersesAM commented Sep 14, 2021 •

edited

Loading

alyssawilk commented Sep 20, 2021

mattklein123 commented Sep 20, 2021

ramaraochavali commented Sep 20, 2021

mattklein123 commented Sep 20, 2021

github-actions bot commented Oct 20, 2021

NersesAM commented Oct 25, 2021

mattklein123 commented Oct 25, 2021

socket_options on listeners are not working on 1.17+ #18107

socket_options on listeners are not working on 1.17+ #18107

Comments

NersesAM commented Sep 13, 2021 • edited Loading

lambdai commented Sep 13, 2021

NersesAM commented Sep 14, 2021 • edited Loading

alyssawilk commented Sep 20, 2021

mattklein123 commented Sep 20, 2021

ramaraochavali commented Sep 20, 2021

mattklein123 commented Sep 20, 2021

github-actions bot commented Oct 20, 2021

NersesAM commented Oct 25, 2021

mattklein123 commented Oct 25, 2021

NersesAM commented Sep 13, 2021 •

edited

Loading

NersesAM commented Sep 14, 2021 •

edited

Loading