Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idle connections over overlay network ends up in a broken state after 15 minutes #31208

Closed
christopherobin opened this issue Feb 21, 2017 · 18 comments

Comments

@christopherobin
Copy link

Description

In a swarm setup using overlay networks, idle connections between 2 services will end up in a broken state after 15 minutes.

The issue is related to the way docker overlay routes packets, using first iptables to mark them and use ipvs to forward them to the right hosts but the default expiration for connections on ipvs is set to 900 seconds (ipvsadm -l --timeout) after which it will stop forwarding packets even though the connection still exists; If this happens then any new packet on this connection will now try to go to the virtual IP for that service that has no valid resolution, resulting in a broken state where it is stuck in limbo while the kernel forever tries to resolve that virtual IP.

Steps to reproduce the issue:

  1. Start 2 services on the same network (on different hosts, though it should be reproducible even on a single host?)
  2. docker exec in both of them, in one start a nc command in listen mode, in the other one connect to that nc server by using the service name DNS.
  3. Send a packet from the client to the server, everything is fine
  4. Find your netns and find your connection by doing nsenter --net=2cc18e502f81 ipvsadm -lnc
  5. Wait for the connection to expire and be removed from the list
  6. Send another packet, nothing ever gets there and the connection doesn't timeout, tcpdump shows lots of ARP packets going out

Describe the results you received:

Packet never reaches the target, kernel is stuck doing ARP requests over and over.

Describe the results you expected:

Either have the connection properly timeout, or find a way to restore the routing in ipvs.

Additional information you deem important (e.g. issue happens only occasionally):

Currently can be resolved by setting net.ipv4.tcp_keepalive_time to less than 900 seconds, to make sure the TCP connection doesn't expire but I'm not sure if it's a valid way to deal with this; At the very least this behavior should be documented.

Output of docker version:

Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:38:28 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:38:28 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 2
 Running: 2
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 1.13.1
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: fluentd
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: l3e2evjei4cvcdgjqavtrztgo
 Is Manager: false
 Node Address: 172.24.0.100
 Manager Addresses:
  172.24.0.200:2377
  172.24.0.50:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.2.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.796 GiB
Name: worker-1
ID: DR4G:LZEQ:YSQ7:CYTR:FAXW:ZNVJ:E4AZ:BX5L:QYYG:ZDY5:SO7U:TFZW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Labels:
 dawn.node.type=worker
 dawn.node.subtype=app
Experimental: false
Insecure Registries:
 172.24.0.50:5000
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

My current test setup is 5 vagrant boxes (2 managers + 3 workers), but it should happen in any environment.

@sanimej
Copy link

sanimej commented Feb 21, 2017

@christopherchines With IPVS or any other man in the middle NAT/Firewall the TCP keep-alive timer has to be tuned when you have "silent" long lived sessions. I will add a note about this in the documentation.

The connection would have got terminated if the TCP packet was delivered to a different backend and resulted in a RST from that backend. But I guess whats happening here is after the initial session expires, when IPVS gets a TCP packet that is not SYN its dropping it and not sending it to backend. This makes sense because its a new TCP session for IPVS and doesn't have SYN bit set.

@GabKlein
Copy link

@christopherobin How did you manage to work around this issue? I'm having the problem between my app that create a pool and my db. After 15 minute being idle the app is not able to reconnect.
I tried adding a sysctl file echo "net.ipv4.tcp_keepalive_time = 60" > /etc/sysctl.d/60-keepalive.conf without success. My my app is still hanging after 15 minutes being idle :/

@christopherobin
Copy link
Author

@GabKlein My current setup uses the following:

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 10

I took the values from https://access.redhat.com/solutions/23874 and tweaked them slightly for our setup. Didn't run in the issue since then.

To check if it's working you can use nsenter and ipvsadm to take a look at your connections and check if they are being pinged properly (see this article for details on how to do that)

@GabKlein
Copy link

Thanks you @christopherobin, I'm going to give it a shot. Adding this settings as a sysctl file is the best way? Do you have to reboot nodes or restart services to apply them?

@christopherobin
Copy link
Author

I'm using ansible to provision my server and it stores the variables in a file in/etc/sysctl.d.

If you are not rebooting you can create the files and run sysctl --system to reload all configuration files, it will also tell you what was loaded in which order so you can see if anything else might be overriding your config.

@mavenugo
Copy link
Contributor

Pls check if this comment is applicable here.

@sixcounts
Copy link

@mavenugo @christopherobin @GabKlein @sanimej Hey Team, this Moby GitHub has no assignee, can someone give us an overview of where this is at?

Thank you

@bm-skutzke
Copy link

@christopherobin Thanks a lot!

Tweaking net.ipv4.tcp_keepalive_time solved my issues with long-running curl queries to a REST API of a reporting service. This service consists of a Tomcat and a MySQL container running on different Docker Swarm nodes.
MySQL errors like "Aborted connection XXX to db ... (Got an error reading communication packets)" disappeared as well.

@BenoitNorrin
Copy link

BenoitNorrin commented Nov 23, 2017

We are facing this issue too and tweaking net.ipv4.tcp_keepalive_* didn't help too.

docker version:

Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:18 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:40:56 2017
 OS/Arch:      linux/amd64
 Experimental: false

docker info:

Containers: 23
 Running: 13
 Paused: 0
 Stopped: 10
Images: 15
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: c0yf55uumm2j1sr81l0bsuskn
 Is Manager: true
 ClusterID: wwv9hujonsqhlwpakwjfqacbt
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
  External CAs:
    cfssl: https://10.211.164.217:12381/api/v1/cfssl/sign
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.211.164.217
 Manager Addresses:
  10.211.164.217:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-98-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 15.67GiB
Name: vm-swm-overlay-1
ID: KFCC:IGCA:OSVX:62BR:7S6Z:LC3G:H6DT:IRPL:DHTG:DF7A:QSL4:2YUH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Here a simple way to reproduce this issue with netcat.

server.yml

version: '3.2'
services:
  server:
    image: multicloud/netcat
    ports:
      - 9898
    command: -lp 9898
    deploy:
      mode: replicated
      replicas: 1
    stdin_open: true
    tty: true
    networks:
      - test-timeout
networks:
  test-timeout:
    external: true

client.yml

version: '3.2'
services:
  client:
    image: multicloud/netcat
    command: server 9898
    deploy:
      mode: replicated
      replicas: 1
    stdin_open: true
    tty: true
    networks:
      - test-timeout

networks:
  test-timeout:
    external: true
  • Launch server container then client.
    docker stack deploy -c server.yml netcat
    docker stack deploy -c client.yml netcat

  • Attach a terminal to both containers :
    docker attach $(docker ps -q -f name="netcat_server")
    docker attach $(docker ps -q -f name="netcat_client")

  • Write something in one of the terminals, you will see the result in the other.

  • Wait at least 900 seconds.

  • Write something ...boom! the connection is broken, the container will crash

Tested with ubuntu and centos.

I think this problem is related to the default service discovery of swarm because does not occur in dnsrr mode.

@harry75369
Copy link

harry75369 commented Dec 5, 2017

This problem is due to the kernel module IPVS. Look at this line: https://github.com/torvalds/linux/blob/master/net/netfilter/ipvs/ip_vs_proto_tcp.c#L366

I changed the IP_VS_TCP_S_ESTABLISHED timeout from 900 to a larger value, recompile the module and reload ip_vs and ip_vs_rr kernel modules, this problem is gone. (maybe reload just ip_vs is also fine, not tested)

Compare with the following default kernel parameters, the IP_VS_TCP_S_ESTABLISHED value of IPVS is obviously too small!

net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300

On the other side, tuning kernel parameters like net.ipv4.tcp_keepalive_timeout does not work for me. Even using the default values, I cannot capture tcp keepalive packages when it should. And thus the connection will always be dropped/reset by IPVS eventually. I think it is due to my application. Because even though kernel support TCP keepalive, the application has to be set up properly. See http://www.tldp.org/HOWTO/TCP-Keepalive-HOWTO/programming.html

@vassilvk
Copy link

vassilvk commented Jan 12, 2018

@christopherobin, @bm-skutzke, how were you able to set net.ipv4.tcp_keepalive_time in service-task containers running in Docker swarm mode?

This is a namespaced kernel parameter and it looks like tuning sysctl parameters is not yet possible in Docker swarm mode: #25209, #33649.

I tried to bake net.ipv4.tcp_keepalive_time = 600 into the image through a new /etc/sysctl.d/* file as well as by directly modifying /etc/sysctl.conf, but those changes didn't take. Running that image as a service in Docker swarm through docker stack deploy and then shelling into it and probing net.ipv4.tcp_keepalive_time gives back the default value of 7200.

I am asking, because based on your comments, it seems like both of you managed to pull that off in a Docker swarm mode setup somehow?..

(Running Docker 17.12.0-ce, build c97c6d6 on Win10).

@christopherobin
Copy link
Author

@vassilvk I have been running my own VMs and bare-metal servers so I didn't run into your issue. I'm not entirely sure what is the best way to do it for Docker on Windows.

Baking the parameters in the docker images themselves won't work (since the init in your container won't apply anything from those file and like you said they are namespaced) so you'll need to do it at the host level.

I'd recommend opening an issue on https://github.com/linuxkit/linuxkit to have it baked in the default image and maybe try to make your own image in the meantime.

It might also be possible to do set it by abusing nsenter from a privileged container, something like nsenter -t 1 -a sysctl -w net.ipv4.tcp_keepalive_time=600 maybe? Not sure if it works and changes will be lost every time docker or your machine restarts.

@vassilvk
Copy link

Thanks @christopherobin - makes sense.
I didn't realize that you were making those changes on the host.

@ju-la-berger
Copy link

We recently ran into this issue using Docker CE 18.03.1 on CentOS 7.

Using Swarm overlay networking and endpoint mode virtual IP (vip) on the server side (i.e. database Swarm service or another microservice Swarm service), causes TCP connections to break after being idle for 15 minutes. (This applies to our JDBC connection pool as well as the Netty HTTP client connection pool for inter-service communications.)

Our workaround is to set the database service to endpoint mode dnsrr and to disable the Netty HTTP connection pooling. (The database will not be a Swarm service in production anyway.)

My question is: Does anyone work on this issue or do you have any other recommendations regarding workarounds? (Other than switching to Kubernetes.)

Thanks in advance!

@vassilvk
Copy link

vassilvk commented Jun 15, 2018

@ju-la-berger - I solved the issue on my end by using keep-alive for the application-level connection. This is protocol specific (I am using gRPC). If Netty HTTP supports keep-alive, maybe you can try that.

@fcrisciani
Copy link
Contributor

Please refer to: #37466 (comment) and https://success.docker.com/article/ipvs-connection-timeout-issue

@thaJeztah
Copy link
Member

let me close this issue, with the comments above referring to solutions / how to configure

@thaJeztah
Copy link
Member

thaJeztah commented Aug 23, 2018

WIP Pull request for setting sysctl for swarm services: #37701 / moby/swarmkit#2729

dyrnq added a commit to dyrnq/kubeadm-vagrant that referenced this issue Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests