Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers and docker commands constantly losing network connectivity. Multiple docker restarts required daily. #1374

Closed
zwass opened this issue Mar 3, 2017 · 89 comments

Comments

@zwass
Copy link

commented Mar 3, 2017

Expected behavior

Docker and related commands run with stability.

Actual behavior

I have to restart docker at least 2-3 times a day when I lose network connectivity with containers and docker/docker-compose commands begin to hang. Restarting docker for Mac seems to be the only fix.

Information

  • Full output of the diagnostics from "Diagnose & Feedback" in the menu
failure: Optional("diagnostic: response is not valid JSON")

After another try the diagnostic "succeeded":

Docker for Mac: version: 1.13.1 (94675c5a7)
macOS: version 10.12.3 (build: 16D32)
logs: /tmp/B2462039-60A5-402A-AA03-4606F57E702D/20170302-164634.tar.gz
failure: docker ps failed: (Failure "docker ps: timeout after 10.00s")
[OK]     vmnetd
[OK]     dns
[OK]     driver.amd64-linux
[OK]     virtualization VT-X
[OK]     app
[OK]     moby
[OK]     system
[OK]     moby-syslog
[OK]     db
[OK]     env
[OK]     virtualization kern.hv_support
[OK]     slirp
[OK]     osxfs
[OK]     moby-console
[OK]     logs
[ERROR]  docker-cli
         docker ps failed
[OK]     menubar
[OK]     disk

Diagnostic ID: B2462039-60A5-402A-AA03-4606F57E702D

$ docker-compose --verbose stop
compose.config.config.find: Using configuration files: ./docker-compose.yml
docker.auth.find_config_file: Trying paths: ['/Users/zwass/.docker/config.json', '/Users/zwass/.dockercfg']
docker.auth.find_config_file: Found file at path: /Users/zwass/.docker/config.json
docker.auth.load_config: Found 'auths' section
docker.auth.parse_auth: Found entry (registry=u'https://index.docker.io/v1/', username=u'zwass')
compose.cli.command.get_client: docker-compose version 1.11.1, build 7c5d5e4
docker-py version: 2.0.2
CPython version: 2.7.12
OpenSSL version: OpenSSL 1.0.2j  26 Sep 2016
compose.cli.command.get_client: Docker base_url: http+docker://localunixsocket
compose.cli.command.get_client: Docker version: KernelVersion=4.9.8-moby, Arch=amd64, BuildTime=2017-02-08T08:47:51.966588829+00:00, ApiVersion=1.26, Version=1.13.1, MinAPIVersion=1.12, GitCommit=092cba3, Os=linux, Experimental=True, GoVersion=go1.7.5
compose.cli.verbose_proxy.proxy_callable: docker containers <- (all=False, filters={u'label': [u'com.docker.compose.project=osquery', u'com.docker.compose.oneoff=False']})
ERROR: compose.cli.errors.log_timeout_error: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).
$ docker -D ps

(hangs indefinitely, I killed it after minutes)

Steps to reproduce the behavior

  1. Use docker for a couple hours, starting and stopping containers.
  2. Wait until docker-compose or other docker commands begin to fail.
@rogaha

This comment has been minimized.

Copy link

commented Mar 3, 2017

@zwass it looks like your containers using too many forwarded connections, we have a very conservative limit (900). Can you please try the following:

cd ~/Library/Containers/com.docker.docker/Data/database/
git reset --hard
echo 2000 > com.docker.driver.amd64-linux/slirp/max-connections
git add com.docker.driver.amd64-linux/slirp/max-connections
git commit -s -m 'increase max connections to 2000`

then restart Docker for Mac. Let us know if that helps!

@rogaha

This comment has been minimized.

Copy link

commented Mar 3, 2017

@zwass I've raised an internal issue to expose these kind of configurations via the Docker app, so it will be much easier to configure it in the future!

@zwass

This comment has been minimized.

Copy link
Author

commented Mar 6, 2017

@rogaha Thank you for the suggestion. A coworker suggested turning off "experimental features", and I so far have not had issues since then. I'm going to continue monitoring and try your suggestion after the next freeze. Will report back.

@zwass zwass closed this Mar 6, 2017

@zwass zwass reopened this Mar 6, 2017

@zwass

This comment has been minimized.

Copy link
Author

commented Mar 9, 2017

I am currently experiencing this or a related issue, however docker ps is working as is the diagnostic tool:

Docker for Mac: version: 17.03.0-ce-mac2 (1d7d97bbb)
macOS: version 10.12.3 (build: 16D32)
logs: /tmp/B2462039-60A5-402A-AA03-4606F57E702D/20170308-172459.tar.gz
[OK]     vmnetd
[OK]     dns
[OK]     driver.amd64-linux
[OK]     virtualization VT-X
[OK]     app
[OK]     moby
[OK]     system
[OK]     moby-syslog
[OK]     db
[OK]     env
[OK]     virtualization kern.hv_support
[OK]     slirp
[OK]     osxfs
[OK]     moby-console
[OK]     logs
[OK]     docker-cli
[OK]     menubar
[OK]     disk

B2462039-60A5-402A-AA03-4606F57E702D

I will now try the above suggestion.

@zwass

This comment has been minimized.

Copy link
Author

commented Mar 9, 2017

To be clear, the issue I am encountering is many EOF errors when attempting to connect to my MySQL database running in Docker.

@zwass

This comment has been minimized.

Copy link
Author

commented Mar 9, 2017

Seeing network issues again after trying the above suggestion. This is the output of diagnose:

failure: Optional("diagnostic: response is not valid JSON")
@zwass

This comment has been minimized.

Copy link
Author

commented Mar 14, 2017

Today I am encountering network issues between containers, but docker ps and docker-compose ps both return successfully.

Diagnose gives me the usual error:

failure: Optional("diagnostic: response is not valid JSON")

/Applications/Docker.app/Contents/Resources/bin/docker-diagnose --verbose --json immediately exits nonzero.

Running that command without the JSON flags gave varying output:

$ /Applications/Docker.app/Contents/Resources/bin/docker-diagnose
Error exec: plutil -extract appVersionHistory xml1 "/Users/zwass//Library/Group Containers/group.com.docker/Library/Preferences/group.com.docker.plist" -o -: exit 1
macOS: version 10.12.3 (build: 16D32)
Docker.app: version: 17.03.0-ce-mac2 (1d7d97bbb)
Local time: Tue Mar 14 14:51:22 PDT 2017
UTC:        Tue Mar 14 21:51:22 UTC 2017
Timestamp:  20170314-145122
Running diagnostic tests:
[OK]      docker-cli
[OK]      Moby booted
[OK]      driver.amd64-linux
[OK]      vmnetd
[OK]      osxfs
[OK]      db

... exited nonzero

$ /Applications/Docker.app/Contents/Resources/bin/docker-diagnose
Error exec: plutil -extract appVersionHistory xml1 "/Users/zwass//Library/Group Containers/group.com.docker/Library/Preferences/group.com.docker.plist" -o -: exit 1
macOS: version 10.12.3 (build: 16D32)
Docker.app: version: 17.03.0-ce-mac2 (1d7d97bbb)
Local time: Tue Mar 14 14:52:57 PDT 2017
UTC:        Tue Mar 14 21:52:57 UTC 2017
Timestamp:  20170314-145257
Running diagnostic tests:
[OK]      docker-cli
[OK]      Moby booted
[OK]      driver.amd64-linux
[OK]      vmnetd
[OK]      osxfs
[OK]      db
[OK]      slirp
[OK]      disk
[OK]      menubar
[OK]      environment
[OK]      Docker
[OK]      VT-x
[OK]      kern.hv_support
Error echo "00000003.0000f3a6" | nc -U /Users/zwass/Library/Containers/com.docker.docker/Data/@connect > /tmp/B2462039-60A5-402A-AA03-4606F57E702D/20170314-145257/diagnostics.tar: timeout after 30.00s
Docker logs are being collected into /tmp/B2462039-60A5-402A-AA03-4606F57E702D/20170314-145257.tar.gz
Would you like to upload log files? [Y/n]: y

...exited nonzero.

@zwass

This comment has been minimized.

Copy link
Author

commented Mar 14, 2017

I notice that the docker logs are full of the following:

Mar 14 14:59:57 zwass Docker[1476] <Error>: exceeded maximum number of forwarded connections (2000)
Mar 14 14:59:57 zwass Docker[1476] <Error>: PPP.listen callback caught Hostnet.Host_uwt.Sockets.Too_many_connections
Mar 14 15:00:04 zwass Docker[1476] <Error>: Hostnet_udp.input: bind raised Hostnet.Host_uwt.Sockets.Too_many_connections

I would think that stopping my containers would allow these forwarded connections to be cleaned up and allow future connections to succeed. Is this not the case?

@djs55

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2017

I had a look through the diagnostic logs for clues but didn't spot anything else, apart from the Too_many_connections errors you've reported.

It must be either a real connection leak somewhere or perhaps a bug in the connection tracking code.

As an experiment could you try disabling the connection tracking altogether? It can be done by:

$ cd ~/Library/Containers/com.docker.docker/Data/database/
$ git reset --hard
HEAD is now at 35c2bf3 Setting certificates
$ cat com.docker.driver.amd64-linux/sl
$ cat com.docker.driver.amd64-linux/slirp/max-connections 
2000
$ git rm com.docker.driver.amd64-linux/slirp/max-connections 
rm 'com.docker.driver.amd64-linux/slirp/max-connections'
$ git commit -s -m 'disable connection limit'
[state fa675ee] disable connection limit
 1 file changed, 1 deletion(-)
 delete mode 100644 com.docker.driver.amd64-linux/slirp/max-connections

If you run syslog -k Sender Docker you should see the log line:

Mar 20 16:20:10 Davids-MBP-2 Docker[88691] <Notice>: remove connection limit

If your tests still fail could you find the pid of the com.docker.slirp process and run:

sudo lsof -p <pid>

-- this should list the open connections. I'd be interested to know what kind of connections they are.

Do you know what sort of network workload you're generating? I've not been able to reproduce this locally so any clue about making a reproduction environment would be appreciated. I noticed some log lines mention UDP which could just be a coincidence.

Thanks again for your report!

@zwass

This comment has been minimized.

Copy link
Author

commented Mar 21, 2017

@djs55 Thanks for the followup!

I disabled connection tracking yesterday and was able to start and stop thousands of containers with minimal network errors, and none that required restarting Docker. Typically I would have had to restart Docker every few dozen.

Today I did hit the error again:

docker-compose up
Starting kolide_mailhog_1
kolide_mysql_test_1 is up-to-date
Starting kolide_redis_1
Starting kolide_mysql_1

ERROR: for redis  Cannot start service redis: driver failed programming external connectivity on endpoint kolide_redis_1 (df726e28626a20bb3c5ed8c6172b24aab067c05fa24596465504df65e0a84fb8): Error starting userland proxy: Bind for 0.0.0.0:6379: unexpected error Hostnet.Host_uwt.Sockets.Too_many_connections

ERROR: for mailhog  Cannot start service mailhog: driver failed programming external connectivity on endpoint kolide_mailhog_1 (cb0e28b20e7edd4a1d0d3d41647597fe1bb66e4e2ea13bb6fd8941ca9c09c2ba): Error starting userland proxy: Bind for 0.0.0.0:8025: unexpected error Hostnet.Host_uwt.Sockets.Too_many_connections

ERROR: for mysql  Cannot start service mysql: driver failed programming external connectivity on endpoint kolide_mysql_1 (d9890212eee4e7300ee5d14ddc276f26a39c037371035d56f546e243c38a7d0f): Error starting userland proxy: Bind for 0.0.0.0:3306: unexpected error Hostnet.Host_uwt.Sockets.Too_many_connections
ERROR: Encountered errors while bringing up the project.

Here's the lsof:

sudo lsof -p 2415
COMMAND    PID  USER   FD     TYPE             DEVICE  SIZE/OFF    NODE NAME
com.docke 2415 zwass  cwd      DIR                1,4       782 3881700 /Users/zwass/Library/Containers/com.docker.docker/Data
com.docke 2415 zwass  txt      REG                1,4  11533984 7688901 /Applications/Docker.app/Contents/MacOS/com.docker.slirp
com.docke 2415 zwass  txt      REG                1,4    694528 3990372 /usr/lib/dyld
com.docke 2415 zwass  txt      REG                1,4 656064512 6742777 /private/var/db/dyld/dyld_shared_cache_x86_64h
com.docke 2415 zwass    0     PIPE 0xa98601562254591b     16384         ->0xa986015622545a9b
com.docke 2415 zwass    1w     CHR                3,2       0t0     304 /dev/null
com.docke 2415 zwass    2w     CHR                3,2       0t0     304 /dev/null
com.docke 2415 zwass    3u    unix 0xa98601561ec34aeb       0t0         /Users/zwass/Library/Containers/com.docker.docker/Data/s50
com.docke 2415 zwass    4u    unix 0xa98601561ec364b3       0t0         /Users/zwass/Library/Containers/com.docker.docker/Data/s51
com.docke 2415 zwass    5u    unix 0xa986015622dadf33       0t0         /Users/zwass/Library/Containers/com.docker.docker/Data/s52
com.docke 2415 zwass    6u    unix 0xa986015622db04b3       0t0         /Users/zwass/Library/Containers/com.docker.docker/Data/s53
com.docke 2415 zwass    7     PIPE 0xa98601563690741b     16384         ->0xa9860156369089db
com.docke 2415 zwass    8     PIPE 0xa9860156369089db     16384         ->0xa98601563690741b
com.docke 2415 zwass    9     PIPE 0xa986015636907d1b     16384         ->0xa98601563690699b
com.docke 2415 zwass   10     PIPE 0xa98601563690699b     16384         ->0xa986015636907d1b
com.docke 2415 zwass   11u  KQUEUE                                      count=0, state=0xa
com.docke 2415 zwass   12     PIPE 0xa986015636906c9b     16384         ->0xa98601563690765b
com.docke 2415 zwass   13     PIPE 0xa98601563690765b     16384         ->0xa986015636906c9b
com.docke 2415 zwass   14     PIPE 0xa98601563690771b     16384         ->0xa98601563690885b
com.docke 2415 zwass   15     PIPE 0xa98601563690885b     16384         ->0xa98601563690771b
com.docke 2415 zwass   16u    unix 0xa98601562c1a525b       0t0         /Users/zwass/Library/Containers/com.docker.docker/Data/s50
com.docke 2415 zwass   17w     CHR                3,2       0t0     304 /dev/null
com.docke 2415 zwass   18u    unix 0xa98601562c1a3c7b       0t0         ->0xa98601562c1a5963
com.docke 2415 zwass   19r     CHR                3,2       0t0     304 /dev/null
com.docke 2415 zwass   20r     REG                1,4       256 8719072 /private/var/run/resolv.conf
com.docke 2415 zwass   21u    unix 0xa986015640ad4573       0t0         /Users/zwass/Library/Containers/com.docker.docker/Data/s51
com.docke 2415 zwass   22u    IPv4 0xa98601562b518beb       0t0     TCP 192.168.0.30:61361->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   23u    IPv4 0xa98601562bc20fcb       0t0     TCP 192.168.0.21:61654->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   24u    IPv4 0xa98601562bc1ebeb       0t0     TCP 192.168.0.21:61405->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   25u    IPv4 0xa986015635210beb       0t0     TCP 192.168.0.30:61364->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   26u    IPv4 0xa986015621b37fcb       0t0     TCP localhost:mysql->localhost:61571 (FIN_WAIT_2)
com.docke 2415 zwass   27u    IPv4 0xa9860156372866d3       0t0     TCP 192.168.0.30:61486->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   28u    IPv4 0xa986015643e6fbeb       0t0     TCP 192.168.0.21:61589->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   29u    IPv4 0xa9860156352102f3       0t0     TCP 192.168.0.21:54914->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   30u    IPv4 0xa98601563fb80fcb       0t0     TCP 192.168.0.30:61623->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   31u    IPv4 0xa98601564220e6d3       0t0     TCP 192.168.0.30:61624->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   32u    IPv4 0xa9860156407e1fcb       0t0     TCP 192.168.0.30:61625->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   33u    IPv4 0xa98601563241cbeb       0t0     TCP 192.168.0.30:61626->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   34u    IPv4 0xa986015636aa4beb       0t0     TCP 192.168.0.30:61627->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   35u    IPv4 0xa98601564162dddb       0t0     TCP 192.168.0.30:61628->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   36u    IPv4 0xa98601564162efcb       0t0     TCP 192.168.0.30:61629->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   37u    IPv4 0xa98601564202cddb       0t0     TCP 192.168.0.30:61630->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   38u    IPv4 0xa98601564162cbeb       0t0     TCP *:opsession-prxy (LISTEN)
com.docke 2415 zwass   39u    IPv6 0xa98601561d274f3b       0t0     TCP localhost:opsession-prxy (LISTEN)
com.docke 2415 zwass   40u    unix 0xa986015631b736a3       0t0         ->0xa986015631b744b3
com.docke 2415 zwass   41u    IPv4 0xa9860156261306d3       0t0     TCP 192.168.0.30:61631->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   42u    IPv4 0xa98601563eb03ddb       0t0     TCP 192.168.0.21:63181->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   43u    IPv4 0xa986015635211ddb       0t0     TCP 192.168.0.21:60831->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   44u    IPv4 0xa986015621b352f3       0t0     TCP 192.168.0.30:61365->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   45u    IPv4 0xa9860156407bb4e3       0t0     TCP 192.168.0.30:61366->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   46u    IPv4 0xa98601563fd146d3       0t0     TCP 192.168.0.21:51624->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   47u    IPv4 0xa98601563241efcb       0t0     TCP localhost:mysql->localhost:61968 (FIN_WAIT_2)
com.docke 2415 zwass   48u    unix 0xa986015629db0ed3       0t0         ->0xa986015629db03e3
com.docke 2415 zwass   49u    IPv4 0xa9860156407e16d3       0t0     TCP 192.168.0.21:51142->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   50u    IPv4 0xa98601563241c2f3       0t0     TCP 192.168.0.21:57539->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   51u    IPv4 0xa98601561bda72f3       0t0     TCP 192.168.0.21:56755->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   52u    IPv4 0xa986015643e7c8c3       0t0     TCP 192.168.0.21:60226->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   53u    IPv4 0xa986015625df4ddb       0t0     TCP 192.168.0.21:55987->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   54u    IPv4 0xa98601564202bbeb       0t0     TCP 192.168.0.21:60400->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   55u    IPv4 0xa986015625df56d3       0t0     TCP 192.168.0.21:56171->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   56u    IPv4 0xa98601564229bfcb       0t0     TCP 192.168.0.21:50509->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   57u    IPv4 0xa98601563fb7f4e3       0t0     TCP 192.168.0.21:60489->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   58u    IPv4 0xa98601562d5402f3       0t0     TCP 192.168.0.21:62497->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   59u    IPv4 0xa986015643e6f2f3       0t0     TCP 192.168.0.21:56510->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   60u    IPv4 0xa9860156226babeb       0t0     TCP 192.168.0.21:53654->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   61u    IPv4 0xa9860156237f08c3       0t0     TCP 192.168.0.21:57028->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   62u    IPv4 0xa98601563fb0a2f3       0t0     TCP 192.168.0.21:50406->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   63u    IPv4 0xa98601561bda7beb       0t0     TCP 192.168.0.21:53705->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   64u    IPv4 0xa98601563fb0cfcb       0t0     TCP 192.168.0.21:60054->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   65u    IPv4 0xa9860156372854e3       0t0     TCP 192.168.0.21:51527->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   66u    IPv4 0xa98601564041a8c3       0t0     TCP 192.168.0.21:64664->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   67u    IPv4 0xa98601562b5182f3       0t0     TCP 192.168.0.21:54415->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   68u    IPv4 0xa98601564158bfcb       0t0     TCP 192.168.0.21:58062->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   69u    IPv4 0xa986015636aa5ddb       0t0     TCP 192.168.0.21:53117->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   70u    IPv4 0xa98601561bdaa8c3       0t0     TCP 192.168.0.21:59118->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   71u    IPv4 0xa98601562612fddb       0t0     TCP 192.168.0.21:61576->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   72u    IPv4 0xa9860156237eeddb       0t0     TCP 192.168.0.21:50077->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   73u    IPv4 0xa9860156372878c3       0t0     TCP 192.168.0.30:61367->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   74u    IPv4 0xa98601564229b6d3       0t0     TCP 192.168.0.21:62117->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   75u    IPv4 0xa9860156239af6d3       0t0     TCP 192.168.0.21:50786->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   76u    IPv4 0xa98601564220f8c3       0t0     TCP 192.168.0.21:49711->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   77u    IPv4 0xa986015641e312f3       0t0     TCP 192.168.0.21:58095->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   78u    IPv4 0xa986015643e704e3       0t0     TCP 192.168.0.30:61368->192.168.0.21:http-alt (SYN_SENT)
com.docke 2415 zwass   79u    IPv4 0xa98601563fd14fcb       0t0     TCP 192.168.0.21:59641->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   80u    IPv4 0xa986015625df5fcb       0t0     TCP 192.168.0.21:61053->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   81u    IPv4 0xa98601563fb806d3       0t0     TCP 192.168.0.21:59383->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   82u    IPv4 0xa98601563fd122f3       0t0     TCP 192.168.0.21:55327->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   84u    IPv4 0xa9860156418e3ddb       0t0     TCP 192.168.0.21:53984->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   85u    IPv4 0xa9860156418e22f3       0t0     TCP 192.168.0.21:53466->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   86u    IPv4 0xa98601562d5438c3       0t0     TCP 192.168.0.21:60165->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   87u    IPv4 0xa986015643e728c3       0t0     TCP 192.168.0.21:56519->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   88u    IPv4 0xa9860156404184e3       0t0     TCP 192.168.0.21:61108->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   89u    IPv4 0xa9860156237ef6d3       0t0     TCP 192.168.0.21:65109->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   91u    IPv4 0xa9860156418e58c3       0t0     TCP 192.168.0.21:59975->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   92u    IPv4 0xa986015637285ddb       0t0     TCP 192.168.0.21:63045->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   93u    IPv4 0xa9860156425f4fcb       0t0     TCP 192.168.0.21:59133->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   94u    IPv4 0xa98601563241d4e3       0t0     TCP 192.168.0.21:56799->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   95u    IPv4 0xa98601562d541ddb       0t0     TCP 192.168.0.21:56586->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   96u    IPv4 0xa986015643e7addb       0t0     TCP 192.168.0.21:59996->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   98u    IPv4 0xa986015621b376d3       0t0     TCP 192.168.0.21:61738->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass   99u    IPv4 0xa98601563eb022f3       0t0     TCP 192.168.0.21:55206->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  100u    IPv4 0xa9860156404d32f3       0t0     TCP 192.168.0.21:56343->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  103u    IPv4 0xa986015640417beb       0t0     TCP 192.168.0.21:50498->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  105u    IPv4 0xa986015636aa66d3       0t0     TCP 192.168.0.21:56523->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  106u    IPv4 0xa9860156407e0ddb       0t0     TCP 192.168.0.21:56026->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  107u    IPv4 0xa98601564202c4e3       0t0     TCP 192.168.0.21:56524->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  109u    IPv4 0xa9860156372a2fcb       0t0     TCP 192.168.0.21:58335->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  111u    IPv4 0xa98601562ac91beb       0t0     TCP 192.168.0.21:59040->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  112u    IPv4 0xa98601564147d4e3       0t0     TCP 192.168.0.21:50694->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  113u    IPv4 0xa9860156239ae4e3       0t0     TCP 192.168.0.21:58840->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  115u    IPv4 0xa98601563eb058c3       0t0     TCP 192.168.0.21:59138->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  118u    IPv4 0xa9860156418e34e3       0t0     TCP 192.168.0.21:59309->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  123u    IPv4 0xa98601564147e6d3       0t0     TCP 192.168.0.21:59310->192.168.0.21:http-alt (CLOSE_WAIT)
com.docke 2415 zwass  167u    IPv4 0xa9860156404172f3       0t0     TCP 192.168.0.21:55024->192.168.0.21:http-alt (CLOSE_WAIT)

The networking workload should be mostly TCP connections from the containers to a server running on the host (not in a container). That server is also connecting to a MySQL DB, mailserver and Redis all running in Docker containers. I am going to see if I can find a way to repro this with components I can disclose publicly and report back.

@zwass

This comment has been minimized.

Copy link
Author

commented Mar 21, 2017

Ah, actually it looks like the connection limit was reinstated:

Mar 20 10:17:22 zwass Docker[95151] <Notice>: remove connection limit
Mar 20 11:23:30 zwass Docker[2415] <Notice>: updating connection limit to 900
Mar 21 12:52:13 zwass Docker[74120] <Notice>: updating connection limit to 900

I think I may have forgotten to commit after I deleted the file.

I'm going to leave the limit on while I attempt a good public repro, then I will try deleting the file again and committing.

@avsm

This comment has been minimized.

Copy link

commented Mar 30, 2017

@zwass Docker for Mac version 17.03.1-ce-rc1-mac3 onwards have increased the connection limit to 2000. We're also investigating removing the limit entirely and raising the ulimit, but that is not yet implemented.

Since it appears that your original bug report is fixed (and there is a workaround to disable the connection tracking), I'll go ahead and close this issue. We have active internal bugs in place to improve the GUI to remove the need for manual git calls for these experimental bits of functionality, and also to continue to refine the connection tracking logic. If you do have more information you'd like to report, please feel free to re-open this bug or create a fresh one!

@ajkerr

This comment has been minimized.

Copy link

commented Apr 19, 2017

I think that there is definitely a leak somewhere. I tried raising the connection limit from 900 to 2000, as recommended in #1132, but this only bought me a bit more time.

I now have no running or stopped containers, but when I run 'lsof -p' on the com.docker.slirp process, I see over 2000 connections, most of them in CLOSE_WAIT state:

com.docke 66605 akerr 2006u IPv4 0xf06b7c9a8a33276d 0t0 TCP 192.168.1.7:62908->s3-us-west-2-r-w.amazonaws.com:https (CLOSE_WAIT)`

Our code does a lot of interaction with S3, and most of these connections are related to that. The question is, why aren't they cleaned up, especially if there are no containers running?

@linickx

This comment has been minimized.

Copy link

commented Apr 23, 2017

+1

OSX: 10.11.6
Docker: 17.03.1-ce-mac5 (16048)

I'm testing prometheus at home therefore the polling is generating/creating a lot of network connections. Default 900 lasts about a day, up'ing to 2000 buys more time to about 3 days.

  1. Is there a way to the restart docker services/processes via cron (as a bodge job) ?
  2. What debugs are needed to find the possible leak?

EDIT: Note for clarity, restarting containers doesn't fix.

@zwass

This comment has been minimized.

Copy link
Author

commented Apr 24, 2017

@avsm can you please reopen this issue? I continue to experience it, and it looks like others are seeing it as well.

@avsm avsm reopened this Apr 24, 2017

@avsm

This comment has been minimized.

Copy link

commented Apr 24, 2017

Reopening this and ccing @djs55 so he sees it.

@vgoklani

This comment has been minimized.

Copy link

commented May 9, 2017

I have the same issue too, please re-open. Moreover, why do we need a connection limit?

@ihomeautomate

This comment has been minimized.

Copy link

commented May 21, 2017

Same issue here. Lots of CLOSE_WAIT connections, until finally limit reached :-(.
(experimental features turned off)

@sulphur

This comment has been minimized.

Copy link

commented Nov 7, 2017

@toblerpwn In my case haproxy seems to have a impact on the acceletation of these connections numbers. I belive it is because haproxy does the health checks on regular basis. Might be the way haproxy is sending the packets that makes them rest on FIN_WAIT_2 AND CLOSE_WAIT state. So with haproxy it is the faster way to reproduce this behaviour. However as i said before on 17.06 i never have more that 100 and my docker is running for days. This behaviour was introduced on 17.08-rc2(i remember it was rc2 something at the beginning ou august). And i'm running 17.06 since then (you need OSX 10.12 Sierra for that however)

@copywrite

This comment has been minimized.

Copy link

commented Nov 13, 2017

Followed
git rm com.docker.driver.amd64-linux/slirp/max-connections
and restarted container but still got
"Hostnet_udp.input: bind raised Too many open files"

I am now running python script in the container to start 20K TCP connections, which stops when the total connections reach about 9811

@a10kiloham

This comment has been minimized.

Copy link

commented Nov 16, 2017

Still happening to me as well with linuxserver/transmission especially which opens many connections

@toblerpwn

This comment has been minimized.

Copy link

commented Nov 17, 2017

I can confirm that version 17.06 works for me, as well, fwiw. And 17.06 seems to work fine for me on High Sierra (macOS 10.13) so far, but others may not be so lucky.

@a10kiloham

This comment has been minimized.

Copy link

commented Nov 18, 2017

@sulphur

This comment has been minimized.

Copy link

commented Nov 25, 2017

tried the 17.11.0-ce-mac40 but no improvement. I'm gonna have to find a alternative for haproxy i guess

@krm1312

This comment has been minimized.

Copy link

commented Nov 25, 2017

Also had to rollback to 17.06.2-ce-mac27 (19092). We also have a haproxy container doing health-checks of several other internal containers and external services. Networking dies within a couple minutes in stable versions after 17.06.2-ce-mac27 (19092).

@sulphur

This comment has been minimized.

Copy link

commented Jan 2, 2018

As a workaround you can disable the check's in haproxy. In my case it works. Just delete the check from your backend config

@fpoppinga

This comment has been minimized.

Copy link

commented Jan 12, 2018

For me it is not possible to change my HAProxy config or even use some alternative to HAProxy. I would highly appreciate if some attempt would be made to fix this issue.

In our setup, a large number of containers is started simultaneously, and the time until a restart of docker is required is not one day, but closer to one minute. This makes docker-for-mac in any version later than 17.06. not usable for us.

@dpyro

This comment has been minimized.

Copy link

commented Feb 5, 2018

Same issue present on 17.12.0-ce-mac49 (21995).

@yuhr

This comment has been minimized.

Copy link

commented Feb 16, 2018

In my case, HAProxy's default TCP check leads lsof -p $(pgrep vpnkit) | wc -l to exceeding in a few minutes and setting something like option httpchk HEAD / HTTP/1.0 seems to get rid of the problem once, but after a day it has exceeded as default.

@shadowspawn

This comment has been minimized.

Copy link

commented Mar 28, 2018

I upgraded to Version 18.03.0-ce-mac59 (23608) yesterday and have been running containers continuously since then. Connection count from lsof -p $(pgrep vpnkit) | wc -l seems stable at around 240 when previously it would have been climbing steadily until things broke and I restarted. Only one run so far and I am not certain the change is related to the update, but I am hopeful!

@djs55 djs55 self-assigned this Mar 28, 2018

@sulphur

This comment has been minimized.

Copy link

commented Mar 30, 2018

not for me :( it jumped to 2000 rather fast ;( i'm on 18.03.0-ce-mac60

@TylerLubeck

This comment has been minimized.

Copy link

commented Apr 5, 2018

We think we've tracked it to the specific occurrence for us - we have nginx running in a container set up to resolve DNS records for upstream servers every 4 seconds (http://nginx.org/en/docs/http/ngx_http_core_module.html#resolver). We have ~40 upstream servers configured. When we bump the valid parameter from 4s to 30s the problem seems to disappear. We haven't been running this for all that long so we're not totally sure if it's "gone" or "delayed", but this could be a good repro case if anyone wants to take a look.

@erikvdwal

This comment has been minimized.

Copy link

commented Apr 5, 2018

For what it's worth, I've resorted to installing an older edge version of Docker for Mac (17.12.0-ce-rc4-mac44 2017-12-21 edge to be precise) and it has been running without issues for about two weeks straight, where on earlier and most recent versions connectivity issues would start to arise after about a day or two.

@vgoklani

This comment has been minimized.

Copy link

commented Apr 6, 2018

Thanks @erikvdwal there needs to be some sort of leaderboard for stability, the latest versions always crash after a couple of days...

@dpyro

This comment has been minimized.

Copy link

commented Apr 6, 2018

I've also seen an issue with sporadic failures with immediately recreating a network after deleting it. This problem does not happen when using a docker-machine provisioned VM, only on the native client.

@haizz

This comment has been minimized.

Copy link

commented Jun 20, 2018

Having exactly the same issue with Docker for Mac Version 18.03.1-ce-mac65 (24312) and haproxy containers with "check" option enabled.

@djs55

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2018

@toblerpwn @sulphur @krm1312 @fpoppinga @yuhr @haizz (and anyone else interested): There is a known issue in the network stack triggered by haproxy's TCP health checks (and possibly other things). There's a detailed description and a link to a development build with the fix here: #1132 (comment) This fix will be in the next official release. If you get a chance to try it, let me know how it goes.

@toblerpwn

This comment has been minimized.

Copy link

commented Jul 24, 2018

@djs55 Thanks for the tag!

Edge release 18.06.0-ce-rc3-mac68 2018-07-19 did indeed fix the connection leak w/ Node.js & HAProxy for me. Awesome! :D

@shadowspawn

This comment has been minimized.

Copy link

commented Jul 26, 2018

No problems so far after installing recent Docker for Mac stable release, version 18.06.0-ce-mac70 (26399)

@djs55

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2018

@JohnRGee thanks for the confirmation.

I believe this bug is fixed in both stable and edge channels now so I'll close the ticket. If there are further problems then please open a fresh ticket and include a set of diagnostics so I can take a look. (If it's related to this problem it might be worth adding a link to this issue too).

Thanks all for your patience!

@jplflyer

This comment has been minimized.

Copy link

commented May 29, 2019

I do not believe this issue is resolved. According to Docker Desktop, I have docker engine 18.09.2, compose is 1.2.2, and machine is 0.16.1. Maybe one of these is old, and updating docker doesn't update them.

As of right now, my docker container is refusing connections.

$ lsof -i -n -P +c 0 | egrep docker | wc
1962 19618 271224

$ lsof -i -n -P +c 0 | egrep docker | egrep "CLOSE_WAIT|FIN_WAIT" | wc
1802 18020 254082

I have to restart Docker to get things to work again.

@AndrewRayCode

This comment has been minimized.

Copy link

commented Jul 2, 2019

I have a single postgres container and a separate nodejs container that connects to postgres. Randomly the server will stop being able to communicate with the database. It will hang, and trying to restart the server returns a timeout trying to connect to the database. Stopping and starting all (previously working) containers doesn't do anything and the server is never able to connect again. docker ps shows all the containers are running and show correct ports. Showing logs for the postgres database are fine, it ends with database system is ready to accept connections, so I assume the issue is with Docker's networking. Only restarting Docker for Mac entirely makes things work again. This is a regular occurance, a few times a week. I ran the diagnostic tool but I don't know what of it would be helpful for diagnosing the problem to post here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.