New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposed ports become unresponsive after heavy load #426

Open
micdah opened this Issue Jan 24, 2017 · 73 comments

Comments

Projects
None yet
@micdah

micdah commented Jan 24, 2017

Michael Friis directed me to submit an issue here (see issue 30400 for more)

I am experiencing an intermittent issue with Docker for Windows, where suddenly all the exposed ports become unresponsive, no connection can be made to the containers. This happens when a lot of activity is put on the containers from the host machine, I am running 4 containers and 11 services on the host machine as well as a handful of websites and API's which all interact with the containers.

How to reproduce

As requested by Michael Friis, I have made some sample code which seems to be able to reproduce the issue. You can see and clone the code here github.com/micdah/DockerBomb. I have also made a YouTube video where I demonstrate the issue using my sample code youtube.com/watch?v=v5k1D60h0zE

I have described how to use the program in the readme.md file in the github repo. Note that it might take anywhere from a few minutes to minutes before the issue triggers, it is somewhat random - likely because it is tightly timing related

The sample program creates the requested number of threads, each creating a single connection to the redis container and issuing as many commands as possible until the connection fails.

As demonstrated, when the issue has occurred the container becomes unresponsive on the exposed ports, although it is still running. Trying to restart the container results in an input/output error when trying to bind to the host port. In my previous issue report (30400) I have also included a netstat dump to show that it is not because the port is reserved, when trying to restart the container, that it fails.

Expected behavior

I would expect the container to continue to be accessible via the exposed ports, as long as it is running. If some resource pool (handles, connection pool, etc.) is exhausted, I would expect the container to become responsive again when the resources become available again (for example when stopping the heavy load on the container).

Information

Diagnostic ID
This is a diagnostic uploaded, just after the issue has occurred, reproduced as described above.

30667474-C49F-4185-B957-3A7AE1F38393/2017-01-24_21-44-30

Output of docker version

Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      windows/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Wed Jan 18 16:20:26 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info

Containers: 5
 Running: 1
 Paused: 0
 Stopped: 4
Images: 6
Server Version: 1.13.0
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.4-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.837 GiB
Name: moby
ID: 5DLJ:7BM4:KTMA:L5UV:ACM5:HJQP:V2W3:ZQXJ:LUS5:XEVE:FJK2:KH5K
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 21
 Goroutines: 28
 System Time: 2017-01-24T20:49:08.8436128Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Jan 25, 2017

I have tested the sample code on my Mac running the newest version of docker as well, and the issue doesn't seem to appear there. Although I am still experiencing intermitten connection failure across all threads at once - but here it recovers immediately and continues to work, no fuss.
Whereas on Windows, the port is dead and cannot be used again before Docker has been restarted.

micdah commented Jan 25, 2017

I have tested the sample code on my Mac running the newest version of docker as well, and the issue doesn't seem to appear there. Although I am still experiencing intermitten connection failure across all threads at once - but here it recovers immediately and continues to work, no fuss.
Whereas on Windows, the port is dead and cannot be used again before Docker has been restarted.

@rn

This comment has been minimized.

Show comment
Hide comment
@rn

rn Jan 28, 2017

Contributor

@micdah Thanks for the very detailed repro (probably the most detailed I've seen sofar!).

Just for my record, I had to install the dotnet CLI tools from here to compile and run the stress test.

The issue seems to be with a component called VPNKit. We run some stress tests against it in CI but your case seems to put an heavier load on it.

In the logs I see a lot of messages like this:

[21:42:21.456][VpnKit         ][Error  ] com.docker.slirp.exe: tcp:0.0.0.0:6379:tcp:172.19.0.2:6379 proxy failed with flow proxy a: write failed with Eof

and this:

[19:18:11.347][VpnKit         ][Error  ] com.docker.slirp.exe: Socket.Stream: caught A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

and this:

[15:05:33.976][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: An established connection was aborted by the software in your host machine.

With Beta 39, which has an updated version of VPNKit, I can also cause issue, albeit slightly different:

[13:49:44.377][VpnKit         ][Error  ] Process died
[13:49:49.381][VpnKit         ][Info   ] Starting C:\Program Files\Docker\Docker\Resources\com.docker.slirp.exe --ethernet hyperv-connect://acaef826-d7a2-41db-9685-1315c13f9a40 --port hyperv-connect://acaef826-d7a2-41db-9685-1315c13f9a40 --db \\.\pipe\dockerDataBase --debug --diagnostics \\.\pipe\dockerVpnKitDiagnostics
[13:49:49.383][VpnKit         ][Info   ] Started
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: Logging to stdout (stdout:true DEBUG:false)
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: Setting handler to ignore all SIGPIPE signals
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: vpnkit version %VERSION% with hostnet version  %HOSTNET_PINNED% uwt version 0.0.3 hvsock version 0.13.0 %HVSOCK_PINNED%
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: starting port forwarding server on port_control_url:hyperv-connect://acaef826-d7a2-41db-9685-1315c13f9a40 vsock_path:
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to acaef826-d7a2-41db-9685-1315c13f9a40:0B95756A-9985-48AD-9470-78E060895BE7
[13:49:49.421][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to acaef826-d7a2-41db-9685-1315c13f9a40:30D48B34-7D27-4B0B-AAAF-BBBED334DD59
[13:49:49.421][VpnKit         ][Error  ] com.docker.slirp.exe: While watching /etc/resolv.conf: ENOENT
[13:49:49.421][VpnKit         ][Info   ] com.docker.slirp.exe: hosts file has bindings for 
[13:49:49.422][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[13:49:49.422][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[13:49:49.422][VpnKit         ][Info   ] com.docker.slirp.exe: attempting to reconnect to database
[13:49:49.422][DataKit        ][Info   ] com.docker.db.exe: accepted a new connection on \\.\pipe\dockerDataBase
[13:49:49.422][DataKit        ][Info   ] com.docker.db.exe: Using protocol TwoThousandU msize 8215
[13:49:49.426][VpnKit         ][Info   ] com.docker.slirp.exe: reconnected transport layer
[13:49:49.426][VpnKit         ][Info   ] com.docker.slirp.exe: remove connection limit
[13:49:49.432][VpnKit         ][Info   ] com.docker.slirp.exe: allowing binds to any IP addresses
[13:49:49.433][VpnKit         ][Info   ] com.docker.slirp.exe: updating resolvers to nameserver 8.8.8.8#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.433][VpnKit         ][Info   ] nameserver 8.8.4.4#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.433][VpnKit         ][Info   ] com.docker.slirp.exe: Add(3): DNS configuration changed to: nameserver 8.8.8.8#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.433][VpnKit         ][Info   ] nameserver 8.8.4.4#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.435][VpnKit         ][Info   ] com.docker.slirp.exe: Creating slirp server pcap_settings:disabled peer_ip:192.168.65.2 local_ip:192.168.65.1 domain_search: mtu:8000
[13:49:49.435][VpnKit         ][Info   ] com.docker.slirp.exe: PPP.negotiate: received ((magic VMN3T)(version 13)(commit"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"))
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: TCP/IP ready
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: stack connected
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: no introspection server requested. See the --introspection argument
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: starting diagnostics server on: \\.\pipe\dockerVpnKitDiagnostics

So, this actually crashes the process and the restart does not seem to work.

On a subsequent run (after restart) I got:

[13:58:45.123][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: Lwt_stream.Closed
[13:58:48.312][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: Lwt_stream.Closed
[13:58:48.312][VpnKit         ][Error  ] com.docker.slirp.exe: Socket.Stream: caught Lwt_stream.Closed
[13:59:30.685][VpnKit         ][Info   ] Thread 2950 killed on uncaught exception Assert_failure("src/core/lwt.ml", 497, 9)
[13:59:30.685][VpnKit         ][Info   ] Raised at file "uwt_preemptive.ml", line 384, characters 23-26
[13:59:30.685][VpnKit         ][Info   ] Called from file "lwt_unix/lwt_hvsock_main_thread.ml", line 49, characters 6-206
[13:59:30.685][VpnKit         ][Info   ] Called from file "thread.ml", line 39, characters 8-14

With beta38, I got a different crash:

Version: 1.13.0-beta38 (9805)
Channel: Beta
Sha1: 9c31a154d11ccf6a29f009610453eab4921bc6e8
[...]
[13:23:50.020][VpnKit         ][Error  ] com.docker.slirp.exe: Lwt.async failure "Assert_failure src/core/lwt.ml:497:9": Raised at file "src/core/lwt.ml", line 497, characters 9-21
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 305, characters 2-34
[13:23:50.020][VpnKit         ][Info   ] Called from file "uwt.ml", line 1669, characters 23-44
[13:23:50.020][VpnKit         ][Info   ] 
Contributor

rn commented Jan 28, 2017

@micdah Thanks for the very detailed repro (probably the most detailed I've seen sofar!).

Just for my record, I had to install the dotnet CLI tools from here to compile and run the stress test.

The issue seems to be with a component called VPNKit. We run some stress tests against it in CI but your case seems to put an heavier load on it.

In the logs I see a lot of messages like this:

[21:42:21.456][VpnKit         ][Error  ] com.docker.slirp.exe: tcp:0.0.0.0:6379:tcp:172.19.0.2:6379 proxy failed with flow proxy a: write failed with Eof

and this:

[19:18:11.347][VpnKit         ][Error  ] com.docker.slirp.exe: Socket.Stream: caught A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

and this:

[15:05:33.976][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: An established connection was aborted by the software in your host machine.

With Beta 39, which has an updated version of VPNKit, I can also cause issue, albeit slightly different:

[13:49:44.377][VpnKit         ][Error  ] Process died
[13:49:49.381][VpnKit         ][Info   ] Starting C:\Program Files\Docker\Docker\Resources\com.docker.slirp.exe --ethernet hyperv-connect://acaef826-d7a2-41db-9685-1315c13f9a40 --port hyperv-connect://acaef826-d7a2-41db-9685-1315c13f9a40 --db \\.\pipe\dockerDataBase --debug --diagnostics \\.\pipe\dockerVpnKitDiagnostics
[13:49:49.383][VpnKit         ][Info   ] Started
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: Logging to stdout (stdout:true DEBUG:false)
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: Setting handler to ignore all SIGPIPE signals
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: vpnkit version %VERSION% with hostnet version  %HOSTNET_PINNED% uwt version 0.0.3 hvsock version 0.13.0 %HVSOCK_PINNED%
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: starting port forwarding server on port_control_url:hyperv-connect://acaef826-d7a2-41db-9685-1315c13f9a40 vsock_path:
[13:49:49.418][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to acaef826-d7a2-41db-9685-1315c13f9a40:0B95756A-9985-48AD-9470-78E060895BE7
[13:49:49.421][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to acaef826-d7a2-41db-9685-1315c13f9a40:30D48B34-7D27-4B0B-AAAF-BBBED334DD59
[13:49:49.421][VpnKit         ][Error  ] com.docker.slirp.exe: While watching /etc/resolv.conf: ENOENT
[13:49:49.421][VpnKit         ][Info   ] com.docker.slirp.exe: hosts file has bindings for 
[13:49:49.422][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[13:49:49.422][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[13:49:49.422][VpnKit         ][Info   ] com.docker.slirp.exe: attempting to reconnect to database
[13:49:49.422][DataKit        ][Info   ] com.docker.db.exe: accepted a new connection on \\.\pipe\dockerDataBase
[13:49:49.422][DataKit        ][Info   ] com.docker.db.exe: Using protocol TwoThousandU msize 8215
[13:49:49.426][VpnKit         ][Info   ] com.docker.slirp.exe: reconnected transport layer
[13:49:49.426][VpnKit         ][Info   ] com.docker.slirp.exe: remove connection limit
[13:49:49.432][VpnKit         ][Info   ] com.docker.slirp.exe: allowing binds to any IP addresses
[13:49:49.433][VpnKit         ][Info   ] com.docker.slirp.exe: updating resolvers to nameserver 8.8.8.8#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.433][VpnKit         ][Info   ] nameserver 8.8.4.4#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.433][VpnKit         ][Info   ] com.docker.slirp.exe: Add(3): DNS configuration changed to: nameserver 8.8.8.8#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.433][VpnKit         ][Info   ] nameserver 8.8.4.4#53
[13:49:49.433][VpnKit         ][Info   ] order 0
[13:49:49.435][VpnKit         ][Info   ] com.docker.slirp.exe: Creating slirp server pcap_settings:disabled peer_ip:192.168.65.2 local_ip:192.168.65.1 domain_search: mtu:8000
[13:49:49.435][VpnKit         ][Info   ] com.docker.slirp.exe: PPP.negotiate: received ((magic VMN3T)(version 13)(commit"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"))
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: TCP/IP ready
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: stack connected
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: no introspection server requested. See the --introspection argument
[13:49:49.436][VpnKit         ][Info   ] com.docker.slirp.exe: starting diagnostics server on: \\.\pipe\dockerVpnKitDiagnostics

So, this actually crashes the process and the restart does not seem to work.

On a subsequent run (after restart) I got:

[13:58:45.123][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: Lwt_stream.Closed
[13:58:48.312][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: Lwt_stream.Closed
[13:58:48.312][VpnKit         ][Error  ] com.docker.slirp.exe: Socket.Stream: caught Lwt_stream.Closed
[13:59:30.685][VpnKit         ][Info   ] Thread 2950 killed on uncaught exception Assert_failure("src/core/lwt.ml", 497, 9)
[13:59:30.685][VpnKit         ][Info   ] Raised at file "uwt_preemptive.ml", line 384, characters 23-26
[13:59:30.685][VpnKit         ][Info   ] Called from file "lwt_unix/lwt_hvsock_main_thread.ml", line 49, characters 6-206
[13:59:30.685][VpnKit         ][Info   ] Called from file "thread.ml", line 39, characters 8-14

With beta38, I got a different crash:

Version: 1.13.0-beta38 (9805)
Channel: Beta
Sha1: 9c31a154d11ccf6a29f009610453eab4921bc6e8
[...]
[13:23:50.020][VpnKit         ][Error  ] com.docker.slirp.exe: Lwt.async failure "Assert_failure src/core/lwt.ml:497:9": Raised at file "src/core/lwt.ml", line 497, characters 9-21
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 201, characters 8-15
[13:23:50.020][VpnKit         ][Info   ] Called from file "src/core/lwt.ml", line 305, characters 2-34
[13:23:50.020][VpnKit         ][Info   ] Called from file "uwt.ml", line 1669, characters 23-44
[13:23:50.020][VpnKit         ][Info   ] 

@rn rn added the version/beta39 label Jan 28, 2017

@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Jan 28, 2017

@rneugeba Ah yes, I forgot to mention that you needed .net core tools, that is my bad. :-)
As this is such an intermittent issue, I figured a consistent way of reproducing it would be useful in diagnosing and identifying the issue.
Also I was curious whether the issue was specific to one of the services we are developing (normally these don't run under Docker, just for development at the moment), or if I could reproduce it with the most basic of application code.

Very interesting findings you have made regarding the probable component - curious that the logs show different parts failing in different versions.

micdah commented Jan 28, 2017

@rneugeba Ah yes, I forgot to mention that you needed .net core tools, that is my bad. :-)
As this is such an intermittent issue, I figured a consistent way of reproducing it would be useful in diagnosing and identifying the issue.
Also I was curious whether the issue was specific to one of the services we are developing (normally these don't run under Docker, just for development at the moment), or if I could reproduce it with the most basic of application code.

Very interesting findings you have made regarding the probable component - curious that the logs show different parts failing in different versions.

@rn

This comment has been minimized.

Show comment
Hide comment
@rn

rn Jan 28, 2017

Contributor

The VPNKit code received significant updates since 1.13.0 stable both in the DNS handling code as well as a on the main data path (including some performance improvements) so I'm not surprised that it fails in different ways.

Your test code seems to put a considerable stress on it...If I understand the code, it tries to continuously open up to 500 connections (depending on the number entered) connections to container running redis and then continuously issues requests to the container.

As mentioned, we have some similar style tests in our regression suite, but they are slightly different

Contributor

rn commented Jan 28, 2017

The VPNKit code received significant updates since 1.13.0 stable both in the DNS handling code as well as a on the main data path (including some performance improvements) so I'm not surprised that it fails in different ways.

Your test code seems to put a considerable stress on it...If I understand the code, it tries to continuously open up to 500 connections (depending on the number entered) connections to container running redis and then continuously issues requests to the container.

As mentioned, we have some similar style tests in our regression suite, but they are slightly different

@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Jan 28, 2017

Yes that is correct, my first instinct was that it was related to either the number of concurrent connections and/or their activity/throughput.

So my first attempt to reproduce the issue, was having a configurable number of continuously open connections with high activity.

In my example code, I make no attempt to re-connect if a connection fails - so the connections seem to be stable, until all connectivity fails at once for all open connections.

My findings are that the number of connections doesn't seem to affect if the issue can occur, but does seem to prolong the wait before the issue occurs. So my expectation would be, that a very specific timing between two or more concurrent connections seem to trigger the issue, thus the more concurrent connections and the more concurrent activity there are, the more likely the issue is to occur.

But to me, of course, this is just one big black box - so I might be totally off track. 😄

micdah commented Jan 28, 2017

Yes that is correct, my first instinct was that it was related to either the number of concurrent connections and/or their activity/throughput.

So my first attempt to reproduce the issue, was having a configurable number of continuously open connections with high activity.

In my example code, I make no attempt to re-connect if a connection fails - so the connections seem to be stable, until all connectivity fails at once for all open connections.

My findings are that the number of connections doesn't seem to affect if the issue can occur, but does seem to prolong the wait before the issue occurs. So my expectation would be, that a very specific timing between two or more concurrent connections seem to trigger the issue, thus the more concurrent connections and the more concurrent activity there are, the more likely the issue is to occur.

But to me, of course, this is just one big black box - so I might be totally off track. 😄

@atmorell

This comment has been minimized.

Show comment
Hide comment
@atmorell

atmorell Jan 30, 2017

I am seeing exactly the same behavior. VPNkit crashes under heavy load, and all containers stops responding. Restarting Docker makes everything great again ;)

dmesg on the docker VM in Hyper-V shows the following error:
docker run --rm --privileged debian:jessie dmesg
[ 90.463682] device veth719b2e2 entered promiscuous mode
[ 90.463788] IPv6: ADDRCONF(NETDEV_UP): veth719b2e2: link is not ready
[ 90.463788] docker0: port 3(veth719b2e2) entered blocking state
[ 90.463788] docker0: port 3(veth719b2e2) entered forwarding state
[ 90.464119] docker0: port 3(veth719b2e2) entered disabled state
[ 90.539736] IPVS: Creating netns size=2104 id=12
[ 90.539739] IPVS: ftp: loaded support on port[0] = 21
[ 90.630341] eth0: renamed from vethc8fadb9
[ 90.680467] IPv6: ADDRCONF(NETDEV_CHANGE): veth719b2e2: link becomes ready
[ 90.680498] docker0: port 3(veth719b2e2) entered blocking state
[ 90.680498] docker0: port 3(veth719b2e2) entered forwarding state
[ 167.381158] docker0: port 4(vethb7059d0) entered blocking state
[ 167.381160] docker0: port 4(vethb7059d0) entered disabled state

atmorell commented Jan 30, 2017

I am seeing exactly the same behavior. VPNkit crashes under heavy load, and all containers stops responding. Restarting Docker makes everything great again ;)

dmesg on the docker VM in Hyper-V shows the following error:
docker run --rm --privileged debian:jessie dmesg
[ 90.463682] device veth719b2e2 entered promiscuous mode
[ 90.463788] IPv6: ADDRCONF(NETDEV_UP): veth719b2e2: link is not ready
[ 90.463788] docker0: port 3(veth719b2e2) entered blocking state
[ 90.463788] docker0: port 3(veth719b2e2) entered forwarding state
[ 90.464119] docker0: port 3(veth719b2e2) entered disabled state
[ 90.539736] IPVS: Creating netns size=2104 id=12
[ 90.539739] IPVS: ftp: loaded support on port[0] = 21
[ 90.630341] eth0: renamed from vethc8fadb9
[ 90.680467] IPv6: ADDRCONF(NETDEV_CHANGE): veth719b2e2: link becomes ready
[ 90.680498] docker0: port 3(veth719b2e2) entered blocking state
[ 90.680498] docker0: port 3(veth719b2e2) entered forwarding state
[ 167.381158] docker0: port 4(vethb7059d0) entered blocking state
[ 167.381160] docker0: port 4(vethb7059d0) entered disabled state
@dgageot

This comment has been minimized.

Show comment
Hide comment
@dgageot

dgageot Jan 30, 2017

@djs55 Could you please take a look at the diagnostics?

dgageot commented Jan 30, 2017

@djs55 Could you please take a look at the diagnostics?

@fc0712

This comment has been minimized.

Show comment
Hide comment
@fc0712

fc0712 Feb 3, 2017

I'm experiencing the same

Will include log when I'm back at my Pc

Log: 0D785797-BD32-437E-9A2D-B634D46D26BE/2017-02-03_09-12-29

Check around 6AM-6:10

Example:

Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.

English:

An attempt to connect failed because the party had connected, did not respond properly after a period of time, or established connection was terminated because the host did not respond.

:06:29.373][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497810748
[06:06:29.568][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496860497
[06:06:29.632][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499463544
[06:06:29.893][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499447550
[06:06:29.995][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496118189
[06:06:31.686][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:31.686][VpnKit ][Info ]
[06:06:31.686][VpnKit ][Info ]
[06:06:32.568][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497074625
[06:06:32.632][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499626264
[06:06:32.995][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496319177
[06:06:33.039][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:33.039][VpnKit ][Info ]
[06:06:33.039][VpnKit ][Info ]
[06:06:33.172][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498814677
[06:06:34.052][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:34.052][VpnKit ][Info ]
[06:06:34.052][VpnKit ][Info ]
[06:06:34.318][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498858170
[06:06:34.355][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498820699
[06:06:34.373][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497927573
[06:06:34.568][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497074625
[06:06:34.632][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499626264
[06:06:34.996][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496319177
[06:06:36.322][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:36.322][VpnKit ][Info ]
[06:06:36.322][VpnKit ][Info ]
[06:06:37.172][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498827741
[06:06:37.427][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:37.427][VpnKit ][Info ]
[06:06:37.427][VpnKit ][Info ]
[06:06:38.319][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498869850
[06:06:38.355][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498833839
[06:06:38.375][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497941981
[06:06:38.449][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:38.449][VpnKit ][Info ]
[06:06:38.449][VpnKit ][Info ]
[06:06:47.319][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499062078
[06:06:47.356][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499021687

fc0712 commented Feb 3, 2017

I'm experiencing the same

Will include log when I'm back at my Pc

Log: 0D785797-BD32-437E-9A2D-B634D46D26BE/2017-02-03_09-12-29

Check around 6AM-6:10

Example:

Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.

English:

An attempt to connect failed because the party had connected, did not respond properly after a period of time, or established connection was terminated because the host did not respond.

:06:29.373][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497810748
[06:06:29.568][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496860497
[06:06:29.632][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499463544
[06:06:29.893][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499447550
[06:06:29.995][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496118189
[06:06:31.686][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:31.686][VpnKit ][Info ]
[06:06:31.686][VpnKit ][Info ]
[06:06:32.568][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497074625
[06:06:32.632][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499626264
[06:06:32.995][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496319177
[06:06:33.039][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:33.039][VpnKit ][Info ]
[06:06:33.039][VpnKit ][Info ]
[06:06:33.172][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498814677
[06:06:34.052][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:34.052][VpnKit ][Info ]
[06:06:34.052][VpnKit ][Info ]
[06:06:34.318][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498858170
[06:06:34.355][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498820699
[06:06:34.373][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497927573
[06:06:34.568][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497074625
[06:06:34.632][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499626264
[06:06:34.996][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 496319177
[06:06:36.322][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:36.322][VpnKit ][Info ]
[06:06:36.322][VpnKit ][Info ]
[06:06:37.172][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498827741
[06:06:37.427][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:37.427][VpnKit ][Info ]
[06:06:37.427][VpnKit ][Info ]
[06:06:38.319][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498869850
[06:06:38.355][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 498833839
[06:06:38.375][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 497941981
[06:06:38.449][VpnKit ][Error ] com.docker.slirp.exe: Socket.Stream: caught Et forsøg på at oprette forbindelse mislykkedes, fordi den part, der havde oprettet forbindelse, ikke svarede korrekt efter en periode, eller en oprettet forbindelse blev afbrudt, fordi værten ikke svarede.
[06:06:38.449][VpnKit ][Info ]
[06:06:38.449][VpnKit ][Info ]
[06:06:47.319][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499062078
[06:06:47.356][VpnKit ][Info ] Tcp.Segment: TCP retransmission on timer seq = 499021687

@fc0712

This comment has been minimized.

Show comment
Hide comment
@fc0712

fc0712 Feb 3, 2017

Any status on this issue?

fc0712 commented Feb 3, 2017

Any status on this issue?

@rn

This comment has been minimized.

Show comment
Hide comment
@rn

rn Feb 3, 2017

Contributor

we are still debugging the issue

Contributor

rn commented Feb 3, 2017

we are still debugging the issue

@atmorell

This comment has been minimized.

Show comment
Hide comment
@atmorell

atmorell Feb 7, 2017

Is there any workaround for this crash?

atmorell commented Feb 7, 2017

Is there any workaround for this crash?

@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Feb 20, 2017

Not to be pushy, but any updates on this issue?

If there is anything I can do to help (such as testing beta/development versions), I'll be happy to oblige.

micdah commented Feb 20, 2017

Not to be pushy, but any updates on this issue?

If there is anything I can do to help (such as testing beta/development versions), I'll be happy to oblige.

@atmorell

This comment has been minimized.

Show comment
Hide comment
@atmorell

atmorell Feb 20, 2017

Almost thought it was solved by the latest update, but my containers became unresponsible after 1-2 hours of high load.

atmorell commented Feb 20, 2017

Almost thought it was solved by the latest update, but my containers became unresponsible after 1-2 hours of high load.

@rn

This comment has been minimized.

Show comment
Hide comment
@rn

rn Feb 20, 2017

Contributor

Unfortunately, the main developer for the component in question, VPNKit, was busy with some other work but will take a look this week. @micdah your reproduction is very useful for the debug. Again thanks for the effort in putting this together.

Contributor

rn commented Feb 20, 2017

Unfortunately, the main developer for the component in question, VPNKit, was busy with some other work but will take a look this week. @micdah your reproduction is very useful for the debug. Again thanks for the effort in putting this together.

@djs55

This comment has been minimized.

Show comment
Hide comment
@djs55

djs55 Feb 20, 2017

Sorry for the delay -- I'm planning to investigate this issue this week.

I notice in @rneugeba 's logs quoted above an assertion failed in lwt/uwt (the libraries we use on top of libuv for asychronous I/O on Windows and Mac) There is a newer version of uwt containing various bug fixes -- I'll see if that makes a difference first since I was planning an upgrade anyway.

The Mac and Windows versions of com.docker.slirp.exe process differ slightly in how they talk to the VM -- on the Mac we use a Unix domain socket while on Windows we use Hyper-V sockets. The Hyper-V socket code happens to stress the Uwt_preemptive module quite heavily which I notice is mentioned in @rneugeba 's quoted traceback. It may be possible to redesign that code to avoid that module if it turns out to be buggy.

I'll keep you informed of progress!

djs55 commented Feb 20, 2017

Sorry for the delay -- I'm planning to investigate this issue this week.

I notice in @rneugeba 's logs quoted above an assertion failed in lwt/uwt (the libraries we use on top of libuv for asychronous I/O on Windows and Mac) There is a newer version of uwt containing various bug fixes -- I'll see if that makes a difference first since I was planning an upgrade anyway.

The Mac and Windows versions of com.docker.slirp.exe process differ slightly in how they talk to the VM -- on the Mac we use a Unix domain socket while on Windows we use Hyper-V sockets. The Hyper-V socket code happens to stress the Uwt_preemptive module quite heavily which I notice is mentioned in @rneugeba 's quoted traceback. It may be possible to redesign that code to avoid that module if it turns out to be buggy.

I'll keep you informed of progress!

@djs55

This comment has been minimized.

Show comment
Hide comment
@djs55

djs55 Feb 24, 2017

I've been running the repro case on beta build number 10183 which was briefly released this week and then recalled (due to discovering a serious bug elsewhere). The test has been running for several hours with no problems. I'll leave it over the weekend to see what happens. The main significant change in this build is the libuv/uwt upgrade.

djs55 commented Feb 24, 2017

I've been running the repro case on beta build number 10183 which was briefly released this week and then recalled (due to discovering a serious bug elsewhere). The test has been running for several hours with no problems. I'll leave it over the weekend to see what happens. The main significant change in this build is the libuv/uwt upgrade.

@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Mar 1, 2017

How did it go with stress testing beta build 10183? Very excited to hear news on this issue. 😄

micdah commented Mar 1, 2017

How did it go with stress testing beta build 10183? Very excited to hear news on this issue. 😄

@djs55

This comment has been minimized.

Show comment
Hide comment
@djs55

djs55 Mar 2, 2017

I left the test running for a few more days but then my machine rebooted to install patches. Could you try again with the latest version of docker? Note the version numbering scheme is now date based -- version 17.03.0-ce was released today: https://store.docker.com/editions/community/docker-ce-desktop-windows?tab=description

djs55 commented Mar 2, 2017

I left the test running for a few more days but then my machine rebooted to install patches. Could you try again with the latest version of docker? Note the version numbering scheme is now date based -- version 17.03.0-ce was released today: https://store.docker.com/editions/community/docker-ce-desktop-windows?tab=description

@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Mar 2, 2017

I have just updated my docker installation and tried to run my repro again, sadly it still seems to trigger.

This time though, at first I was able to restart the docker container without error (previously I got an exception when it tried to bind to the same port when starting the container again), but restarting the container doesn't fix the connectivity - it remains unreachable after the crash.

Output from docker -v:

Docker version 17.03.0-ce, build 60ccb22

After having restarted the container and tried running my program again to check if the container became reachable (and verifying that it remained unreachable), I tried restarting the container once more (using docker-compose restart each time), I received a new error I haven't seen before:

Error response from daemon: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.

It seems like the upgrade of VPNKit have had an effect on the issue, but doesn't seem to have fixed it.

Any docker command (other than asking for version) after the above mentioned error message appeared, results in the same message, even a simple docker ps.

I have uploaded a diagnostics dump just after reproducing the error, restarting the container once, re-running and confirming still unreachable, and restarting a second time and encountering the above new error:

30667474-C49F-4185-B957-3A7AE1F38393/2017-03-02_19-03-47

micdah commented Mar 2, 2017

I have just updated my docker installation and tried to run my repro again, sadly it still seems to trigger.

This time though, at first I was able to restart the docker container without error (previously I got an exception when it tried to bind to the same port when starting the container again), but restarting the container doesn't fix the connectivity - it remains unreachable after the crash.

Output from docker -v:

Docker version 17.03.0-ce, build 60ccb22

After having restarted the container and tried running my program again to check if the container became reachable (and verifying that it remained unreachable), I tried restarting the container once more (using docker-compose restart each time), I received a new error I haven't seen before:

Error response from daemon: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.

It seems like the upgrade of VPNKit have had an effect on the issue, but doesn't seem to have fixed it.

Any docker command (other than asking for version) after the above mentioned error message appeared, results in the same message, even a simple docker ps.

I have uploaded a diagnostics dump just after reproducing the error, restarting the container once, re-running and confirming still unreachable, and restarting a second time and encountering the above new error:

30667474-C49F-4185-B957-3A7AE1F38393/2017-03-02_19-03-47
@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Mar 2, 2017

Just tried to reproduce it once more, just to verify, this time it happened again (both times, it seemed to trigger more quickly than it did previously, but that might just be imagination or random luck).

This time, after re-running the program to verify the container is unreachable, and then trying to restart the container (docker-compose restart), I received the same error message as originally:

$ docker-compose restart
Restarting dockerbomb_redis_1 ... error

ERROR: for dockerbomb_redis_1  Cannot restart container 96579f13ce0f4322817342186728a646e3a33af3e7b6c91ead73a6becafb0742: driver failed programming external connectivity on endpoint dockerbomb_redis_1 (64d3c0dada115d7f483974802d85f3a8ff5851e4469b9c61db63c4cd0c4ae646): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:6379:tcp:172.19.0.2:6379: input/output error

Odd that it let me restart the container, first time around, but not this time. Might just be due to the randomness of the undefined state it might be in, when the bug triggers.

Here is a diagnostics dump from my second reproduction:

30667474-C49F-4185-B957-3A7AE1F38393/2017-03-02_19-13-05

micdah commented Mar 2, 2017

Just tried to reproduce it once more, just to verify, this time it happened again (both times, it seemed to trigger more quickly than it did previously, but that might just be imagination or random luck).

This time, after re-running the program to verify the container is unreachable, and then trying to restart the container (docker-compose restart), I received the same error message as originally:

$ docker-compose restart
Restarting dockerbomb_redis_1 ... error

ERROR: for dockerbomb_redis_1  Cannot restart container 96579f13ce0f4322817342186728a646e3a33af3e7b6c91ead73a6becafb0742: driver failed programming external connectivity on endpoint dockerbomb_redis_1 (64d3c0dada115d7f483974802d85f3a8ff5851e4469b9c61db63c4cd0c4ae646): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:6379:tcp:172.19.0.2:6379: input/output error

Odd that it let me restart the container, first time around, but not this time. Might just be due to the randomness of the undefined state it might be in, when the bug triggers.

Here is a diagnostics dump from my second reproduction:

30667474-C49F-4185-B957-3A7AE1F38393/2017-03-02_19-13-05
@rn

This comment has been minimized.

Show comment
Hide comment
@rn

rn Mar 2, 2017

Contributor

@micdah thanks for testing. is there anything in the logs? Look for the ones prefixed with vpnkit.

I was able to reproduce it easily with earlier versions, but currently travelling so can't retest ATM

Contributor

rn commented Mar 2, 2017

@micdah thanks for testing. is there anything in the logs? Look for the ones prefixed with vpnkit.

I was able to reproduce it easily with earlier versions, but currently travelling so can't retest ATM

@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Mar 2, 2017

@rneugeba Just re-ran the test again, so I could pinpoint the exact part of the log relevant to the point when the error occurs.

This is what was output, just when the error triggered:

[19:36:07.862][VpnKit         ][Error  ] Process died
[19:36:12.862][VpnKit         ][Info   ] Starting C:\Program Files\Docker\Docker\Resources\com.docker.slirp.exe --ethernet hyperv-connect://f08ee3f6-1270-41b8-8bd4-1874485c066a --port hyperv-connect://f08ee3f6-1270-41b8-8bd4-1874485c066a --db \\.\pipe\dockerDataBase --debug --diagnostics \\.\pipe\dockerVpnKitDiagnostics
[19:36:12.863][VpnKit         ][Info   ] Started
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: Logging to stdout (stdout:true DEBUG:false)
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: Setting handler to ignore all SIGPIPE signals
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: vpnkit version c41f7c8589352c95b14de636c895e8fbd72222e5 with hostnet version   uwt version 0.0.3 hvsock version 0.13.0 
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: starting port forwarding server on port_control_url:hyperv-connect://f08ee3f6-1270-41b8-8bd4-1874485c066a vsock_path:
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to f08ee3f6-1270-41b8-8bd4-1874485c066a:0B95756A-9985-48AD-9470-78E060895BE7
[19:36:12.885][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to f08ee3f6-1270-41b8-8bd4-1874485c066a:30D48B34-7D27-4B0B-AAAF-BBBED334DD59
[19:36:12.885][VpnKit         ][Error  ] com.docker.slirp.exe: While watching /etc/resolv.conf: ENOENT
[19:36:12.885][VpnKit         ][Info   ] com.docker.slirp.exe: hosts file has bindings for client.openvpn.net ##URL-REMOVED-BY-MICDAH##
[19:36:12.886][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[19:36:12.886][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[19:36:12.886][VpnKit         ][Info   ] com.docker.slirp.exe: attempting to reconnect to database
[19:36:12.886][DataKit        ][Info   ] com.docker.db.exe: accepted a new connection on \\.\pipe\dockerDataBase
[19:36:12.886][DataKit        ][Info   ] com.docker.db.exe: Using protocol TwoThousandU msize 8215
[19:36:12.888][VpnKit         ][Info   ] com.docker.slirp.exe: reconnected transport layer
[19:36:12.888][VpnKit         ][Info   ] com.docker.slirp.exe: remove connection limit
[19:36:12.898][VpnKit         ][Info   ] com.docker.slirp.exe: allowing binds to any IP addresses
[19:36:12.900][VpnKit         ][Info   ] com.docker.slirp.exe: updating resolvers to nameserver 192.168.1.1#53
[19:36:12.900][VpnKit         ][Info   ] order 0
[19:36:12.900][VpnKit         ][Info   ] com.docker.slirp.exe: Add(3): DNS configuration changed to: nameserver 192.168.1.1#53
[19:36:12.900][VpnKit         ][Info   ] order 0
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: Creating slirp server peer_ip:192.168.65.2 local_ip:192.168.65.1 domain_search: mtu:8000
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: PPP.negotiate: received ((magic VMN3T)(version 13)(commit"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"))
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: TCP/IP ready
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: stack connected
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: no introspection server requested. See the --introspection argument
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: starting diagnostics server on: \\.\pipe\dockerVpnKitDiagnostics

There are no logs prior to this, I noted the line of the output before starting my program. So this is the entirety of the log output when the error triggers.

Is there any way to increase verbosity of the output, seems odd the vpnkit process would just die without any error or exception

micdah commented Mar 2, 2017

@rneugeba Just re-ran the test again, so I could pinpoint the exact part of the log relevant to the point when the error occurs.

This is what was output, just when the error triggered:

[19:36:07.862][VpnKit         ][Error  ] Process died
[19:36:12.862][VpnKit         ][Info   ] Starting C:\Program Files\Docker\Docker\Resources\com.docker.slirp.exe --ethernet hyperv-connect://f08ee3f6-1270-41b8-8bd4-1874485c066a --port hyperv-connect://f08ee3f6-1270-41b8-8bd4-1874485c066a --db \\.\pipe\dockerDataBase --debug --diagnostics \\.\pipe\dockerVpnKitDiagnostics
[19:36:12.863][VpnKit         ][Info   ] Started
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: Logging to stdout (stdout:true DEBUG:false)
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: Setting handler to ignore all SIGPIPE signals
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: vpnkit version c41f7c8589352c95b14de636c895e8fbd72222e5 with hostnet version   uwt version 0.0.3 hvsock version 0.13.0 
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: starting port forwarding server on port_control_url:hyperv-connect://f08ee3f6-1270-41b8-8bd4-1874485c066a vsock_path:
[19:36:12.883][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to f08ee3f6-1270-41b8-8bd4-1874485c066a:0B95756A-9985-48AD-9470-78E060895BE7
[19:36:12.885][VpnKit         ][Info   ] com.docker.slirp.exe: connecting to f08ee3f6-1270-41b8-8bd4-1874485c066a:30D48B34-7D27-4B0B-AAAF-BBBED334DD59
[19:36:12.885][VpnKit         ][Error  ] com.docker.slirp.exe: While watching /etc/resolv.conf: ENOENT
[19:36:12.885][VpnKit         ][Info   ] com.docker.slirp.exe: hosts file has bindings for client.openvpn.net ##URL-REMOVED-BY-MICDAH##
[19:36:12.886][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[19:36:12.886][VpnKit         ][Info   ] com.docker.slirp.exe: hvsock connected successfully
[19:36:12.886][VpnKit         ][Info   ] com.docker.slirp.exe: attempting to reconnect to database
[19:36:12.886][DataKit        ][Info   ] com.docker.db.exe: accepted a new connection on \\.\pipe\dockerDataBase
[19:36:12.886][DataKit        ][Info   ] com.docker.db.exe: Using protocol TwoThousandU msize 8215
[19:36:12.888][VpnKit         ][Info   ] com.docker.slirp.exe: reconnected transport layer
[19:36:12.888][VpnKit         ][Info   ] com.docker.slirp.exe: remove connection limit
[19:36:12.898][VpnKit         ][Info   ] com.docker.slirp.exe: allowing binds to any IP addresses
[19:36:12.900][VpnKit         ][Info   ] com.docker.slirp.exe: updating resolvers to nameserver 192.168.1.1#53
[19:36:12.900][VpnKit         ][Info   ] order 0
[19:36:12.900][VpnKit         ][Info   ] com.docker.slirp.exe: Add(3): DNS configuration changed to: nameserver 192.168.1.1#53
[19:36:12.900][VpnKit         ][Info   ] order 0
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: Creating slirp server peer_ip:192.168.65.2 local_ip:192.168.65.1 domain_search: mtu:8000
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: PPP.negotiate: received ((magic VMN3T)(version 13)(commit"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"))
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: TCP/IP ready
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: stack connected
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: no introspection server requested. See the --introspection argument
[19:36:12.902][VpnKit         ][Info   ] com.docker.slirp.exe: starting diagnostics server on: \\.\pipe\dockerVpnKitDiagnostics

There are no logs prior to this, I noted the line of the output before starting my program. So this is the entirety of the log output when the error triggers.

Is there any way to increase verbosity of the output, seems odd the vpnkit process would just die without any error or exception

@atmorell

This comment has been minimized.

Show comment
Hide comment
@atmorell

atmorell Mar 3, 2017

Crashing after a few minutes with the new build. Trying to start anything gives the following error:

Error response from daemon: driver failed programming external connectivity on endpoint sonarr (06921e1a5b924edb2347fde1b1e0ed18707e2508cb0052e090dbcaae930fa05b): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:8989:tcp:172.17.0.4:8989: input/output error
Error: failed to start containers: sonarr

atmorell commented Mar 3, 2017

Crashing after a few minutes with the new build. Trying to start anything gives the following error:

Error response from daemon: driver failed programming external connectivity on endpoint sonarr (06921e1a5b924edb2347fde1b1e0ed18707e2508cb0052e090dbcaae930fa05b): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:8989:tcp:172.17.0.4:8989: input/output error
Error: failed to start containers: sonarr

@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Mar 3, 2017

Yeah, today while trying to demonstrate a development environment I have been trying out for the past few months on my machine, to a few colleagues, Docker simply wouldn't start - it sat for a few minutes before showing an error message and the icon became red.

Sadly I was in a hurry, so I didn't have time to try and find something in the logs - and just tried starting docker again and now it works (the machine have been restarted).

Of course, these things will happen, but it is the first time I have ever experienced that failure.

micdah commented Mar 3, 2017

Yeah, today while trying to demonstrate a development environment I have been trying out for the past few months on my machine, to a few colleagues, Docker simply wouldn't start - it sat for a few minutes before showing an error message and the icon became red.

Sadly I was in a hurry, so I didn't have time to try and find something in the logs - and just tried starting docker again and now it works (the machine have been restarted).

Of course, these things will happen, but it is the first time I have ever experienced that failure.

@bluevulpine

This comment has been minimized.

Show comment
Hide comment
@bluevulpine

bluevulpine Mar 5, 2017

I've got a 'me, too!' on this.

I'm running Minecraft in a docker container so there's a lot of network activity out to the four people connected and this is happening 2-3 times a night when we're all connected. When this happens everyone gets booted, so it's easy to know when to go check the logs. The docker logs look the same as micdah - VpnKit posts a process died message, then proceeds through its restart. I can usually restore things to working condition by triggering a complete docker restart.

This is occurring with Docker Community Edition, Version 17.03.0-ce-win1 (10296), Channel: stable, e5a07a1, running on Windows 10 version 1607 build 14393.693.

Diagnostic:

0F9BB5CA-F560-4D34-B4A4-881C5409B32C/2017-03-05_03-41-01

bluevulpine commented Mar 5, 2017

I've got a 'me, too!' on this.

I'm running Minecraft in a docker container so there's a lot of network activity out to the four people connected and this is happening 2-3 times a night when we're all connected. When this happens everyone gets booted, so it's easy to know when to go check the logs. The docker logs look the same as micdah - VpnKit posts a process died message, then proceeds through its restart. I can usually restore things to working condition by triggering a complete docker restart.

This is occurring with Docker Community Edition, Version 17.03.0-ce-win1 (10296), Channel: stable, e5a07a1, running on Windows 10 version 1607 build 14393.693.

Diagnostic:

0F9BB5CA-F560-4D34-B4A4-881C5409B32C/2017-03-05_03-41-01

@Bazmcl

This comment has been minimized.

Show comment
Hide comment
@Bazmcl

Bazmcl May 4, 2017

seem to be suffering from the same issue - not opening a lot of ports but a lot of traffic

Uploaded diagnostic EB0E3EF6-6538-4B43-ADB9-8CFE688B9BBF/2017-05-04_09-26-08

Bazmcl commented May 4, 2017

seem to be suffering from the same issue - not opening a lot of ports but a lot of traffic

Uploaded diagnostic EB0E3EF6-6538-4B43-ADB9-8CFE688B9BBF/2017-05-04_09-26-08

@tparikka

This comment has been minimized.

Show comment
Hide comment
@tparikka

tparikka May 9, 2017

@djs55 Has there been any new information unearthed since @micdah confirmed it was only the ports that stopped in his environment?

tparikka commented May 9, 2017

@djs55 Has there been any new information unearthed since @micdah confirmed it was only the ports that stopped in his environment?

@bvitale

This comment has been minimized.

Show comment
Hide comment
@bvitale

bvitale May 16, 2017

A diagnostic was uploaded with id: 0BFBE273-80BD-45FB-AD1A-36352590E93D/2017-05-16_13-44-49

[13:41:50.640][VpnKit         ][Warning] vpnkit.exe: ARP table has no entry for 172.18.0.2
[13:41:50.640][VpnKit         ][Error  ] vpnkit.exe: PPP.listen callback caught Ipv4.Make(Ethif)(Arpv4).Routing.No_route_to_destination_address(_)
[13:42:02.762][VpnKit         ][Error  ] vpnkit.exe: Hvsock.shutdown_write: got Eof
[13:42:02.762][VpnKit         ][Error  ] vpnkit.exe: tcp:0.0.0.0:5000:tcp:172.18.0.15:3000 proxy failed with flow proxy b: write failed with Eof
[13:42:02.970][VpnKit         ][Error  ] vpnkit.exe: Socket.Stream: caught An existing connection was forcibly closed by the remote host.

Windows 10 14393.1066

My scenario is sending a large number of requests to the same port.

bvitale commented May 16, 2017

A diagnostic was uploaded with id: 0BFBE273-80BD-45FB-AD1A-36352590E93D/2017-05-16_13-44-49

[13:41:50.640][VpnKit         ][Warning] vpnkit.exe: ARP table has no entry for 172.18.0.2
[13:41:50.640][VpnKit         ][Error  ] vpnkit.exe: PPP.listen callback caught Ipv4.Make(Ethif)(Arpv4).Routing.No_route_to_destination_address(_)
[13:42:02.762][VpnKit         ][Error  ] vpnkit.exe: Hvsock.shutdown_write: got Eof
[13:42:02.762][VpnKit         ][Error  ] vpnkit.exe: tcp:0.0.0.0:5000:tcp:172.18.0.15:3000 proxy failed with flow proxy b: write failed with Eof
[13:42:02.970][VpnKit         ][Error  ] vpnkit.exe: Socket.Stream: caught An existing connection was forcibly closed by the remote host.

Windows 10 14393.1066

My scenario is sending a large number of requests to the same port.

@tparikka

This comment has been minimized.

Show comment
Hide comment
@tparikka

tparikka May 30, 2017

@djs55 or @jeanlaurent Has there been any progress on this issue, and any more tests/diagnostics we can provide that would assist at this time? Thank you for your help!

tparikka commented May 30, 2017

@djs55 or @jeanlaurent Has there been any progress on this issue, and any more tests/diagnostics we can provide that would assist at this time? Thank you for your help!

@mattjanssen

This comment has been minimized.

Show comment
Hide comment
@mattjanssen

mattjanssen May 30, 2017

Same issue with Win 10 version 1703 build 15063.296 and Docker edge 17.05.0-ce-win11 (12053). No matter what random (unused) port combinations I used, I got the same error.

I fixed it by stopping Docker from the system tray and restarting it. After I fixed the problem I created diagnostic 950F6894-7F6D-4081-BDCE-7B35E19A391B/2017-05-30_16-55-11.

C:\Users\Matt>docker run -p "10392:13293" agaveapi/beanstalkd-console
docker: Error response from daemon: driver failed programming external connectivity on endpoint 
hopeful_mcnulty (fa135ff9192e4bd4f103e5f6128863d174b426483463d32558f332440d5865a4): 
Error starting userland proxy: 
mkdir /port/tcp:0.0.0.0:10392:tcp:172.17.0.2:13293: input/output error.

mattjanssen commented May 30, 2017

Same issue with Win 10 version 1703 build 15063.296 and Docker edge 17.05.0-ce-win11 (12053). No matter what random (unused) port combinations I used, I got the same error.

I fixed it by stopping Docker from the system tray and restarting it. After I fixed the problem I created diagnostic 950F6894-7F6D-4081-BDCE-7B35E19A391B/2017-05-30_16-55-11.

C:\Users\Matt>docker run -p "10392:13293" agaveapi/beanstalkd-console
docker: Error response from daemon: driver failed programming external connectivity on endpoint 
hopeful_mcnulty (fa135ff9192e4bd4f103e5f6128863d174b426483463d32558f332440d5865a4): 
Error starting userland proxy: 
mkdir /port/tcp:0.0.0.0:10392:tcp:172.17.0.2:13293: input/output error.
@alarys

This comment has been minimized.

Show comment
Hide comment
@alarys

alarys Jun 6, 2017

Hi,
I'm experiencing similar problems as those above.
I submitted a diagnostic
A diagnostic was uploaded with id: 7FF77AD4-6196-4C0C-BF18-962C00826605/2017-06-06_14-05-31

My Docker version 17.03.1-ce, build c6d412e

I tried to use the latest Edge release, but my docker configurations throw errors when I try to create containers. It seems to be complaining about local drive mappings. Not sure what is going on there.

My containers are not responding as above. As soon as a download begins, and the data transfer ramps up to 1-2Mbps, all my containers stop responding. Restarting docker gets things working again.

I have mitigated the problem somewhat by throttling bandwidth. But even throttling bandwidth to 500 kbps, the problem still resurfaces after a while. I can reliably reproduce by not throttling the bandwidth and kicking off a download.

I'm really quite disappointed with how docker on windows is handling large data throughput. And this seems like an issue that others have experienced too.

alarys commented Jun 6, 2017

Hi,
I'm experiencing similar problems as those above.
I submitted a diagnostic
A diagnostic was uploaded with id: 7FF77AD4-6196-4C0C-BF18-962C00826605/2017-06-06_14-05-31

My Docker version 17.03.1-ce, build c6d412e

I tried to use the latest Edge release, but my docker configurations throw errors when I try to create containers. It seems to be complaining about local drive mappings. Not sure what is going on there.

My containers are not responding as above. As soon as a download begins, and the data transfer ramps up to 1-2Mbps, all my containers stop responding. Restarting docker gets things working again.

I have mitigated the problem somewhat by throttling bandwidth. But even throttling bandwidth to 500 kbps, the problem still resurfaces after a while. I can reliably reproduce by not throttling the bandwidth and kicking off a download.

I'm really quite disappointed with how docker on windows is handling large data throughput. And this seems like an issue that others have experienced too.

@tparikka

This comment has been minimized.

Show comment
Hide comment
@tparikka

tparikka Jun 14, 2017

@djs55 @jeanlaurent There hasn't been any input from the Docker team on this issue since April 21. I'm hopeful that if there hasn't been any progress that the community may be able to help by trying out builds or providing additional diagnostics. Thank you!

tparikka commented Jun 14, 2017

@djs55 @jeanlaurent There hasn't been any input from the Docker team on this issue since April 21. I'm hopeful that if there hasn't been any progress that the community may be able to help by trying out builds or providing additional diagnostics. Thank you!

@mikesnare

This comment has been minimized.

Show comment
Hide comment
@mikesnare

mikesnare Aug 4, 2017

@djs55 @jeanlaurent It's now been close to 4 months with no developer response to what could be argued is a pretty serious -- crippling -- bug. Any updates? I'm using Docker for Windows to spin up a zookeeper and a couple kafka instances and it dies pretty consistently under load and then fails during restart with the same errors others are describing, forcing me to restart docker entirely.

mikesnare commented Aug 4, 2017

@djs55 @jeanlaurent It's now been close to 4 months with no developer response to what could be argued is a pretty serious -- crippling -- bug. Any updates? I'm using Docker for Windows to spin up a zookeeper and a couple kafka instances and it dies pretty consistently under load and then fails during restart with the same errors others are describing, forcing me to restart docker entirely.

@agentilela

This comment has been minimized.

Show comment
Hide comment
@agentilela

agentilela Aug 11, 2017

Same issue here, uploaded diagnostic: 0CC3ABDF-040B-4BF0-9D39-B24CAE24F6ED/2017-08-10_19-47-42

Here's the interesting stuff

[19:37:29.451][VpnKit         ][Info   ] Tcp.PCB: ERROR: thread failure; terminating threads and closing connection

[19:37:29.452][VpnKit         ][Error  ] vpnkit.exe: Lwt.async failure (Invalid_argument Lwt.wakeup_result): Raised at file "format.ml", line 241, characters 41-52

[19:37:29.452][VpnKit         ][Info   ] Called from file "format.ml", line 482, characters 6-24

[19:37:29.452][VpnKit         ][Info   ] 

[19:40:29.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:31.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:35.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:43.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:59.857][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:41:07.180][VpnKit         ][Error  ] Process died

agentilela commented Aug 11, 2017

Same issue here, uploaded diagnostic: 0CC3ABDF-040B-4BF0-9D39-B24CAE24F6ED/2017-08-10_19-47-42

Here's the interesting stuff

[19:37:29.451][VpnKit         ][Info   ] Tcp.PCB: ERROR: thread failure; terminating threads and closing connection

[19:37:29.452][VpnKit         ][Error  ] vpnkit.exe: Lwt.async failure (Invalid_argument Lwt.wakeup_result): Raised at file "format.ml", line 241, characters 41-52

[19:37:29.452][VpnKit         ][Info   ] Called from file "format.ml", line 482, characters 6-24

[19:37:29.452][VpnKit         ][Info   ] 

[19:40:29.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:31.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:35.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:43.855][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:40:59.857][VpnKit         ][Info   ] Tcp.Segment: TCP retransmission on timer seq = 531271309

[19:41:07.180][VpnKit         ][Error  ] Process died
@jinh-dk

This comment has been minimized.

Show comment
Hide comment
@jinh-dk

jinh-dk Aug 24, 2017

I have the same issue in docker version 17.06.1-ce-win24 (13025) , as welll as last version.

when I execute docker-compose in powershell console, I saw
WindowsError: [Error 2] The system cannot find the file specified: u'***********************' Failed to execute script docker-compose

Have seen in docker daemon log.
[08:50:59.912][VpnKit ][Error ] vpnkit.exe: Hvsock.read: An established connection was aborted by the software in your host machine.

jinh-dk commented Aug 24, 2017

I have the same issue in docker version 17.06.1-ce-win24 (13025) , as welll as last version.

when I execute docker-compose in powershell console, I saw
WindowsError: [Error 2] The system cannot find the file specified: u'***********************' Failed to execute script docker-compose

Have seen in docker daemon log.
[08:50:59.912][VpnKit ][Error ] vpnkit.exe: Hvsock.read: An established connection was aborted by the software in your host machine.

@smellinet

This comment has been minimized.

Show comment
Hide comment
@smellinet

smellinet Sep 7, 2017

Hello,
I have the same issue with latest version :
Version 17.06.2-ce-win27 (13194)
Channel: stable
428bd6c
After a heavy load , the network of container is broken.
the stop/start of container doesn't solve the problem:

Error response from daemon: driver failed programming external connectivity on endpoint tapo (23bc1c5ec134f7b164eb6c35e810cd89e876d8c8da3b46db4d8685b642f8ac8d): Error starting userland proxy: mkdir /port/tcp:0.0.0
.0:5500:tcp:172.17.0.2:5500: input/output error

diagnostic id upload :
Diagnostics successfully uploaded (C64A9176-3C73-4FBC-B4FA-D4B0017B689C/2017-09-07_10-18-23).

smellinet commented Sep 7, 2017

Hello,
I have the same issue with latest version :
Version 17.06.2-ce-win27 (13194)
Channel: stable
428bd6c
After a heavy load , the network of container is broken.
the stop/start of container doesn't solve the problem:

Error response from daemon: driver failed programming external connectivity on endpoint tapo (23bc1c5ec134f7b164eb6c35e810cd89e876d8c8da3b46db4d8685b642f8ac8d): Error starting userland proxy: mkdir /port/tcp:0.0.0
.0:5500:tcp:172.17.0.2:5500: input/output error

diagnostic id upload :
Diagnostics successfully uploaded (C64A9176-3C73-4FBC-B4FA-D4B0017B689C/2017-09-07_10-18-23).

@Naragato

This comment has been minimized.

Show comment
Hide comment
@Naragato

Naragato Sep 12, 2017

I can't believe this still isn't a priority to fix. :(

Naragato commented Sep 12, 2017

I can't believe this still isn't a priority to fix. :(

@mittork

This comment has been minimized.

Show comment
Hide comment
@mittork

mittork Sep 14, 2017

@djs55 , can you provide any ETA for this to be fixed? So far Docker for Windows is not usable in a productive way for us and we have to think about workarounds (like using another standalone Host and configure docker to client connect to this).

But I ask: How can I trust a software for production, which is not able to handle a bit more load in development stage. I know it is related to the VPN-kit, but anyway....

mittork commented Sep 14, 2017

@djs55 , can you provide any ETA for this to be fixed? So far Docker for Windows is not usable in a productive way for us and we have to think about workarounds (like using another standalone Host and configure docker to client connect to this).

But I ask: How can I trust a software for production, which is not able to handle a bit more load in development stage. I know it is related to the VPN-kit, but anyway....

@tparikka

This comment has been minimized.

Show comment
Hide comment
@tparikka

tparikka Oct 18, 2017

@djs55 @jeanlaurent Is there any more information available on this problem, an ETR, or even an updated priority?

tparikka commented Oct 18, 2017

@djs55 @jeanlaurent Is there any more information available on this problem, an ETR, or even an updated priority?

@TheFamilyRoom

This comment has been minimized.

Show comment
Hide comment
@TheFamilyRoom

TheFamilyRoom Nov 20, 2017

We are experiencing the same behavior. heavy load on single port and the docker bridge falls over. the containers are still running but can't be accessed. it seems like this is not a priority for anyone to address but it is holding us up. proposed solurions:

run on mac/linux - we will try this next
run less load? - sorta defeats the point.

anyone else have success getting this to work on Win 10?

TheFamilyRoom commented Nov 20, 2017

We are experiencing the same behavior. heavy load on single port and the docker bridge falls over. the containers are still running but can't be accessed. it seems like this is not a priority for anyone to address but it is holding us up. proposed solurions:

run on mac/linux - we will try this next
run less load? - sorta defeats the point.

anyone else have success getting this to work on Win 10?

@micdah

This comment has been minimized.

Show comment
Hide comment
@micdah

micdah Nov 20, 2017

Yeah I have more or less given up on running heavy loads on Docker for Windows, interestingly I don't seem to have the same issues after we are moving our services over onto Kubernetes running via minikube on windows.

Naturally this environment is just an extra stack on top of Docker, but it seems like Minikube at least, runs "better" on Windows (using Hyper-V, but it is also possible to use VirutalBox).

micdah commented Nov 20, 2017

Yeah I have more or less given up on running heavy loads on Docker for Windows, interestingly I don't seem to have the same issues after we are moving our services over onto Kubernetes running via minikube on windows.

Naturally this environment is just an extra stack on top of Docker, but it seems like Minikube at least, runs "better" on Windows (using Hyper-V, but it is also possible to use VirutalBox).

@SC7639

This comment has been minimized.

Show comment
Hide comment
@SC7639

SC7639 Nov 20, 2017

I'm still experiencing this issue now and again. It happened today and I had to restart docker for windows for a container to use the port again.

SC7639 commented Nov 20, 2017

I'm still experiencing this issue now and again. It happened today and I had to restart docker for windows for a container to use the port again.

@tparikka

This comment has been minimized.

Show comment
Hide comment
@tparikka

tparikka Jan 11, 2018

@djs55, @jeanlaurent, can you comment on whether or not this issue been officially abandoned?

tparikka commented Jan 11, 2018

@djs55, @jeanlaurent, can you comment on whether or not this issue been officially abandoned?

@djs55

This comment has been minimized.

Show comment
Hide comment
@djs55

djs55 Jan 11, 2018

@tparikka We've not abandoned the issue, but unfortunately other issues have been higher priority recently -- I apologise for the delay.

We're hoping to update the version of the Linux kernel we use to 4.14, which has a newer implementation of Hyper-V sockets which we use for exposing ports. We should be able to drop some of the workarounds for bugs in the previous version and hopefully this will make the whole system more reliable. As part of this update we'll do some general stress testing and attempt to reproduce this issue.

Thanks again for your patience.

djs55 commented Jan 11, 2018

@tparikka We've not abandoned the issue, but unfortunately other issues have been higher priority recently -- I apologise for the delay.

We're hoping to update the version of the Linux kernel we use to 4.14, which has a newer implementation of Hyper-V sockets which we use for exposing ports. We should be able to drop some of the workarounds for bugs in the previous version and hopefully this will make the whole system more reliable. As part of this update we'll do some general stress testing and attempt to reproduce this issue.

Thanks again for your patience.

@SC7639

This comment has been minimized.

Show comment
Hide comment
@SC7639

SC7639 Jan 11, 2018

Thanks for the update

SC7639 commented Jan 11, 2018

Thanks for the update

@Michal-Svoboda

This comment has been minimized.

Show comment
Hide comment
@Michal-Svoboda

Michal-Svoboda Feb 8, 2018

We suffer the same issue in our project as well.
@djs55 - I would like to ask you, if there is any schedule when there will be the new version of Docker available using the newer implementation of Hyper-V sockets?

And what is the current status of this issue?

Thanks a lot.

Michal-Svoboda commented Feb 8, 2018

We suffer the same issue in our project as well.
@djs55 - I would like to ask you, if there is any schedule when there will be the new version of Docker available using the newer implementation of Hyper-V sockets?

And what is the current status of this issue?

Thanks a lot.

@vohtaski

This comment has been minimized.

Show comment
Hide comment
@vohtaski

vohtaski Mar 15, 2018

Same problem here. Running MariaDb in a docker container on Windows.
After several thousand requests, it dies with "dial tcp 127.0.0.1:33061: getsockopt: connection refused"
Would be amazing to have a fix or a workaround

vohtaski commented Mar 15, 2018

Same problem here. Running MariaDb in a docker container on Windows.
After several thousand requests, it dies with "dial tcp 127.0.0.1:33061: getsockopt: connection refused"
Would be amazing to have a fix or a workaround

@tparikka

This comment has been minimized.

Show comment
Hide comment
@tparikka

tparikka May 24, 2018

@djs55 @jeanlaurent I wanted to check in on this since it's been about 4 months. Is there any update on this issue, and perhaps is there a separate Git issue that's been logged for the Linux kernel version update that you hope will improve stability under load so we can follow it?

tparikka commented May 24, 2018

@djs55 @jeanlaurent I wanted to check in on this since it's been about 4 months. Is there any update on this issue, and perhaps is there a separate Git issue that's been logged for the Linux kernel version update that you hope will improve stability under load so we can follow it?

@sw-carlin

This comment has been minimized.

Show comment
Hide comment
@sw-carlin

sw-carlin Jul 6, 2018

This problem seems to have improved in the stable channel, as I'm on Docker version 18.03.1-ce and am able to still run docker commands when the exposed ports of my containers aren't responsive; In the previous version that was not possible.

I am also able to recover from the situation by stopping some of the containers which I guess is freeing up frozen sockets? I'm running 20 containers that compose a microservice ecosystem with lots of traffic moving between them and can trigger the situation by running any of my system integration tests. I will try running the tests from inside the container composition to see if that is a good workaround.

sw-carlin commented Jul 6, 2018

This problem seems to have improved in the stable channel, as I'm on Docker version 18.03.1-ce and am able to still run docker commands when the exposed ports of my containers aren't responsive; In the previous version that was not possible.

I am also able to recover from the situation by stopping some of the containers which I guess is freeing up frozen sockets? I'm running 20 containers that compose a microservice ecosystem with lots of traffic moving between them and can trigger the situation by running any of my system integration tests. I will try running the tests from inside the container composition to see if that is a good workaround.

@tparikka

This comment has been minimized.

Show comment
Hide comment
@tparikka

tparikka Aug 16, 2018

@djs55, @jeanlaurent it has been over 7 months since the last update. Is there any further information on this issue?

tparikka commented Aug 16, 2018

@djs55, @jeanlaurent it has been over 7 months since the last update. Is there any further information on this issue?

@djs55

This comment has been minimized.

Show comment
Hide comment
@djs55

djs55 Aug 16, 2018

@tparikka sorry for the delay. There has been some progress: we've started updating the Hyper-V socket implementation used in several of the components to remove a complex (possibly buggy) workaround for bugs in old Windows builds (< 14393). Once this is done we'll update the Hyper-V socket GUIDs that we use and then we can bump the kernel version. These changes will be merged into the development branch gradually -- I'll let you know when there are interesting development builds you can test.

djs55 commented Aug 16, 2018

@tparikka sorry for the delay. There has been some progress: we've started updating the Hyper-V socket implementation used in several of the components to remove a complex (possibly buggy) workaround for bugs in old Windows builds (< 14393). Once this is done we'll update the Hyper-V socket GUIDs that we use and then we can bump the kernel version. These changes will be merged into the development branch gradually -- I'll let you know when there are interesting development builds you can test.

@tg73

This comment has been minimized.

Show comment
Hide comment
@tg73

tg73 Sep 20, 2018

I've also run into the same or possibly a related issue, in this case using Windows Containers hosted on Windows Server Core 1803. The image is based on jetbrains/teamcity-agent - so the container acts as a build agent for TeamCity. When running a build via the agent running within the container, at some arbitrary point, the container becomes unresponsive. With process isolation, RDP to the host OS also becomes unresponsive and the host eventually reboots. With hyperv isolation, the container becomes unresponsive and then stops, but the host OS stays up and responsive. Builds do sometimes complete, but more often than not they fail. TeamCity server reports a loss of connection to the build agent, and eventually the build is marked as failed.

Having invested quite a lot of time getting the image to have all the tools our builds need, it was disappointing (to say the least) that what seems to be a fundamental virtualization issue renders this approach unusable. In the end I've had to revert to individual Windows Server VMs per agent.

Unfortunately, I don't have further time to fully log this problem and try to produce a minimal test case - so my apologies for not logging a full issue report. I have attached my custom Dockerfile for interest. Just to note also that the lack of --cpus support with docker service create is also a big problem with this use case.

issue.zip

tg73 commented Sep 20, 2018

I've also run into the same or possibly a related issue, in this case using Windows Containers hosted on Windows Server Core 1803. The image is based on jetbrains/teamcity-agent - so the container acts as a build agent for TeamCity. When running a build via the agent running within the container, at some arbitrary point, the container becomes unresponsive. With process isolation, RDP to the host OS also becomes unresponsive and the host eventually reboots. With hyperv isolation, the container becomes unresponsive and then stops, but the host OS stays up and responsive. Builds do sometimes complete, but more often than not they fail. TeamCity server reports a loss of connection to the build agent, and eventually the build is marked as failed.

Having invested quite a lot of time getting the image to have all the tools our builds need, it was disappointing (to say the least) that what seems to be a fundamental virtualization issue renders this approach unusable. In the end I've had to revert to individual Windows Server VMs per agent.

Unfortunately, I don't have further time to fully log this problem and try to produce a minimal test case - so my apologies for not logging a full issue report. I have attached my custom Dockerfile for interest. Just to note also that the lack of --cpus support with docker service create is also a big problem with this use case.

issue.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment