Reproducible: container network behavior breaks all host-to-container networking #3487

hughsw · 2019-01-25T15:31:08Z

[x ] I have tried with the latest version of my channel (Stable or Edge)
[x ] I have uploaded Diagnostics
Diagnostics ID: CDF6C54F-DD10-459F-9401-2F769C811A8F/20190509120230
Version 2.0.0.2 ID: CDF6C54F-DD10-459F-9401-2F769C811A8F/20190125144328

Expected behavior

Host to Docker engine networking should just work.

Actual behavior

Express/Sequelize process on host attempts to send a BLOB larger than the configured max_allowed_packet to a MariaDB Sql server running in a container. MariaDB server Aborts the connection and complains (Got a packet bigger than 'max_allowed_packet' bytes). Thereafter, all networking from the host to any existing or newly running containers fail with connection timeouts.

Information

The problem is reproducible. It seems to be triggered by transfer of large BLOB data (15MB) from NodeJS/Sequelize process running on the Mac into a MariaDB SQL server running in a container. The MariaDB server is exposing its port, 3306, to the host, and the NodeJS/Sequelize process is contacting it as localhost:3306. The MariaDB SQL container mounts a host directory for the container to use for the SQL database.

After the failure, no containers can be accessed via networking from the host. That is, existing running containers can no longer be reached (connections time out). Also, newly docker run containers cannot be reached either.

Note: The connections to the containers time out, they are not refused.

If I docker exec into running containers, I can access their networked ports via the container's localhost. So, the issue seems to be more at the level of host-network-to-container-network rather than the internal container's networking....

I have looked at the streaming logs while the failure occurs (/usr/bin/log stream ...). There is no docker message in these logs around the time of the failure event.

The only fix I've found so far is to Restart the Docker app/engine.

I do not know how long the problem has been around because I have only just started the work to upload large BLOBs to the MariaDB server. Smaller BLOBs (1MB) do not cause the problem.

macOS Version: 10.13.6 (17G4015)

Diagnostic logs

Diagnose succeded

Steps to reproduce the behavior

This problem was found during dev work for a complex setup with 8 containers running in a Stack.

There is now a repo with minimal code to reproducibly demonstrate this problem: https://github.com/hughsw/dockerbug

The text was updated successfully, but these errors were encountered:

hughsw · 2019-01-25T15:42:57Z

Similar to #3448 but reproducible.

hughsw · 2019-01-27T19:55:15Z

I've created a repository with code to reproducibly break Docker engine on Docker for Mac.

https://github.com/hughsw/dockerbug

hughsw · 2019-02-04T23:09:13Z

I have determined that it is not the amount of data transferred that triggers the failure. It appears that the failure is triggered by whatever it is that the DockerHub image for mariadb:10.3 does when it handles a query with an amount of data that exceeds its configurable max-allowed-packet size.

This parameter defaults to 16777216 (16 MiB), and we first observed the Docker engine failure when we tried to insert BLOBs larger than this amount. But, by setting a smaller limit, e.g. by passing this to the image, --max-allowed-packet=1048576 (1 MiB), we were able to trigger the Docker engine failure with a 1.5MB blob. And by setting it to a larger value (128 MiB), we were able to successfully insert blobs of more than 100MB.

That is, it appears that whatever network activity happens when MariaDB is generating its exception about max-allowed-packet being exceeded is what breaks the Docker engine host-to-container networking, not the amount of data itself.

hughsw · 2019-02-04T23:14:16Z

Also, we have observed that both the mysql2 and mariadb Javascript client libraries can trigger the exception in the MariaDB container that breaks the Docker engine host-to-container networking. That is, Sequelize is not necessary to cause the problem.

Allan-Clements · 2019-02-07T23:16:55Z

Hi @hughsw, I ran into something similar to what you're describing. Figured I'd share my experiences.

For what it's worth I found that upgrading to anything later than the August 2018 version breaks so I've personally rolled back to that after downloading it here: https://docs.docker.com/docker-for-mac/release-notes/#docker-community-edition-18061-ce-mac73-2018-08-29.

I also found this issue to also be reproducible but interestingly it seems our scale of "size" seems to be different. I created a demo app here: https://github.com/Allan-Clements/docker-demo

It spins up a localstack container hosting a localhost version of S3 and then uploads files of increasing sizes to it with random bytes in each file to increase the size each time.

I encountered this bug due to a test for a real application uploading files to a localstack container kept getting stuck at the 2nd test case. The first case involved uploading a 7 KB file. Whereas the 2nd case was 168 KB.

Running this demo app I have observed it failing to upload a file as small as 15 KB up to 31 KB depending on whichever post August 2018 release of Docker For Mac I was trying out.

And like in your case it seems like my docker environment is stuck until I reset the docker daemon. I would confirm the networking layer was broken by just trying to run the nginx example container and curl against it. It'd hang like this after my demo app would hit whatever file size would break it for that version:

docker run --detach -p 80:80 --name=webserver nginx && sleep 1 && curl -v -p 127.0.0.1:80 && docker stop webserver && docker rm webserver
e4c8675706f669bb3bf9e8329ac66163db40b482281b8461a29b75b61c1462b1
* Rebuilt URL to: 127.0.0.1:80/
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 80 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.54.0
> Accept: */*
>

docker-robott · 2019-05-08T01:00:18Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale comment.
Stale issues will be closed after an additional 30d of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle stale

Allan-Clements · 2019-05-08T14:09:24Z

/remove-lifecycle stale

hughsw · 2019-05-08T21:49:47Z

I've just confirmed that the bug is in Docker for Mac 2.0.0.3/18.09.2

I've updated the documentation etc for the reproducible code: https://github.com/hughsw/dockerbug

djs55 · 2019-07-08T10:47:15Z

Thanks for the report (and especially for the repro example!)

I can confirm the current stable is still broken (Version 2.0.0.4 (31365)). Inside the VM (using a command like docker run -it --privileged --pid=host justincormack/nsenter1) the log file /var/log/vpnkit-forwarder.log has a record of the failure:

2019-07-08T09:34:56Z vpnkit-forwarder 2019/07/08 09:34:56 Multiplexer main loop failed with EOF

I believe the bug is fixed in current edge (Version 2.0.5.0 (35318)). Inside the VM the log file has a different message:

2019-07-08T09:47:45Z vpnkit-forwarder 2019/07/08 09:47:45 Discarded 65536 bytes from 2147483693 Data length 65536

This log comes from here, added by this commit in moby/vpnkit#453 . I believe what's happening is the connection is being closed while data is in-flight: previously this would trigger a fatal error breaking all future port forwarding. It now generates a log message and safely continues.

So I think this will be fixed in stable in the next stable release. Thanks again for the repro example!

docker-robott · 2019-10-06T01:00:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale comment.
Stale issues will be closed after an additional 30d of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle stale

docker-robott · 2020-07-01T03:00:47Z

Closed issues are locked after 30 days of inactivity.
This helps our team focus on active issues.

If you have found a problem that seems similar to this, please open a new issue.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle locked

docker-robott added the version/2.0.0.2 label Jan 25, 2019

hughsw changed the title ~~Some data transfers break host-to-container networking~~ Reproducible: data transfers break host-to-container networking Jan 27, 2019

docker-robott added the lifecycle/stale label May 8, 2019

docker-robott removed the lifecycle/stale label May 8, 2019

docker-robott added the version/2.0.0.3 label May 9, 2019

hughsw changed the title ~~Reproducible: data transfers break host-to-container networking~~ Reproducible: image network behavior breaks all host-to-container networking May 9, 2019

This was referenced May 9, 2019

Network connectivity issues. Docker restart is required. #3657

Closed

Network connectivity issues. Docker restart is required. #3647

Closed

hughsw changed the title ~~Reproducible: image network behavior breaks all host-to-container networking~~ Reproducible: container network behavior breaks all host-to-container networking May 9, 2019

djs55 self-assigned this Jul 8, 2019

rnorth mentioned this issue Jul 18, 2019

Docker for Mac - cannot connect to Ryuk testcontainers/testcontainers-java#1623

Closed

docker-robott added the lifecycle/stale label Oct 6, 2019

docker-robott closed this as completed Nov 5, 2019

docker locked and limited conversation to collaborators Jul 1, 2020

docker-robott added the lifecycle/locked label Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducible: container network behavior breaks all host-to-container networking #3487

Reproducible: container network behavior breaks all host-to-container networking #3487

hughsw commented Jan 25, 2019 •

edited

hughsw commented Jan 25, 2019

hughsw commented Jan 27, 2019

hughsw commented Feb 4, 2019 •

edited

hughsw commented Feb 4, 2019

Allan-Clements commented Feb 7, 2019

docker-robott commented May 8, 2019

Allan-Clements commented May 8, 2019

hughsw commented May 8, 2019 •

edited

djs55 commented Jul 8, 2019

docker-robott commented Oct 6, 2019

docker-robott commented Jul 1, 2020

Reproducible: container network behavior breaks all host-to-container networking #3487

Reproducible: container network behavior breaks all host-to-container networking #3487

Comments

hughsw commented Jan 25, 2019 • edited

Expected behavior

Actual behavior

Information

Diagnostic logs

Steps to reproduce the behavior

hughsw commented Jan 25, 2019

hughsw commented Jan 27, 2019

hughsw commented Feb 4, 2019 • edited

hughsw commented Feb 4, 2019

Allan-Clements commented Feb 7, 2019

docker-robott commented May 8, 2019

Allan-Clements commented May 8, 2019

hughsw commented May 8, 2019 • edited

djs55 commented Jul 8, 2019

docker-robott commented Oct 6, 2019

docker-robott commented Jul 1, 2020

hughsw commented Jan 25, 2019 •

edited

hughsw commented Feb 4, 2019 •

edited

hughsw commented May 8, 2019 •

edited