Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible: container network behavior breaks all host-to-container networking #3487

Closed
hughsw opened this issue Jan 25, 2019 · 11 comments
Closed

Comments

@hughsw
Copy link

hughsw commented Jan 25, 2019

  • [x ] I have tried with the latest version of my channel (Stable or Edge)

  • [x ] I have uploaded Diagnostics

  • Diagnostics ID: CDF6C54F-DD10-459F-9401-2F769C811A8F/20190509120230

  • Version 2.0.0.2 ID: CDF6C54F-DD10-459F-9401-2F769C811A8F/20190125144328

Expected behavior

Host to Docker engine networking should just work.

Actual behavior

Express/Sequelize process on host attempts to send a BLOB larger than the configured max_allowed_packet to a MariaDB Sql server running in a container. MariaDB server Aborts the connection and complains (Got a packet bigger than 'max_allowed_packet' bytes). Thereafter, all networking from the host to any existing or newly running containers fail with connection timeouts.

Information

The problem is reproducible. It seems to be triggered by transfer of large BLOB data (15MB) from NodeJS/Sequelize process running on the Mac into a MariaDB SQL server running in a container. The MariaDB server is exposing its port, 3306, to the host, and the NodeJS/Sequelize process is contacting it as localhost:3306. The MariaDB SQL container mounts a host directory for the container to use for the SQL database.

After the failure, no containers can be accessed via networking from the host. That is, existing running containers can no longer be reached (connections time out). Also, newly docker run containers cannot be reached either.

Note: The connections to the containers time out, they are not refused.

If I docker exec into running containers, I can access their networked ports via the container's localhost. So, the issue seems to be more at the level of host-network-to-container-network rather than the internal container's networking....

I have looked at the streaming logs while the failure occurs (/usr/bin/log stream ...). There is no docker message in these logs around the time of the failure event.

The only fix I've found so far is to Restart the Docker app/engine.

I do not know how long the problem has been around because I have only just started the work to upload large BLOBs to the MariaDB server. Smaller BLOBs (1MB) do not cause the problem.

  • macOS Version: 10.13.6 (17G4015)

Diagnostic logs

Diagnose succeded

Steps to reproduce the behavior

This problem was found during dev work for a complex setup with 8 containers running in a Stack.

There is now a repo with minimal code to reproducibly demonstrate this problem: https://github.com/hughsw/dockerbug

@hughsw
Copy link
Author

hughsw commented Jan 25, 2019

Similar to #3448 but reproducible.

@hughsw
Copy link
Author

hughsw commented Jan 27, 2019

I've created a repository with code to reproducibly break Docker engine on Docker for Mac.

https://github.com/hughsw/dockerbug

@hughsw hughsw changed the title Some data transfers break host-to-container networking Reproducible: data transfers break host-to-container networking Jan 27, 2019
@hughsw
Copy link
Author

hughsw commented Feb 4, 2019

I have determined that it is not the amount of data transferred that triggers the failure. It appears that the failure is triggered by whatever it is that the DockerHub image for mariadb:10.3 does when it handles a query with an amount of data that exceeds its configurable max-allowed-packet size.

This parameter defaults to 16777216 (16 MiB), and we first observed the Docker engine failure when we tried to insert BLOBs larger than this amount. But, by setting a smaller limit, e.g. by passing this to the image, --max-allowed-packet=1048576 (1 MiB), we were able to trigger the Docker engine failure with a 1.5MB blob. And by setting it to a larger value (128 MiB), we were able to successfully insert blobs of more than 100MB.

That is, it appears that whatever network activity happens when MariaDB is generating its exception about max-allowed-packet being exceeded is what breaks the Docker engine host-to-container networking, not the amount of data itself.

@hughsw
Copy link
Author

hughsw commented Feb 4, 2019

Also, we have observed that both the mysql2 and mariadb Javascript client libraries can trigger the exception in the MariaDB container that breaks the Docker engine host-to-container networking. That is, Sequelize is not necessary to cause the problem.

@Allan-Clements
Copy link

Hi @hughsw, I ran into something similar to what you're describing. Figured I'd share my experiences.

For what it's worth I found that upgrading to anything later than the August 2018 version breaks so I've personally rolled back to that after downloading it here: https://docs.docker.com/docker-for-mac/release-notes/#docker-community-edition-18061-ce-mac73-2018-08-29.

I also found this issue to also be reproducible but interestingly it seems our scale of "size" seems to be different. I created a demo app here: https://github.com/Allan-Clements/docker-demo

It spins up a localstack container hosting a localhost version of S3 and then uploads files of increasing sizes to it with random bytes in each file to increase the size each time.

I encountered this bug due to a test for a real application uploading files to a localstack container kept getting stuck at the 2nd test case. The first case involved uploading a 7 KB file. Whereas the 2nd case was 168 KB.

Running this demo app I have observed it failing to upload a file as small as 15 KB up to 31 KB depending on whichever post August 2018 release of Docker For Mac I was trying out.

And like in your case it seems like my docker environment is stuck until I reset the docker daemon. I would confirm the networking layer was broken by just trying to run the nginx example container and curl against it. It'd hang like this after my demo app would hit whatever file size would break it for that version:

docker run --detach -p 80:80 --name=webserver nginx && sleep 1 && curl -v -p 127.0.0.1:80 && docker stop webserver && docker rm webserver
e4c8675706f669bb3bf9e8329ac66163db40b482281b8461a29b75b61c1462b1
* Rebuilt URL to: 127.0.0.1:80/
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 80 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.54.0
> Accept: */*
>

@docker-robott
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale comment.
Stale issues will be closed after an additional 30d of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle stale

@Allan-Clements
Copy link

/remove-lifecycle stale

@hughsw
Copy link
Author

hughsw commented May 8, 2019

I've just confirmed that the bug is in Docker for Mac 2.0.0.3/18.09.2

I've updated the documentation etc for the reproducible code: https://github.com/hughsw/dockerbug

@hughsw hughsw changed the title Reproducible: data transfers break host-to-container networking Reproducible: image network behavior breaks all host-to-container networking May 9, 2019
@hughsw hughsw changed the title Reproducible: image network behavior breaks all host-to-container networking Reproducible: container network behavior breaks all host-to-container networking May 9, 2019
@djs55 djs55 self-assigned this Jul 8, 2019
@djs55
Copy link
Contributor

djs55 commented Jul 8, 2019

Thanks for the report (and especially for the repro example!)

I can confirm the current stable is still broken (Version 2.0.0.4 (31365)). Inside the VM (using a command like docker run -it --privileged --pid=host justincormack/nsenter1) the log file /var/log/vpnkit-forwarder.log has a record of the failure:

2019-07-08T09:34:56Z vpnkit-forwarder 2019/07/08 09:34:56 Multiplexer main loop failed with EOF

I believe the bug is fixed in current edge (Version 2.0.5.0 (35318)). Inside the VM the log file has a different message:

2019-07-08T09:47:45Z vpnkit-forwarder 2019/07/08 09:47:45 Discarded 65536 bytes from 2147483693 Data length 65536

This log comes from here, added by this commit in moby/vpnkit#453 . I believe what's happening is the connection is being closed while data is in-flight: previously this would trigger a fatal error breaking all future port forwarding. It now generates a log message and safely continues.

So I think this will be fixed in stable in the next stable release. Thanks again for the repro example!

@docker-robott
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale comment.
Stale issues will be closed after an additional 30d of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle stale

@docker-robott
Copy link
Collaborator

Closed issues are locked after 30 days of inactivity.
This helps our team focus on active issues.

If you have found a problem that seems similar to this, please open a new issue.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle locked

@docker docker locked and limited conversation to collaborators Jul 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants