Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Desktop breaks localstack s3 transfer on M1 mac since 4.27.0 (Unknown network issue/limiting) #7207

Closed
mgrundie-r7 opened this issue Feb 29, 2024 · 11 comments

Comments

@mgrundie-r7
Copy link

mgrundie-r7 commented Feb 29, 2024

Description

We noticed that as soon as we upgraded docker desktop to 4.27.0+ that cucumber tests began to hang indefinitely when the aws sdk for ruby is uploading test files to a localstack bucket. I was able to replicate the behaviour (described below) using the awslocal cli. This will happen on a different test each time and on a different iteration each time in the replication steps

Started: when we upgraded docker to 4.27.0
(Upgrading to later or the latest version does not help.)

Workaround: Downgrade docker desktop to 4.26.1

Environment

- OS: MacOS Ventura 13.6.4
- LocalStack: 3.1.0
- Docker Desktop 4.27, 4.28

Affects colleagues with M1 Macbooks with MacOS Ventura .
Does NOT affect colleagues with M1 Macbooks with MacOS Sonoma
Does NOT affect colleagues with intel macbooks

(We are restricted from upgrading to Sonoma at this time)

Raised ticket with localstack but they are unable to replicate on an M3 Max mac: localstack/localstack#10340

Reproduce

docker run \
  --rm -it \
  -p 4566:4566 \
  -p 4510-4559:4510-4559 \
  -e DEBUG=1 localstack/localstack:3.1.0

Different shell session

// Create a 22mb file 
dd if=/dev/zero of=samplefile.dat  bs=1m count=22

// Create a bucket in localstack
awslocal s3api create-bucket --bucket testbucket

// Attempt to copy the file 50 times 
for i in {1..50}; do time awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat --body ./samplefile.dat &> /dev/null; done

Expected behavior

The file copies successfully or fails with an error

Actual: Something in docker is interrupting or killing the connection which makes localstack retry and hang

❯ for i in {1..50}; do
time awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat --body ./samplefile.dat &> /dev/null;
done
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.36s user 0.18s system 63% cpu 0.852 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.37s user 0.17s system 64% cpu 0.837 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.37s user 0.18s system 66% cpu 0.817 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.37s user 0.17s system 64% cpu 0.835 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.37s user 0.18s system 64% cpu 0.844 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.36s user 0.17s system 64% cpu 0.823 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.36s user 0.17s system 63% cpu 0.830 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.35s user 0.17s system 64% cpu 0.818 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.36s user 0.17s system 62% cpu 0.845 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.36s user 0.17s system 64% cpu 0.820 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.36s user 0.18s system 52% cpu 1.038 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.37s user 0.19s system 52% cpu 1.062 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.36s user 0.18s system 63% cpu 0.849 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.37s user 0.18s system 52% cpu 1.049 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.38s user 0.20s system 65% cpu 0.881 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.35s user 0.17s system 62% cpu 0.839 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.36s user 0.17s system 65% cpu 0.817 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.37s user 0.19s system 65% cpu 0.853 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.46s user 0.26s system 0% cpu 3:13.53 total
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.45s user 0.27s system 0% cpu 3:07.84 total

docker version

❯ docker version
Client:
 Cloud integration: v1.0.35+desktop.10
 Version:           25.0.1
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        29cf629
 Built:             Tue Jan 23 23:06:12 2024
 OS/Arch:           darwin/arm64
 Context:           desktop-linux

Server: Docker Desktop 4.27.0 (135262)
 Engine:
  Version:          25.0.1
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       71fa3ab
  Built:            Tue Jan 23 23:09:35 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.6.27
  GitCommit:        a1496014c916f9e62104b33d1bb5bd03b0858e59
 runc:
  Version:          1.1.11
  GitCommit:        v1.1.11-0-g4bccb38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

❯ docker info
Client:
 Version:    25.0.1
 Context:    desktop-linux
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.12.1-desktop.4
    Path:     /Users/mgrundie/.docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.24.3-desktop.1
    Path:     /Users/mgrundie/.docker/cli-plugins/docker-compose
  debug: Get a shell into any image or container. (Docker Inc.)
    Version:  0.0.22
    Path:     /Users/mgrundie/.docker/cli-plugins/docker-debug
  dev: Docker Dev Environments (Docker Inc.)
    Version:  v0.1.0
    Path:     /Users/mgrundie/.docker/cli-plugins/docker-dev
  extension: Manages Docker extensions (Docker Inc.)
    Version:  v0.2.21
    Path:     /Users/mgrundie/.docker/cli-plugins/docker-extension
  feedback: Provide feedback, right in your terminal! (Docker Inc.)
    Version:  v1.0.4
    Path:     /Users/mgrundie/.docker/cli-plugins/docker-feedback
  init: Creates Docker-related starter files for your project (Docker Inc.)
    Version:  v1.0.0
    Path:     /Users/mgrundie/.docker/cli-plugins/docker-init
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
    Version:  0.6.0
    Path:     /Users/mgrundie/.docker/cli-plugins/docker-sbom
  scout: Docker Scout (Docker Inc.)
    Version:  v1.3.0
    Path:     /Users/mgrundie/.docker/cli-plugins/docker-scout
WARNING: Plugin "/Users/mgrundie/.docker/cli-plugins/docker-scan" is not valid: failed to fetch metadata: fork/exec /Users/mgrundie/.docker/cli-plugins/docker-scan: no such file or directory

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 19
 Server Version: 25.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: a1496014c916f9e62104b33d1bb5bd03b0858e59
 runc version: v1.1.11-0-g4bccb38
 init version: de40ad0
 Security Options:
  seccomp
   Profile: unconfined
  cgroupns
 Kernel Version: 6.6.12-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: aarch64
 CPUs: 10
 Total Memory: 11.67GiB
 Name: docker-desktop
 ID: 4a465bdf-a712-4327-b3db-2e9e70b6805a
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5555
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: daemon is not using the default seccomp profile

Diagnostics ID

24EC8920-277D-4196-9055-11FCA2E65C2B/20240229142349

Additional Info

Also raised localstack issue but I'm sure this is a docker issue localstack/localstack#10340

Happy to provide any requested logs that may shed more light.

@dgageot
Copy link
Member

dgageot commented Mar 1, 2024

cc @djs55

@mgrundie-r7
Copy link
Author

Since this also affects the latest docker version (4.28) should version 4.28 label also be added?

@djs55
Copy link
Contributor

djs55 commented Mar 18, 2024

Thanks for the report. It's interesting that the bug only manifests on the combination of Ventura + recent Docker Desktop. Could you ask people using Ventura to disable the "Use kernel networking for UDP" option in Settings -> Resources -> Network? There are signs in the diagnostics that the second NIC (used for UDP as an optimisation) is failing to DHCP in a loop, so perhaps this is causing the interruption.

I tried to reproduce the issue on a recent Sonoma beta and did noticing something odd -- I see uploads stalling and if I capture a packet trace with

docker run -it --net=host -v /Users/.../bug:/bug  djs55/tcpdump -n -i eth0 -s 0 -w /bug/debug.pcap

and then look at the trace with Wireshark, I see local stack's service on port 4566 close its receive window (i.e. ask not to receive any more data presumably because its socket buffers are full) for a few seconds and then re-open, causing stalls and low throughput:
Screenshot 2567-03-18 at 17 41 48
I'm not sure what to make of this.

@mgrundie-r7
Copy link
Author

mgrundie-r7 commented Mar 19, 2024

Thanks, I tried with Use kernel networking for UDP disabled but the issue persists.

I obtained a pcap in the following way and also see TCP ZeroWindow, though it does not recover, and the ZeroWindowProbes appear to be duplicated. This ZeroWindowProbe + ZeroWindowProbeAck loop persists even after you kill the awslocal s3api put-object command

sudo tcpdump -i any port 4566 -n -w 4566.pcap

image

When I lsof the ports that continue proping I can see they are being used by docker

TCP DUMP

12:02:07.628530 IP6 ::1.51310 > ::1.4566: Flags [.], seq 0:1, ack 1, win 6370, options [nop,nop,TS val 3707105296 ecr 3030307405], length 1
12:02:07.628559 IP6 ::1.51310 > ::1.4566: Flags [.], seq 0:1, ack 1, win 6370, options [nop,nop,TS val 3707105296 ecr 3030307405], length 1
12:02:07.628602 IP6 ::1.4566 > ::1.51310: Flags [.], ack 0, win 0, options [nop,nop,TS val 3030312436 ecr 3707105296], length 0
12:02:07.628606 IP6 ::1.4566 > ::1.51310: Flags [.], ack 0, win 0, options [nop,nop,TS val 3030312436 ecr 3707105296], length 0

lsof

❯ lsof -i tcp:51310
COMMAND     PID     USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
com.docke 85642 mgrundie  269u  IPv6 0xf9fbd45a05632263      0t0  TCP localhost:kwtc->localhost:51310 (ESTABLISHED)

~ ·································································································································································································································· 12:00:07
❯ lsof -i tcp:51292
COMMAND     PID     USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
com.docke 85642 mgrundie  266u  IPv6 0xf9fbd45a05630a63      0t0  TCP localhost:kwtc->localhost:51292 (ESTABLISHED)

ps

❯ ps -p 85642 -o command
COMMAND
/Applications/Docker.app/Contents/MacOS/com.docker.backend

@djs55
Copy link
Contributor

djs55 commented Apr 2, 2024

Sorry for the delay replying. I've managed to reproduce a similar problem (hopefully the same one) and I've got a prototype fix if you'd like to try it:

With this build I can leave a large transfer running and it doesn't stall for me.

@mgrundie-r7
Copy link
Author

@djs55 I ran the loop for 500 iterations using your build and MacOS Ventura 13.6.6 (22G630) and this is what happen

@ iteration 349

...
//348
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.36s user 0.17s system 66% cpu 0.796 total
//349
awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat     0.44s user 0.22s system 0% cpu 3:07.31 total

In a new terminal I ran it manually to see the output. Command took ~3 mins:

❯ awslocal s3api put-object --bucket testbucket --key somekey/samplefile.dat --body ./samplefile.dat

Connection was closed before we received a valid response from endpoint URL: "http://localhost:4566/testbucket/somekey/samplefile.dat".

I started a packet capture when the trouble started and continued it after killing the loop and see the same ZeroWindow output in Wireshark as my last post.

I repeated the above steps a second time and trouble started at iteration 123.

I would say your build has definitely improved the situation as it previously took less iterations before failure, also I can run my cucumber test suite now without it hanging due to this problem.

For completeness I did confirm that the 500 iteration loop runs successfully with docker 4.26.1

Tested the Apple Silicon build only as I don not have access to an intel mac anymore.

@djs55
Copy link
Contributor

djs55 commented Apr 3, 2024

Thanks for the interesting test results!

@djs55
Copy link
Contributor

djs55 commented Apr 8, 2024

@mgrundie-r7 I've got another experimental build, if you'd like to try it. It has a fix for a bug where the "zero window" probe fails to work and the connection stalls. I suspect the previous experimental build encountered this scenario less often, which is why it was slightly improved but not completely fixed:

If you have time to try either of those, let me know how it goes.

@mgrundie-r7
Copy link
Author

Using your latest build I reran the steps we discussed previously.

I ran 500 iterations of the for loop twice (1000 total) without issue. Seems like you've fixed it. :)

Thanks.

(Tested Apple Silicon build only)

@djs55
Copy link
Contributor

djs55 commented Apr 8, 2024

Glad to hear it! (for the actual TCP fix most of the credit should go to the fine people over at google/gvisor)

Unfortunately the fix has missed the release deadline for 4.29. I recommend to keep using the dev build until either 4.30 (or perhaps a 4.29.1 update if there is one).

Thanks for your help tracking this down.

@chaizeg
Copy link

chaizeg commented May 6, 2024

Closing this issue because a fix has been released in Docker Desktop 4.30.0 . See the release notes for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants