Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network in containers breaks under bigger network load #140

Closed
mjkonarski-b opened this issue Jan 24, 2022 · 47 comments
Closed

Network in containers breaks under bigger network load #140

mjkonarski-b opened this issue Jan 24, 2022 · 47 comments
Labels
enhancement New feature or request
Milestone

Comments

@mjkonarski-b
Copy link

mjkonarski-b commented Jan 24, 2022

Network breaks in containers when they start multiple network connections at the same time.

I noticed this behaviour e.g. during downloading Python dependencies. When multiple packages are downloaded at the same time I start getting Network is unreachable error. Then when I login to the underlying QEMU machine (limactl shell colima) I can see that it can't reach any network address. I cannot even ping 8.8.8.8. My host computer doesn't have any connection issues.

It gets better after few moments of inactivity. Restarting QEMU machine (colima stop && colima start) fixes the network, but the problem comes back when I increase the network load.

This is a problem that I can consistently reproduce. I created a minimum setup to demonstrate it: https://github.com/mjkonarski-b/colima-poc

I experience that problem on multiple Macbooks, so it doesn't seem to be related to any particular processor or macOS version:

  • MBP 2021 M1 Pro with 12.1 Monterey
  • MBP 2019 i7 with 12.1 Monterey
  • MBP 2019 i7 with 11.5.2 BigSur
$ colima version
colima version 0.3.2
git commit: 272db4732b90390232ed9bdba955877f46a50552

runtime: docker
arch: aarch64
client: v20.10.11
server: v20.10.11


$ limactl --version
limactl version 0.8.1


$ qemu-img --version
qemu-img version 6.2.0
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
@dmikusa
Copy link

dmikusa commented Jan 25, 2022

+1 I'm seeing the same thing. Same colima, lima and qemu versions. Mac 12.1 Monterey

  1. colima start
  2. docker pull any image that's a couple of hundred MB+
  3. It gets halfway through and stalls. It'll try to continue with limited success.

I tried downgrading to colima 0.3.1 and it didn't seem to help. I did a colima delete and colima start after downgrading but still seeing the issue.

@mjkonarski-b
Copy link
Author

This seems to be related too #137

@mjkonarski-b
Copy link
Author

I did more investigation, but I couldn't find the root cause. So far it seems that the problem lies in Lima or QEMU itself. I could reproduce it on machines running raw Lima images, without Colima. I found two issues in Lima repo that seem describing the very same problem:
lima-vm/lima#537
lima-vm/lima#561

@bolt-juri-gavshin
Copy link

bolt-juri-gavshin commented Jan 27, 2022

I believe the problem is not inside Docker or containers, maybe even not in Co(lima)/qemu...

Some additional information:
I have Colima version 0.3.2.
Everything worked in MacOS 12.0.1, but broke after upgrade to 12.2.

Steps to reproduce:

colima delete
colima start --cpu 6 --memory 8 --disk 60
colima ssh -- sudo apk add apache2-utils
colima ssh -- ab -c 100 -n 1000 https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl

I receive lots of errors like SSL read failed (5) - closing connection and the summary is:

Time taken for tests:   0.916 seconds
Complete requests:      1000
Failed requests:        1818
   (Connect: 0, Receive: 0, Length: 909, Exceptions: 909)

If last command is run on the host, everything is good:

Time taken for tests:   10.788 seconds
Complete requests:      1000
Failed requests:        0

Before MacOS upgrade (12.0.1 -> 12.2) running ab inside Colima VM gave the same result, as host (i.e. success).

P.S. In Docker problem is the same, for example:

docker run --rm jordi/ab -c 100 -n 1000 https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl

@bolt-juri-gavshin
Copy link

bolt-juri-gavshin commented Jan 31, 2022

Another note: Rancher Desktop 1.0.0 works without problems on the same machine, when I run docker command:

docker run --rm jordi/ab -c 100 -n 1000 https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl

@mrsladek
Copy link

mrsladek commented Feb 2, 2022

Hi,
Same issue here on 5 different MBP machines.

When pulling multiple images at the same time with docker-compose the network breaks and I get unreachable error or i/o timeout.

Great if the problem could be addressed soon.

$ colima version 0.3.2
git commit: 272db4732b90390232ed9bdba955877f46a50552

runtime: docker
arch: x86_64
client: v20.10.12
server: v20.10.11

$ limactl --version                                                                                                                                                                                          
limactl version 0.8.1

$ qemu-img --version                                                                                                                                                                                         
qemu-img version 6.2.0
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

@brunoselvacj
Copy link

brunoselvacj commented Feb 2, 2022

That looks like a problem with Alpine itself. I did the same kind of configuration that lima does using yaml file straight to limavm and the results are the same. That is not reproducible with Ubuntu using the yaml and loading docker as provisioning

$ colima version 0.3.2
git commit: 272db47

runtime: docker
arch: x86_64
client: v20.10.12
server: v20.10.11

$ limactl --version
limactl version 0.8.1

$ qemu-img --version
qemu-img version 6.2.0
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

@MMartyn
Copy link

MMartyn commented Feb 2, 2022

I was having this issue and was able to work around it by adding the following to ~/.lima/_config/override.yaml

useHostResolver: false
dns:
- 8.8.8.8

@abiosoft
Copy link
Owner

abiosoft commented Feb 2, 2022

I was having this issue and was able to work around it by adding the following to ~/.lima/_config/override.yaml

useHostResolver: false
dns:
- 8.8.8.8

If that's the case, user-configurable dns can be added to the next version.

@mrsladek
Copy link

mrsladek commented Feb 3, 2022

useHostResolver: false
dns:

  • 8.8.8.8

not working on my setup :(

@abiosoft
Copy link
Owner

abiosoft commented Feb 3, 2022

useHostResolver: false
dns:

  • 8.8.8.8

not working on my setup :(

I'm working on a proper fix for configurable dns on each startup.
This actually takes effect on the initial VM creation, so nothing changes if you apply this to an existing VM.

@bolt-juri-gavshin
Copy link

I'm working on a proper fix for configurable dns on each startup.

The problem doesn't seem to be only DNS related. As author described, after VM gets into a "bad" state, ping/connection by IP doesn't work either.

As I wrote before, rancher desktop (which also uses alpine lima images under the hood) works fine on the same machine. Maybe we can try to use the same images? I am ready to experiment with new versions, just need some guidance. I have M1, M1Pro and Intel Macs at my disposal to test it.

@keerati
Copy link

keerati commented Feb 10, 2022

I was having this issue and was able to work around it by adding the following to ~/.lima/_config/override.yaml

useHostResolver: false
dns:
- 8.8.8.8

This works for me.

@dmikusa
Copy link

dmikusa commented Feb 11, 2022

I was having this issue and was able to work around it by adding the following to ~/.lima/_config/override.yaml

useHostResolver: false
dns:
- 8.8.8.8

I got some time to try this with my previous docker pull test and it seems better, usable even, but I did see a couple of times where docker had to retry, which never happens when I run in other VMs.

@abiosoft
Copy link
Owner

abiosoft commented Feb 13, 2022

Another note: Rancher Desktop 1.0.0 works without problems on the same machine, when I run docker command:

docker run --rm jordi/ab -c 100 -n 1000 https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl

I actually got a similar experience with Rancher Desktop. It seems to be something specific to Alpine as I could not reproduce with an Ubuntu image.

I am still troubleshooting and would prefer not to ditch Alpine.

@abiosoft
Copy link
Owner

I actually got a similar experience with Rancher Desktop. It seems to be something specific to Alpine as I could not reproduce with an Ubuntu image.

I am still troubleshooting and would prefer not to ditch Alpine.

I take that back, it looks to be specific to Lima as I have reproduced it with multiple distros.

@abiosoft
Copy link
Owner

I think it is more related to this lima-vm/lima#561.
It is specific to macOS and not reproducible on Linux, makes me think it is something to do with macOS networking.

@deviantintegral
Copy link

I'm seeing this as well, coming up while running docker pull with a macOS host. The only workaround I have is to restart colima - is there anything else? I agree this is likely related to the 12.2 upgrade, as I didn't (knowingly) make other upgrades when this started breaking.

@deviantintegral
Copy link

Hm, well I had thought it was a problem only once or twice day. However, I just tried starting up a PHP project and composer install hung. I killed it, ssh'ed in to the VM, and networking is very broken:

colima:~$ ping google.com
PING google.com (142.251.33.174): 56 data bytes
64 bytes from 142.251.33.174: seq=0 ttl=42 time=2178818.518 ms
64 bytes from 142.251.33.174: seq=0 ttl=42 time=2179820.651 ms (DUP!)
64 bytes from 142.251.33.174: seq=0 ttl=42 time=2180821.875 ms (DUP!)
64 bytes from 142.251.33.174: seq=0 ttl=42 time=2181825.445 ms (DUP!)
64 bytes from 142.251.33.174: seq=0 ttl=42 time=2182829.395 ms (DUP!)
^C
--- google.com ping statistics ---
5 packets transmitted, 1 packets received, 4 duplicates, 80% packet loss
round-trip min/avg/max = 2178818.518/2180823.176/2182829.395 ms

@jandubois
Copy link

@deviantintegral Networking may be broken, but ICMP doesn't really work over the slirp network, so ping doesn't work properly from inside the guest:

User Networking (SLIRP)
This is the default networking backend and generally is the easiest to use. It does not require root / Administrator privileges. It has the following limitations:

  • there is a lot of overhead so the performance is poor
  • in general, ICMP traffic does not work (so you cannot use ping within a guest)
  • on Linux hosts, ping does work from within the guest, but it needs initial setup by root (once per host) -- see the steps below
  • the guest is not directly accessible from the host or the external network

@elventear
Copy link

elventear commented Mar 11, 2022

I dug deeper into this issue I have been able to work around it within lima using PTP based networking as reported in lima-vm/lima#724. It would be nice to able to make this all work seamlessly without manually managing the colima template or the vde_vmnet process.

One half solution is to add in ~/.lima/_config/override.yaml the following:

---
networks:
   - vnl: "/tmp/vde.ptp"
     switchPort: 65535

To inject the PTP network into the colima image without changing the template, but it will require manually starting the vde_vmnet process and deleting the default route going through the SLIRP network.

@abiosoft
Copy link
Owner

To inject the PTP network into the colima image without changing the template, but it will require manually starting the vde_vmnet process and deleting the default route going through the SLIRP network.

If this provides the best results so far, Colima can be updated to handle this.

@abiosoft abiosoft added this to the v0.4.0 milestone Mar 12, 2022
@abiosoft abiosoft added the enhancement New feature or request label Mar 12, 2022
@abiosoft
Copy link
Owner

I dug deeper into this issue I have been able to work around it within lima using PTP based networking as reported in lima-vm/lima#724. It would be nice to able to make this all work seamlessly without manually managing the colima template or the vde_vmnet process.

With this workaround, I was able to get the desired result with this test #140 (comment).

I will keep an eye on the upstream issue. And in the meantime I will look at implementing this workaround in Colima.

@abiosoft
Copy link
Owner

Just an FYI that I have made notable progress with this.

Going with PTP based networking (thanks @elventear) minimised the dependencies required to only vde_vmnet. It then turned out easy to bundle with Colima due to its small size.

In addition to fixing this issue (hopefully finally), all VMs also get IP addresses that are reachable from the host, which then fixes #189, #97, #71 and provides a workaround for #135.

@elventear
Copy link

@abiosoft I have noticed that after running for a while eth0 got added back as a default route, I assume due to some network or power event. I am thinking a solution is to disable eth0 being considered for a default route. Seems the right way to do this in alpine is:

echo 'NO_GATEWAY="eth0"' >> /etc/udhcpd.conf

Currently testing this.

@elventear
Copy link

elventear commented Mar 18, 2022

@abiosoft the configuration setting didn't seem to enough (I am not familiar at all with alpine).

Looking deeper, it seems the configuration happens via udhcpc and this script: /usr/share/udhcpc/default.script. That script expects to have the configuration in /etc/udhcpc/udhcpc.conf but default alpine image has the configuration in /etc/udhcpc.conf 🤷.

@abiosoft
Copy link
Owner

abiosoft commented Mar 18, 2022

@elventear what I do notice is that the default route gets reset on startup.

@elventear
Copy link

elventear commented Mar 18, 2022

@abiosoft I think I got it. The issue seems to be when the DHCP client refreshes the connection (as eth0 is dhcp configured), it will revert the default route. The right setting is:

echo 'NO_GATEWAY="eth0"' >> /etc/udhcpc/udhcpc.conf

Having this in my provision script seems to be more robust:

      mkdir -p /etc/udhcpc
      touch /etc/udhcpc/udhcpc.conf

      if ! grep -q 'NO_GATEWAY' /etc/udhcpc/udhcpc.conf > /dev/null; then
        echo 'NO_GATEWAY="eth0"' >> /etc/udhcpc/udhcpc.conf
      fi

      kill -s SIGUSR2 $(cat /var/run/udhcpc.eth0.pid) # force DHCP release
      kill -s SIGUSR1 $(cat /var/run/udhcpc.eth0.pid) # force DHCP reconfigure

No need to delete the default route explicitly, udhcpc will do it for you.

You can test that things are correctly configuring from the shell doing sudo kill -s SIGUSR1 $(cat /var/run/udhcpc.eth0.pid) and then checking if eth0 is back in the default routes. docs.

The most elegant solution though would be to have /etc/udhcpc/udhcpc.conf configured before the image configured the network, I don't know if that is a possibility for you.

@abiosoft
Copy link
Owner

@elventear kindly install the current development version with brew install --HEAD colima and give it a try.

Thanks.

@jasoncodes
Copy link

jasoncodes commented Mar 20, 2022

I just gave this a go with HEAD-5e2e413 and initially got the following output during colima start:

INFO[0000] preparing network ...                         context=vm
WARN[0015] error starting network: error at 'preparing network': stat /Users/jason/.colima/network/vmnet.ptp: no such file or directory  context=vm

~/.colima/network/vmnet.stderr contained the following:

sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required

After reviewing the generated ~/Library/LaunchAgents/com.abiosoft.colima.colima.plist file, I created /etc/sudoers.d/colima with the following:

%admin ALL=(ALL) NOPASSWD: /opt/colima/bin/colima-vmnet start colima

colima start now runs cleanly. lima0 is setup as 192.168.106.2 and is the default IPv4 route. Outbound TCP and ICMP are working well.

Edit: See #140 (comment). I had a custom /etc/sudoers.d/colima. Removing this file fixes thing.


DNS is still using the user mode network which I have found to be unreliable with some DNS-heavy loads, even when all other traffic is routing via lima0. I’m using the following ~/.lima/_config/override.yaml to use lima0 for DNS:

useHostResolver: false
dns:
  - 192.168.106.1

With a couple more optional tweaks I can also get direct IP access to containers from the host:

sudo route -n add -net 172.17.0.0/16 192.168.106.2
colima ssh -- sudo iptables -A FORWARD -i lima0 -j ACCEPT

The following in /etc/docker/daemon.json (along with sudo rc-service docker restart) ensures Docker Compose networks use 172.17.0.0/16 too, avoiding having to add additional host routes for these Docker networks:

  "default-address-pools": [
    {
      "base": "172.17.0.0/16",
      "size": 24
    }
  ]

@abiosoft
Copy link
Owner

@jasoncodes thanks for troubleshooting that. Are you on Intel or M1 mac?

@jasoncodes
Copy link

I was testing on an Intel Mac running macOS 12.3. I’ll give it a go on an M1 soon.

@abiosoft
Copy link
Owner

@jasoncodes with regards to your scenario, can you kindly answer the following.

  • Did you start colima via the terminal? If yes, were you prompted for sudo password to setup network?
  • The /etc/sudoers.d/colima file was meant to be generated as part of the network setup. Did you create that entirely or you modified an existing one?

Thanks.

@jasoncodes
Copy link

Ah, I see what’s going on. I had manually created a /etc/sudoers.d/colima file some time ago (pre-0.3) to avoid password prompts for the docker.sock symlinking and didn’t remove it when upgrading to 0.3. I just manually removed this file and colima has created a new one with the correct colima-vmnet entry.

I wonder if it’d be worth outputting a warning if this file exists without an entry for colima-vmnet? Maybe not worth it. It should only be an issue for people in my scenario (upgraded from pre-0.3 without removing a custom sudoers file).

@abiosoft
Copy link
Owner

I wonder if it’d be worth outputting a warning if this file exists without an entry for colima-vmnet? Maybe not worth it. It should only be an issue for people in my scenario (upgraded from pre-0.3 without removing a custom sudoers file).

Yeah, if the file does exist it should be checked for the entry, and if the entry is missing it should be appended.

@elventear
Copy link

elventear commented Mar 21, 2022

I have installed colima from the latest HEAD and so far it seems to do the right thing, I have not had time to really test it thoroughly, I will do this over the week.

I have some concerns about the privilege setup, this what I have noticed:

  • Binaries that are installed in /opt/colima/{bin,lib} do not have a locked down ownership (i.e. root:wheel). Since those will execute with superuser privileges, I find that to be a vulnerability.
  • Why does vde_vmnet need to be started via the colima-vmnet symlink instead of doing it directly? I ask because depending how users install colima (.e.g downloading the binary directly GitHub), the binary itself might have the user ownership and this symlink path is entry point in the sudoers file. This also could be seen as a vulnerability.

While convenient to embed the tools with the application, I personally use MacPorts where the distribution itself will manage the dependencies for you and also setup things in a more locked down manner. I am wondering if it would be possible to provide an explicit way to manage the installation of the dependencies so that colima doesn't install them but use something that is provided externally.

@elventear
Copy link

Also, I think IPv6 is broken in the container (I haven't dug the root cause), but if you have IPv6 it can mess things up for you. In my workaround, I just disable IPv6 until it can be solved why routing is not working properly:

      sysctl -w net.ipv6.conf.all.disable_ipv6=1
      sysctl -w net.ipv6.conf.default.disable_ipv6=1

@jasoncodes
Copy link

I’ll give it a go on an M1 soon.

I gave this a go just now using arm64 Homebrew and it fails with the following:

dyld[12318]: Library not loaded: /opt/colima/lib/libvdeplug.3.dylib
  Referenced from: /opt/colima/bin/vde_vmnet
  Reason: tried: '/opt/colima/lib/libvdeplug.3.dylib' (mach-o file, but is an incompatible architecture (have 'arm64e', need 'x86_64')), '/usr/local/lib/libvdeplug.3.dylib' (no such file), '/usr/lib/libvdeplug.3.dylib' (no such file)

libvdeplug.3.dylib is arm64e but vde_vmnet is x86_64. I’ve confirmed vmnet_arm64.tar.gz in the repo is the same.

Looking at the Makefile for vde_vmnet, it seems like it checks ARCH (defaulting to the current arch). Running ARCH=arm64 make looks to correctly generate an arm64 binary when running on my x86_64 machine. I expect the reverse will also apply when building x86_64 binaries on an arm64 machine.

@abiosoft
Copy link
Owner

abiosoft commented Mar 22, 2022

Looking at the Makefile for vde_vmnet, it seems like it checks ARCH (defaulting to the current arch). Running ARCH=arm64 make looks to correctly generate an arm64 binary when running on my x86_64 machine. I expect the reverse will also apply when building x86_64 binaries on an arm64 machine.

Oh my! how did I miss this 🙈 . Can you build from source @jasoncodes? If yes, I would appreciate if you can assist with testing directly on a development branch before getting it into main.

@jasoncodes
Copy link

Yes, I’m more than happy to test any development branches you may have. Looking forward to having a release with built-in support for VDE networking. Thanks for your great work. :)

Aside: Is there a documented uninstall process anywhere? Prior to this a colima delete on all profiles (followed by a brew uninstall) would clean everything up. Now we also have /opt/colima which is not automatically removed. Might be worth adding something to the README?

@abiosoft
Copy link
Owner

Yes, I’m more than happy to test any development branches you may have. Looking forward to having a release with built-in support for VDE networking. Thanks for your great work. :)

This should be fixed now. You can give it another try on your M1 device.

Aside: Is there a documented uninstall process anywhere? Prior to this a colima delete on all profiles (followed by a brew uninstall) would clean everything up. Now we also have /opt/colima which is not automatically removed. Might be worth adding something to the README?

Yeah, I have considered this as well.

@jasoncodes
Copy link

You can give it another try on your M1 device.

Looking good on M1 now. 👍

@abiosoft
Copy link
Owner

I have some concerns about the privilege setup, this what I have noticed:

  • Binaries that are installed in /opt/colima/{bin,lib} do not have a locked down ownership (i.e. root:wheel). Since those will execute with superuser privileges, I find that to be a vulnerability.

Thanks for taking note. These will be tightened and limited to privileged user, should be 0744 at best.

  • Why does vde_vmnet need to be started via the colima-vmnet symlink instead of doing it directly? I ask because depending how users install colima (.e.g downloading the binary directly GitHub), the binary itself might have the user ownership and this symlink path is entry point in the sudoers file. This also could be seen as a vulnerability.

It is primarily for convenience since vmnet can only be started by a privileged user.
The colima-vmnet symlink can only do one thing, which is to start vmnet.
The ptp network requires a vmnet process per vm and that mandates the need for different arguments to start vde_vmnet. Because of that, a single entry for vde_vmnet in sudoers file will not suffice, except it is for the entire vde_vmnet command irrespective of the program args.

I am very much open to better ideas around this.

While convenient to embed the tools with the application, I personally use MacPorts where the distribution itself will manage the dependencies for you and also setup things in a more locked down manner. I am wondering if it would be possible to provide an explicit way to manage the installation of the dependencies so that colima doesn't install them but use something that is provided externally.

Embedding in this case is a decent option as it provides a consistent experience without adding notable size overhead.
I would prefer to stick to the convenience of having a single binary and zero dependencies.

Just to clarify, I am not against external dependencies. Lima, Qemu are dependencies as well after all. But I think vde_vmnet is better embedded.

@jasoncodes
Copy link

May I enquire as to why vde_vmnet is being started via a launch agent? Lima itself seems to start VDE networking directly (also using sudo).

The advantage of this is that you can SSH into a machine (such as the M1 machine I am testing on :)) without having a graphical login. I just double checked spinning up a VM directly with Lima using shared networking over SSH and it works well. Colima expectedly fails to start the launch agent.

@abiosoft
Copy link
Owner

@jasoncodes launchd is used mainly to keep it as background running process. I can borrow from the approach used by Lima or find a way to tie it to the qemu process.

Thanks, your feedbacks have been helpful.

@abiosoft
Copy link
Owner

This should be fixed by now

@deviantintegral
Copy link

Agreed, I haven't had any trouble over the past few weeks running code from main. Nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests