Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colima hangs a short while after starting #552

Open
1 of 5 tasks
blame-git opened this issue Dec 30, 2022 · 36 comments
Open
1 of 5 tasks

Colima hangs a short while after starting #552

blame-git opened this issue Dec 30, 2022 · 36 comments

Comments

@blame-git
Copy link

Description

Shortly after starting a colima instance it seem to hang, and I’m unable to ssh into the session

% colima -v start -t vz -c 8 -m 8 --mount /Volumes/Void:w --mount-type virtiofs --mount /Volumes/Stash:w --mount-type virtiofs --mount /Volumes/Vacuum:w --mount-type virtiofs

% colima list
PROFILE    STATUS     ARCH       CPUS    MEMORY    DISK     RUNTIME    ADDRESS
default    Running    aarch64    8       8GiB      60GiB    docker     

# Roughly 5~10 minutes later
% colima ssh
FATA[0006] exit status 255

From ~/.lima/colima/ha.stderr.log I see:


{"level":"error","msg":"write unixgram -\u003e: write: no buffer space available","time":"2022-12-30T11:24:42-08:00"}
{"level":"error","msg":"cannot receive packets from , disconnecting: cannot read size from socket: read unixgram -\u003e: use of closed network connection","time":"2022-12-30T11:24:42-08:00"}{"level":"error","msg":"virtual network error: \"cannot read size from socket: read unixgram -\u003e: use of closed network connection\"","time":"2022-12-30T11:24:42-08:00"}
  • [ ]

Version

Colima Version:

colima version HEAD-88390f5
git commit: 88390f54bceb72e248044aa3b452b64c676d99d1

Lima Version: limactl version 0.14.2
Qemu Version: qemu-img version 7.2.0
macOS Version: 13.1 22C65

Operating System

  • macOS Intel <= 12 (Monterrey)
  • macOS Intel >= 13 (Ventura)
  • macOS M1 <= 12 (Monterrey)
  • macOS M1 >= 13 (Ventura)
  • Linux

Output of colima status

% colima status
FATA[0003] error retrieving current runtime: empty value

vm-type: vz
mount type: virtiofs

Reproduction Steps

  1. Start Colima
  2. 3 Docker containers start
  3. Wait about 5-10 minutes

Expected behaviour

No response

Additional context

No response

@abiosoft
Copy link
Owner

abiosoft commented Dec 30, 2022

Thanks for reporting this. I have experienced same with the VZ vm type and still troubleshooting.

If you do not need the faster filesystem access, QEMU is more stable at the moment.

@balajiv113
Copy link

@blame-git
If you could mention about the kind of network activity happening around that time (even your running containers info if possible) it would help debug/fix this issue better.

I have been trying to reproduce this but couldn't find a way

@abiosoft
Copy link
Owner

abiosoft commented Jan 4, 2023

@balajiv113 I am not sure what fixed it precisely but I'm suspecting its lima-vm/lima#1261.

I am still monitoring it but it seems to have stabilised now and no longer freezing.

@Sangeppato
Copy link

I'm still experiencing this issue from time to time, but I haven't found a way to systematically reproduce it

@blame-git
Copy link
Author

I’m still able to reproduce, but it is less frequent now. What seems to do it for me, is using a container that downloads files from the internet at high speed, and writing them out to volumes (mounted via virtiofs)

@blame-git
Copy link
Author

Oops, didn’t meant to close

@blame-git blame-git reopened this Jan 17, 2023
@balajiv113
Copy link

@blame-git

What seems to do it for me, is using a container that downloads files from the internet at high speed, and writing them out to volumes

Is this steps makes colima hang ?? If yes, can you share some command to redo this steps??

@mcelen35
Copy link

Bumping, I'm have the exact same issue.

@developerrespig
Copy link

Also having this issue and it drives me nuts :(

@balajiv113
Copy link

If its consistently reproducible please do share the steps.

Having hard time in reproducing this. It is happening but not always.

@mcelen35
Copy link

@balajiv113, there's no weird steps actually.

Just start colima, start couple of containers, it hangs after 5 min.

I have mongodb,nats,mysql,redis (from bitnami) containers only, within a custom network.

@balajiv113
Copy link

I have mongodb,nats,mysql,redis (from bitnami) containers only, within a custom network.

If you could share the docker-compose for this setup it would be great.

@mcelen35
Copy link

I have mongodb,nats,mysql,redis (from bitnami) containers only, within a custom network.

If you could share the docker-compose for this setup it would be great.

Here you go

version: '3'
services:
  nats:
    image: nats:latest
    container_name: nats
    networks:
      - placeholder
    expose:
      - 4222
    ports:
      - "4222:4222"
  mongodb:
    image: ghcr.io/zcube/bitnami-compat/mongodb:6.0.3-debian-11-r51
    container_name: mongodb
    environment:
      - MONGODB_USERNAME=placeholder
      - MONGODB_PASSWORD=placeholder
      - MONGODB_DATABASE=backend
      - MONGODB_ROOT_PASSWORD=root
    networks:
      - placeholder
    expose:
      - 27017
    ports:
      - "27017:27017"
    volumes:
      - 'mongodb:/bitnami'

  redis:
    image: ghcr.io/zcube/bitnami-compat/redis:7.0.7-debian-11-r51
    container_name: redis
    environment:
      - ALLOW_EMPTY_PASSWORD=yes
    networks:
      - placeholder
    expose:
      - 6379
    ports:
      - "6379:6379"
    volumes:
      - 'redis:/bitnami'
  mysql:
    image: ghcr.io/zcube/bitnami-compat/mysql:8.0.31-debian-11-r51
    container_name: mysql
    environment:
      - MYSQL_USER=placeholder
      - MYSQL_PASSWORD=placeholder
      - MYSQL_DATABASE=placeholder
      - MYSQL_AUTHENTICATION_PLUGIN=mysql_native_password
      - MYSQL_ROOT_PASSWORD=root
    networks:
      - placeholder
    expose:
      - 3306
    ports:
      - "3306:3306"
    volumes:
      - 'mysql:/bitnami'
networks:
  placeholder:
    external: true
volumes:
  mongodb:
    driver: local
    name: placeholder-mongodb
  mysql:
    driver: local
    name: placeholder-mysql
  redis:
    driver: local
    name: placeholder-redis

@balajiv113
Copy link

Thanks for the compose file.
Running this in my intel mac for past 30mins no hangs as of now :(

Will try the same with M1

@abiosoft
Copy link
Owner

abiosoft commented Feb 14, 2023

@balajiv113 it only happens if VM type is vz.

@balajiv113
Copy link

@abiosoft - Yes, trying with vz only

@codeinearts
Copy link

codeinearts commented Mar 22, 2023

This is also happening to me with a mongodb setup. After working a while with the containers, it hangs up.

Mostly happens after been working with it for a while. Error is persistent after that.

A delete and start from scratch the vm solves the issue until some days later it happens again.

edit: I'm using qemu on a macbook pro intel setup

@davepoon
Copy link

davepoon commented May 5, 2023

Having the same issue with my M1 MBP, I have to run colima stop -f and start again frequently :(

@balajiv113
Copy link

balajiv113 commented May 5, 2023

The network stack for vz was updated in lima-vm/lima#1383 (targeted for v0.16).

If interested in trying out do try with latest lima master and with this new network stack

@ryancurrah
Copy link
Contributor

ryancurrah commented Jun 1, 2023

I get this issue when I switch from docker to containerd back to docker runtime.

MacOS 13.3.1 (22E261)

❯ colima version
colima version 0.5.5
git commit: 6251dc2c2c5d8197c356f0e402ad028945f0e830
❯ limactl --version
limactl version 0.16.0
❯ colima start --verbose --runtime=docker
INFO[0000] starting colima
INFO[0000] runtime: docker
INFO[0000] creating and starting ...                     context=vm
> Terminal is not available, proceeding without opening an editor
> `vmType: vz` is experimental
> "Attempting to download the image" arch=aarch64 digest="sha512:84c93e8aaa09446618bf87daa993e260da69b50e95670aed5df6671b2cff9464810752cbf70f6ee5ddf9d3e1c91d98104b3c573cc024c5f0687ad3f4d2e93ebc" location="https://github.com/abiosoft/alpine-lima/releases/download/colima-v0.5.5/alpine-lima-clm-3.18.0-aarch64.iso"
> Using cache "/Users/rcurrah/Library/Caches/lima/download/by-url-sha256/bbac3cc01786365dbff7aa3e7c1dc2dcc8ee0aeacd6df51bce9840c8feeca75f/data"
> [hostagent] Starting VZ (hint: to watch the boot progress, see "/Users/rcurrah/.lima/colima/serial.log")
> [hostagent] new connection from  to
> [hostagent] Setting up Rosetta share
> SSH Local Port: 51110
> [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
> [hostagent] [VZ] - vm state change: running
> [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
> [hostagent] 2023/06/01 17:42:44 tcpproxy: for incoming conn 127.0.0.1:51116, error dialing "192.168.5.1:22": connect tcp 192.168.5.1:22: connection was refused
q> [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
> [hostagent] The essential requirement 1 of 3 is satisfied
> [hostagent] Waiting for the essential requirement 2 of 3: "user session is ready for ssh"
> [hostagent] The essential requirement 2 of 3 is satisfied
> [hostagent] Waiting for the essential requirement 3 of 3: "the guest agent to be running"
> [hostagent] The essential requirement 3 of 3 is satisfied
> [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished"
> [hostagent] Forwarding "/var/run/docker.sock" (guest) to "/Users/rcurrah/.colima/default/docker.sock" (host)
> [hostagent] Forwarding "/var/run/docker.sock" (guest) to "/Users/rcurrah/.colima/docker.sock" (host)
> [hostagent] The final requirement 1 of 1 is satisfied
> READY. Run `limactl shell colima` to open the shell.
> stat: can't stat '/proc/sys/fs/binfmt_misc/rosetta': No such file or directory
>   File: /proc/sys/fs/binfmt_misc/qemu-x86_64
>   Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
> Device: 32h/50d	Inode: 8709        Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2023-06-01 21:42:46.290000001 +0000
> Modify: 2023-06-01 21:42:46.290000001 +0000
> Change: 2023-06-01 21:42:46.290000001 +0000
INFO[0021] provisioning ...                              context=docker
> error retrieving current runtime: empty value
FATA[0021] error provisioning docker: exit status 1
❯ colima delete -v
are you sure you want to delete colima and all settings? [y/N] y
INFO[0001] deleting colima
WARN[0001] error retrieving runtimes: error retrieving current runtime: empty value
> Sending SIGKILL to the vz driver process 31489
> Sending SIGKILL to the host agent process 31489
> Removing *.pid *.sock under "/Users/rcurrah/.lima/colima"
> Removing "/Users/rcurrah/.lima/colima/ga.sock"
> Removing "/Users/rcurrah/.lima/colima/ha.pid"
> Removing "/Users/rcurrah/.lima/colima/ha.sock"
> Removing "/Users/rcurrah/.lima/colima/ssh.sock"
> remove /Users/rcurrah/.lima/colima/ssh.sock: no such file or directory
> Removing "/Users/rcurrah/.lima/colima/usernet__endpoint.sock"
> Removing "/Users/rcurrah/.lima/colima/usernet__fd.sock"
> Removing "/Users/rcurrah/.lima/colima/vz.pid"
> Deleted "colima" ("/Users/rcurrah/.lima/colima")
INFO[0001] done

@ryancurrah
Copy link
Contributor

ryancurrah commented Jun 2, 2023

So the lima VM is started successfully, I can shell into it from limactl shell but not colima ssh.

❯ limactl list
NAME      STATUS     SSH                VMTYPE    ARCH       CPUS    MEMORY    DISK      DIR
colima    Running    127.0.0.1:49485    vz        aarch64    4       12GiB     200GiB    ~/.lima/colima
❯ limactl shell colima
colima:/Users/rcurrah$
❯ colima ssh --very-verbose
TRAC[0000] cmd ["limactl" "info"]
TRAC[0000] cmd ["limactl" "list" "colima" "--json"]
TRAC[0000] cmd ["limactl" "list" "colima" "--json"]
TRAC[0000] cmd ["limactl" "list" "colima" "--json"]
TRAC[0000] cmd ["limactl" "show-ssh" "--format" "args" "colima"]
TRAC[0000] cmd ["limactl" "list" "colima" "--json"]
TRAC[0000] cmd ["limactl" "shell" "colima" "--" "sh" "-c" "echo $COLIMA_LAYER_SSH_PORT"]
TRAC[0000] cmd int ["ssh" "-o" "IdentityFile=/Users/rcurrah/.lima/_config/user" "-o" "StrictHostKeyChecking=no" "-o" "UserKnownHostsFile=/dev/null" "-o" "NoHostAuthenticationForLocalhost=yes" "-o" "GSSAPIAuthentication=no" "-o" "PreferredAuthentications=publickey" "-o" "Compression=no" "-o" "BatchMode=yes" "-o" "IdentitiesOnly=yes" "-o" "Ciphers=^aes128-gcm@openssh.com,aes256-gcm@openssh.com" "-o" "User=rcurrah" "-o" "ControlMaster=auto" "-o" "ControlPath=/Users/rcurrah/.colima/default/ssh.sock" "-o" "ControlPersist=5m" "-o" "ForwardAgent=yes" "-o" "Hostname=127.0.0.1" "-o" "Port=49484" "-q" "-t" "127.0.0.1" "--" "cd /Users/rcurrah 2> /dev/null; \"$SHELL\" --login"]
FATA[0000] exit status 255

@ryancurrah
Copy link
Contributor

ryancurrah commented Jun 2, 2023

Oh I figured it out, to use containerd I uninstalled docker cli and linked nerdctl to docker so all my tools could use docker command still. I switched back to the docker runtime and did not re-install docker cli.

The fix was to remove the link and install brew install docker cli again. Hope this helps anyone else out.

@stefandunca
Copy link

stefandunca commented Jun 11, 2023

I can reproduce this issue on MacOS by running a Syncting instance of the following docker-compose. Colima hangs after a few minutes to a few hours. It might be correlated with the download speed. Last time it hung after downloading 19.5 GiB at ~20.4 MiB/s

colima version: 0.5.4
MacOS: 13.4 (22F66)

version: "3"
services:
  syncthing:
    image: lscr.io/linuxserver/syncthing:latest
    container_name: syncthing
    hostname: syncthing
    environment:
      - PUID=456
      - PGID=45
      - TZ=Europe/Bucharest
    volumes:
      - ./syncthing/config:/config
      - /Users/tmp/mount/sync:/storage/sync
    ports:
      - 8384:8384
      - 22000:22000/tcp
      - 22000:22000/udp
      - 21027:21027/udp
    restart: unless-stopped

@Uplink03
Copy link

Uplink03 commented Jun 19, 2023

Note: This comment is preliminary. I hope to narrow things down and provide a set up where I can reproduce the issue reliably. This looks like gremlin territory, so no promises.

Not sure if this is related to the thread above, but I'm getting some hangs similar to what's described in this thread.

The difference though is that I'm not using VZ. I'm using the default setup, with QEMU. It's an Intel Mac, running Ventura 13.4.

This worked fine until today, when I added an extra container to my setup.

The hangs appear to happen at the end of a "composer install" or an "npm install". The first time, the hang cured itself after a few minutes and the containers were accessible again. The second time, in the same session, it didn't recover.

  • Control+C does not do anything.
  • colima status hangs.
  • colima version says it's version 0.5.5, git commit 6251dc2, then hangs
  • all docker commands hang

I did a colima stop -f, but I wasn't kicked out of a container where I had a shell running - which was frozen. I had to kill -9 the docker-compose process that was running that shell.

Restarting colima seems to get things working again, so I don't know how long it will take to reproduce the issue again. Colima had been running for about a week without a restart, but through computer sleeps, if that's relevant.

Some filesystem synchronisation issues seem to be at play. Npm complains about permissions on a file in the cache. This cache is on a Docker volume. Composer gave an error when I ran it just now on a local mount point. Where these two have in common is that they all operat on lots of small files very quickly. If this is related, I guess the filesystem could get stuck, bringing everything down with it.

Context: My setup is basically based on this repository: https://gitlab.com/nucleware/docker-dev . Please excuse the lack of documentation. I made that to ease my multi-project PHP development setup, and it still had many pitfalls and annoyances. My volumes are mounted using the "local" driver, even though that repo defaults to nfs on a Mac.

Update 1: Restarting colima didn't actually make everything work again. I could run a shell in my containers, but I couldn't connect to my traefik container with my browser. I had to reboot my Mac to be able to connect. The filesystem problems are still there.

@rhino-harley
Copy link

I am getting this frequently too. At first I thought it was due to problems with my corporate VPN interfering with lima, but maybe it isn't. Unfortunately I can't go back to using qemu easily since I need the x86_64 emulation for some containers that haven't been compiled for arm.

@lucaspar
Copy link

I'm experiencing a similar freeze using qemu, pretty much just like @Uplink03 described: Ctrl+C doesn't do anything, status and version both hang, with the only thing left being a forced stop. The machine also does not seem to recover at all, remaining frozen for hours.

I'm running an Apache Spark workload, very IO-bound, so this might have to do with the filesystem after all. The issue happens every time.

@martin-watts
Copy link

I'm seeing this on an M1 Mac, in a dev-environment container that's trying to do a yarn install - it's trying to read/write >10K files on a bind mount (standard JavaScript dependency nonsense) and just seems to hang at some point - the only option is force-killing the qemu vm. The colima logs don't give any hints as far as I can tell.

@Ptival
Copy link

Ptival commented Jul 11, 2023

Experiencing a similar hang, on Intel Mac.

All I'm doing is starting an ubuntu:22.04 image, installing some build dependencies, and trying to build binutils.

i.e.

apt update
apt install bison build-essential flex git libgmp-dev libmpfr-dev texinfo
git clone git://sourceware.org/git/binutils-gdb.git
cd binutils-gdb
CC=gcc ./configure
make

Not sure it'll help, but it starts hanging at this point:

rm -f bfd-tmp.h
cp bfd-in3.h bfd-tmp.h
/bin/bash ./../move-if-change bfd-tmp.h bfd.h
rm -f bfd-tmp.h
touch stmp-bfd-h
  CC       archures.lo
  CC       targets.lo
  CC       dwarf2.lo
rm -f tofiles
f=""; \
for i in elf64-x86-64.lo elfxx-x86.lo elf-ifunc.lo elf-vxworks.lo elf64.lo elf.lo elflink.lo elf-attrs.lo elf-strtab.lo elf-eh-frame.lo elf-sframe.lo dwarf1.lo dwarf2.lo elf32-i386.lo elf32.lo pei-i386.lo peigen.lo cofflink.lo coffgen.lo pe-x86_64.lo pex64igen.lo pei-x86_64.lo elf64-gen.lo elf32-gen.lo plugin.lo cpu-i386.lo cpu-iamcu.lo  archive64.lo ; do \
  case " $f " in \
    *" $i "*) ;; \
    *) f="$f $i" ;; \
  esac ; \
done ; \
echo $f > tofiles
/bin/bash ./../move-if-change tofiles ofiles
touch stamp-ofiles
  CCLD     libbfd.la

EDIT: Actually, this may be quite important to the hang, but the build happens in a mounted directory. So the clone happens on the native file system.

@lucaspar
Copy link

lucaspar commented Jul 11, 2023

@Ptival can we have a minimal example of a Dockerfile based on this that is guaranteed to "hang"?

@Ptival
Copy link

Ptival commented Jul 11, 2023

@lucaspar Not sure whether you saw my edit before asking, but I believe the crux of the problem lies in the build happening in a mounted, native directory, rather than a directory of the image.

I just tried running the build in a directory inside the image, and it seems to go further.

I don't think you can mount a native directory as part of a Dockerfile, so I think the problem cannot be reproduced that way.

My steps are, more or less, the following:

# Natively (MacOS Intel)
colima start --cpu 4 --memory 8
mkdir -p /Users/me/somedir
docker run --name colima-hangs --detach --rm -v /Users/me/somedir:/somedir -it colima-hangs
docker exec -it colima-hangs /bin/bash

# Now inside the Docker container
cd /somedir
apt update
apt install bison build-essential flex git libgmp-dev libmpfr-dev texinfo
git clone git://sourceware.org/git/binutils-gdb.git
cd binutils-gdb
CC=gcc ./configure
make

@mdenna-synaptics
Copy link

mdenna-synaptics commented Sep 25, 2023

Same issue for me, I start my build in a container and after a couple of minutes Colima hangs.
Ctrl-c doesn't work and I'm not able to do colima ssh, nor colima stop

I have Intel MacBook Pro with macOS 13.6 , using qemu with sshfs.
Note that:

  1. the build takes place in a mounted volume
  2. I just updated to the latest Ventura and Colima 0.5.5
  3. Until yesterday I was using macOS Monterey with the same configuration (qemu+sshfs) and I had no issues at all
  4. When Colima hangs, qemu's CPU usage goes down to less than 5%

This happens all the time, please let me know if I can provide more info

@nrikos
Copy link

nrikos commented Sep 26, 2023

I am having the same issue. MacOs Ventura 13.6 Intel

I start two docker containers -> Mongo and Redis

I then start three different applications in debug mode and then after a few seconds Colima is unresponsive and I get the following error when I execute docker ps

error during connect: Get "http://%2FUsers%2Fnkokkoris%2F.colima%2Fdefault%2Fdocker.sock/v1.24/containers/json": EOF

After a few minutes docker becomes responsive but as long as I have the application opening connections to Redis this keeps happening.

If I only use the Mongo image there is no problem.

@mdenna-synaptics
Copy link

Same issue for me, I start my build in a container and after a couple of minutes Colima hangs. Ctrl-c doesn't work and I'm not able to do colima ssh, nor colima stop

I have Intel MacBook Pro with macOS 13.6 , using qemu with sshfs. Note that:

FYI I switched to vz with virtiofs, since then I didn't see the issue anymore and the build is faster.

@DustinHolden
Copy link

@mdenna-synaptics Do you mind sharing your colima start command?

@mdenna-synaptics
Copy link

mdenna-synaptics commented Oct 11, 2023

@mdenna-synaptics Do you mind sharing your colima start command?

delete current colima session and settings

colima delete

configure colima to use virtiofs

colima start --vm-type vz --mount-type virtiofs

@aukevanleeuwen
Copy link

aukevanleeuwen commented Oct 26, 2023

So the lima VM is started successfully, I can shell into it from limactl shell but not colima ssh.

I'm a bit out of my comfort zone here :), but this issue is bugging me as well. Your comment (@ryancurrah) seems to apply to me as well. I have no idea how exactly that VM should behave to be honest, but I was just poking around. One of the things that's quite peculiar I think is the fact that in the process tree it has this hanging process:

 2695 root      0:00 sshd.pam: avanleeuwen [priv]
 2697 avanleeu  0:00 sshd.pam: avanleeuwen@pts/0,pts/1
 2744 root      0:00 supervise-daemon docker --start --retry TERM/60/KILL/10 --respawn-delay 2 --respawn-max 5 --respawn-period 1800 --stderr /var/log/docker.log --stdout /var/log/docker.log /usr/bin/dockerd --
 2746 root      0:00 /usr/bin/dockerd
 2771 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
 3009 root      0:00 sudo cat /etc/colima/colima.json
 3010 avanleeu  0:00 /bin/ash --login

I found the sudo cat /etc/colima/colima.json a bit strange and tried it myself in the container as well:

> cat /etc/colima/colima.json
{"kubernetes_config":"{\"Enabled\":false,\"Version\":\"v1.24.3+k3s1\",\"K3sArgs\":[\"--disable=traefik\"]}","runtime":"docker"}

However:

> sudo cat /etc/colima/colima.json
... hangs indefinitely ...

Basically everything that I try to do with sudo hangs. Doing a regular kill -term doesn't help. Tried some other signals as well, but need sudo for -9 I guess.

Not sure if any of this is the cause or the effect, but figured I'd write it down and see if it helps in figuring out this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests