Machine hung on starting, and stop didn't work #18662

deboer-tim · 2023-05-23T12:43:59Z

Issue Description

I had a podman machine hang indefinitely while starting. I don't know if it was random, but I was on a very low bandwidth internet connection at the time.

Using the stop command said it stopped the machine, but didn't. As expected start didn't do anything. In the end I had to find and kill the process.

Steps to reproduce the issue

Unknown.

Describe the results you received

% podman machine list                     
NAME                  VM TYPE     CREATED      LAST UP             CPUS        MEMORY      DISK SIZE
podman-machine-demo*  qemu        4 days ago   Currently starting  4           12.58GB     107.4GB
podman-machine-alt    qemu        2 weeks ago  4 days ago          4           9.98GB      107.4GB
podman-45             qemu        2 weeks ago  11 days ago         4           11.22GB     107.4GB
% podman machine stop podman-machine-demo 
Machine "podman-machine-demo" stopped successfully
% podman machine list                    
NAME                  VM TYPE     CREATED      LAST UP             CPUS        MEMORY      DISK SIZE
podman-machine-demo*  qemu        4 days ago   Currently starting  4           12.58GB     107.4GB
podman-machine-alt    qemu        2 weeks ago  4 days ago          4           9.98GB      107.4GB
podman-45             qemu        2 weeks ago  11 days ago         4           11.22GB     107.4GB
% podman machine start podman-machine-alt
Error: cannot start VM podman-machine-alt. VM podman-machine-demo is currently running or starting: only one VM can be active at a time
% podman machine list                    
NAME                  VM TYPE     CREATED      LAST UP             CPUS        MEMORY      DISK SIZE
podman-machine-demo*  qemu        4 days ago   Currently starting  4           12.58GB     107.4GB
podman-machine-alt    qemu        2 weeks ago  4 days ago          4           9.98GB      107.4GB
podman-45             qemu        2 weeks ago  11 days ago         4           11.22GB     107.4GB

Describe the results you expected

It would be nice if 'starting' always led to 'started' state, or failed after a timeout. Either way, the stop command should always stop the process (and never say it stopped it when it didn't), even if it has to kill it in the background.

podman info output

% podman info
host:
  arch: arm64
  buildahVersion: 1.30.0
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 99.25
    systemPercent: 0.45
    userPercent: 0.3
  cpus: 4
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: coreos
    version: "38"
  eventLogger: journald
  hostname: localhost.localdomain
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
    uidmap:
    - container_id: 0
      host_id: 501
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
  kernel: 6.2.15-300.fc38.aarch64
  linkmode: dynamic
  logDriver: journald
  memFree: 11486474240
  memTotal: 12208103424
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.4-1.fc38.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.4
      commit: 5a8fa99a5e41facba2eda4af12fa26313918805b
      rundir: /run/user/501/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/501/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.aarch64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 0h 7m 24.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/home/core/.config/containers/storage.conf
  containerStore:
    number: 7
    paused: 0
    running: 0
    stopped: 7
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/core/.local/share/containers/storage
  graphRootAllocated: 106769133568
  graphRootUsed: 7990566912
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 10
  runRoot: /run/user/501/containers
  transientStore: false
  volumePath: /var/home/core/.local/share/containers/storage/volumes
version:
  APIVersion: 4.5.0
  Built: 1681486872
  BuiltTime: Fri Apr 14 11:41:12 2023
  GitCommit: ""
  GoVersion: go1.20.2
  Os: linux
  OsArch: linux/arm64
  Version: 4.5.0

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

The text was updated successfully, but these errors were encountered:

Aenimus · 2023-05-23T22:29:24Z

Hi @deboer-tim,

I forgot to check the existing issues when I created my issue:
#18666

I think we had/have the same issue, if I am not mistaken? But I'm on macOS rather than Linux.

ashley-cui · 2023-05-25T21:06:16Z

I've been seeing this recently as well, the only thing that fixes it is killing the qemu process. What qemu version are you using?

deboer-tim · 2023-05-26T14:14:49Z

Looks like I'm on 8.0.0.

deboer-tim · 2023-06-13T12:26:18Z

FWIW this happened again yesterday, only solution was killing qemu.

vrothberg · 2023-07-26T09:14:49Z

I think the state transitions of podman machine are not correct. In order to determine whether a machine is currently running or "starting", CheckExclusiveActiveVM() is being used. This function returns the first VM that is either being running or starting and matches that against the to-be-started VM, see

podman/cmd/podman/machine/start.go

Lines 70 to 75 in 9706147

    
           if active { 
        
           	if vmName == activeName { 
        
           		return fmt.Errorf("cannot start VM %s: %w", vmName, machine.ErrVMAlreadyRunning) 
        
           	} 
        
           	return fmt.Errorf("cannot start VM %s. VM %s is currently running or starting: %w", vmName, activeName, machine.ErrMultipleActiveVM) 
        
           }

.

There's a couple of issue:

Since CheckExclusiveActiveVM() does not return all running/starting VMs but only the first one it finds, machine start may very well start an already running machine.

There is no way to recover from the "starting" state. It's being written early during star, see

podman/pkg/machine/qemu/machine.go

Lines 609 to 612 in 9706147

    
           v.Starting = true 
        
           if err := v.writeConfig(); err != nil { 
        
           	return fmt.Errorf("writing JSON file: %w", err) 
        
           }

and if the machine process gets killed, we're stuck in this state.

The single .Starting bool is not enough for such state transitions. We need locks for concurrent processes to block. If we run podman start and get the lock and the machine is still in .Starting == true, we know something went wrong.

I assume this problem applies to other providers than QEMU. @n1hility @ashley-cui WDYT?

vrothberg · 2023-07-26T12:44:43Z

NOTE: to reproduce a CTRL+C is not sufficient. The process must actually get killed, e.g. via killall -9 podman.

vrothberg · 2023-07-26T13:26:59Z

machine stop does work in my environment with the latest version of Podman though.

ashley-cui · 2023-07-26T13:30:54Z

Thanks for looking into it @vrothberg!

There's a couple of issue:

Since CheckExclusiveActiveVM() does not return all running/starting VMs but only the first one it finds, machine start may very well start an already running machine.

I believe the purpose of CheckExclusiveActiveVM() is to check if there is ANY vm running, as opposed to which vm is running. Podman machine (on Mac and Linux) only allows one machine to be running, so it is sufficient to return only the first starting/running VM found. If a podman machine start is issued on a different machine while another is running, it should fail. I'm not sure the bug is here.

There is no way to recover from the "starting" state. It's being written early during star, see
and if the machine process gets killed, we're stuck in this state.

I believe a podman machine stop will fix incorrect state if the machine gets stuck in starting.

The single .Starting bool is not enough for such state transitions. We need locks for concurrent processes to block. If we run podman start and get the lock and the machine is still in .Starting == true, we know something went wrong.

This sounds good to me

I assume this problem applies to other providers than QEMU. @n1hility @ashley-cui WDYT?

The interesting thing is that Windows allows more than one machine to be running at a time. There were disagreements on whether this was a bug, and if this should be allowed on all platforms, but that's another discussion.

I think there were actually two different bugs here that looked like one. One was the stuck in starting bug, which was fixed in stop. Another was that the QEMU process becomes unresponsive to machine commands. This happens when the QMP socket is unable to be pinged. Podman machine assumes that if the QMP socket is not up, the machine is down, so stop doesn't do anything, but this is not the case when this unresponsive bug pops up.

podman/pkg/machine/qemu/machine.go

Lines 897 to 901 in 85ab620

    
           // check if the qmp socket is there. if not, qemu instance is gone 
        
           if _, err := os.Stat(v.QMPMonitor.Address.GetPath()); errors.Is(err, fs.ErrNotExist) { 
        
           	// Right now it is NOT an error to stop a stopped machine 
        
           	logrus.Debugf("QMP monitor socket %v does not exist", v.QMPMonitor.Address) 
        
           	// Fix incorrect starting state in case of crash during start

vrothberg · 2023-07-26T13:41:32Z

I believe a podman machine stop will https://github.com/containers/podman/pull/18328if the machine gets stuck in starting.

I can confirm that (as mentioned in #18662 (comment)). stop must have changed in the meantime.

The interesting thing is that Windows allows #18415 to be running at a time. There were disagreements on whether this was a bug, and if this should be allowed on all platforms, but that's another discussion.

I think it's a legitimate expectation to have more than one running. CheckExclusiveActiveVM() should be able to check whether a given VM is running or not. With a locking mechanism the function wouldn't be needed anymore.

I am trying to find a solution to fix the start issue. IMO all commands that alter a VM such start, stop, update etc. should hold a write-lock. This will avoid races between the commands but also it allows for detecting whether a command has failed or not.

BUT: that would be a lot for work and there's no CI to check whether the changes cause regressions. We could start with QEMU but that would increase divergence among the providers.

rhatdan · 2023-07-26T14:20:55Z

I just want to point out that I want to get to the point that a user could have multiple machines running at the same time.

Imagine on a MAC M2, you have an ARM Machine and a X86_64 machine running, and then you use the podman buildfarm command to build a multi-arch image.

n1hility · 2023-07-26T15:22:45Z

The interesting thing is that Windows allows more than one machine to be running at a time. There were disagreements on whether this was a bug, and if this should be allowed on all platforms, but that's another discussion.

I just posted a reply on the discussion with details (#18415 (comment)), the TL;DR is it was intentional to support multiple machines running in parallel.

n1hility · 2023-07-26T16:01:39Z

CheckExclusiveActiveVM() should be able to check whether a given VM is running or not. With a locking mechanism the function wouldn't be needed anymore.

I am trying to find a solution to fix the start issue. IMO all commands that alter a VM such start, stop, update etc. should hold a write-lock. This will avoid races between the commands but also it allows for detecting whether a command has failed or not.

Keep in mind that the write-lock wont fully cover all cases, since the podman machine command could crash or be killed mid-start (e.g. system shutdown), and the qemu or gvproxy process can always immediately fail right after a state is written. So while I agree it makes sense to use file locking as a safeguard to serialize start/stop on the same machine name, all of the commands should be able to handle and recover from inaccurate state. For example start() - if not already recently changed to - should double check qemu is running at the expected pid even if the state file says Started.

BUT: that would be a lot for work and there's no CI to check whether the changes cause regressions. We could start with QEMU but that would increase divergence among the providers.

IMO it wouldn't be too bad to add some machine name specific flock / aquire / wait guards amonst the providers . Let me know if you need a hand.

n1hility · 2023-07-26T16:03:39Z

@vrothberg BTW there is an unlikely race in the qemu / gvproxy dance that I have been aware of but havnet gotten to fix it. I'll try to push that up as a PR today just in case it helps (I doubt it but just in case)

vrothberg · 2023-07-26T16:13:03Z

Keep in mind that the write-lock wont fully cover all cases, since the podman machine command could crash or be killed mid-start (e.g. system shutdown), and the qemu or gvproxy process can always immediately fail right after a state is written. So while I agree it makes sense to use file locking as a safeguard to serialize start/stop on the same machine name, all of the commands should be able to handle and recover from inaccurate state. For example start() - if not already recently changed to - should double check qemu is running at the expected pid even if the state file says Started.

Absolutely agree on that. As mentioned above, the locking is means to serialize and have certain assumptions. Currently, two simultaneously running starts could shoot each other. With locks, we can have the assumption/invariant that "starting" implies that a previous start went wrong. Then we can, for instance, clean up leftovers of the previous failed start and restart etc.

vrothberg · 2023-07-27T06:57:02Z

@vrothberg BTW there is an unlikely race in the qemu / gvproxy dance that I have been aware of but havnet gotten to fix it. I'll try to push that up as a PR today just in case it helps (I doubt it but just in case)

Thanks, @n1hility !

Lock the VM on start. If the machine is in the "starting" state we know that a previous start has failed and guide the user into resolving the issue. [NO NEW TESTS NEEDED] Fixes: containers#18662 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Lock the VM on start. If the machine is in the "starting" state we know that a previous start has failed and guide the user into resolving the issue. Concurrent starts will busy wait and return the expected "already running" error. NOTE: this change is only looking at the start issue (containers#18662). Other commands such as stop and update should also lock and will be updated in a future change. I expect the underlying issue to apply to all machine providers, not only QEMU. It's desirable to aim for extending the machine interface to also allow to `Lock()` and `Unlock()`. After acquiring the lock, the VM should automatically be reloaded/updated. [NO NEW TESTS NEEDED] Fixes: containers#18662 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2023-07-27T12:21:46Z

Opened #19396

deboer-tim added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2023

github-actions bot added the remote Problem is in podman-remote label May 23, 2023

Luap99 added the machine label May 23, 2023

Aenimus mentioned this issue May 23, 2023

Podman machine perpetually in MachineState Starting after podman machine stop #18666

Closed

Luap99 mentioned this issue May 25, 2023

podman machine stop does not stop the machine #18683

Closed

benoitf added the podman-desktop label Jun 13, 2023

vrothberg mentioned this issue Jul 27, 2023

machine: QEMU: lock VM on start #19396

Merged

rhatdan closed this as completed in #19396 Jul 27, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 26, 2023

github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine hung on starting, and stop didn't work #18662

Machine hung on starting, and stop didn't work #18662

deboer-tim commented May 23, 2023

Aenimus commented May 23, 2023 •

edited

ashley-cui commented May 25, 2023

deboer-tim commented May 26, 2023

deboer-tim commented Jun 13, 2023

vrothberg commented Jul 26, 2023

vrothberg commented Jul 26, 2023

vrothberg commented Jul 26, 2023

ashley-cui commented Jul 26, 2023 •

edited

vrothberg commented Jul 26, 2023

rhatdan commented Jul 26, 2023

n1hility commented Jul 26, 2023

n1hility commented Jul 26, 2023

n1hility commented Jul 26, 2023

vrothberg commented Jul 26, 2023 •

edited

vrothberg commented Jul 27, 2023

vrothberg commented Jul 27, 2023

Machine hung on starting, and stop didn't work #18662

Machine hung on starting, and stop didn't work #18662

Comments

deboer-tim commented May 23, 2023

Issue Description

Steps to reproduce the issue

Describe the results you received

Describe the results you expected

podman info output

Podman in a container

Privileged Or Rootless

Upstream Latest Release

Additional environment details

Additional information

Aenimus commented May 23, 2023 • edited

ashley-cui commented May 25, 2023

deboer-tim commented May 26, 2023

deboer-tim commented Jun 13, 2023

vrothberg commented Jul 26, 2023

vrothberg commented Jul 26, 2023

vrothberg commented Jul 26, 2023

ashley-cui commented Jul 26, 2023 • edited

vrothberg commented Jul 26, 2023

rhatdan commented Jul 26, 2023

n1hility commented Jul 26, 2023

n1hility commented Jul 26, 2023

n1hility commented Jul 26, 2023

vrothberg commented Jul 26, 2023 • edited

vrothberg commented Jul 27, 2023

vrothberg commented Jul 27, 2023

Aenimus commented May 23, 2023 •

edited

ashley-cui commented Jul 26, 2023 •

edited

vrothberg commented Jul 26, 2023 •

edited