Stopping LXD should signal instances early for shutdown #13460

simondeziel · 2024-05-08T17:52:09Z

I have a long running operation it seems (refreshing many images?):

$ lxc operation list; TZ=UTC date
+--------------------------------------+------+-----------------+---------+------------+----------------------+
|                  ID                  | TYPE |   DESCRIPTION   | STATUS  | CANCELABLE |       CREATED        |
+--------------------------------------+------+-----------------+---------+------------+----------------------+
| 87f6266a-1559-4e68-a04b-144aa48c3a5b | TASK | Updating images | RUNNING | NO         | 2024/05/08 16:58 UTC |
+--------------------------------------+------+-----------------+---------+------------+----------------------+
Wed May  8 17:29:01 UTC 2024

And a VM (v1) running.

If I try to stop LXD (sudo snap stop lxd) it seems that LXD will wait ~5 minutes for the operation to complete before asking the instance to stop. This grace period waiting for the operation(s) to finish is taking away a lot of time for the instance(s) to do an orderly shutdown. The snap stop logic is:

# Shutdown the daemons
## LXD
echo "=> Stopping LXD (with instance shutdown)"

echo host-shutdown > "${SNAP_COMMON}/state"
if [ -n "${PID}" ] && kill -0 "${PID}" 2>/dev/null; then
    if ! kill -30 "${PID}"; then
        echo "==> Failed to signal LXD to shutdown"
    fi

    DEAD=0
    for _ in $(seq 540); do
        if ! kill -0 "${PID}" 2>/dev/null; then
            DEAD=1
            echo "==> Stopped LXD"
            break
        fi
        sleep 1
    done

    if [ "${DEAD}" = "0" ]; then
        echo "==> Forcefully stopping LXD after 9 minutes wait"
        if kill -9 "${PID}" 2>/dev/null; then
            echo "==> Stopped LXD"
        else
            echo "==> Failed to stop LXD"
        fi
    fi
fi

So after sending signal 30, we have 540 seconds (9 minutes) before a hard kill (9) is sent. Waiting 5 minutes before asking the instances to cleanly stop means they have to do so inside of ~4 minutes. This is likely too little if there are many Windows VMs potentially applying Windows Updates during shutdown (=> CPU/IO congestion).

Here is the lxc monitor --pretty during the stop:

INFO   [2024-05-08T13:30:26-04:00] Received signal                               signal="power failure"
INFO   [2024-05-08T13:30:26-04:00] Starting shutdown sequence                    signal="power failure"
ERROR  [2024-05-08T13:30:26-04:00] Failed updating images                        err="context canceled"
INFO   [2024-05-08T13:30:26-04:00] Waiting for 1 operation(s) to finish 
INFO   [2024-05-08T13:31:26-04:00] Waiting for 1 operation(s) to finish         
INFO   [2024-05-08T13:32:26-04:00] Waiting for 1 operation(s) to finish         
INFO   [2024-05-08T13:33:26-04:00] Waiting for 1 operation(s) to finish         
INFO   [2024-05-08T13:34:26-04:00] Waiting for 1 operation(s) to finish         
INFO   [2024-05-08T13:35:26-04:00] Stopping daemon storage volumes              
DEBUG  [2024-05-08T13:35:26-04:00] UnmountCustomVolume started                   driver=zfs pool=default project=default volName=backups
DEBUG  [2024-05-08T13:35:26-04:00] UnmountCustomVolume finished                  driver=zfs pool=default project=default volName=backups
DEBUG  [2024-05-08T13:35:26-04:00] Unmounted ZFS dataset                         dev=default/custom/default_backups driver=zfs path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_backups pool=default volName=default_backups
DEBUG  [2024-05-08T13:35:26-04:00] UnmountCustomVolume started                   driver=zfs pool=default project=default volName=images
DEBUG  [2024-05-08T13:35:26-04:00] Failed to unmount                             attempt=0 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:26-04:00] Failed to unmount                             attempt=1 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:27-04:00] Failed to unmount                             attempt=2 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:27-04:00] Failed to unmount                             attempt=3 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:28-04:00] Failed to unmount                             attempt=4 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:28-04:00] Failed to unmount                             attempt=5 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:29-04:00] Failed to unmount                             attempt=6 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:29-04:00] Failed to unmount                             attempt=7 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:30-04:00] Failed to unmount                             attempt=8 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:30-04:00] Failed to unmount                             attempt=9 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:31-04:00] Failed to unmount                             attempt=10 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:31-04:00] Failed to unmount                             attempt=11 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:32-04:00] Failed to unmount                             attempt=12 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:32-04:00] Failed to unmount                             attempt=13 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:33-04:00] Failed to unmount                             attempt=14 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:33-04:00] Failed to unmount                             attempt=15 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:34-04:00] Failed to unmount                             attempt=16 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:34-04:00] Failed to unmount                             attempt=17 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:35-04:00] Failed to unmount                             attempt=18 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:35-04:00] Failed to unmount                             attempt=19 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
ERROR  [2024-05-08T13:35:36-04:00] Failed to unmount image and backup volumes    err="Failed to unmount images storage: Failed to unmount storage volume \"default/images\": Failed to unmount \"/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images\": device or resource busy"
DEBUG  [2024-05-08T13:35:36-04:00] UnmountCustomVolume finished                  driver=zfs pool=default project=default volName=images
DEBUG  [2024-05-08T13:35:36-04:00] Shutdown started                              instance=v1 instanceType=virtual-machine project=default timeout=30s
DEBUG  [2024-05-08T13:35:36-04:00] Instance operation lock created               action=stop instance=v1 project=default reusable=true
DEBUG  [2024-05-08T13:35:36-04:00] Shutdown request sent to instance             instance=v1 instanceType=virtual-machine project=default
DEBUG  [2024-05-08T13:35:37-04:00] Waiting for VM process to finish              instance=v1 instanceType=virtual-machine project=default
DEBUG  [2024-05-08T13:35:37-04:00] Instance stopped                              instance=v1 instanceType=virtual-machine project=default reason=guest-shutdown target=stop
DEBUG  [2024-05-08T13:35:37-04:00] onStop hook started                           instance=v1 instanceType=virtual-machine project=default target=stop
DEBUG  [2024-05-08T13:35:37-04:00] Instance operation lock inherited for stop    action=stop instance=v1 instanceType=virtual-machine project=default
DEBUG  [2024-05-08T13:35:37-04:00] QMP monitor stopped                           path=/var/snap/lxd/common/lxd/logs/v1/qemu.monitor
DEBUG  [2024-05-08T13:35:37-04:00] VM process finished                           instance=v1 instanceType=virtual-machine project=default
DEBUG  [2024-05-08T13:35:37-04:00] Stopping device                               device=root instance=v1 instanceType=virtual-machine project=default type=disk
DEBUG  [2024-05-08T13:35:37-04:00] Stopping device                               device=eth0 instance=v1 instanceType=virtual-machine project=default type=nic
DEBUG  [2024-05-08T13:35:37-04:00] UnmountInstance started                       driver=zfs instance=v1 pool=default project=default
DEBUG  [2024-05-08T13:35:37-04:00] Unmounted ZFS dataset                         dev=default/virtual-machines/v1 driver=zfs path=/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/v1 pool=default volName=v1
DEBUG  [2024-05-08T13:35:37-04:00] UnmountInstance finished                      driver=zfs instance=v1 pool=default project=default
DEBUG  [2024-05-08T13:35:37-04:00] Deactivated ZFS volume                        dev=default/virtual-machines/v1.block driver=zfs pool=default volName=v1
DEBUG  [2024-05-08T13:35:38-04:00] onStop hook finished                          instance=v1 instanceType=virtual-machine project=default target=stop
DEBUG  [2024-05-08T13:35:38-04:00] Shutdown finished                             instance=v1 instanceType=virtual-machine project=default timeout=30s
DEBUG  [2024-05-08T13:35:38-04:00] Instance operation lock finished              action=stop err="<nil>" instance=v1 project=default reusable=true
INFO   [2024-05-08T13:35:38-04:00] Action: instance-shutdown, Source: /1.0/instances/v1 
INFO   [2024-05-08T13:35:38-04:00] Stopping networks                            
DEBUG  [2024-05-08T13:35:38-04:00] Stop                                          driver=bridge network=juju0 project=default
DEBUG  [2024-05-08T13:35:38-04:00] Deleting firewall                             driver=bridge network=juju0 project=default
DEBUG  [2024-05-08T13:35:38-04:00] Stop                                          driver=bridge network=lxdbr0 project=default
DEBUG  [2024-05-08T13:35:38-04:00] Deleting firewall                             driver=bridge network=lxdbr0 project=default
INFO   [2024-05-08T13:35:38-04:00] Stopping storage pools                       
DEBUG  [2024-05-08T13:35:38-04:00] Unmount started                               driver=zfs pool=default
ERROR  [2024-05-08T13:35:38-04:00] Unable to unmount storage pool                err="Failed to run: zpool export default: exit status 1 (cannot unmount '/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images': pool or dataset is busy)" pool=default
DEBUG  [2024-05-08T13:35:38-04:00] Unmount finished                              driver=zfs pool=default
DEBUG  [2024-05-08T13:35:38-04:00] Unmount started                               driver=btrfs pool=juju-btrfs
DEBUG  [2024-05-08T13:35:38-04:00] Unmount finished                              driver=btrfs pool=juju-btrfs
DEBUG  [2024-05-08T13:35:38-04:00] Unmount started                               driver=zfs pool=juju-zfs
DEBUG  [2024-05-08T13:35:39-04:00] Cancel ongoing or future gRPC connection attempts 
DEBUG  [2024-05-08T13:35:39-04:00] Unmount finished                              driver=zfs pool=juju-zfs
INFO   [2024-05-08T13:35:39-04:00] Closing the database                         
INFO   [2024-05-08T13:35:39-04:00] Stop database gateway                        
INFO   [2024-05-08T13:35:39-04:00] Unmounting temporary filesystems             
INFO   [2024-05-08T13:35:39-04:00] Closing socket                                socket=/var/snap/lxd/common/lxd/devlxd/sock type="devlxd socket"
INFO   [2024-05-08T13:35:39-04:00] Closing socket                                socket="host(2):24419" type="VM socket"
INFO   [2024-05-08T13:35:39-04:00] Done unmounting temporary filesystems        
INFO   [2024-05-08T13:35:39-04:00] Closing socket                                socket=/var/snap/lxd/common/lxd/unix.socket type="REST API Unix socket"
Error: websocket: close 1006 (abnormal closure): unexpected EOF

$ journalctl -fu snap.lxd.daemon
...
May 08 13:30:26 sdeziel-lemur systemd[1]: Stopping Service for snap application lxd.daemon...
May 08 13:30:26 sdeziel-lemur lxd.daemon[2568375]: => Stop reason is: host shutdown
May 08 13:30:26 sdeziel-lemur lxd.daemon[2568375]: => Stopping LXD (with instance shutdown)
May 08 13:30:26 sdeziel-lemur lxd.daemon[2564894]: time="2024-05-08T13:30:26-04:00" level=error msg="Failed updating images" err="context canceled"
May 08 13:35:36 sdeziel-lemur lxd.daemon[2564894]: time="2024-05-08T13:35:36-04:00" level=error msg="Failed to unmount image and backup volumes" err="Failed to unmount images storage: Failed to unmount storage volume \"default/images\": Failed to unmount \"/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images\": device or resource busy"
May 08 13:35:38 sdeziel-lemur lxd.daemon[2564894]: time="2024-05-08T13:35:38-04:00" level=error msg="Unable to unmount storage pool" err="Failed to run: zpool export default: exit status 1 (cannot unmount '/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images': pool or dataset is busy)" pool=default
May 08 13:35:39 sdeziel-lemur lxd.daemon[2564746]: => LXD exited cleanly
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: ==> Stopped LXD
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: => Stopping LXCFS
May 08 13:35:39 sdeziel-lemur lxd.daemon[2564882]: Running destructor lxcfs_exit
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: ==> Stopped LXCFS
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: => Cleaning up PID files
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: => Cleaning up namespaces
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: => All done
May 08 13:35:39 sdeziel-lemur systemd[1]: snap.lxd.daemon.service: Deactivated successfully.
May 08 13:35:39 sdeziel-lemur systemd[1]: Stopped Service for snap application lxd.daemon.

Note: the umount issue is probably unrelated, except maybe for causing the refresh operation to take ~forever.

Additional information:

$ snap list lxd
Name  Version      Rev    Tracking     Publisher   Notes
lxd   git-696b610  28614  latest/edge  canonical✓  -

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopping LXD should signal instances early for shutdown #13460

Stopping LXD should signal instances early for shutdown #13460

simondeziel commented May 8, 2024

Stopping LXD should signal instances early for shutdown #13460

Stopping LXD should signal instances early for shutdown #13460

Comments

simondeziel commented May 8, 2024