Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopping LXD should signal instances early for shutdown #13460

Open
simondeziel opened this issue May 8, 2024 · 0 comments
Open

Stopping LXD should signal instances early for shutdown #13460

simondeziel opened this issue May 8, 2024 · 0 comments

Comments

@simondeziel
Copy link
Member

I have a long running operation it seems (refreshing many images?):

$ lxc operation list; TZ=UTC date
+--------------------------------------+------+-----------------+---------+------------+----------------------+
|                  ID                  | TYPE |   DESCRIPTION   | STATUS  | CANCELABLE |       CREATED        |
+--------------------------------------+------+-----------------+---------+------------+----------------------+
| 87f6266a-1559-4e68-a04b-144aa48c3a5b | TASK | Updating images | RUNNING | NO         | 2024/05/08 16:58 UTC |
+--------------------------------------+------+-----------------+---------+------------+----------------------+
Wed May  8 17:29:01 UTC 2024

And a VM (v1) running.

If I try to stop LXD (sudo snap stop lxd) it seems that LXD will wait ~5 minutes for the operation to complete before asking the instance to stop. This grace period waiting for the operation(s) to finish is taking away a lot of time for the instance(s) to do an orderly shutdown. The snap stop logic is:

# Shutdown the daemons
## LXD
echo "=> Stopping LXD (with instance shutdown)"

echo host-shutdown > "${SNAP_COMMON}/state"
if [ -n "${PID}" ] && kill -0 "${PID}" 2>/dev/null; then
    if ! kill -30 "${PID}"; then
        echo "==> Failed to signal LXD to shutdown"
    fi

    DEAD=0
    for _ in $(seq 540); do
        if ! kill -0 "${PID}" 2>/dev/null; then
            DEAD=1
            echo "==> Stopped LXD"
            break
        fi
        sleep 1
    done

    if [ "${DEAD}" = "0" ]; then
        echo "==> Forcefully stopping LXD after 9 minutes wait"
        if kill -9 "${PID}" 2>/dev/null; then
            echo "==> Stopped LXD"
        else
            echo "==> Failed to stop LXD"
        fi
    fi
fi

So after sending signal 30, we have 540 seconds (9 minutes) before a hard kill (9) is sent. Waiting 5 minutes before asking the instances to cleanly stop means they have to do so inside of ~4 minutes. This is likely too little if there are many Windows VMs potentially applying Windows Updates during shutdown (=> CPU/IO congestion).

Here is the lxc monitor --pretty during the stop:

INFO   [2024-05-08T13:30:26-04:00] Received signal                               signal="power failure"
INFO   [2024-05-08T13:30:26-04:00] Starting shutdown sequence                    signal="power failure"
ERROR  [2024-05-08T13:30:26-04:00] Failed updating images                        err="context canceled"
INFO   [2024-05-08T13:30:26-04:00] Waiting for 1 operation(s) to finish 
INFO   [2024-05-08T13:31:26-04:00] Waiting for 1 operation(s) to finish         
INFO   [2024-05-08T13:32:26-04:00] Waiting for 1 operation(s) to finish         
INFO   [2024-05-08T13:33:26-04:00] Waiting for 1 operation(s) to finish         
INFO   [2024-05-08T13:34:26-04:00] Waiting for 1 operation(s) to finish         
INFO   [2024-05-08T13:35:26-04:00] Stopping daemon storage volumes              
DEBUG  [2024-05-08T13:35:26-04:00] UnmountCustomVolume started                   driver=zfs pool=default project=default volName=backups
DEBUG  [2024-05-08T13:35:26-04:00] UnmountCustomVolume finished                  driver=zfs pool=default project=default volName=backups
DEBUG  [2024-05-08T13:35:26-04:00] Unmounted ZFS dataset                         dev=default/custom/default_backups driver=zfs path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_backups pool=default volName=default_backups
DEBUG  [2024-05-08T13:35:26-04:00] UnmountCustomVolume started                   driver=zfs pool=default project=default volName=images
DEBUG  [2024-05-08T13:35:26-04:00] Failed to unmount                             attempt=0 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:26-04:00] Failed to unmount                             attempt=1 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:27-04:00] Failed to unmount                             attempt=2 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:27-04:00] Failed to unmount                             attempt=3 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:28-04:00] Failed to unmount                             attempt=4 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:28-04:00] Failed to unmount                             attempt=5 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:29-04:00] Failed to unmount                             attempt=6 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:29-04:00] Failed to unmount                             attempt=7 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:30-04:00] Failed to unmount                             attempt=8 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:30-04:00] Failed to unmount                             attempt=9 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:31-04:00] Failed to unmount                             attempt=10 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:31-04:00] Failed to unmount                             attempt=11 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:32-04:00] Failed to unmount                             attempt=12 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:32-04:00] Failed to unmount                             attempt=13 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:33-04:00] Failed to unmount                             attempt=14 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:33-04:00] Failed to unmount                             attempt=15 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:34-04:00] Failed to unmount                             attempt=16 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:34-04:00] Failed to unmount                             attempt=17 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:35-04:00] Failed to unmount                             attempt=18 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
DEBUG  [2024-05-08T13:35:35-04:00] Failed to unmount                             attempt=19 err="device or resource busy" path=/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images
ERROR  [2024-05-08T13:35:36-04:00] Failed to unmount image and backup volumes    err="Failed to unmount images storage: Failed to unmount storage volume \"default/images\": Failed to unmount \"/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images\": device or resource busy"
DEBUG  [2024-05-08T13:35:36-04:00] UnmountCustomVolume finished                  driver=zfs pool=default project=default volName=images
DEBUG  [2024-05-08T13:35:36-04:00] Shutdown started                              instance=v1 instanceType=virtual-machine project=default timeout=30s
DEBUG  [2024-05-08T13:35:36-04:00] Instance operation lock created               action=stop instance=v1 project=default reusable=true
DEBUG  [2024-05-08T13:35:36-04:00] Shutdown request sent to instance             instance=v1 instanceType=virtual-machine project=default
DEBUG  [2024-05-08T13:35:37-04:00] Waiting for VM process to finish              instance=v1 instanceType=virtual-machine project=default
DEBUG  [2024-05-08T13:35:37-04:00] Instance stopped                              instance=v1 instanceType=virtual-machine project=default reason=guest-shutdown target=stop
DEBUG  [2024-05-08T13:35:37-04:00] onStop hook started                           instance=v1 instanceType=virtual-machine project=default target=stop
DEBUG  [2024-05-08T13:35:37-04:00] Instance operation lock inherited for stop    action=stop instance=v1 instanceType=virtual-machine project=default
DEBUG  [2024-05-08T13:35:37-04:00] QMP monitor stopped                           path=/var/snap/lxd/common/lxd/logs/v1/qemu.monitor
DEBUG  [2024-05-08T13:35:37-04:00] VM process finished                           instance=v1 instanceType=virtual-machine project=default
DEBUG  [2024-05-08T13:35:37-04:00] Stopping device                               device=root instance=v1 instanceType=virtual-machine project=default type=disk
DEBUG  [2024-05-08T13:35:37-04:00] Stopping device                               device=eth0 instance=v1 instanceType=virtual-machine project=default type=nic
DEBUG  [2024-05-08T13:35:37-04:00] UnmountInstance started                       driver=zfs instance=v1 pool=default project=default
DEBUG  [2024-05-08T13:35:37-04:00] Unmounted ZFS dataset                         dev=default/virtual-machines/v1 driver=zfs path=/var/snap/lxd/common/lxd/storage-pools/default/virtual-machines/v1 pool=default volName=v1
DEBUG  [2024-05-08T13:35:37-04:00] UnmountInstance finished                      driver=zfs instance=v1 pool=default project=default
DEBUG  [2024-05-08T13:35:37-04:00] Deactivated ZFS volume                        dev=default/virtual-machines/v1.block driver=zfs pool=default volName=v1
DEBUG  [2024-05-08T13:35:38-04:00] onStop hook finished                          instance=v1 instanceType=virtual-machine project=default target=stop
DEBUG  [2024-05-08T13:35:38-04:00] Shutdown finished                             instance=v1 instanceType=virtual-machine project=default timeout=30s
DEBUG  [2024-05-08T13:35:38-04:00] Instance operation lock finished              action=stop err="<nil>" instance=v1 project=default reusable=true
INFO   [2024-05-08T13:35:38-04:00] Action: instance-shutdown, Source: /1.0/instances/v1 
INFO   [2024-05-08T13:35:38-04:00] Stopping networks                            
DEBUG  [2024-05-08T13:35:38-04:00] Stop                                          driver=bridge network=juju0 project=default
DEBUG  [2024-05-08T13:35:38-04:00] Deleting firewall                             driver=bridge network=juju0 project=default
DEBUG  [2024-05-08T13:35:38-04:00] Stop                                          driver=bridge network=lxdbr0 project=default
DEBUG  [2024-05-08T13:35:38-04:00] Deleting firewall                             driver=bridge network=lxdbr0 project=default
INFO   [2024-05-08T13:35:38-04:00] Stopping storage pools                       
DEBUG  [2024-05-08T13:35:38-04:00] Unmount started                               driver=zfs pool=default
ERROR  [2024-05-08T13:35:38-04:00] Unable to unmount storage pool                err="Failed to run: zpool export default: exit status 1 (cannot unmount '/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images': pool or dataset is busy)" pool=default
DEBUG  [2024-05-08T13:35:38-04:00] Unmount finished                              driver=zfs pool=default
DEBUG  [2024-05-08T13:35:38-04:00] Unmount started                               driver=btrfs pool=juju-btrfs
DEBUG  [2024-05-08T13:35:38-04:00] Unmount finished                              driver=btrfs pool=juju-btrfs
DEBUG  [2024-05-08T13:35:38-04:00] Unmount started                               driver=zfs pool=juju-zfs
DEBUG  [2024-05-08T13:35:39-04:00] Cancel ongoing or future gRPC connection attempts 
DEBUG  [2024-05-08T13:35:39-04:00] Unmount finished                              driver=zfs pool=juju-zfs
INFO   [2024-05-08T13:35:39-04:00] Closing the database                         
INFO   [2024-05-08T13:35:39-04:00] Stop database gateway                        
INFO   [2024-05-08T13:35:39-04:00] Unmounting temporary filesystems             
INFO   [2024-05-08T13:35:39-04:00] Closing socket                                socket=/var/snap/lxd/common/lxd/devlxd/sock type="devlxd socket"
INFO   [2024-05-08T13:35:39-04:00] Closing socket                                socket="host(2):24419" type="VM socket"
INFO   [2024-05-08T13:35:39-04:00] Done unmounting temporary filesystems        
INFO   [2024-05-08T13:35:39-04:00] Closing socket                                socket=/var/snap/lxd/common/lxd/unix.socket type="REST API Unix socket"
Error: websocket: close 1006 (abnormal closure): unexpected EOF
$ journalctl -fu snap.lxd.daemon
...
May 08 13:30:26 sdeziel-lemur systemd[1]: Stopping Service for snap application lxd.daemon...
May 08 13:30:26 sdeziel-lemur lxd.daemon[2568375]: => Stop reason is: host shutdown
May 08 13:30:26 sdeziel-lemur lxd.daemon[2568375]: => Stopping LXD (with instance shutdown)
May 08 13:30:26 sdeziel-lemur lxd.daemon[2564894]: time="2024-05-08T13:30:26-04:00" level=error msg="Failed updating images" err="context canceled"
May 08 13:35:36 sdeziel-lemur lxd.daemon[2564894]: time="2024-05-08T13:35:36-04:00" level=error msg="Failed to unmount image and backup volumes" err="Failed to unmount images storage: Failed to unmount storage volume \"default/images\": Failed to unmount \"/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images\": device or resource busy"
May 08 13:35:38 sdeziel-lemur lxd.daemon[2564894]: time="2024-05-08T13:35:38-04:00" level=error msg="Unable to unmount storage pool" err="Failed to run: zpool export default: exit status 1 (cannot unmount '/var/snap/lxd/common/lxd/storage-pools/default/custom/default_images': pool or dataset is busy)" pool=default
May 08 13:35:39 sdeziel-lemur lxd.daemon[2564746]: => LXD exited cleanly
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: ==> Stopped LXD
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: => Stopping LXCFS
May 08 13:35:39 sdeziel-lemur lxd.daemon[2564882]: Running destructor lxcfs_exit
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: ==> Stopped LXCFS
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: => Cleaning up PID files
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: => Cleaning up namespaces
May 08 13:35:39 sdeziel-lemur lxd.daemon[2568375]: => All done
May 08 13:35:39 sdeziel-lemur systemd[1]: snap.lxd.daemon.service: Deactivated successfully.
May 08 13:35:39 sdeziel-lemur systemd[1]: Stopped Service for snap application lxd.daemon.

Note: the umount issue is probably unrelated, except maybe for causing the refresh operation to take ~forever.

Additional information:

$ snap list lxd
Name  Version      Rev    Tracking     Publisher   Notes
lxd   git-696b610  28614  latest/edge  canonical✓  -
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant