-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common flakiness caused by Failed to QGA guest-exec-status
#40
Comments
Hi, thanks for the report. That sounds fairly mysterious - I haven't seen that error before. Two things that would be useful to look at:
|
The retry 30 times didn't do much but it looks like it gave the system more time to spew extra information. I am not yet sure how to do your second suggestion there.
Based on this error I now wonder if the issue is actually in qemu (I don't think the kernel since it fails sporadically on different kernels). I am also wondering if I am just starving it from memory or CPU resources. I still find it suspicious it only began to fail for us on newer releases of vmtest, however. We have a separate CI workflow for testing in aarch that uses an older and forked version of vmtest and that one never hits any issues but that might be a coincidence and work because due to aarch vs x86 instead of due to the older vmtest 🤷🏾♂️ |
Looking at:
It looks like your tests may be triggering some kind of kernel issue. Or qemu issue. Looks like there's some hints about RCU stall detection here: https://docs.kernel.org/RCU/stallwarn.html . Either way it doesn't look too much like a vmtest issue.
I think
So could be as simple as passing additional |
Fwiw, I downgraded to vmtest 0.8.1 in my CI tests and everything passed without any issues (normally I see at least one kernel fail out of 6 I test on). So while it is possible that the issue is in my tests, qemu, or the specific kernels I am testing on, it is something that wasn't being hit on older vmtest versions. |
That's pretty odd. If you could bisect it down to the commit that broke it that would help a lot. |
Fwiw I merged the |
Now that we can run `vmtest` within a given chroot, we could host the chroot for a foreign architecture and run a vmtest in that chroot using a kernel matching that ar hitecture. ``` $ uname -a && cat /etc/lsb-release && RUST_LOG=debug cargo run -- -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release' Linux surya 6.5.0-14-generic danobi#14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.10 DISTRIB_CODENAME=mantic DISTRIB_DESCRIPTION="Ubuntu 23.10" Finished dev [unoptimized + debuginfo] target(s) in 0.05s Running `target/debug/vmtest -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release'` => Image.gz ===> Booting ===> Setting up VM ===> Running command Linux (none) 6.6.0-rc5-ga4a0c99f10ca-dirty danobi#40 SMP Thu Oct 26 18:28:11 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.04 DISTRIB_CODENAME=lunar DISTRIB_DESCRIPTION="Ubuntu 23.04" ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com>
Now that we can run `vmtest` within a given chroot, we could host the chroot for a foreign architecture and run a vmtest in that chroot using a kernel matching that ar hitecture. ``` $ uname -a && cat /etc/lsb-release && RUST_LOG=debug cargo run -- -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release' Linux surya 6.5.0-14-generic danobi#14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.10 DISTRIB_CODENAME=mantic DISTRIB_DESCRIPTION="Ubuntu 23.10" Finished dev [unoptimized + debuginfo] target(s) in 0.05s Running `target/debug/vmtest -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release'` => Image.gz ===> Booting ===> Setting up VM ===> Running command Linux (none) 6.6.0-rc5-ga4a0c99f10ca-dirty danobi#40 SMP Thu Oct 26 18:28:11 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.04 DISTRIB_CODENAME=lunar DISTRIB_DESCRIPTION="Ubuntu 23.04" ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com>
Now that we can run `vmtest` within a given chroot, we could host the chroot for a foreign architecture and run a vmtest in that chroot using a kernel matching that ar hitecture. ``` $ uname -a && cat /etc/lsb-release && RUST_LOG=debug cargo run -- -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release' Linux surya 6.5.0-14-generic danobi#14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.10 DISTRIB_CODENAME=mantic DISTRIB_DESCRIPTION="Ubuntu 23.10" Finished dev [unoptimized + debuginfo] target(s) in 0.05s Running `target/debug/vmtest -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release'` => Image.gz ===> Booting ===> Setting up VM ===> Running command Linux (none) 6.6.0-rc5-ga4a0c99f10ca-dirty danobi#40 SMP Thu Oct 26 18:28:11 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.04 DISTRIB_CODENAME=lunar DISTRIB_DESCRIPTION="Ubuntu 23.04" ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com>
Now that we can run `vmtest` within a given chroot, we could host the chroot for a foreign architecture and run a vmtest in that chroot using a kernel matching that ar hitecture. ``` $ uname -a && cat /etc/lsb-release && RUST_LOG=debug cargo run -- -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release' Linux surya 6.5.0-14-generic danobi#14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.10 DISTRIB_CODENAME=mantic DISTRIB_DESCRIPTION="Ubuntu 23.10" Finished dev [unoptimized + debuginfo] target(s) in 0.05s Running `target/debug/vmtest -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release'` => Image.gz ===> Booting ===> Setting up VM ===> Running command Linux (none) 6.6.0-rc5-ga4a0c99f10ca-dirty danobi#40 SMP Thu Oct 26 18:28:11 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.04 DISTRIB_CODENAME=lunar DISTRIB_DESCRIPTION="Ubuntu 23.04" ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com>
Now that we can run `vmtest` within a given chroot, we could host the chroot for a foreign architecture and run a vmtest in that chroot using a kernel matching that ar hitecture. ``` $ uname -a && cat /etc/lsb-release && RUST_LOG=debug cargo run -- -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release' Linux surya 6.5.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.10 DISTRIB_CODENAME=mantic DISTRIB_DESCRIPTION="Ubuntu 23.10" Finished dev [unoptimized + debuginfo] target(s) in 0.05s Running `target/debug/vmtest -a aarch64 -k arm64/kbuild-output/arch/arm64/boot/Image.gz -r ../rootfs/ubuntu-lunar-arm64/ 'uname -a && cat /etc/lsb-release'` => Image.gz ===> Booting ===> Setting up VM ===> Running command Linux (none) 6.6.0-rc5-ga4a0c99f10ca-dirty #40 SMP Thu Oct 26 18:28:11 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux DISTRIB_ID=Ubuntu DISTRIB_RELEASE=23.04 DISTRIB_CODENAME=lunar DISTRIB_DESCRIPTION="Ubuntu 23.04" ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com>
@mrxus do you have a repro you could share by any chance?
It is unclear on which version is reproduces. Your fork is on top of v0.9.0, are v0.8.2 and v0.8.3 also affected? I have seen this a few times during some of my tests. I believe this is likely due to resource exhaustion in the VM preventing. For instance, I could repro with:
other variants:
maybe ionicing would help? |
Unfortunately I do not have a repro example I can share as I use it at my job so I cannot share that code and it is not open source. We use vmtest use this during CI for work to run tests that exercise eBPF programs on different kernels . The issue being caused due to it running out of resources sounds plausible although I haven't been able to find where that may be happening as it is only some times, on different kernels, and never when running the tests locally 😞 |
I looked into nice-ing qemu-ga but as a side effect, it would also set the command ran by qemu-ga with high priority, so some fiddling needs to happen. Assuming coreutils are installed (which I am not sure we can assume in case of image builds....), this is what I am thinking of: I will try to repro, but @nrxus if you could also try this patch and confirm whether or not it help, it would be appreciated. |
When I tried that change with the example scripts I mentioned in #40 (comment) , it did help a bit though. |
one way I have found to reproduce the issue is to saturate both the host (4 cores) and the vm (2 cpus) using the command: Saturating only one of the host/vm does not reproduce. Now, if I run the host's stress command with a niceness of 19 (instead of default 0), while the vm is sluggish, as seen when running something like:
the issue is not manifesting. So basically, this seems to happen when both the host and the VM are overloaded. guest_exec_status is returning the error after 5 second, which matches Line 80 in 3844d8b
With a combo of retry and longer timeout, the VM would also show the same stacktrace as in #40 (comment) Bumping the read timeout like crazy in the QgaWrapper is likely not a good approach as it would impact every command ran in the VM, even the ones used to set up the VM, in which case we probably want to fail fast enough.
or the retry approach, which same here, we may not want to do a lot of retries for every On the other hand, prior to #27, Something similar to https://gist.github.com/chantra/c09534363519d429018b095dcbb18cdd . In more details, other commands are run in vm later, so we would probably want to set the stream to no-timeout only during that call. @danobi any thoughts? |
Fixes danobi#40 When both the host and the VM are saturated, the VM may not be scheduled for a while, causing the connection to qga to timeout. Prior to danobi#27 the unix socket would block indefinitely. This change brings back this behaviour while we run a command, and set the timeout back to what it was before running the command. This way, we can give a chance to the host/vm to recover. Tested by running: ``` stress --cpu 256 --io 256 --vm 4 --vm-bytes 1024M --timeout 1000s ``` in the host, and ``` stress --cpu 512 --io 512 --vm 4 --vm-bytes 1024M --timeout 100s & while true; do date; sleep 2; done ``` in the guest. Output showsa that the VM was struggling to run its loop and print the date until the stress tool was done: ``` ===> Setting up VM [ 1.840931] 9pnet: Limiting 'msize' to 512000 as this is the maximum supported by transport virtio ===> Running command Thu Jan 25 01:43:12 PM PST 2024 stress: info: [85] dispatching hogs: 512 cpu, 512 io, 4 vm, 0 hdd [ 5.493242] hrtimer: interrupt took 3387058 ns [ 89.331664] clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 31697570ns [ 89.465647] clocksource: wd-tsc-wd read-back delay of 2998850ns, clock-skew test skipped! Thu Jan 25 01:45:00 PM PST 2024 Thu Jan 25 01:45:09 PM PST 2024 Thu Jan 25 01:45:12 PM PST 2024 stress: info: [85] successful run completed in 124s Thu Jan 25 01:45:17 PM PST 2024 Thu Jan 25 01:45:19 PM PST 2024 Thu Jan 25 01:45:21 PM PST 2024 Thu Jan 25 01:45:23 PM PST 2024 Thu Jan 25 01:45:26 PM PST 2024 Thu Jan 25 01:45:28 PM PST 2024 ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com>
Proposed change in #59 . |
Fixes #40 When both the host and the VM are saturated, the VM may not be scheduled for a while, causing the connection to qga to timeout. Prior to #27 the unix socket would block indefinitely. This change brings back this behaviour while we run a command, and set the timeout back to what it was before running the command. This way, we can give a chance to the host/vm to recover. Tested by running: ``` stress --cpu 256 --io 256 --vm 4 --vm-bytes 1024M --timeout 1000s ``` in the host, and ``` stress --cpu 512 --io 512 --vm 4 --vm-bytes 1024M --timeout 100s & while true; do date; sleep 2; done ``` in the guest. Output showsa that the VM was struggling to run its loop and print the date until the stress tool was done: ``` ===> Setting up VM [ 1.840931] 9pnet: Limiting 'msize' to 512000 as this is the maximum supported by transport virtio ===> Running command Thu Jan 25 01:43:12 PM PST 2024 stress: info: [85] dispatching hogs: 512 cpu, 512 io, 4 vm, 0 hdd [ 5.493242] hrtimer: interrupt took 3387058 ns [ 89.331664] clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 31697570ns [ 89.465647] clocksource: wd-tsc-wd read-back delay of 2998850ns, clock-skew test skipped! Thu Jan 25 01:45:00 PM PST 2024 Thu Jan 25 01:45:09 PM PST 2024 Thu Jan 25 01:45:12 PM PST 2024 stress: info: [85] successful run completed in 124s Thu Jan 25 01:45:17 PM PST 2024 Thu Jan 25 01:45:19 PM PST 2024 Thu Jan 25 01:45:21 PM PST 2024 Thu Jan 25 01:45:23 PM PST 2024 Thu Jan 25 01:45:26 PM PST 2024 Thu Jan 25 01:45:28 PM PST 2024 ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com>
We use
vmtest
to run tests in CI to check kernel compatibility with some eBPF programs. About 2 weeks ago we noticed that our CI was failing with:At first I thought that maybe a naive "just retry" in
QgaWrapper::guest_exec_status
would fix it (show in this branch that I made). But the error still appeared when running the vmtest compiled by that branch.I am more than happy to try to contribute a fix but I am pretty much at a loss on what is going on. I am assuming it is related to this recent PR: #27. This PR is from 6 weeks ago and we only noticed the error two weeks ago but we hadn't been touching our eBPF programs much lately and thus that CI hadn't been running so we wouldn't have noticed.
For context, this error occurs in both newer and older kernels in x86_64. (We test a few kernels starting from 4.14).
I don't think we do anything special in our tests when starting
vmtest
, but i do see that we runip link set dev lo up
prior to running our tests. I am not sure if related.The text was updated successfully, but these errors were encountered: