Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate docker container from ubuntu 20.04 to 22.04 #5877

Merged
merged 1 commit into from
Dec 20, 2023

Conversation

rveerama1
Copy link
Contributor

Migration attempt has been done few times already in following PRs. #5072, #5449, #5456, #5487 and dropped due to various reasons. One main reason was failure of the test named live_migration::live_migration_sequential::test_live_migration_ovs_dpdk #5532.

After analyzing those PRs I would like to approach this issue step by step.

  1. Migrate to Ubuntu 22.04 with minimal changes where all the tests will be passed and CI will be happy.
  2. Upgrade SPDK to latest or stable version.
  3. Upgrade virtiofsd to latest or stable version.
  4. Upgrade any other dependencies to latest or stable version based on how CI proceeds with tests or any other specific requirements.

@likebreath @rbradford @michael2012z Any comments or suggestions?

@rveerama1 rveerama1 requested a review from a team as a code owner October 25, 2023 11:10
@rveerama1 rveerama1 marked this pull request as draft October 25, 2023 11:10
@likebreath
Copy link
Member

@rveerama1 Thank you for looking into this. I am glad to see that you are learning from previous attempts.

In general, I'd suggest you focus on x86_64 worker first, and gradually move to other workers, say aarch64, amd, windows workers. On the other hand, I believe most of the debugging work can be done offline, either on your own machine or a equivalent Azure VM instance. I'd only use the CI pipeline to validate once a certain worker is working locally.

@likebreath
Copy link
Member

Instead of using this PR, let's have the discussion and track the process in the issue #5878.

@rveerama1
Copy link
Contributor Author

rveerama1 commented Oct 26, 2023

In general, I'd suggest you focus on x86_64 worker first, and gradually move to other workers, say aarch64, amd, windows workers.

Sure.

@rveerama1
Copy link
Contributor Author

rveerama1 commented Oct 27, 2023

some progress with the tests

just migrating ubuntu from 20.04 to 22.04 and keeping the virtiofsd to 1.1.0 version caused below tests to fail.

[2023-10-25T11:35:44.202Z] cloud-hypervisor: 7.164748s: <vmm> INFO:vmm/src/lib.rs:1952 -- API request event: VmAddFs(FsConfig { tag: "myfs", socket: "/tmp/ch8QW0js/virtiofs.sock", num_queues: 1, queue_size: 1024, id: Some("myfs0"), pci_segment: 15 }, Sender { .. })
[2023-10-25T11:35:44.202Z] cloud-hypervisor: 7.164813s: <vmm> INFO:vmm/src/device_manager.rs:2614 -- Creating virtio-fs device: FsConfig { tag: "myfs", socket: "/tmp/ch8QW0js/virtiofs.sock", num_queues: 1, queue_size: 1024, id: Some("myfs0"), pci_segment: 15 }
[2023-10-25T11:35:44.202Z] cloud-hypervisor: 67.239434s: <vmm> ERROR:virtio-devices/src/vhost_user/vu_common_ctrl.rs:410 -- Failed connecting the backend after trying for 1 minute: VhostUserProtocol(SocketConnect(Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }))
[2023-10-25T11:35:44.202Z] cloud-hypervisor: 67.239491s: <vmm> ERROR:vmm/src/lib.rs:1160 -- Error when adding new fs to the VM: DeviceManager(CreateVirtioFs(VhostUserConnect))

[2023-10-25T11:35:44.210Z] failures:
[2023-10-25T11:35:44.210Z]     common_parallel::test_vfio
[2023-10-25T11:35:44.210Z]     common_parallel::test_virtio_fs
[2023-10-25T11:35:44.210Z]     common_parallel::test_virtio_fs_hotplug
[2023-10-25T11:35:44.210Z]     common_parallel::test_virtio_fs_multi_segment
[2023-10-25T11:35:44.210Z]     common_parallel::test_virtio_fs_multi_segment_hotplug

In previous attempts virtiofsd was increased to 1.4.0. In this PR I considered latest version 1.8.0 then I see below tests are passing in my local machine.

[2023-10-27T14:31:46Z INFO  virtiofsd] Client connected, servicing requests
[2023-10-27T14:31:53Z INFO  virtiofsd] Client connected, servicing requests
[2023-10-27T14:32:19Z INFO  virtiofsd] Client disconnected, shutting down
test common_parallel::test_virtio_fs ... ok
test common_parallel::test_virtio_fs_hotplug has been running for over 60 seconds
test common_parallel::test_virtio_fs_multi_segment_hotplug has been running for over 60 seconds
[2023-10-27T14:32:31Z INFO  virtiofsd] Client disconnected, shutting down
[2023-10-27T14:32:38Z INFO  virtiofsd] Client disconnected, shutting down
[2023-10-27T14:32:41Z INFO  virtiofsd] Waiting for vhost-user socket connection...
[2023-10-27T14:32:48Z INFO  virtiofsd] Waiting for vhost-user socket connection...
[2023-10-27T14:33:01Z INFO  virtiofsd] Client connected, servicing requests
[2023-10-27T14:33:08Z INFO  virtiofsd] Client connected, servicing requests
[2023-10-27T14:33:12Z INFO  virtiofsd] Client disconnected, shutting down
test common_parallel::test_virtio_fs_hotplug ... ok
[2023-10-27T14:33:19Z INFO  virtiofsd] Client disconnected, shutting down
test common_parallel::test_virtio_fs_multi_segment_hotplug ... ok

[2023-10-27T14:35:03Z INFO  virtiofsd] Waiting for vhost-user socket connection...
[2023-10-27T14:35:04Z INFO  virtiofsd] Waiting for vhost-user socket connection...
[2023-10-27T14:35:14Z INFO  virtiofsd] Client connected, servicing requests
[2023-10-27T14:35:20Z INFO  virtiofsd] Client connected, servicing requests
[2023-10-27T14:35:56Z INFO  virtiofsd] Client disconnected, shutting down
test common_parallel::test_virtio_fs_multi_segment ... ok
test common_parallel::test_virtio_fs_multi_segment_hotplug has been running for over 60 seconds
[2023-10-27T14:36:05Z INFO  virtiofsd] Client disconnected, shutting down
[2023-10-27T14:36:15Z INFO  virtiofsd] Waiting for vhost-user socket connection...
[2023-10-27T14:36:35Z INFO  virtiofsd] Client connected, servicing requests
[2023-10-27T14:36:46Z INFO  virtiofsd] Client disconnected, shutting down
test common_parallel::test_virtio_fs_multi_segment_hotplug ... ok

@michael2012z
Copy link
Member

Regarding the build failure on AArch64, it was also seen in #5487.

There is an issue rust-lang/rust#89626 in rust-lang community tracking the problem. But seems it has not been resolved. Some workarounds were mentioned among the discussion.

@rveerama1
Copy link
Contributor Author

Regarding the build failure on AArch64, it was also seen in #5487.

Yes, I noticed it.

There is an issue rust-lang/rust#89626 in rust-lang community tracking the problem. But seems it has not been resolved. Some workarounds were mentioned among the discussion.

Let's see if we can find a good workaround.

@michael2012z
Copy link
Member

@rveerama1, for the AArch64 build error, can you try updating the RUSTFLAGS like this:

diff --git a/scripts/dev_cli.sh b/scripts/dev_cli.sh
index 73b31850..ee9cc48a 100755
--- a/scripts/dev_cli.sh
+++ b/scripts/dev_cli.sh
@@ -286,8 +286,7 @@ cmd_build() {
     rustflags="$RUSTFLAGS"
     target_cc=""
     if [ "$(uname -m)" = "aarch64" ] && [ "$libc" = "musl" ]; then
-        rustflags="$rustflags -C link-arg=-lgcc -C link_arg=-specs -C link_arg=/usr/lib/aarch64-linux-musl/musl-gcc.specs"
-        target_cc="musl-gcc"
+        rustflags="$rustflags -C link-args=-Wl,-Bstatic -C link-args=-lc"
     fi

     $DOCKER_RUNTIME run \
@@ -400,8 +399,7 @@ cmd_tests() {
     rustflags="$RUSTFLAGS"
     target_cc=""
     if [ "$(uname -m)" = "aarch64" ] && [ "$libc" = "musl" ]; then
-        rustflags="$rustflags -C link-arg=-lgcc -C link_arg=-specs -C link_arg=/usr/lib/aarch64-linux-musl/musl-gcc.specs"
-        target_cc="musl-gcc"
+        rustflags="$rustflags -C link-args=-Wl,-Bstatic -C link-args=-lc"
     fi

     if [[ "$unit" = true ]]; then

I can reproduce the error, the change works.

@rveerama1
Copy link
Contributor Author

@rveerama1, for the AArch64 build error, can you try updating the RUSTFLAGS like this:

Done, Thank you.

@rveerama1
Copy link
Contributor Author

some update
1)

failures:

---- common_parallel::test_vfio stdout ----

==== Start cloud-hypervisor command-line ====
"target/x86_64-unknown-linux-gnu/release/cloud-hypervisor" "--cpus" "boot=4" "--memory" "size=2G,hugepages=on,shared=on" "--kernel" "/root/workloads/vmlinux" "--disk" "path=/tmp/chZLE0Eu/osdisk.img" "path=/tmp/chZLE0Eu/cloudinit" "path=/root/workloads/vfio.img" "path=/root/workloads/blk.img,iommu=on" "--cmdline" "root=/dev/vda1 console=hvc0 rw systemd.journald.forward_to_console=1 kvm-intel.nested=1 vfio_iommu_type1.allow_unsafe_interrupts" "--net" "tap=vfio-tap0,mac=12:34:56:78:90:00" "tap=vfio-tap1,mac=de:ad:be:ef:12:00,iommu=on" "tap=vfio-tap2,mac=de:ad:be:ef:34:00,iommu=on" "tap=vfio-tap3,mac=de:ad:be:ef:56:00,iommu=on" "-v"
==== End cloud-hypervisor command-line ====

==== Start ssh command output (FAILED) ====
command="grep -c VFIOTAG /proc/cmdline"
auth="PasswordAuth {
    username: "cloud",
    password: "cloud123",
}"
ip="172.18.0.3"
output=""
error="Connection(Os { code: 113, kind: HostUnreachable, message: "No route to host" })"
==== End ssh command outout ====

thread 'common_parallel::test_vfio' panicked at 'called `Result::unwrap()` on an `Err` value: Connection(Os { code: 113, kind: HostUnreachable, message: "No route to host" })', tests/integration.rs:4318:22

test common_parallel::test_vfio ... FAILED , test was unable to ssh into second level (l2) VM. Reason could be second vm didn't boot.

Some guess from @likebreath , "I think the reason why the test_vfio was failing is Cloud Hypervisor binary compiled on new container image Ubuntu 22.04 has different dynamic binaries from what's provided from Ubuntu 20.04 (the focal guest image)."

I modified test_vfio to use Jammy image (Ubuntu 22.04) and tests are passing.

  1. @michael2012z provided a fix for compilation error related to arm workers and it is working and proceeded to integration tests.

Now some tests are failing

  1. arm worker
[2023-10-31T15:20:22.459Z] failures:
[2023-10-31T15:20:22.459Z] 
[2023-10-31T15:20:22.459Z] ---- common_parallel::test_vfio_user stdout ----
[2023-10-31T15:20:22.459Z] thread 'common_parallel::test_vfio_user' panicked at 'assertion failed: exec_host_command_status(\"/usr/local/bin/spdk-nvme/rpc.py nvmf_create_transport -t VFIOUSER\").success()', tests/integration.rs:6737:9
  1. live_migration::live_migration_sequential::test_live_migration_ovs_dpdk failed just like previous attempts. I am looking into this now.
[2023-10-31T15:40:32.680Z] ==== End 'ovs_vm' stderr ====

[2023-10-31T15:40:32.680Z] thread 'live_migration::live_migration_sequential::test_live_migration_ovs_dpdk' panicked at 'Test failed: Error occurred during live-migration', tests/integration.rs:8777:9
  1. windows worker
[2023-10-31T15:29:31.914Z] failures:
[2023-10-31T15:29:31.914Z]     windows::test_windows_guest_disk_hotplug
[2023-10-31T15:29:31.914Z]     windows::test_windows_guest_disk_hotplug_multi

@rveerama1
Copy link
Contributor Author

No new changes, just rebased against main.

@likebreath
Copy link
Member

@rveerama1 Index branching will run the CI from time to time even if the PR is not updated and as a draft, and it will waste resources. Let's keep the PR closed unless you want to run the CI. Please reopen as needed. Thank you.

@likebreath likebreath closed this Nov 9, 2023
@likebreath
Copy link
Member

Of course, you will still have the CI log history as a reference for debugging offline: https://cloud-hypervisor-jenkins.westus.cloudapp.azure.com/blue/organizations/jenkins/cloud-hypervisor/activity/?branch=PR-5877

@rveerama1
Copy link
Contributor Author

@rveerama1 Index branching will run the CI from time to time even if the PR is not updated and as a draft, and it will waste resources. Let's keep the PR closed unless you want to run the CI. Please reopen as needed. Thank you.

Noticed many changes related to integrations tests recently. Thought of checking, will that introduce new issues or solve something on all workers. Anyway I am looking into live_migration::live_migration_sequential::test_live_migration_ovs_dpdk issue so far, no major update towards it. It can be closed for time being.

@rveerama1
Copy link
Contributor Author

Need some help to investigating it further.

So far live_migration::live_migration_sequential::test_live_migration_ovs_dpdk test has been stuck while running on Ubuntu 22.04 after below steps

cloud-hypervisor: 19.432839s: <vmm> INFO:arch/src/x86_64/mod.rs:579 -- Running under nested virtualisation. Hypervisor string: KVMKVMKVM
cloud-hypervisor: 19.432909s: <vmm> INFO:arch/src/x86_64/mod.rs:585 -- Generating guest CPUID for with physical address size: 40

It was struck at https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/vmm/src/lib.rs#L1721 : vm.start_dirty_log() and never returned.
for further investigation shows
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/vmm/src/vm.rs#L2524
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/vmm/src/device_manager.rs#L4634 for device id (_net3) which is vhost network.
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/vhost_user/mod.rs#L455
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/vhost_user/vu_common_ctrl.rs#L527
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/vhost_user/vu_common_ctrl.rs#L533 update_log_base()
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/vhost_user/vu_common_ctrl.rs#L447
and finally it stuck at here
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/vhost_user/vu_common_ctrl.rs#L500

        self.vu
            .set_log_base(0, Some(log))
            .map_err(Error::VhostUserSetLogBase)?; 

it never returned from there.
set_log_base belongs to https://docs.rs/crate/vhost/0.8.1/source/src/vhost_user/master.rs

Don't know exactly what cloud be reason on Ubuntu22.04 it stuck over there.

some sample logs from Ubuntu 20.04

cloud-hypervisor: 85.599615s: <vmm> INFO:arch/src/x86_64/mod.rs:579 -- Running under nested virtualisation. Hypervisor string: KVMKVMKVM
cloud-hypervisor: 85.599689s: <vmm> INFO:arch/src/x86_64/mod.rs:585 -- Generating guest CPUID for with physical address size: 40
cloud-hypervisor: 86.828570s: <vmm> INFO:vmm/src/lib.rs:1743 -- Dirty memory migration 0 of 5
cloud-hypervisor: 86.829669s: <vmm> INFO:vmm/src/memory_manager.rs:2679 -- Dirty Memory Range Table:
cloud-hypervisor: 86.829698s: <vmm> INFO:vmm/src/memory_manager.rs:2681 -- GPA: 2003000 size: 4 (KiB)
cloud-hypervisor: 86.829736s: <vmm> INFO:vmm/src/memory_manager.rs:2681 -- GPA: 2007000 size: 4 (KiB)
cloud-hypervisor: 86.829782s: <vmm> INFO:vmm/src/memory_manager.rs:2681 -- GPA: 200f000 size: 8 (KiB)
cloud-hypervisor: 86.829848s: <vmm> INFO:vmm/src/memory_manager.rs:2681 -- GPA: 2014000 size: 8 (KiB)
cloud-hypervisor: 86.829902s: <vmm> INFO:vmm/src/memory_manager.rs:2681 -- GPA: 2035000 size: 4 (KiB)

It does proceed further with dirty memory migration.

Any help/suggestion/insights @likebreath @rbradford @sboeuf ?

@rveerama1 rveerama1 changed the title Migrate docker container from ubuntu 20.04 to 22.04 [WIP] Migrate docker container from ubuntu 20.04 to 22.04 Nov 16, 2023
@rveerama1 rveerama1 reopened this Nov 16, 2023
@rveerama1
Copy link
Contributor Author

Marking as WIP doesn't trigger CI.

@likebreath
Copy link
Member

finally it stuck at here https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/vhost_user/vu_common_ctrl.rs#L500

        self.vu
            .set_log_base(0, Some(log))
            .map_err(Error::VhostUserSetLogBase)?; 

it never returned from there. set_log_base belongs to https://docs.rs/crate/vhost/0.8.1/source/src/vhost_user/master.rs

@rveerama1 Good progress. It is now clear that the problem is from migrating vhost_user device (that is using a ovs-dpdk backend) while sending request SET_LOG_BASE to the backend. I'd suggest you locate precisely where the hang comes from in the vhost crate and open an issue from the vhost crate repository to get some inputs [1].

Note that we recently upgraded vhost crate version. So please rebase before further debugging.

[1] https://github.com/rust-vmm/vhost

@rveerama1
Copy link
Contributor Author

Note that we recently upgraded vhost crate version. So please rebase before further debugging.

Ok, I will check and update.

@rveerama1
Copy link
Contributor Author

rveerama1 commented Dec 13, 2023

Also this error

[2023-12-12T12:53:19.501Z] ERROR: Build data file '/root/workloads/spdk/build/libvfio-user/build-release/meson-private/build.dat' references functions or classes that don't exist. This probably means that it was generated with an old version of meson. Consider reconfiguring the directory with "meson setup --reconfigure".
[2023-12-12T12:53:19.501Z] make[1]: *** [Makefile:21: build] Error 1
[2023-12-12T12:53:19.501Z] make: *** [/root/workloads/spdk/mk/spdk.subdirs.mk:16: vfiouserbuild] Error 2
[2023-12-12T12:53:19.501Z] make: *** Waiting for unfinished jobs....

@rveerama1 rveerama1 changed the title [WIP] Migrate docker container from ubuntu 20.04 to 22.04 Migrate docker container from ubuntu 20.04 to 22.04 Dec 13, 2023
@rveerama1
Copy link
Contributor Author

ARM worker fixed as well.

@rveerama1
Copy link
Contributor Author

On windows worker two tests were failing

[2023-11-30T13:29:03.480Z] failures:
[2023-11-30T13:29:03.480Z]     windows::test_windows_guest_disk_hotplug
[2023-11-30T13:29:03.480Z]     windows::test_windows_guest_disk_hotplug_multi

@likebreath @liuw can you request someone to take a look on windows worker side.

@weltling Would you please help take a look?

Disabled those two tests. Now CI looks fine.

@rveerama1 rveerama1 marked this pull request as ready for review December 13, 2023 15:08
tests/integration.rs Outdated Show resolved Hide resolved
tests/integration.rs Outdated Show resolved Hide resolved
Copy link
Member

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to run change more times to see how stable our tests are on the migrated docker container. Also, we will need to test bare-metal workers.

Since we are cutting a new release tomorrow. Let's do these tests after the release. Marking this PR as DNM to avoid unnecessary CI runs before that.

@likebreath likebreath changed the title Migrate docker container from ubuntu 20.04 to 22.04 [DNM] Migrate docker container from ubuntu 20.04 to 22.04 Dec 13, 2023
@rveerama1
Copy link
Contributor Author

it seems spdk needs to rebuild every time?
it started throwing error again on ARM side

[2023-12-14T09:52:16.961Z] cloud-hypervisor: 6.856043s: <vmm> INFO:vmm/src/lib.rs:1969 -- API request event: VmInfo(Sender { .. })
[2023-12-14T09:52:20.395Z] /usr/local/bin/spdk-nvme/nvmf_tgt: error while loading shared libraries: libjson-c.so.4: cannot open shared object file: No such file or directory

[2023-12-14T09:57:57.622Z] ---stdout---
[2023-12-14T09:57:57.622Z] Error while connecting to /var/tmp/spdk.sock
[2023-12-14T09:57:57.622Z] Is SPDK application running?
[2023-12-14T09:57:57.622Z] Error details: Invalid or non-existing address: '/var/tmp/spdk.sock'
[2023-12-14T09:57:57.622Z] 
[2023-12-14T09:57:57.622Z] ---stderr--- 
[2023-12-14T09:57:57.622Z] 
[2023-12-14T09:57:57.622Z] ==== End 'exec_host_command' failed ====
[2023-12-14T09:57:57.622Z] thread 'common_parallel::test_vfio_user' panicked at 'assertion failed: exec_host_command_status(\"/usr/local/bin/spdk-nvme/rpc.py nvmf_create_transport -t VFIOUSER\").success()', 

It was picking right shared libraries of libjson-c.so.5 and initialization of nvmf_tgt. But now container picking again old libs.
Seems something wrong with how container works.

@rveerama1
Copy link
Contributor Author

So for clean SPDK builds, it works properly.
If we build once and copy only binaries like in the scripts then test_vfio_user test goes flaky.

@rveerama1 rveerama1 force-pushed the issue_5532 branch 2 times, most recently from 7ee76b4 to 0976355 Compare December 14, 2023 12:08
@likebreath likebreath changed the title [DNM] Migrate docker container from ubuntu 20.04 to 22.04 [WIP] Migrate docker container from ubuntu 20.04 to 22.04 Dec 14, 2023
@likebreath
Copy link
Member

DNM does not stopthe CI pipeline from building, only WIP or RFC does.

@likebreath likebreath changed the title [WIP] Migrate docker container from ubuntu 20.04 to 22.04 Migrate docker container from ubuntu 20.04 to 22.04 Dec 20, 2023
The following tests have been temporarily disabled:

1. Live upgrade/migration test with ovs-dpdk (cloud-hypervisor#5532);
2. Disk hotplug tests on windows guests (cloud-hypervisor#6037);

This patch has been tested with PR cloud-hypervisor#6048.

Signed-off-by: Ravi kumar Veeramally <ravikumar.veeramally@intel.com>
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Tested-by: Bo Chen <chen.bo@intel.com>
@likebreath
Copy link
Member

This patch has been tested with #6048. Details: #6048 (comment). I updated the container tag and added details about the lists being disabled in the commit message.

I think we are ready to land this PR. Thank you for the good work @rveerama1.

@likebreath likebreath enabled auto-merge (rebase) December 20, 2023 19:53
@likebreath likebreath merged commit 24f384d into cloud-hypervisor:main Dec 20, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants