Publish event for VM.STOP when out of band stop is detected #7878

mlsorensen · 2023-08-17T20:42:48Z

Description

This PR publishes an action event for VM.STOP when the power state processor detects a VM is gone from hypervisor. Currently only a power state event is published on the message bus. This allows events to be seen and processed when VM is detected to be stopped out of band.

Additionally, it was discovered that the existing missing VM code is triggered when a VM is taking awhile to start. For example if we are waiting on the router VM to come up, the report can possibly see no VM when one is expected. The VM is assigned to the host, but doesn't exist yet, and triggers the missing VM code. A check was added to ignore the VM if it is still in "Starting" state.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Tested out of band stop by shutting down guest within the VM, confirmed new event triggered.

Tested startup of VM where power state is processed while waiting on router to come up, confirmed events no longer triggered detecting a "missing VM" when VM is in starting state.

Tested live migration, confirmed we processed power state reports for both source and destination hypervisor hosts and did not issue the new VM.STOP event.

weizhouapache

code lgtm

this PR has only impact on the events

weizhouapache · 2023-08-18T07:37:17Z

@blueorangutan package

blueorangutan · 2023-08-18T07:38:03Z

@weizhouapache a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2023-08-18T08:33:58Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6825

weizhouapache · 2023-08-18T08:36:59Z

@blueorangutan test

blueorangutan · 2023-08-18T08:38:05Z

@weizhouapache a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

DaanHoogland · 2023-08-18T10:41:40Z

@mlsorensen have you considered an out of bounds migration scenario? this might give an undesirable effect if the VM was started on another host, would it?.

mlsorensen · 2023-08-18T15:09:10Z

@mlsorensen have you considered an out of bounds migration scenario? this might give an undesirable effect if the VM was started on another host, would it?.

Good point. Whatever is happening in this code during out of band migration, this doesn't change that. I don't know if anyone is looking at the message bus that is currently already publishing, or if there are existing issues with how it handles the VM state, but maybe publishing to the event system will make it more obvious if there's an existing bug here.

I don't know that anyone using KVM at least is doing out of band migration because the tools aren't great for that, and in many cases simply not possible due to storage and network plugins needing to run and make storage/network accessible on the host. Perhaps useful for someone using VMware, where there is a separate system to manage VMs independently. I don't know that a test via KVM is going to really exercise it in the same way the VMware integration does.

Worst case, we see an event for VMware users indicating the VM was removed from its host out of band, but we aren't making a change to how the VM state is handled and synced up for these kinds of migrations.

With the existing code what I see is during live migration, both source and destination hypervisors send a report that the VM is powered on, and both systems are updating the VM power state when they send pings. Thus the VM's power_host flip flops back and forth, but the host_id is already set to the new host. When the VM finally migrates, it doesn't complain that the VM is powered off because the host complaining the VM is gone is not matching the host_id. With out of band, we need to make sure we understand how the host_id will get updated and that this is compatible.

mlsorensen · 2023-08-18T15:44:24Z

Ok, it took some manual intervention to test with KVM (copying config drive iso, editing XML) but here is what I'm seeing:

With OOB live migration, instead of just the power_host flip-flopping, the host_id itself flip flops. This could be an issue right now as looking at the VM during this time just shows the VM "Running" but we don't know which host it's supposed to land on. Something like a volume resize or snapshot issued during this time would probably fail. The fix to this is probably to properly detect the migrating scenario and change the VM state to Migrating, but outside of the scope of this PR. I question if we'd be able to detect the OOB migration without any sort of race condition. Libvirt events may help close the gap somewhat. Really it seems out of band moves are not supported (at least on KVM), Cloudstack just does what it can to reconcile.
This change doesn't seem to have any effect on OOB handling. What I observe is that given two hosts issuing power state reports: H1 Report ... t1 ... H2 Report ... t2 ... H1 Report, and VM OOB migrating from H1 to H2, if the migration completes during t1, then H2 Report claims the VM properly and when the last H1 report comes there is no change. If the migration completes during t2, H2 was the last one to claim the VM, so again there is no effect. The only case where there could be an issue is if the H2 agent is offline, or some other situation where we receive two H1 reports in a row before H2 can claim the VM. In this case it is probably fair to notify that a VM has stopped if someone has out of band moved it to a host CloudStack has no active control over.

DaanHoogland · 2023-08-21T07:50:04Z

Perhaps useful for someone using VMware, where there is a separate system to manage VMs independently. I don't know that a test via KVM is going to really exercise it in the same way the VMware integration does.

Worst case, we see an event for VMware users indicating the VM was removed from its host out of band, but we aren't making a change to how the VM state is handled and synced up for these kinds of migrations.

A scenario we have seen (which is purely a billing issue but still) Is that a VM was at a certain moment powered off and hence absent in reports from both hosts, while migration was still going on. This led to billing issues as those are a bit fragile. it relies a bit too much on a proper order of events and matching on/off create/delete etc combinations of events. What I am missing in this PR is the start event (also for OOB).

We will have to test this and I think you are right this will probably only be an issue on vmware (not sure about xen).

@borisstoyanov (@vladimirpetrov) I think we need to give an extra eye to testing this.

The fix to this is probably to properly detect the migrating scenario and change the VM state to Migrating, but outside of the scope of this PR.

agreed

weizhouapache · 2023-08-21T08:21:54Z

smoke tests passed with message

22:26:44 msg: {"body": "<b>[SF] Trillian test result (tid-7476)</b>\nEnvironment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7\nTotal time taken: 41024 seconds\nMarvin logs: [https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7476-kvm-centos7.zip\nSmoke](https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7476-kvm-centos7.zip/nSmoke) tests completed. 108 look OK, 0 have errors, 0 did not run\nOnly failed and skipped tests results shown below:\n\n\nTest | Result | Time (s) | Test File\n--- | --- | --- | ---\n"}

DaanHoogland · 2023-08-21T09:42:04Z

smoke tests passed with message

22:26:44 msg: {"body": "<b>[SF] Trillian test result (tid-7476)</b>\nEnvironment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7\nTotal time taken: 41024 seconds\nMarvin logs: [https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7476-kvm-centos7.zip\nSmoke](https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7476-kvm-centos7.zip/nSmoke) tests completed. 108 look OK, 0 have errors, 0 did not run\nOnly failed and skipped tests results shown below:\n\n\nTest | Result | Time (s) | Test File\n--- | --- | --- | ---\n"}

api rate limit again?

weizhouapache · 2023-08-21T09:44:08Z

api rate limit again?
yes @DaanHoogland

rohityadavcloud · 2023-08-22T09:04:34Z

@blueorangutan package

blueorangutan · 2023-08-22T09:06:03Z

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

rohityadavcloud

LGTM, haven't tested it, also don't know if this can cause side-effects for all hypervisors and cases, as we're skipping

blueorangutan · 2023-08-22T10:04:46Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6848

DaanHoogland · 2023-08-22T14:34:37Z

@blueorangutan test matrix

blueorangutan · 2023-08-22T14:36:03Z

@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

blueorangutan · 2023-08-23T01:49:43Z

[SF] Trillian test result (tid-7516)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38862 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7516-xenserver-71.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

blueorangutan · 2023-08-23T03:05:33Z

[SF] Trillian test result (tid-7518)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43352 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7518-kvm-centos7.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

blueorangutan · 2023-08-23T04:54:14Z

[SF] Trillian test result (tid-7517)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 49874 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7517-vmware-67u3.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_02_upgrade_kubernetes_cluster	`Failure`	669.19	test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	546.81	test_kubernetes_clusters.py

DaanHoogland · 2023-08-23T07:21:18Z

engine/orchestration/src/main/java/com/cloud/vm/VirtualMachinePowerStateSyncImpl.java

+                        ActionEventUtils.onActionEvent(User.UID_SYSTEM, Account.ACCOUNT_ID_SYSTEM,instance.getDomainId(),
+                                EventTypes.EVENT_VM_STOP, "Out of band VM power off", instance.getId(), ApiCommandResourceType.VirtualMachine.toString());


I think this event should be spewed at handlePowerOffReportWithNoPendingJobsOnVM and countered by a EVENT_VM_START in handlePowerOnReportWithNoPendingJobsOnVM both in VirtualMachineManagerImpl. Or alternatively a proper place to handle the OOB start event in this class should be found. I think this is not a big deal for you @mlsorensen , but for people using event/usgae server based billing it would be. If an OOB migration happens their billing for this VM would/could stop or continue based on the order of events.

Understood, I'll look into that.

This may get is what you're looking for @DaanHoogland - in testing this didn't trigger any event at all during out of band migration on KVM. However it still responded to OOB stop, and even OOB start, publishing an event in both situations.

I still don't have the means to test how the VMware side reacts to this, or if further changes would be necessary there.

@mlsorensen , this is very hard to reproduce as it depends on the order of power sync reports coming in from hosts. I'll give this a swing after 4.18.1, and think about simulating race conditions. code looks good.

weizhouapache · 2023-08-24T08:25:23Z

moved to 4.18.2.0

Signed-off-by: Marcus Sorensen <mls@apple.com>

codecov · 2023-08-24T17:41:56Z

Codecov Report

Merging #7878 (2146194) into 4.18 (f7345e8) will increase coverage by 0.01%.
Report is 17 commits behind head on 4.18.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##               4.18    #7878      +/-   ##
============================================
+ Coverage     13.04%   13.06%   +0.01%     
- Complexity     9067     9088      +21     
============================================
  Files          2720     2720              
  Lines        257236   257395     +159     
  Branches      40103    40130      +27     
============================================
+ Hits          33552    33621      +69     
- Misses       219474   219552      +78     
- Partials       4210     4222      +12

Files Changed	Coverage Δ
...n/java/com/cloud/vm/VirtualMachineManagerImpl.java	`6.35% <0.00%> (-0.01%)`	⬇️

... and 16 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

DaanHoogland

clgtm

rohityadavcloud · 2023-09-02T08:40:44Z

LGTM
@blueorangutan package

blueorangutan · 2023-09-02T08:42:03Z

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2023-09-02T09:35:11Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6964

rohityadavcloud · 2023-09-06T05:31:09Z

@blueorangutan test

blueorangutan · 2023-09-06T05:32:03Z

@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2023-09-06T21:07:09Z

[SF] Trillian test result (tid-7632)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 54608 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7632-kvm-centos7.zip
Smoke tests completed. 105 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_1_create_iso_with_checksum_sha1_negative	`Error`	1007.09	test_iso.py
test_02_upgrade_kubernetes_cluster	`Failure`	536.57	test_kubernetes_clusters.py
test_05_create_template_with_no_checksum	`Error`	65.42	test_templates.py

DaanHoogland · 2023-09-18T09:02:35Z

@blueorangutan package

blueorangutan · 2023-09-18T09:04:03Z

@DaanHoogland a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2023-09-18T09:57:50Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7040

DaanHoogland · 2023-09-18T11:21:55Z

@blueorangutan test matrix

blueorangutan · 2023-09-18T11:24:04Z

@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

blueorangutan · 2023-09-18T23:40:23Z

[SF] Trillian test result (tid-7674)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42743 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7674-xenserver-71.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

blueorangutan · 2023-09-18T23:47:40Z

[SF] Trillian test result (tid-7676)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43120 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7676-kvm-centos7.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_02_upgrade_kubernetes_cluster	`Failure`	589.13	test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	650.62	test_kubernetes_clusters.py

blueorangutan · 2023-09-19T00:36:52Z

[SF] Trillian test result (tid-7675)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 46072 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7675-vmware-67u3.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

DaanHoogland · 2023-09-19T07:43:17Z

@weizhouapache @mlsorensen @rohityadavcloud Do we merge or do we need more testing on this?

mlsorensen · 2023-09-22T14:56:11Z

@DaanHoogland from my end it's good to go.

* 4.18: Publish event for VM.STOP when out of band stop is detected (#7878)

boring-cyborg bot added component:compute component:orchestration labels Aug 17, 2023

weizhouapache added this to the 4.18.1.0 milestone Aug 17, 2023

weizhouapache approved these changes Aug 18, 2023

View reviewed changes

weizhouapache requested a review from DaanHoogland August 21, 2023 13:50

rohityadavcloud approved these changes Aug 22, 2023

View reviewed changes

weizhouapache self-requested a review August 23, 2023 07:16

DaanHoogland requested changes Aug 23, 2023

View reviewed changes

weizhouapache removed this from the 4.18.1.0 milestone Aug 24, 2023

weizhouapache added this to the 4.18.2.0 milestone Aug 24, 2023

Publish event for VM.STOP when out of band stop is detected

2146194

Signed-off-by: Marcus Sorensen <mls@apple.com>

mlsorensen force-pushed the 4.18-oob-vm-stop-publish branch from 5abfcca to 2146194 Compare August 24, 2023 16:39

mlsorensen requested a review from DaanHoogland August 24, 2023 16:40

DaanHoogland approved these changes Aug 25, 2023

View reviewed changes

DaanHoogland merged commit 3071ad6 into apache:4.18 Sep 25, 2023
24 of 27 checks passed

DaanHoogland added a commit that referenced this pull request Sep 25, 2023

Merge release branch 4.18 to main

f539c4b

* 4.18: Publish event for VM.STOP when out of band stop is detected (#7878)

		ActionEventUtils.onActionEvent(User.UID_SYSTEM, Account.ACCOUNT_ID_SYSTEM,instance.getDomainId(),
		EventTypes.EVENT_VM_STOP, "Out of band VM power off", instance.getId(), ApiCommandResourceType.VirtualMachine.toString());

Publish event for VM.STOP when out of band stop is detected #7878

Publish event for VM.STOP when out of band stop is detected #7878

Conversation

mlsorensen commented Aug 17, 2023

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

weizhouapache left a comment

Choose a reason for hiding this comment

weizhouapache commented Aug 18, 2023

blueorangutan commented Aug 18, 2023

blueorangutan commented Aug 18, 2023

weizhouapache commented Aug 18, 2023

blueorangutan commented Aug 18, 2023

DaanHoogland commented Aug 18, 2023

mlsorensen commented Aug 18, 2023

mlsorensen commented Aug 18, 2023 • edited

DaanHoogland commented Aug 21, 2023

weizhouapache commented Aug 21, 2023

DaanHoogland commented Aug 21, 2023

weizhouapache commented Aug 21, 2023

rohityadavcloud commented Aug 22, 2023

blueorangutan commented Aug 22, 2023

rohityadavcloud left a comment

Choose a reason for hiding this comment

blueorangutan commented Aug 22, 2023

DaanHoogland commented Aug 22, 2023

blueorangutan commented Aug 22, 2023

blueorangutan commented Aug 23, 2023

blueorangutan commented Aug 23, 2023

blueorangutan commented Aug 23, 2023

DaanHoogland Aug 23, 2023

Choose a reason for hiding this comment

mlsorensen Aug 24, 2023

Choose a reason for hiding this comment

mlsorensen Aug 24, 2023 • edited

Choose a reason for hiding this comment

DaanHoogland Aug 25, 2023

Choose a reason for hiding this comment

weizhouapache commented Aug 24, 2023

codecov bot commented Aug 24, 2023

Codecov Report

DaanHoogland left a comment

Choose a reason for hiding this comment

rohityadavcloud commented Sep 2, 2023

blueorangutan commented Sep 2, 2023

blueorangutan commented Sep 2, 2023

rohityadavcloud commented Sep 6, 2023

blueorangutan commented Sep 6, 2023

blueorangutan commented Sep 6, 2023

DaanHoogland commented Sep 18, 2023

blueorangutan commented Sep 18, 2023

blueorangutan commented Sep 18, 2023

DaanHoogland commented Sep 18, 2023

blueorangutan commented Sep 18, 2023

blueorangutan commented Sep 18, 2023

blueorangutan commented Sep 18, 2023

blueorangutan commented Sep 19, 2023

DaanHoogland commented Sep 19, 2023

mlsorensen commented Sep 22, 2023

mlsorensen commented Aug 18, 2023 •

edited

mlsorensen Aug 24, 2023 •

edited