Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish event for VM.STOP when out of band stop is detected #7878

Merged
merged 1 commit into from Sep 25, 2023

Conversation

mlsorensen
Copy link
Contributor

Description

This PR publishes an action event for VM.STOP when the power state processor detects a VM is gone from hypervisor. Currently only a power state event is published on the message bus. This allows events to be seen and processed when VM is detected to be stopped out of band.

Additionally, it was discovered that the existing missing VM code is triggered when a VM is taking awhile to start. For example if we are waiting on the router VM to come up, the report can possibly see no VM when one is expected. The VM is assigned to the host, but doesn't exist yet, and triggers the missing VM code. A check was added to ignore the VM if it is still in "Starting" state.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

Screenshot 2023-08-17 at 10 37 38 AM

How Has This Been Tested?

Tested out of band stop by shutting down guest within the VM, confirmed new event triggered.

Tested startup of VM where power state is processed while waiting on router to come up, confirmed events no longer triggered detecting a "missing VM" when VM is in starting state.

Tested live migration, confirmed we processed power state reports for both source and destination hypervisor hosts and did not issue the new VM.STOP event.

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

this PR has only impact on the events

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6825

@weizhouapache
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland
Copy link
Contributor

@mlsorensen have you considered an out of bounds migration scenario? this might give an undesirable effect if the VM was started on another host, would it?.

@mlsorensen
Copy link
Contributor Author

@mlsorensen have you considered an out of bounds migration scenario? this might give an undesirable effect if the VM was started on another host, would it?.

Good point. Whatever is happening in this code during out of band migration, this doesn't change that. I don't know if anyone is looking at the message bus that is currently already publishing, or if there are existing issues with how it handles the VM state, but maybe publishing to the event system will make it more obvious if there's an existing bug here.

I don't know that anyone using KVM at least is doing out of band migration because the tools aren't great for that, and in many cases simply not possible due to storage and network plugins needing to run and make storage/network accessible on the host. Perhaps useful for someone using VMware, where there is a separate system to manage VMs independently. I don't know that a test via KVM is going to really exercise it in the same way the VMware integration does.

Worst case, we see an event for VMware users indicating the VM was removed from its host out of band, but we aren't making a change to how the VM state is handled and synced up for these kinds of migrations.

With the existing code what I see is during live migration, both source and destination hypervisors send a report that the VM is powered on, and both systems are updating the VM power state when they send pings. Thus the VM's power_host flip flops back and forth, but the host_id is already set to the new host. When the VM finally migrates, it doesn't complain that the VM is powered off because the host complaining the VM is gone is not matching the host_id. With out of band, we need to make sure we understand how the host_id will get updated and that this is compatible.

@mlsorensen
Copy link
Contributor Author

mlsorensen commented Aug 18, 2023

Ok, it took some manual intervention to test with KVM (copying config drive iso, editing XML) but here is what I'm seeing:

  1. With OOB live migration, instead of just the power_host flip-flopping, the host_id itself flip flops. This could be an issue right now as looking at the VM during this time just shows the VM "Running" but we don't know which host it's supposed to land on. Something like a volume resize or snapshot issued during this time would probably fail. The fix to this is probably to properly detect the migrating scenario and change the VM state to Migrating, but outside of the scope of this PR. I question if we'd be able to detect the OOB migration without any sort of race condition. Libvirt events may help close the gap somewhat. Really it seems out of band moves are not supported (at least on KVM), Cloudstack just does what it can to reconcile.

  2. This change doesn't seem to have any effect on OOB handling. What I observe is that given two hosts issuing power state reports: H1 Report ... t1 ... H2 Report ... t2 ... H1 Report, and VM OOB migrating from H1 to H2, if the migration completes during t1, then H2 Report claims the VM properly and when the last H1 report comes there is no change. If the migration completes during t2, H2 was the last one to claim the VM, so again there is no effect. The only case where there could be an issue is if the H2 agent is offline, or some other situation where we receive two H1 reports in a row before H2 can claim the VM. In this case it is probably fair to notify that a VM has stopped if someone has out of band moved it to a host CloudStack has no active control over.

@DaanHoogland
Copy link
Contributor

Perhaps useful for someone using VMware, where there is a separate system to manage VMs independently. I don't know that a test via KVM is going to really exercise it in the same way the VMware integration does.

Worst case, we see an event for VMware users indicating the VM was removed from its host out of band, but we aren't making a change to how the VM state is handled and synced up for these kinds of migrations.

A scenario we have seen (which is purely a billing issue but still) Is that a VM was at a certain moment powered off and hence absent in reports from both hosts, while migration was still going on. This led to billing issues as those are a bit fragile. it relies a bit too much on a proper order of events and matching on/off create/delete etc combinations of events. What I am missing in this PR is the start event (also for OOB).

We will have to test this and I think you are right this will probably only be an issue on vmware (not sure about xen).

@borisstoyanov (@vladimirpetrov) I think we need to give an extra eye to testing this.

  1. The fix to this is probably to properly detect the migrating scenario and change the VM state to Migrating, but outside of the scope of this PR.

agreed

@weizhouapache
Copy link
Member

smoke tests passed with message

22:26:44 msg: {"body": "<b>[SF] Trillian test result (tid-7476)</b>\nEnvironment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7\nTotal time taken: 41024 seconds\nMarvin logs: [https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7476-kvm-centos7.zip\nSmoke](https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7476-kvm-centos7.zip/nSmoke) tests completed. 108 look OK, 0 have errors, 0 did not run\nOnly failed and skipped tests results shown below:\n\n\nTest | Result | Time (s) | Test File\n--- | --- | --- | ---\n"}

@DaanHoogland
Copy link
Contributor

smoke tests passed with message

22:26:44 msg: {"body": "<b>[SF] Trillian test result (tid-7476)</b>\nEnvironment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7\nTotal time taken: 41024 seconds\nMarvin logs: [https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7476-kvm-centos7.zip\nSmoke](https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7476-kvm-centos7.zip/nSmoke) tests completed. 108 look OK, 0 have errors, 0 did not run\nOnly failed and skipped tests results shown below:\n\n\nTest | Result | Time (s) | Test File\n--- | --- | --- | ---\n"}

api rate limit again?

@weizhouapache
Copy link
Member

api rate limit again?
yes @DaanHoogland

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, haven't tested it, also don't know if this can cause side-effects for all hypervisors and cases, as we're skipping

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6848

@DaanHoogland
Copy link
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link

@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-7516)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 38862 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7516-xenserver-71.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-7518)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43352 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7518-kvm-centos7.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-7517)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 49874 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7517-vmware-67u3.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_upgrade_kubernetes_cluster Failure 669.19 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 546.81 test_kubernetes_clusters.py

@weizhouapache weizhouapache self-requested a review August 23, 2023 07:16
Comment on lines 180 to 181
ActionEventUtils.onActionEvent(User.UID_SYSTEM, Account.ACCOUNT_ID_SYSTEM,instance.getDomainId(),
EventTypes.EVENT_VM_STOP, "Out of band VM power off", instance.getId(), ApiCommandResourceType.VirtualMachine.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this event should be spewed at handlePowerOffReportWithNoPendingJobsOnVM and countered by a EVENT_VM_START in handlePowerOnReportWithNoPendingJobsOnVM both in VirtualMachineManagerImpl. Or alternatively a proper place to handle the OOB start event in this class should be found. I think this is not a big deal for you @mlsorensen , but for people using event/usgae server based billing it would be. If an OOB migration happens their billing for this VM would/could stop or continue based on the order of events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I'll look into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may get is what you're looking for @DaanHoogland - in testing this didn't trigger any event at all during out of band migration on KVM. However it still responded to OOB stop, and even OOB start, publishing an event in both situations.

I still don't have the means to test how the VMware side reacts to this, or if further changes would be necessary there.

Screenshot 2023-08-24 at 10 32 46 AM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlsorensen , this is very hard to reproduce as it depends on the order of power sync reports coming in from hosts. I'll give this a swing after 4.18.1, and think about simulating race conditions. code looks good.

@weizhouapache
Copy link
Member

moved to 4.18.2.0

@weizhouapache weizhouapache removed this from the 4.18.1.0 milestone Aug 24, 2023
@weizhouapache weizhouapache added this to the 4.18.2.0 milestone Aug 24, 2023
Signed-off-by: Marcus Sorensen <mls@apple.com>
@codecov
Copy link

codecov bot commented Aug 24, 2023

Codecov Report

Merging #7878 (2146194) into 4.18 (f7345e8) will increase coverage by 0.01%.
Report is 17 commits behind head on 4.18.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##               4.18    #7878      +/-   ##
============================================
+ Coverage     13.04%   13.06%   +0.01%     
- Complexity     9067     9088      +21     
============================================
  Files          2720     2720              
  Lines        257236   257395     +159     
  Branches      40103    40130      +27     
============================================
+ Hits          33552    33621      +69     
- Misses       219474   219552      +78     
- Partials       4210     4222      +12     
Files Changed Coverage Δ
...n/java/com/cloud/vm/VirtualMachineManagerImpl.java 6.35% <0.00%> (-0.01%) ⬇️

... and 16 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@rohityadavcloud
Copy link
Member

LGTM
@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6964

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-7632)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 54608 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7632-kvm-centos7.zip
Smoke tests completed. 105 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_1_create_iso_with_checksum_sha1_negative Error 1007.09 test_iso.py
test_02_upgrade_kubernetes_cluster Failure 536.57 test_kubernetes_clusters.py
test_05_create_template_with_no_checksum Error 65.42 test_templates.py

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7040

@DaanHoogland
Copy link
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link

@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-7674)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42743 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7674-xenserver-71.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-7676)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43120 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7676-kvm-centos7.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_upgrade_kubernetes_cluster Failure 589.13 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 650.62 test_kubernetes_clusters.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-7675)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 46072 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7878-t7675-vmware-67u3.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland
Copy link
Contributor

@weizhouapache @mlsorensen @rohityadavcloud Do we merge or do we need more testing on this?

@mlsorensen
Copy link
Contributor Author

@DaanHoogland from my end it's good to go.

@DaanHoogland DaanHoogland merged commit 3071ad6 into apache:4.18 Sep 25, 2023
24 of 27 checks passed
DaanHoogland added a commit that referenced this pull request Sep 25, 2023
* 4.18:
  Publish event for VM.STOP when out of band stop is detected (#7878)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants