New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish event for VM.STOP when out of band stop is detected #7878
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
this PR has only impact on the events
@blueorangutan package |
@weizhouapache a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6825 |
@blueorangutan test |
@weizhouapache a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
@mlsorensen have you considered an out of bounds migration scenario? this might give an undesirable effect if the VM was started on another host, would it?. |
Good point. Whatever is happening in this code during out of band migration, this doesn't change that. I don't know if anyone is looking at the message bus that is currently already publishing, or if there are existing issues with how it handles the VM state, but maybe publishing to the event system will make it more obvious if there's an existing bug here. I don't know that anyone using KVM at least is doing out of band migration because the tools aren't great for that, and in many cases simply not possible due to storage and network plugins needing to run and make storage/network accessible on the host. Perhaps useful for someone using VMware, where there is a separate system to manage VMs independently. I don't know that a test via KVM is going to really exercise it in the same way the VMware integration does. Worst case, we see an event for VMware users indicating the VM was removed from its host out of band, but we aren't making a change to how the VM state is handled and synced up for these kinds of migrations. With the existing code what I see is during live migration, both source and destination hypervisors send a report that the VM is powered on, and both systems are updating the VM power state when they send pings. Thus the VM's |
Ok, it took some manual intervention to test with KVM (copying config drive iso, editing XML) but here is what I'm seeing:
|
A scenario we have seen (which is purely a billing issue but still) Is that a VM was at a certain moment powered off and hence absent in reports from both hosts, while migration was still going on. This led to billing issues as those are a bit fragile. it relies a bit too much on a proper order of events and matching on/off create/delete etc combinations of events. What I am missing in this PR is the start event (also for OOB). We will have to test this and I think you are right this will probably only be an issue on vmware (not sure about xen). @borisstoyanov (@vladimirpetrov) I think we need to give an extra eye to testing this.
agreed |
smoke tests passed with message
|
api rate limit again? |
|
@blueorangutan package |
@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, haven't tested it, also don't know if this can cause side-effects for all hypervisors and cases, as we're skipping
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6848 |
@blueorangutan test matrix |
@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
[SF] Trillian test result (tid-7516)
|
[SF] Trillian test result (tid-7518)
|
[SF] Trillian test result (tid-7517)
|
ActionEventUtils.onActionEvent(User.UID_SYSTEM, Account.ACCOUNT_ID_SYSTEM,instance.getDomainId(), | ||
EventTypes.EVENT_VM_STOP, "Out of band VM power off", instance.getId(), ApiCommandResourceType.VirtualMachine.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this event should be spewed at handlePowerOffReportWithNoPendingJobsOnVM
and countered by a EVENT_VM_START
in handlePowerOnReportWithNoPendingJobsOnVM
both in VirtualMachineManagerImpl
. Or alternatively a proper place to handle the OOB start event in this class should be found. I think this is not a big deal for you @mlsorensen , but for people using event/usgae server based billing it would be. If an OOB migration happens their billing for this VM would/could stop or continue based on the order of events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I'll look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may get is what you're looking for @DaanHoogland - in testing this didn't trigger any event at all during out of band migration on KVM. However it still responded to OOB stop, and even OOB start, publishing an event in both situations.
I still don't have the means to test how the VMware side reacts to this, or if further changes would be necessary there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlsorensen , this is very hard to reproduce as it depends on the order of power sync reports coming in from hosts. I'll give this a swing after 4.18.1, and think about simulating race conditions. code looks good.
moved to 4.18.2.0 |
Signed-off-by: Marcus Sorensen <mls@apple.com>
5abfcca
to
2146194
Compare
Codecov Report
@@ Coverage Diff @@
## 4.18 #7878 +/- ##
============================================
+ Coverage 13.04% 13.06% +0.01%
- Complexity 9067 9088 +21
============================================
Files 2720 2720
Lines 257236 257395 +159
Branches 40103 40130 +27
============================================
+ Hits 33552 33621 +69
- Misses 219474 219552 +78
- Partials 4210 4222 +12
... and 16 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
LGTM |
@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6964 |
@blueorangutan test |
@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
[SF] Trillian test result (tid-7632)
|
@blueorangutan package |
@DaanHoogland a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7040 |
@blueorangutan test matrix |
@DaanHoogland a [SF] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
[SF] Trillian test result (tid-7674)
|
[SF] Trillian test result (tid-7676)
|
[SF] Trillian test result (tid-7675)
|
@weizhouapache @mlsorensen @rohityadavcloud Do we merge or do we need more testing on this? |
@DaanHoogland from my end it's good to go. |
* 4.18: Publish event for VM.STOP when out of band stop is detected (#7878)
Description
This PR publishes an action event for VM.STOP when the power state processor detects a VM is gone from hypervisor. Currently only a power state event is published on the message bus. This allows events to be seen and processed when VM is detected to be stopped out of band.
Additionally, it was discovered that the existing missing VM code is triggered when a VM is taking awhile to start. For example if we are waiting on the router VM to come up, the report can possibly see no VM when one is expected. The VM is assigned to the host, but doesn't exist yet, and triggers the missing VM code. A check was added to ignore the VM if it is still in "Starting" state.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Tested out of band stop by shutting down guest within the VM, confirmed new event triggered.
Tested startup of VM where power state is processed while waiting on router to come up, confirmed events no longer triggered detecting a "missing VM" when VM is in starting state.
Tested live migration, confirmed we processed power state reports for both source and destination hypervisor hosts and did not issue the new VM.STOP event.