-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent vm's from stopping while enabling maintenance mode #4636
Conversation
Sometimes when host isput into maintenance, the connection get disconnected and as result vm's are stopped. So check for extra state before considering host as down and stopping the vm's
@ravening shouldn't we want to be able to control some vm lifecycle (at least things like stop) if host is in maintenance mode, could that cause a regression (in expected behaviour). what's the use-case for this? |
@ravening the only use case I can think of (tested before) is having a single active host, then attempting to put the host into maintenance. Can you elaborate on the use cases? |
@blueorangutan package |
@davidjumani a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1013 |
@ravening When a host is put into maintenance mode, all VMs are migrated from it. |
@davidjumani if I remember properly, this code comes into picture only when you enable maintenance mode |
@ravening Yes, however since VMs will be migrated in maintenance mode I don't see any benefit in this, however I'm a +0 |
@davidjumani if you see the description, it says "vm's are stopped" and not migrated We have seen some cases where vm's are stopped rather than migrating away |
@ravening I'm not denying the issue, just that when I try to reproduce it by killing the agent while the host is entering maintenance mode, the VMs still get migrated. Since I'm unable to test / reproduce the issue, it is a +0 from me. My apology if the previous comment came out the wrong way |
@davidjumani no problem.. but tell me one thing... If you kill agent, how are the vm's getting migrated away? That shouldn't happen at all right as there is no communication with mgt and agent If agent is dead, it won't even communicate with libvirt to initiate the migration also... So I'm not sure how vm's are getting migrated after agent is dead |
@ravening The host fails to go into maintenance mode if the agent is down, so I have to kill it after I send the command to the MS. Guess that during that time the migration is initiated |
@davidjumani correct... But migration of all vm's are not initiated.. it will happen one by one... So in best case only one VM can get migrated but others wont |
@blueorangutan test |
@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
Trillian test result (tid-1805)
|
@blueorangutan package |
@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1279 |
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1280 |
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1281 |
@blueorangutan test |
@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
Trillian test result (tid-2108)
|
@blueorangutan package |
@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
@nvazquez What steps you tried to reproduce it? May be, keep as many VMs possible in a host, and try rebooting host / disconnecting host agent as soon as maintenance is enabled. @ravening can you confirm on how to reproduce / verify this? |
@sureshanaparti @nvazquez this issue mostly happens when you try to enable maintenance mode on the host but it fails to enter the maintenance mode and rathen it goes into disconnected state. In that case vm's will be stopped as they cant be reached through agent. Im ok to close this pr, if it cant be reproduced |
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1371 |
@ravening @sureshanaparti @nvazquez Encountered this two days ago. Doing regular maintenance on KVM hosts Ubuntu 18.04/16.04 ACS 4.15.1. Put KVM host in maintenance and some VM are migrate with shutdown. Happened in advanced and basic networking around 10~ VM where affected. 2021-09-21 15:04:01,620 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-114:ctx-92ff343d job-537958/job-538524 ctx-23550d3b) (logid:ff628fbe) VM state transitted from :Running to Stopping with event: StopRequestedvm's original host id: 183 new host id: 184 host id before state transition: 184 2021-09-21 15:04:01,593 WARN [o.a.c.n.t.BasicNetworkTopology] (Work-Job-Executor-114:ctx-92ff343d job-537958/job-538524 ctx-23550d3b) (logid:ff628fbe) Unable to apply save userdata entry on disconnected router r-nnnn-VM But router is ok/live. I can privately send all log if you are interested. |
@kricud |
@weizhouapache We are in process of moving to 4.15.2 and 99% that it’s not fixed in it. |
@kricud |
@weizhouapache 4.15.2 has same problems. |
@kricud |
@weizhouapache Both occasions we encountered it again there was no pattern. |
Hi @kricud any update on this issue and how to reproduce it? Thanks |
@nvazquez no |
@DaanHoogland @weizhouapache @rohityadavcloud this fix is simply adding an extra check before disconnecting a host while waiting to enable maintenance mode to prevent VM's to stop - however it is not clear on how to reproduce the issue, could occur sometimes, what do you advise? |
@blueorangutan package |
@nvazquez a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2946 |
@blueorangutan test |
@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
Trillian test result (tid-3687)
|
Description
Sometimes when host isput into maintenance, the connection get
disconnected and as result vm's are stopped. So check for extra state
before considering host as down and stopping the vm's
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?