Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent vm's from stopping while enabling maintenance mode #4636

Merged
merged 1 commit into from
Mar 25, 2022

Conversation

ravening
Copy link
Member

@ravening ravening commented Feb 1, 2021

Description

Sometimes when host isput into maintenance, the connection get
disconnected and as result vm's are stopped. So check for extra state
before considering host as down and stopping the vm's

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Sometimes when host isput into maintenance, the connection get
disconnected and as result vm's are stopped. So check for extra state
before considering host as down and stopping the vm's
@shwstppr shwstppr added this to the 4.16.0.0 milestone Feb 8, 2021
@rohityadavcloud
Copy link
Member

@ravening shouldn't we want to be able to control some vm lifecycle (at least things like stop) if host is in maintenance mode, could that cause a regression (in expected behaviour). what's the use-case for this?

@nvazquez
Copy link
Contributor

@ravening the only use case I can think of (tested before) is having a single active host, then attempting to put the host into maintenance. Can you elaborate on the use cases?

@ravening
Copy link
Member Author

@rhtyd @nvazquez we have seen some corner cases when maintenance mode was enabled on host, the connection gets disconnected with mgt server and as a result it thinks that host is down and it stops the vm rather than migrating the vm's away which is supposed to be the default behavior

@davidjumani
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@davidjumani a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1013

@davidjumani
Copy link
Contributor

@ravening When a host is put into maintenance mode, all VMs are migrated from it.
This change could also migrate VMs when the host is disconnected even when not in maintenance

@ravening
Copy link
Member Author

@ravening When a host is put into maintenance mode, all VMs are migrated from it.

This change could also migrate VMs when the host is disconnected even when not in maintenance

@davidjumani if I remember properly, this code comes into picture only when you enable maintenance mode

@davidjumani
Copy link
Contributor

@ravening Yes, however since VMs will be migrated in maintenance mode I don't see any benefit in this, however I'm a +0

@ravening
Copy link
Member Author

@ravening Yes, however since VMs will be migrated in maintenance mode I don't see any benefit in this, however I'm a +0

@davidjumani if you see the description, it says "vm's are stopped" and not migrated

We have seen some cases where vm's are stopped rather than migrating away

@davidjumani
Copy link
Contributor

@ravening I'm not denying the issue, just that when I try to reproduce it by killing the agent while the host is entering maintenance mode, the VMs still get migrated. Since I'm unable to test / reproduce the issue, it is a +0 from me. My apology if the previous comment came out the wrong way

@ravening
Copy link
Member Author

@ravening I'm not denying the issue, just that when I try to reproduce it by killing the agent while the host is entering maintenance mode, the VMs still get migrated. Since I'm unable to test / reproduce the issue, it is a +0 from me. My apology if the previous comment came out the wrong way

@davidjumani no problem.. but tell me one thing... If you kill agent, how are the vm's getting migrated away? That shouldn't happen at all right as there is no communication with mgt and agent

If agent is dead, it won't even communicate with libvirt to initiate the migration also... So I'm not sure how vm's are getting migrated after agent is dead

@davidjumani
Copy link
Contributor

@ravening The host fails to go into maintenance mode if the agent is down, so I have to kill it after I send the command to the MS. Guess that during that time the migration is initiated

@ravening
Copy link
Member Author

@ravening The host fails to go into maintenance mode if the agent is down, so I have to kill it after I send the command to the MS. Guess that during that time the migration is initiated

@davidjumani correct... But migration of all vm's are not initiated.. it will happen one by one... So in best case only one VM can get migrated but others wont

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1805)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 71532 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4636-t1805-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_password_server.py
Intermittent failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermittent failure detected: /marvin/tests/smoke/test_reset_vm_on_reboot.py
Intermittent failure detected: /marvin/tests/smoke/test_resource_accounting.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dhcphosts.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dns.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dnsservice.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_iptables_default_policy.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_routers.py
Intermittent failure detected: /marvin/tests/smoke/test_secondary_storage.py
Intermittent failure detected: /marvin/tests/smoke/test_service_offerings.py
Intermittent failure detected: /marvin/tests/smoke/test_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_ssvm.py
Intermittent failure detected: /marvin/tests/smoke/test_storage_policy.py
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_deployment_planner.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_volumes.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_router_nics.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Intermittent failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 65 look OK, 24 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestVPCNics>:setup Error 0.00 test_vpc_router_nics.py
ContextSuite context=TestResetVmOnReboot>:setup Error 0.00 test_reset_vm_on_reboot.py
ContextSuite context=TestRAMCPUResourceAccounting>:setup Error 0.00 test_resource_accounting.py
ContextSuite context=TestRVPCSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVPCSite2SiteVPNMultipleOptions>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcRemoteAccessVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestVpcSite2SiteVpn>:setup Error 0.00 test_vpc_vpn.py
ContextSuite context=TestRouterDHCPHosts>:setup Error 0.00 test_router_dhcphosts.py
ContextSuite context=TestRouterDHCPOpts>:setup Error 0.00 test_router_dhcphosts.py
ContextSuite context=TestRouterDns>:setup Error 0.00 test_router_dns.py
ContextSuite context=TestRouterDnsService>:setup Error 0.00 test_router_dnsservice.py
ContextSuite context=TestRouterIpTablesPolicies>:setup Error 0.00 test_routers_iptables_default_policy.py
ContextSuite context=TestVPCIpTablesPolicies>:setup Error 0.00 test_routers_iptables_default_policy.py
ContextSuite context=TestIsolatedNetworks>:setup Error 0.00 test_routers_network_ops.py
ContextSuite context=TestRedundantIsolateNetworks>:setup Error 0.00 test_routers_network_ops.py
ContextSuite context=TestRouterServices>:setup Error 0.00 test_routers.py
test_01_sys_vm_start Failure 0.08 test_secondary_storage.py
ContextSuite context=TestCpuCapServiceOfferings>:setup Error 0.00 test_service_offerings.py
ContextSuite context=TestServiceOfferings>:setup Error 0.12 test_service_offerings.py
ContextSuite context=TestSnapshotRootDisk>:setup Error 0.00 test_snapshots.py
test_01_list_sec_storage_vm Failure 0.03 test_ssvm.py
test_02_list_cpvm_vm Failure 0.03 test_ssvm.py
test_03_ssvm_internals Failure 0.02 test_ssvm.py
test_04_cpvm_internals Failure 0.02 test_ssvm.py
test_05_stop_ssvm Failure 0.03 test_ssvm.py
test_06_stop_cpvm Failure 0.03 test_ssvm.py
test_07_reboot_ssvm Failure 0.03 test_ssvm.py
test_08_reboot_cpvm Failure 0.03 test_ssvm.py
test_09_reboot_ssvm_forced Failure 0.02 test_ssvm.py
test_10_reboot_cpvm_forced Failure 0.03 test_ssvm.py
test_11_destroy_ssvm Failure 0.03 test_ssvm.py
test_12_destroy_cpvm Failure 0.03 test_ssvm.py
test_02_cancel_host_maintenace_with_migration_jobs Error 1.39 test_host_maintenance.py
test_03_cancel_host_maintenace_with_migration_jobs_failure Error 1.52 test_host_maintenance.py
test_01_cancel_host_maintenance_ssh_enabled_agent_connected Failure 18.41 test_host_maintenance.py
test_03_cancel_host_maintenance_ssh_disabled_agent_connected Failure 18.40 test_host_maintenance.py
test_04_cancel_host_maintenance_ssh_disabled_agent_disconnected Failure 29.27 test_host_maintenance.py
ContextSuite context=TestHostMaintenanceAgents>:teardown Error 30.36 test_host_maintenance.py
ContextSuite context=TestVMWareStoragePolicies>:setup Error 0.00 test_storage_policy.py
test_02_create_template_with_checksum_sha1 Error 65.35 test_templates.py
test_03_create_template_with_checksum_sha256 Error 65.38 test_templates.py
test_04_create_template_with_checksum_md5 Error 65.36 test_templates.py
test_05_create_template_with_no_checksum Error 65.33 test_templates.py
test_02_deploy_vm_from_direct_download_template Error 1.18 test_templates.py
ContextSuite context=TestTemplates>:setup Error 17.96 test_templates.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestLBRuleUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestNatRuleUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestPublicIPUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestSnapshotUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestVmUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestVolumeUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestVpnUsage>:setup Error 0.00 test_usage.py
test_01_deploy_vm_on_specific_host Error 2.27 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 1.20 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 1.22 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 2.25 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 2.23 test_vm_deployment_planner.py
ContextSuite context=TestDeployVM>:setup Error 0.00 test_vm_life_cycle.py
test_01_secure_vm_migration Error 76.29 test_vm_life_cycle.py
test_02_unsecure_vm_migration Error 216.55 test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration Error 145.79 test_vm_life_cycle.py
test_04_nonsecured_to_secured_vm_migration Error 145.85 test_vm_life_cycle.py
ContextSuite context=TestVMLifeCycle>:setup Error 1.65 test_vm_life_cycle.py
ContextSuite context=TestVmSnapshot>:setup Error 1.62 test_vm_snapshots.py
ContextSuite context=TestCreateVolume>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVolumes>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVPCRedundancy>:setup Error 0.00 test_vpc_redundant.py
test_disable_oobm_ha_state_ineligible Error 1511.10 test_hostha_kvm.py

@nvazquez
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1279

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1280

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1281

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2108)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 36781 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4636-t2108-kvm-centos7.zip
Smoke tests completed. 88 look OK, 1 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_nic Error 142.19 test_nic.py

@weizhouapache
Copy link
Member

@ravening @nvazquez @rhtyd
can we move to 4.16.1 ?

@sureshanaparti
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@sureshanaparti
Copy link
Contributor

Been trying to reproduce the issue but couldn't, pinging @rhtyd @weizhouapache @sureshanaparti the changes makes sense however I could not reproduce the failure to test the fix, please advise if you figure out a way to reproduce it

@nvazquez What steps you tried to reproduce it? May be, keep as many VMs possible in a host, and try rebooting host / disconnecting host agent as soon as maintenance is enabled. @ravening can you confirm on how to reproduce / verify this?

@ravening
Copy link
Member Author

Been trying to reproduce the issue but couldn't, pinging @rhtyd @weizhouapache @sureshanaparti the changes makes sense however I could not reproduce the failure to test the fix, please advise if you figure out a way to reproduce it

@nvazquez What steps you tried to reproduce it? May be, keep as many VMs possible in a host, and try rebooting host / disconnecting host agent as soon as maintenance is enabled. @ravening can you confirm on how to reproduce / verify this?

@sureshanaparti @nvazquez this issue mostly happens when you try to enable maintenance mode on the host but it fails to enter the maintenance mode and rathen it goes into disconnected state. In that case vm's will be stopped as they cant be reached through agent.

Im ok to close this pr, if it cant be reproduced

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1371

@weizhouapache weizhouapache modified the milestones: 4.16.0.0, 4.16.1.0 Sep 23, 2021
@kricud
Copy link

kricud commented Sep 24, 2021

@ravening @sureshanaparti @nvazquez Encountered this two days ago. Doing regular maintenance on KVM hosts Ubuntu 18.04/16.04 ACS 4.15.1. Put KVM host in maintenance and some VM are migrate with shutdown. Happened in advanced and basic networking around 10~ VM where affected.

2021-09-21 15:04:01,620 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-114:ctx-92ff343d job-537958/job-538524 ctx-23550d3b) (logid:ff628fbe) VM state transitted from :Running to Stopping with event: StopRequestedvm's original host id: 183 new host id: 184 host id before state transition: 184
2021-09-21 15:04:01,793 DEBUG [c.c.a.m.ClusteredAgentAttache] (Work-Job-Executor-114:ctx-92ff343d job-537958/job-538524 ctx-23550d3b) (logid:ff628fbe) Seq 184-8823114619972222986: Forwarding Seq 184-8823114619972222986: { Cmd , MgmtId: 345051779916, via: 184(zzz1.zzz.ccc), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.StopCommand":{"isProxy":"false","checkBeforeCleanup":"false","forceStop":"true","volumesToDisconnect":[],"vmName":"i-hhh-kkkkk-VM","executeInSequence":"false","wait":"0","bypassHostMaintenance":"false"}}] } to 345051789152

2021-09-21 15:04:01,593 WARN [o.a.c.n.t.BasicNetworkTopology] (Work-Job-Executor-114:ctx-92ff343d job-537958/job-538524 ctx-23550d3b) (logid:ff628fbe) Unable to apply save userdata entry on disconnected router r-nnnn-VM

But router is ok/live.

I can privately send all log if you are interested.

@weizhouapache
Copy link
Member

@ravening @sureshanaparti @nvazquez Encountered this two days ago. Doing regular maintenance on KVM hosts Ubuntu 18.04/16.04 ACS 4.15.1. Put KVM host in maintenance and some VM are migrate with shutdown. Happened in advanced and basic networking around 10~ VM where affected.

2021-09-21 15:04:01,620 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-114:ctx-92ff343d job-537958/job-538524 ctx-23550d3b) (logid:ff628fbe) VM state transitted from :Running to Stopping with event: StopRequestedvm's original host id: 183 new host id: 184 host id before state transition: 184
2021-09-21 15:04:01,793 DEBUG [c.c.a.m.ClusteredAgentAttache] (Work-Job-Executor-114:ctx-92ff343d job-537958/job-538524 ctx-23550d3b) (logid:ff628fbe) Seq 184-8823114619972222986: Forwarding Seq 184-8823114619972222986: { Cmd , MgmtId: 345051779916, via: 184(zzz1.zzz.ccc), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.StopCommand":{"isProxy":"false","checkBeforeCleanup":"false","forceStop":"true","volumesToDisconnect":[],"vmName":"i-hhh-kkkkk-VM","executeInSequence":"false","wait":"0","bypassHostMaintenance":"false"}}] } to 345051789152

2021-09-21 15:04:01,593 WARN [o.a.c.n.t.BasicNetworkTopology] (Work-Job-Executor-114:ctx-92ff343d job-537958/job-538524 ctx-23550d3b) (logid:ff628fbe) Unable to apply save userdata entry on disconnected router r-nnnn-VM

But router is ok/live.

I can privately send all log if you are interested.

@kricud
I am not sure if it has been fixed in 4.15.2.0 (there are many bug fixes in 4.15.2.0). it would be better to upgrade to 4.15.2.0.

@kricud
Copy link

kricud commented Sep 24, 2021

@weizhouapache We are in process of moving to 4.15.2 and 99% that it’s not fixed in it.
From discussion above I understand that it can’t be repeated. With my comment intention was to confirm that it is a problem. When you put host in maintenance you get ErrorInPrepareForMainatance~ and VM's are going down.

@rohityadavcloud rohityadavcloud modified the milestones: 4.16.1.0, 4.17.0.0 Nov 25, 2021
@weizhouapache
Copy link
Member

@weizhouapache We are in process of moving to 4.15.2 and 99% that it’s not fixed in it. From discussion above I understand that it can’t be repeated. With my comment intention was to confirm that it is a problem. When you put host in maintenance you get ErrorInPrepareForMainatance~ and VM's are going down.

@kricud
have you upgraded to 4.15.2 ?
do you still have the issue you mentioned before ?

@kricud
Copy link

kricud commented Feb 16, 2022

@weizhouapache 4.15.2 has same problems.

@weizhouapache
Copy link
Member

@weizhouapache 4.15.2 has same problems.

@kricud
do you have idea how to reproduce the issue ?

@kricud
Copy link

kricud commented Feb 16, 2022

@weizhouapache Both occasions we encountered it again there was no pattern.
Now we are doing it by hand for instances and after pressing maintenance hoping that routers will not shutdown.

@nvazquez
Copy link
Contributor

Hi @kricud any update on this issue and how to reproduce it? Thanks

@kricud
Copy link

kricud commented Mar 17, 2022

@nvazquez no

@nvazquez
Copy link
Contributor

@DaanHoogland @weizhouapache @rohityadavcloud this fix is simply adding an extra check before disconnecting a host while waiting to enable maintenance mode to prevent VM's to stop - however it is not clear on how to reproduce the issue, could occur sometimes, what do you advise?

@nvazquez
Copy link
Contributor

Hi @ravening @kricud since we are not able to reproduce it, can you confirm if this fix has solved the issue in your environment (in case you have applied it)? I think we could merge it in that case

@nvazquez
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2946

@nvazquez
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3687)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31729 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4636-t3687-kvm-centos7.zip
Smoke tests completed. 92 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@nvazquez nvazquez merged commit aa00ef9 into apache:main Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

10 participants