-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
ISSUE TYPE
- Bug Report
COMPONENT NAME
HA
CLOUDSTACK VERSION
4.15.2.0 or higher (and mayber lower)
CONFIGURATION
Advanced zone with VMware hypervisor
OS / ENVIRONMENT
N/A
SUMMARY
STEPS TO REPRODUCE
- Deploy VMs with HA on a particular host. Some VRs can also be placed on the same host
- Simulate host disconnection (I turned off ESXi host)
- Observe CloudStack marks the host in Aler state
EXPECTED RESULTS
HA VMs, VRs migrated to a different host or there is some migration attempt
ACTUAL RESULTS
VMs continue showing as Running in CloudStack. Host continue to show in Alert state
Logs:
[root@ref-trl-4587-v-M7-abhishek-kumar-mgmt1 ~]# grep "acba27c2" /var/log/cloudstack/management/management-server.log
2023-03-07 11:36:50,161 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Investigating why host 13 has disconnected with event PingTimeout
2023-03-07 11:36:50,161 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) checking if agent (13) is alive
2023-03-07 11:36:50,163 DEBUG [c.c.a.t.Request] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Seq 13-7700592412850127195: Sending { Cmd , MgmtId: 32989308257179, via: 13(10.0.32.160), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2023-03-07 11:36:50,163 DEBUG [c.c.a.t.Request] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Seq 13-7700592412850127195: Executing: { Cmd , MgmtId: 32989308257179, via: 13(10.0.32.160), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2023-03-07 11:36:50,163 INFO [c.c.h.v.r.VmwareResource] (DirectAgent-441:ctx-b54e839c 10.0.32.160, cmd: CheckHealthCommand) (logid:acba27c2) Executing resource CheckHealthCommand: {"wait":50,"bypassHostMaintenance":false}
2023-03-07 11:36:50,194 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgent-441:ctx-b54e839c) (logid:acba27c2) Seq 13-7700592412850127195: Response Received:
2023-03-07 11:36:50,194 DEBUG [c.c.a.t.Request] (DirectAgent-441:ctx-b54e839c) (logid:acba27c2) Seq 13-7700592412850127195: Processing: { Ans: , MgmtId: 32989308257179, via: 13(10.0.32.160), Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckHealthAnswer":{"result":"false","details":"resource is not alive","wait":"0","bypassHostMaintenance":"false"}}] }
2023-03-07 11:36:50,194 DEBUG [c.c.a.t.Request] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Seq 13-7700592412850127195: Received: { Ans: , MgmtId: 32989308257179, via: 13(10.0.32.160), Ver: v1, Flags: 10, { CheckHealthAnswer } }
2023-03-07 11:36:50,194 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Details from executing class com.cloud.agent.api.CheckHealthCommand: resource is not alive
2023-03-07 11:36:50,195 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) SimpleInvestigator unable to determine the state of the host. Moving on.
2023-03-07 11:36:50,195 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) XenServerInvestigator unable to determine the state of the host. Moving on.
2023-03-07 11:36:50,195 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) KVMInvestigator unable to determine the state of the host. Moving on.
2023-03-07 11:36:50,195 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) HypervInvestigator unable to determine the state of the host. Moving on.
2023-03-07 11:36:50,195 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) VMwareInvestigator was able to determine host 13 is in Disconnected
2023-03-07 11:36:50,195 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) The agent from host 13 state determined is Disconnected
2023-03-07 11:36:50,195 WARN [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Agent is disconnected but the host is still up: 13-10.0.32.160
2023-03-07 11:36:50,197 WARN [c.c.a.AlertManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) AlertType:: 7 | dataCenterId:: 1 | podId:: 1 | clusterId:: null | message:: Host disconnected, name: 10.0.32.160 (id:13), availability zone: ref-trl-4587-v-M7-abhishek-kumar, pod: Pod1
2023-03-07 11:36:50,200 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Host 13 is disconnecting with event AgentDisconnected
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) The next status of agent 13is Alert, current status is Up
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Deregistering link for 13 with state Alert
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Remove Agent : 13
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.DirectAgentAttache] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Processing disconnect 13(10.0.32.160)
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.hypervisor.xenserver.discoverer.XcpServerDiscoverer
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.hypervisor.hyperv.discoverer.HypervServerDiscoverer
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.network.security.SecurityGroupListener
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.deploy.DeploymentPlanningManagerImpl
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.storage.secondary.SecondaryStorageListener
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.hypervisor.vmware.manager.VmwareManagerImpl
2023-03-07 11:36:50,201 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.storage.listener.StoragePoolMonitor
2023-03-07 11:36:50,204 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: org.apache.cloudstack.engine.orchestration.NetworkOrchestrator
2023-03-07 11:36:50,205 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.vm.ClusteredVirtualMachineManagerImpl
2023-03-07 11:36:50,205 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.network.NetworkUsageManagerImpl$DirectNetworkStatsListener
2023-03-07 11:36:50,205 DEBUG [c.c.n.NetworkUsageManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Disconnected called on 13 with status Alert
2023-03-07 11:36:50,205 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.storage.LocalStoragePoolListener
2023-03-07 11:36:50,205 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.storage.upload.UploadListener
2023-03-07 11:36:50,208 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.network.SshKeysDistriMonitor
2023-03-07 11:36:50,210 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.network.router.VpcVirtualNetworkApplianceManagerImpl
2023-03-07 11:36:50,210 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.capacity.StorageCapacityListener
2023-03-07 11:36:50,210 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.capacity.ComputeCapacityListener
2023-03-07 11:36:50,210 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.agent.manager.AgentManagerImpl$BehindOnPingListener
2023-03-07 11:36:50,210 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.agent.manager.AgentManagerImpl$SetHostParamsListener
2023-03-07 11:36:50,210 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.consoleproxy.ConsoleProxyListener
2023-03-07 11:36:50,211 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.network.SshKeysDistriMonitor
2023-03-07 11:36:50,214 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.network.router.VirtualNetworkApplianceManagerImpl
2023-03-07 11:36:50,214 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Sending Disconnect to listener: com.cloud.storage.download.DownloadListener
2023-03-07 11:36:50,216 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 13, name = 10.0.32.160]
2023-03-07 11:36:50,222 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Notifying other nodes of to disconnect
2023-03-07 11:36:50,223 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-7:ctx-f77c38a5) (logid:acba27c2) Notifying other nodes of to disconnect
Logs like these appear
2023-03-07 11:42:47,227 DEBUG [c.c.a.m.AgentManagerImpl] (RouterStatusMonitor-1:ctx-5e559960) (logid:d8defb83) Can not send command com.cloud.agent.api.routing.SetMonitorServiceCommand due to Host 13 is not up
2023-03-07 11:42:47,228 ERROR [c.c.n.r.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:ctx-5e559960) (logid:d8defb83) Unable to update health checks data to router r-19-VM
...
...
2023-03-07 11:45:22,283 DEBUG [c.c.a.r.v.VirtualRoutingResource] (ClusteredAgentManager Timer:ctx-75540959) (logid:235791a1) The router.aggregation.command.each.timeout in seconds is set to 600
2023-03-07 11:45:22,284 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-13:ctx-05bd1038) (logid:ee70b59f) Simulating start for resource 10.0.32.160 id 13
2023-03-07 11:45:22,284 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-13:ctx-05bd1038) (logid:ee70b59f) Creating agent for host 13
2023-03-07 11:45:22,298 INFO [c.c.h.v.r.VmwareResource] (AgentTaskPool-13:ctx-05bd1038) (logid:ee70b59f) Host 10.0.32.160 is not in connected state
2023-03-07 11:45:22,298 INFO [c.c.r.ResourceManagerImpl] (AgentTaskPool-13:ctx-05bd1038) (logid:ee70b59f) Unable to fully initialize the agent because no StartupCommands are returned
2023-03-07 11:45:22,298 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-13:ctx-05bd1038) (logid:ee70b59f) Completed creating agent for host 13
...
2023-03-07 11:51:22,279 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-60a17032) (logid:ea25577d) Simulating start for resource 10.0.32.160 id 13
2023-03-07 11:51:22,280 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-60a17032) (logid:ea25577d) Creating agent for host 13
2023-03-07 11:51:22,298 INFO [c.c.h.v.r.VmwareResource] (AgentTaskPool-2:ctx-60a17032) (logid:ea25577d) Host 10.0.32.160 is not in connected state
2023-03-07 11:51:22,298 INFO [c.c.r.ResourceManagerImpl] (AgentTaskPool-2:ctx-60a17032) (logid:ea25577d) Unable to fully initialize the agent because no StartupCommands are returned
2023-03-07 11:51:22,298 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-60a17032) (logid:ea25577d) Completed creating agent for host 13