Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KVMHAMonitor thread blocks indefinitely while NFS not available #2890

Closed
csquire opened this issue Oct 8, 2018 · 13 comments
Closed

KVMHAMonitor thread blocks indefinitely while NFS not available #2890

csquire opened this issue Oct 8, 2018 · 13 comments

Comments

@csquire
Copy link
Contributor

csquire commented Oct 8, 2018

ISSUE TYPE
  • Bug Report
COMPONENT NAME
KVM Agent
CLOUDSTACK VERSION
4.11.2.0-41120rc2
CONFIGURATION
OS / ENVIRONMENT
SUMMARY

Also see comment thread on PR #2722

We installed an RC release which includes PR #2722 on a test system expecting the host to get marked as Disconnected after using iptables to drop NFS requests, but instead the host gets marked as Down. My investigation shows that the line storage = conn.storagePoolLookupByUUIDString(uuid); blocks indefinitely. So, kvmheartbeat.sh is never executed, a host investigation is started, the host with blocked NFS is marked as Down and finally all VMs on that host are rescheduled and result in duplicate VMs.

I pulled a thread dump and found the KVMHAMonitor thread will hang here until NFS is unblocked.

java.lang.Thread.State: RUNNABLE
      at com.sun.jna.Native.invokePointer(Native Method)
      at com.sun.jna.Function.invokePointer(Function.java:470)
      at com.sun.jna.Function.invoke(Function.java:404)
      at com.sun.jna.Function.invoke(Function.java:315)
      at com.sun.jna.Library$Handler.invoke(Library.java:212)
      at com.sun.proxy.$Proxy3.virStoragePoolLookupByUUIDString(Unknown Source)
      at org.libvirt.Connect.storagePoolLookupByUUIDString(Unknown Source)
      at com.cloud.hypervisor.kvm.resource.KVMHAMonitor$Monitor.runInContext(KVMHAMonitor.java:95)
      - locked <1afb3370> (a java.util.concurrent.ConcurrentHashMap)
      at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
      at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
      at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
      at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
      at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
      at java.lang.Thread.run(Thread.java:748)

 Locked ownable synchronizers:
      - None
STEPS TO REPRODUCE

EXPECTED RESULTS
The host still runs kvmheartbeat.sh and shows as `Disconnected`
ACTUAL RESULTS
The host heartbeat hangs and get marked as `Down` via host investigation
@csquire
Copy link
Contributor Author

csquire commented Oct 9, 2018

After performing more tests, virtlockd does help prevent from getting duplicate VMs. With Host HA disabled, the blocked host goes to a Down state but the VMs continue to run on the blocked host and new ones don't start on other hosts because of the lock on the root disk.

But, if Host HA is enabled, the blocked host gets rebooted because of the bug above. The VMs on the host get rescheduled on that same host when it comes up, but it still should not have been rebooted because of blocked NFS.

@somejfn
Copy link

somejfn commented Oct 17, 2018

This morning I confirmed the behavior on 4.9 is different than 4.11. When there's a long lasting (say 15 minutes) NFS hang the agent stays Up and when NFS operations resumes everyone's happy. Note we did disable the automatic reboot in the heartbeat script for that to work. This saved us from massive reboots and VM outages before when we had a network maintenance that cut all KVM host from NFS for 22 minutes.

@somejfn
Copy link

somejfn commented Oct 23, 2018

Confirmed we see similar behavior on 4.11.2rc3 and the agent went in Down state. Agent logs:

810986-e702-36ea-a87b-fd48064ecb12
2018-10-23 13:14:40,391 INFO [kvm.resource.LibvirtConnection] (agentRequest-Handler-4:null) (logid:f8cd7cf7) No existing libvirtd connection found. Opening a new one
2018-10-23 13:14:40,392 WARN [kvm.resource.LibvirtConnection] (agentRequest-Handler-4:null) (logid:f8cd7cf7) Can not find a connection for Instance i-4-24-VM. Assuming the default connection.
2018-10-23 13:14:40,399 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:f8cd7cf7) Trying to fetch storage pool 4e49054a-463f-306f-9678-b0d9b02af9a1 from libvirt
2018-10-23 13:14:51,496 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:3a0df8e5) Trying to fetch storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c from libvirt
2018-10-23 13:14:51,498 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:3a0df8e5) Asking libvirt to refresh storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c
2018-10-23 13:15:25,027 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:581a1d95) Trying to fetch storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c from libvirt
2018-10-23 13:15:25,029 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:581a1d95) Asking libvirt to refresh storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c
2018-10-23 13:15:25,590 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:581a1d95) Trying to fetch storage pool 3e810986-e702-36ea-a87b-fd48064ecb12 from libvirt
2018-10-23 13:15:25,592 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:581a1d95) Asking libvirt to refresh storage pool 3e810986-e702-36ea-a87b-fd48064ecb12

2018-10-23 13:21:28,804 WARN [kvm.resource.KVMHAChecker] (Script-3:null) (logid:) Interrupting script.
2018-10-23 13:21:28,806 WARN [kvm.resource.KVMHAChecker] (pool-15160-thread-1:null) (logid:c3d5dcaf) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m /mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 . Output is:
2018-10-23 13:21:32,826 WARN [kvm.resource.KVMHAChecker] (Script-7:null) (logid:) Interrupting script.
2018-10-23 13:21:32,827 WARN [kvm.resource.KVMHAChecker] (pool-15161-thread-1:null) (logid:c3d5dcaf) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m /mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 . Output is:
2018-10-23 13:21:36,846 WARN [kvm.resource.KVMHAChecker] (Script-4:null) (logid:) Interrupting script.
2018-10-23 13:21:36,847 WARN [kvm.resource.KVMHAChecker] (pool-15162-thread-1:null) (logid:4a3cb34f) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m /mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 . Output is:
2018-10-23 13:24:44,205 INFO [cloud.agent.Agent] (Agent-Handler-1:null) (logid:5a5a7500) Lost connection to host: 10.73.96.19. Attempting reconnection while we still have 5 commands in progress.
2018-10-23 13:24:44,206 INFO [utils.nio.NioClient] (Agent-Handler-1:null) (logid:5a5a7500) NioClient connection closed
2018-10-23 13:24:44,206 INFO [cloud.agent.Agent] (Agent-Handler-1:null) (logid:5a5a7500) Reconnecting to host:10.73.96.19
2018-10-23 13:24:44,207 INFO [utils.nio.NioClient] (Agent-Handler-1:null) (logid:5a5a7500) Connecting to 10.73.96.19:8250
2018-10-23 13:24:44,207 INFO [utils.nio.Link] (Agent-Handler-1:null) (logid:5a5a7500) Conf file found: /etc/cloudstack/agent/agent.properties

Note sometimes you will see the agent successfully go in Disconnect state but the host HA framework might still fire after the kvm.ha.degraded.max.period timer and that is not expected. In any case we want to avoid massive KVM host resets via IPMI for storage related problems because this is more damaging than waiting for primary storage to come back.

@csquire
Copy link
Contributor Author

csquire commented Oct 25, 2018

When you block NFS on a host, eventually all the agentRequest-Handler and UgentTask threads gradually hang as well. It seems after the last UgentTask handler thread hangs, is about the time the host is marked as Down.

threaddump.txt

@somejfn
Copy link

somejfn commented Oct 26, 2018

@rhtyd

@rohityadavcloud rohityadavcloud self-assigned this Oct 26, 2018
@rohityadavcloud rohityadavcloud added this to the 4.11.2.0 milestone Oct 27, 2018
@borisstoyanov
Copy link
Contributor

borisstoyanov commented Oct 29, 2018

hi @csquire @somejfn, thanks for this issue!

I think it's correct that the host goes into 'Down' state after loosing it's grip on the storage, since this is basically making it non-operative. Going into 'Disconnected' state would only mean the connection between management and host is compromised.

On the other hand duplicated VMs is definitely something that needs to get addressed, prior marking the host as 'Down' when we have a VM-HA enabled. Just to be sure, can you please confirm you don't see these duplicated VMs on a non VM-ha enabled instances? I'd like to narrow down this issue and make sure it's in the VM-HA logic.

@somejfn
Copy link

somejfn commented Oct 29, 2018 via email

@borisstoyanov
Copy link
Contributor

I think the leanest way to fence the resource would be, prior to setting the host down to iterate all it's VMs and shut them down, only then to proceed and mark the host as 'Down', once were there, there's no issue with VM-HA starting a new instance on a separate host.
I guess this needs further investigation and a fix as described.

@somejfn
Copy link

somejfn commented Oct 29, 2018 via email

@rohityadavcloud
Copy link
Member

rohityadavcloud commented Oct 30, 2018

@csquire I could not reproduce the issue. I guess it depends on how you're simulating NFS/shared-storage failure and what your iptables rules are. In my case I simply shutdown the NFS server instead of iptables rules and observed following:

.
2018-10-30 08:04:04,842 WARN  [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 172.20.0.1 -p /export/testing/primary -m /mnt/05c30f6d-5725-3369-a8d6-b5fb37c7ba8f -h 172.20.1.10 .  Output is: 
2018-10-30 08:04:04,843 WARN  [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) write heartbeat failed: timeout, try: 5 of 5
2018-10-30 08:04:04,843 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) [ignored] interupted between heartbeat retries.
2018-10-30 08:04:04,843 WARN  [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) write heartbeat failed: timeout; stopping cloudstack-agent
2018-10-30 08:04:04,843 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 172.20.0.1 -p /export/testing/primary -m /mnt/05c30f6d-5725-3369-a8d6-b5fb37c7ba8f -c 
2018-10-30 08:04:04,845 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Executing while with timeout : 60000
2018-10-30 08:04:05,007 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/network/security_group.py get_rule_logs_for_vms 
2018-10-30 08:04:05,009 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) (logid:) Executing while with timeout : 1800000
2018-10-30 08:04:05,109 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) (logid:) Execution is successful.
2018-10-30 08:04:05,109 DEBUG [kvm.resource.LibvirtConnection] (UgentTask-5:null) (logid:) Looking for libvirtd connection at: qemu:///system
2018-10-30 08:04:05,114 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-157:  { Cmd , MgmtId: -1, via: 1, Ver: v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{"r-5-VM":{"state":"PowerOn","host":"centos7-kvm1"},"s-1-VM":{"state":"PowerOn","host":"centos7-kvm1"},"v-2-VM":{"state":"PowerOn","host":"centos7-kvm1"},"i-2-4-VM":{"state":"PowerOn","host":"centos7-kvm1"}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":1,"wait":0}}] }
2018-10-30 08:04:05,166 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null) (logid:) Received response: Seq 1-157:  { Ans: , MgmtId: 2485476061376, via: 1, Ver: v1, Flags: 100010, [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":1,"wait":0},"result":true,"wait":0}}] }
2018-10-30 08:04:06,762 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:0aa26d2d) Request:Seq 1-3303108851699548248:  { Cmd , MgmtId: 2485476061376, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.GetVncPortCommand":{"id":4,"name":"i-2-4-VM","wait":0}}] }
2018-10-30 08:04:06,762 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:0aa26d2d) Processing command: com.cloud.agent.api.GetVncPortCommand
2018-10-30 08:04:06,762 DEBUG [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:null) (logid:0aa26d2d) Looking for libvirtd connection at: qemu:///system
2018-10-30 08:04:06,765 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:0aa26d2d) Seq 1-3303108851699548248:  { Ans: , MgmtId: 2485476061376, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.GetVncPortAnswer":{"address":"172.20.1.10","port":5903,"result":true,"wait":0}}] }
2018-10-30 08:04:09,871 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Exit value is 143
2018-10-30 08:04:09,872 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Redirecting to /bin/systemctl stop cloudstack-agent.service
2018-10-30 08:04:09,873 INFO  [cloud.agent.Agent] (AgentShutdownThread:null) (logid:) Stopping the agent: Reason = sig.kill
2018-10-30 08:04:09,874 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null) (logid:) Sending shutdown to management server
2018-10-30 08:04:09,876 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.20.0.1,port=8250,localport=43646] closed on read.  Probably -1 returned: Connection closed with -1 on reading size.
2018-10-30 08:04:09,876 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.20.0.1,port=8250,localport=43646]
2018-10-30 08:04:10,875 DEBUG [kvm.resource.LibvirtConnection] (AgentShutdownThread:null) (logid:) Looking for libvirtd connection at: qemu:///system

I could see the kvmheartbeat.sh try 5 times every 60 seconds and eventually failing to shutdown the agent by this: /bin/systemctl stop cloudstack-agent.service. On the management server side, I saw the host state go from Up to Disconnected:

screenshot from 2018-10-30 13-38-20

The only change in behaviour I see is that the KVM host is not rebooted but only the agent gets shutdown. The downside of this approach is that when the kvm agent is restarted and NFS server is started, the KVM hosts and its VMs are not responsive and I had to reboot the host manually to start on a clean slate, perhaps the previous behaviour of triggering a host reboot was better. Can you advise if we should revert that behaviour - @DaanHoogland @PaulAngus @borisstoyanov ?

@rohityadavcloud
Copy link
Member

Based on the triaging exercise, I've moved this to 4.11.3.0 as further discussion is pending. I've taken the least risk approach to revert part of the change in behaviour and submitted - #2984

@somejfn
Copy link

somejfn commented Oct 30, 2018 via email

@rohityadavcloud rohityadavcloud removed this from the 4.11.3.0 milestone May 27, 2019
@rohityadavcloud rohityadavcloud removed their assignment Jun 26, 2019
@DaanHoogland
Copy link
Contributor

@rhtyd closing as this is two years old this weekend and #2890

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants