KVMHAMonitor thread blocks indefinitely while NFS not available #2890

csquire · 2018-10-08T15:54:19Z

ISSUE TYPE

Bug Report

COMPONENT NAME

KVM Agent

CLOUDSTACK VERSION

4.11.2.0-41120rc2

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

Also see comment thread on PR #2722

We installed an RC release which includes PR #2722 on a test system expecting the host to get marked as Disconnected after using iptables to drop NFS requests, but instead the host gets marked as Down. My investigation shows that the line storage = conn.storagePoolLookupByUUIDString(uuid); blocks indefinitely. So, kvmheartbeat.sh is never executed, a host investigation is started, the host with blocked NFS is marked as Down and finally all VMs on that host are rescheduled and result in duplicate VMs.

I pulled a thread dump and found the KVMHAMonitor thread will hang here until NFS is unblocked.

java.lang.Thread.State: RUNNABLE
      at com.sun.jna.Native.invokePointer(Native Method)
      at com.sun.jna.Function.invokePointer(Function.java:470)
      at com.sun.jna.Function.invoke(Function.java:404)
      at com.sun.jna.Function.invoke(Function.java:315)
      at com.sun.jna.Library$Handler.invoke(Library.java:212)
      at com.sun.proxy.$Proxy3.virStoragePoolLookupByUUIDString(Unknown Source)
      at org.libvirt.Connect.storagePoolLookupByUUIDString(Unknown Source)
      at com.cloud.hypervisor.kvm.resource.KVMHAMonitor$Monitor.runInContext(KVMHAMonitor.java:95)
      - locked <1afb3370> (a java.util.concurrent.ConcurrentHashMap)
      at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
      at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
      at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
      at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
      at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
      at java.lang.Thread.run(Thread.java:748)

 Locked ownable synchronizers:
      - None

STEPS TO REPRODUCE

EXPECTED RESULTS

The host still runs kvmheartbeat.sh and shows as `Disconnected`

ACTUAL RESULTS

The host heartbeat hangs and get marked as `Down` via host investigation

The text was updated successfully, but these errors were encountered:

csquire · 2018-10-09T17:35:28Z

After performing more tests, virtlockd does help prevent from getting duplicate VMs. With Host HA disabled, the blocked host goes to a Down state but the VMs continue to run on the blocked host and new ones don't start on other hosts because of the lock on the root disk.

But, if Host HA is enabled, the blocked host gets rebooted because of the bug above. The VMs on the host get rescheduled on that same host when it comes up, but it still should not have been rebooted because of blocked NFS.

somejfn · 2018-10-17T16:41:08Z

This morning I confirmed the behavior on 4.9 is different than 4.11. When there's a long lasting (say 15 minutes) NFS hang the agent stays Up and when NFS operations resumes everyone's happy. Note we did disable the automatic reboot in the heartbeat script for that to work. This saved us from massive reboots and VM outages before when we had a network maintenance that cut all KVM host from NFS for 22 minutes.

somejfn · 2018-10-23T18:53:09Z

Confirmed we see similar behavior on 4.11.2rc3 and the agent went in Down state. Agent logs:

810986-e702-36ea-a87b-fd48064ecb12
2018-10-23 13:14:40,391 INFO [kvm.resource.LibvirtConnection] (agentRequest-Handler-4:null) (logid:f8cd7cf7) No existing libvirtd connection found. Opening a new one
2018-10-23 13:14:40,392 WARN [kvm.resource.LibvirtConnection] (agentRequest-Handler-4:null) (logid:f8cd7cf7) Can not find a connection for Instance i-4-24-VM. Assuming the default connection.
2018-10-23 13:14:40,399 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:f8cd7cf7) Trying to fetch storage pool 4e49054a-463f-306f-9678-b0d9b02af9a1 from libvirt
2018-10-23 13:14:51,496 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:3a0df8e5) Trying to fetch storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c from libvirt
2018-10-23 13:14:51,498 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:3a0df8e5) Asking libvirt to refresh storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c
2018-10-23 13:15:25,027 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:581a1d95) Trying to fetch storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c from libvirt
2018-10-23 13:15:25,029 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:581a1d95) Asking libvirt to refresh storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c
2018-10-23 13:15:25,590 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:581a1d95) Trying to fetch storage pool 3e810986-e702-36ea-a87b-fd48064ecb12 from libvirt
2018-10-23 13:15:25,592 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:581a1d95) Asking libvirt to refresh storage pool 3e810986-e702-36ea-a87b-fd48064ecb12

2018-10-23 13:21:28,804 WARN [kvm.resource.KVMHAChecker] (Script-3:null) (logid:) Interrupting script.
2018-10-23 13:21:28,806 WARN [kvm.resource.KVMHAChecker] (pool-15160-thread-1:null) (logid:c3d5dcaf) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m /mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 . Output is:
2018-10-23 13:21:32,826 WARN [kvm.resource.KVMHAChecker] (Script-7:null) (logid:) Interrupting script.
2018-10-23 13:21:32,827 WARN [kvm.resource.KVMHAChecker] (pool-15161-thread-1:null) (logid:c3d5dcaf) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m /mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 . Output is:
2018-10-23 13:21:36,846 WARN [kvm.resource.KVMHAChecker] (Script-4:null) (logid:) Interrupting script.
2018-10-23 13:21:36,847 WARN [kvm.resource.KVMHAChecker] (pool-15162-thread-1:null) (logid:4a3cb34f) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m /mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 . Output is:
2018-10-23 13:24:44,205 INFO [cloud.agent.Agent] (Agent-Handler-1:null) (logid:5a5a7500) Lost connection to host: 10.73.96.19. Attempting reconnection while we still have 5 commands in progress.
2018-10-23 13:24:44,206 INFO [utils.nio.NioClient] (Agent-Handler-1:null) (logid:5a5a7500) NioClient connection closed
2018-10-23 13:24:44,206 INFO [cloud.agent.Agent] (Agent-Handler-1:null) (logid:5a5a7500) Reconnecting to host:10.73.96.19
2018-10-23 13:24:44,207 INFO [utils.nio.NioClient] (Agent-Handler-1:null) (logid:5a5a7500) Connecting to 10.73.96.19:8250
2018-10-23 13:24:44,207 INFO [utils.nio.Link] (Agent-Handler-1:null) (logid:5a5a7500) Conf file found: /etc/cloudstack/agent/agent.properties

Note sometimes you will see the agent successfully go in Disconnect state but the host HA framework might still fire after the kvm.ha.degraded.max.period timer and that is not expected. In any case we want to avoid massive KVM host resets via IPMI for storage related problems because this is more damaging than waiting for primary storage to come back.

csquire · 2018-10-25T14:50:43Z

When you block NFS on a host, eventually all the agentRequest-Handler and UgentTask threads gradually hang as well. It seems after the last UgentTask handler thread hangs, is about the time the host is marked as Down.

threaddump.txt

somejfn · 2018-10-26T18:22:06Z

@rhtyd

borisstoyanov · 2018-10-29T07:21:17Z

hi @csquire @somejfn, thanks for this issue!

I think it's correct that the host goes into 'Down' state after loosing it's grip on the storage, since this is basically making it non-operative. Going into 'Disconnected' state would only mean the connection between management and host is compromised.

On the other hand duplicated VMs is definitely something that needs to get addressed, prior marking the host as 'Down' when we have a VM-HA enabled. Just to be sure, can you please confirm you don't see these duplicated VMs on a non VM-ha enabled instances? I'd like to narrow down this issue and make sure it's in the VM-HA logic.

somejfn · 2018-10-29T12:46:28Z

Correct. Only VM-HA enabled would get restarted and create a duplicate when the host goes down. Still, I don't think this a good behavior to fire VM-HA (because of host Up --> Down state) under any scenarios caused by transient storage disconnection. If the host goes down after 5 minutes, VM-HA restarts VM about one minute later, and then if the NFS issue gets resolved you have almost 100% probability of root disk corruption and you don't know where the 2 VMs are since Cloudstack only remembers the last copy it started.

…

On Mon, Oct 29, 2018 at 3:21 AM Boris Stoyanov - a.k.a Bobby < ***@***.***> wrote: hi @csquire <https://github.com/csquire> @somejfn <https://github.com/somejfn>, thanks for this issue! I think it's correct that the host goes into 'Down' state after loosing it's grip on the storage, since this is basically making it inoperable. Going into 'Disconnected' state would only mean the connection between management and host is compromised. On the other hand duplicated VMs is definitely something that needs to get addressed, prior marking the host as 'Down' when we have a VM-HA enabled. Just to be sure, can you please confirm you don't see these duplicated VMs on a non VM-ha enabled instances? I'd like to narrow down this issue and make sure it's in the VM-HA logic. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2890 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOflqLPAw0oQ2_po-4BHKpzHRDZMUc9Hks5upqxxgaJpZM4XNOD8> .

borisstoyanov · 2018-10-29T13:18:20Z

I think the leanest way to fence the resource would be, prior to setting the host down to iterate all it's VMs and shut them down, only then to proceed and mark the host as 'Down', once were there, there's no issue with VM-HA starting a new instance on a separate host.
I guess this needs further investigation and a fix as described.

somejfn · 2018-10-29T13:35:47Z

With NFS not available and since those are hard mounts, even a "virsh destroy" would not work. Libvirtd will block until the NFS mount issue is resolved. I think that ideally the cloudstack-agent would do every task in a non-blocking way and not be affected by primary storage hiccups. For instance, to avoid the thread pool blocking on libvirtd tasks, why no implement a configurable timeout on thosetasks with sensible defaults ? I don't see a good reason a call to libvirtd take more than a few seconds (beside known long lasting tasks such as live migration) As for fencing, afaik the host HA framework was created for the purpose or reliable fencing... but will cause more damage than good if the end result is to reboot all KVM hosts via IPMI (compared to just wait for NFS to come back)

…

On Mon, Oct 29, 2018 at 9:18 AM Boris Stoyanov - a.k.a Bobby < ***@***.***> wrote: I think the leanest way to fence the resource would be, prior to setting the host down to iterate all it's VMs and shut them down, only then to proceed and mark the host as 'Down', once were there, there's no issue with VM-HA starting a new instance on a separate host. I guess this needs further investigation and a fix as described. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2890 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOflqMVZGtW3QGGlL-DfyzMy6duL7EXQks5upwAhgaJpZM4XNOD8> .

rohityadavcloud · 2018-10-30T08:09:16Z

@csquire I could not reproduce the issue. I guess it depends on how you're simulating NFS/shared-storage failure and what your iptables rules are. In my case I simply shutdown the NFS server instead of iptables rules and observed following:

.
2018-10-30 08:04:04,842 WARN  [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 172.20.0.1 -p /export/testing/primary -m /mnt/05c30f6d-5725-3369-a8d6-b5fb37c7ba8f -h 172.20.1.10 .  Output is: 
2018-10-30 08:04:04,843 WARN  [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) write heartbeat failed: timeout, try: 5 of 5
2018-10-30 08:04:04,843 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) [ignored] interupted between heartbeat retries.
2018-10-30 08:04:04,843 WARN  [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) write heartbeat failed: timeout; stopping cloudstack-agent
2018-10-30 08:04:04,843 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 172.20.0.1 -p /export/testing/primary -m /mnt/05c30f6d-5725-3369-a8d6-b5fb37c7ba8f -c 
2018-10-30 08:04:04,845 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Executing while with timeout : 60000
2018-10-30 08:04:05,007 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/network/security_group.py get_rule_logs_for_vms 
2018-10-30 08:04:05,009 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) (logid:) Executing while with timeout : 1800000
2018-10-30 08:04:05,109 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) (logid:) Execution is successful.
2018-10-30 08:04:05,109 DEBUG [kvm.resource.LibvirtConnection] (UgentTask-5:null) (logid:) Looking for libvirtd connection at: qemu:///system
2018-10-30 08:04:05,114 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-157:  { Cmd , MgmtId: -1, via: 1, Ver: v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{"r-5-VM":{"state":"PowerOn","host":"centos7-kvm1"},"s-1-VM":{"state":"PowerOn","host":"centos7-kvm1"},"v-2-VM":{"state":"PowerOn","host":"centos7-kvm1"},"i-2-4-VM":{"state":"PowerOn","host":"centos7-kvm1"}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":1,"wait":0}}] }
2018-10-30 08:04:05,166 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null) (logid:) Received response: Seq 1-157:  { Ans: , MgmtId: 2485476061376, via: 1, Ver: v1, Flags: 100010, [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":1,"wait":0},"result":true,"wait":0}}] }
2018-10-30 08:04:06,762 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:0aa26d2d) Request:Seq 1-3303108851699548248:  { Cmd , MgmtId: 2485476061376, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.GetVncPortCommand":{"id":4,"name":"i-2-4-VM","wait":0}}] }
2018-10-30 08:04:06,762 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:0aa26d2d) Processing command: com.cloud.agent.api.GetVncPortCommand
2018-10-30 08:04:06,762 DEBUG [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:null) (logid:0aa26d2d) Looking for libvirtd connection at: qemu:///system
2018-10-30 08:04:06,765 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:0aa26d2d) Seq 1-3303108851699548248:  { Ans: , MgmtId: 2485476061376, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.GetVncPortAnswer":{"address":"172.20.1.10","port":5903,"result":true,"wait":0}}] }
2018-10-30 08:04:09,871 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Exit value is 143
2018-10-30 08:04:09,872 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Redirecting to /bin/systemctl stop cloudstack-agent.service
2018-10-30 08:04:09,873 INFO  [cloud.agent.Agent] (AgentShutdownThread:null) (logid:) Stopping the agent: Reason = sig.kill
2018-10-30 08:04:09,874 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null) (logid:) Sending shutdown to management server
2018-10-30 08:04:09,876 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.20.0.1,port=8250,localport=43646] closed on read.  Probably -1 returned: Connection closed with -1 on reading size.
2018-10-30 08:04:09,876 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.20.0.1,port=8250,localport=43646]
2018-10-30 08:04:10,875 DEBUG [kvm.resource.LibvirtConnection] (AgentShutdownThread:null) (logid:) Looking for libvirtd connection at: qemu:///system

I could see the kvmheartbeat.sh try 5 times every 60 seconds and eventually failing to shutdown the agent by this: /bin/systemctl stop cloudstack-agent.service. On the management server side, I saw the host state go from Up to Disconnected:

The only change in behaviour I see is that the KVM host is not rebooted but only the agent gets shutdown. The downside of this approach is that when the kvm agent is restarted and NFS server is started, the KVM hosts and its VMs are not responsive and I had to reboot the host manually to start on a clean slate, perhaps the previous behaviour of triggering a host reboot was better. Can you advise if we should revert that behaviour - @DaanHoogland @PaulAngus @borisstoyanov ?

rohityadavcloud · 2018-10-30T08:47:07Z

Based on the triaging exercise, I've moved this to 4.11.3.0 as further discussion is pending. I've taken the least risk approach to revert part of the change in behaviour and submitted - #2984

somejfn · 2018-10-30T12:31:56Z

If you just shutdown the NFS server, the NFS client will get immediate response (TCP reset) so this is not the same as blocking the NFS server IP with iptables and DROP rules like I do for testing the network outage: *iptables -I INPUT -s nfs_server_ip -j DROP ; iptables -I OUTPUT -d nfs_server_ip -j DROP* It may take more than one attempt to see the tread pool block.

…

On Tue, Oct 30, 2018 at 4:47 AM Rohit Yadav ***@***.***> wrote: Based on the triaging exercise, I've moved this to 4.11.3.0 as further discussion is pending. I've taken the least risk approach to revert part of the change in behaviour and submitted - #2984 <#2984> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2890 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOflqLrd7IVAIUdSuy-hW7NLhwXPDnfbks5uqBISgaJpZM4XNOD8> .

DaanHoogland · 2020-10-28T18:45:36Z

@rhtyd closing as this is two years old this weekend and #2890

csquire mentioned this issue Oct 8, 2018

CLOUDSTACK-10310 Fix KVM reboot on storage issue #2722

Merged

rohityadavcloud self-assigned this Oct 26, 2018

rohityadavcloud added this to the 4.11.2.0 milestone Oct 27, 2018

rohityadavcloud mentioned this issue Oct 30, 2018

kvm: reset KVM host on heartbeat failure #2984

Merged

5 tasks

rohityadavcloud modified the milestones: 4.11.2.0, 4.11.3.0 Oct 30, 2018

rohityadavcloud added type:bug component:kvm labels Dec 6, 2018

rohityadavcloud removed this from the 4.11.3.0 milestone May 27, 2019

rohityadavcloud removed their assignment Jun 26, 2019

DaanHoogland closed this as completed Oct 28, 2020

weizhouapache mentioned this issue Feb 18, 2021

kvm: Handle storage issue on NFS/KVM in multiple ways #4708

Closed

12 tasks

snyk-bot mentioned this issue Jul 16, 2021

[Snyk] Security upgrade karma from 1.7.1 to 2.0.0 hixio-mh/cloudstack#53

Open

snyk-bot mentioned this issue Sep 4, 2021

[Snyk] Security upgrade karma from 1.7.1 to 2.0.0 hixio-mh/cloudstack#97

Open

weizhouapache mentioned this issue Apr 8, 2024

KVMHAMonitor getting initialized without host ha enabled #8682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVMHAMonitor thread blocks indefinitely while NFS not available #2890

KVMHAMonitor thread blocks indefinitely while NFS not available #2890

csquire commented Oct 8, 2018 •

edited

csquire commented Oct 9, 2018

somejfn commented Oct 17, 2018

somejfn commented Oct 23, 2018 •

edited

csquire commented Oct 25, 2018

somejfn commented Oct 26, 2018

borisstoyanov commented Oct 29, 2018 •

edited

somejfn commented Oct 29, 2018 via email

borisstoyanov commented Oct 29, 2018

somejfn commented Oct 29, 2018 via email

rohityadavcloud commented Oct 30, 2018 •

edited

rohityadavcloud commented Oct 30, 2018

somejfn commented Oct 30, 2018 via email

DaanHoogland commented Oct 28, 2020

KVMHAMonitor thread blocks indefinitely while NFS not available #2890

KVMHAMonitor thread blocks indefinitely while NFS not available #2890

Comments

csquire commented Oct 8, 2018 • edited

ISSUE TYPE

COMPONENT NAME

CLOUDSTACK VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

csquire commented Oct 9, 2018

somejfn commented Oct 17, 2018

somejfn commented Oct 23, 2018 • edited

csquire commented Oct 25, 2018

somejfn commented Oct 26, 2018

borisstoyanov commented Oct 29, 2018 • edited

somejfn commented Oct 29, 2018 via email

borisstoyanov commented Oct 29, 2018

somejfn commented Oct 29, 2018 via email

rohityadavcloud commented Oct 30, 2018 • edited

rohityadavcloud commented Oct 30, 2018

somejfn commented Oct 30, 2018 via email

DaanHoogland commented Oct 28, 2020

csquire commented Oct 8, 2018 •

edited

somejfn commented Oct 23, 2018 •

edited

borisstoyanov commented Oct 29, 2018 •

edited

rohityadavcloud commented Oct 30, 2018 •

edited