Fix: API Thread held forever during force deleting across MS#12968
Conversation
|
@blueorangutan package |
|
@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 4.22 #12968 +/- ##
============================================
- Coverage 17.60% 17.60% -0.01%
- Complexity 15677 15678 +1
============================================
Files 5918 5918
Lines 531681 531711 +30
Branches 65005 65008 +3
============================================
- Hits 93623 93622 -1
- Misses 427498 427526 +28
- Partials 10560 10563 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 17371 |
|
@blueorangutan package |
|
@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✖️ debian ✔️ suse15. SL-JID 17372 |
|
@blueorangutan test |
|
@nvazquez a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-15815)
|
DaanHoogland
left a comment
There was a problem hiding this comment.
clgtm, how can this be tested, @nvazquez
|
@DaanHoogland this was impacting large multi management server environments on force host deletion. While it may be hard to replicate, no regressions must be observed on multiple management server environments when force removing hosts |
kiranchavala
left a comment
There was a problem hiding this comment.
LGTM
Tested manually
Deploy a multi-management servers CloudStack environment
Add a kvm host such that it is owned/connected to MS-1
Issue the API call (deletehost) with force option from MS-2
Kvm Host successfully deleted
2026-04-15 05:04:28,540 DEBUG [c.c.a.ApiServlet] (qtp1390913202-7010:[ctx-b2bd9068]) (logid:acab41d5) ===START=== 10.0.3.251 -- POST
command=deleteHost
response=json
id=1d668143-f03d-4d1f-8155-35d31c656c87
forced=true
sessionkey=GDVEQO5rpCmmxA_VCngA2VlqSnM
2026-04-15 05:04:28,540 DEBUG [c.c.a.ApiServlet] (qtp1390913202-7010:[ctx-b2bd9068]) (logid:acab41d5) Two factor authentication is already verified for the user 2, so skipping
2026-04-15 05:04:28,550 DEBUG [c.c.a.ApiServer] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) CIDRs from which account 'Account [{"accountName":"admin","id":2,"uuid":"630244b9-3739-11f1-9978-1e00e20001cf"}]' is allowed to perform API calls: 0.0.0.0/0,::/0
2026-04-15 05:04:28,553 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) Account for user id 6302a12a-3739-11f1-9978-1e00e20001cf is Root Admin or Domain Admin, all APIs are allowed.
2026-04-15 05:04:28,553 DEBUG [o.a.c.a.StaticRoleBasedAPIAccessChecker] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) RoleService is enabled. We will use it instead of StaticRoleBasedAPIAccessChecker.
2026-04-15 05:04:28,553 DEBUG [o.a.c.r.ApiRateLimitServiceImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) API rate limiting is disabled. We will not use ApiRateLimitService.
2026-04-15 05:04:28,558 DEBUG [c.c.r.ResourceManagerImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) Propagating resource request event:DeleteHost to agent:2
2026-04-15 05:04:28,558 DEBUG [c.c.c.ClusterManagerImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) 32985382387974 -> 32989140484559.2 [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":2,"event":"DeleteHost","forced":true,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:04:28,562 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Worker-2:[ctx-64afa57f]) (logid:f503771c) Cluster PDU 32985382387974 -> 32989140484559. agent: 2, pdu seq: 12473, pdu ack seq: 0, json: [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":2,"event":"DeleteHost","forced":true,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:04:28,562 DEBUG [c.c.c.ClusterServiceServletImpl] (Cluster-Worker-2:[ctx-64afa57f]) (logid:f503771c) Executing ClusterServicePdu with service URL: https://10.0.33.60:9090/clusterservice
2026-04-15 05:04:28,564 DEBUG [c.c.c.ClusterServiceServletImpl] (Cluster-Worker-2:[ctx-64afa57f]) (logid:f503771c) POST https://10.0.33.60:9090/clusterservice response :true, responding time: 2 ms
2026-04-15 05:04:28,564 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Worker-2:[ctx-64afa57f]) (logid:f503771c) Cluster PDU 32985382387974 -> 32989140484559 completed. time: 2ms. agent: 2, pdu seq: 12473, pdu ack seq: 0, json: [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":2,"event":"DeleteHost","forced":true,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:04:28,624 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (Cluster-Worker-7:[ctx-4790e955]) (logid:fad31311) Dispatch ->2, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":2,"event":"AgentDisconnected","contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:04:28,625 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (Cluster-Worker-7:[ctx-4790e955]) (logid:fad31311) Intercepting command for agent change: agent 2 event: AgentDisconnected
2026-04-15 05:04:28,625 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (Cluster-Worker-7:[ctx-4790e955]) (logid:fad31311) Received agent disconnect event for host 2 (null)
2026-04-15 05:04:28,625 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (Cluster-Worker-7:[ctx-4790e955]) (logid:fad31311) Result is true
2026-04-15 05:04:28,627 DEBUG [c.c.c.ClusterManagerImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) 32985382387974 -> 32989140484559.2 completed. result: [{"com.cloud.agent.api.Answer":{"result":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:04:28,627 DEBUG [c.c.r.ResourceManagerImpl] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) Result for agent change is true
2026-04-15 05:04:28,628 DEBUG [c.c.a.ApiServlet] (qtp1390913202-7010:[ctx-b2bd9068, ctx-1fc766ec]) (logid:acab41d5) ===END=== 10.0.3.251 -- POST
command=deleteHost
response=json
id=1d668143-f03d-4d1f-8155-35d31c656c87
forced=true
sessionkey=GDVEQO5rpCmmxA_VCngA2VlqSnM
Tested delete host api without the force option parameter
Host deleted successfully
2026-04-15 05:15:47,621 DEBUG [c.c.a.ApiServlet] (qtp1390913202-23:[ctx-1bf5b649]) (logid:832f240c) ===START=== 10.0.3.251 -- POST
command=deleteHost
response=json
id=7d807b7c-092c-4994-86e1-3b751bbca11e
sessionkey=GDVEQO5rpCmmxA_VCngA2VlqSnM
2026-04-15 05:15:47,621 DEBUG [c.c.a.ApiServlet] (qtp1390913202-23:[ctx-1bf5b649]) (logid:832f240c) Two factor authentication is already verified for the user 2, so skipping
2026-04-15 05:15:47,628 DEBUG [c.c.a.ApiServer] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) CIDRs from which account 'Account [{"accountName":"admin","id":2,"uuid":"630244b9-3739-11f1-9978-1e00e20001cf"}]' is allowed to perform API calls: 0.0.0.0/0,::/0
2026-04-15 05:15:47,630 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) Account for user id 6302a12a-3739-11f1-9978-1e00e20001cf is Root Admin or Domain Admin, all APIs are allowed.
2026-04-15 05:15:47,630 DEBUG [o.a.c.a.StaticRoleBasedAPIAccessChecker] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) RoleService is enabled. We will use it instead of StaticRoleBasedAPIAccessChecker.
2026-04-15 05:15:47,630 DEBUG [o.a.c.r.ApiRateLimitServiceImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) API rate limiting is disabled. We will not use ApiRateLimitService.
2026-04-15 05:15:47,633 DEBUG [c.c.r.ResourceManagerImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) Propagating resource request event:DeleteHost to agent:5
2026-04-15 05:15:47,633 DEBUG [c.c.c.ClusterManagerImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) 32985382387974 -> 32989140484559.5 [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":5,"event":"DeleteHost","forced":false,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:15:47,633 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Worker-3:[ctx-f561d9a6]) (logid:12c9c152) Cluster PDU 32985382387974 -> 32989140484559. agent: 5, pdu seq: 12535, pdu ack seq: 0, json: [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":5,"event":"DeleteHost","forced":false,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:15:47,633 DEBUG [c.c.c.ClusterServiceServletImpl] (Cluster-Worker-3:[ctx-f561d9a6]) (logid:12c9c152) Executing ClusterServicePdu with service URL: https://10.0.33.60:9090/clusterservice
2026-04-15 05:15:47,636 DEBUG [c.c.c.ClusterServiceServletImpl] (Cluster-Worker-3:[ctx-f561d9a6]) (logid:12c9c152) POST https://10.0.33.60:9090/clusterservice response :true, responding time: 2 ms
2026-04-15 05:15:47,636 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Worker-3:[ctx-f561d9a6]) (logid:12c9c152) Cluster PDU 32985382387974 -> 32989140484559 completed. time: 2ms. agent: 5, pdu seq: 12535, pdu ack seq: 0, json: [{"com.cloud.agent.api.PropagateResourceEventCommand":{"hostId":5,"event":"DeleteHost","forced":false,"forceDeleteStorage":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:15:47,685 DEBUG [c.c.c.ClusterManagerImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) 32985382387974 -> 32989140484559.5 completed. result: [{"com.cloud.agent.api.Answer":{"result":true,"contextMap":{},"wait":0,"bypassHostMaintenance":false}}]
2026-04-15 05:15:47,685 DEBUG [c.c.r.ResourceManagerImpl] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) Result for agent change is true
2026-04-15 05:15:47,685 DEBUG [c.c.a.ApiServlet] (qtp1390913202-23:[ctx-1bf5b649, ctx-94b2a3bb]) (logid:832f240c) ===END=== 10.0.3.251 -- POST
command=deleteHost
response=json
id=7d807b7c-092c-4994-86e1-3b751bbca11e
sessionkey=GDVEQO5rpCmmxA_VCngA2VlqSnM
Description
This PR fixes indefinite hang on deleteHost operation for multiple management server environments. In a multi-management-server (clustered) environment, a forced deleteHost API call causes the calling MS to hang indefinitely, eventually exhausting API threads and rendering the entire environment unresponsive (502 gateway errors).
Fixed by adding:
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?