Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup virtual router of a redundant VPC stays in starting state after restart with clean up. #8055

Closed
soreana opened this issue Oct 9, 2023 · 21 comments · Fixed by #8385
Closed

Comments

@soreana
Copy link
Member

soreana commented Oct 9, 2023

ISSUE TYPE
  • Bug Report
COMPONENT NAME
VR, VPC
CLOUDSTACK VERSION
4.18.1
CONFIGURATION

Advance zone with isolated network

SUMMARY

One of our CloudStack environments has been upgraded from 4.17.2 to 4.18.1, last week. Then we restarted a big VPC with 11 tiers and ACLs that have 8 rules in them.
After restarting that VPC, we noticed the following behavior:

  1. The backup VR stopped and destroyed (expected).
  2. The new VR created and put in Starting state (expected).
  3. The new VR stayed in Starting state for approximately 30 minutes. Then it turned into Stopped state.
  4. 1 hour later the Async job poll returned error "Unable to serialize: Job is cancelled as it has been blocking others for too long".
  5. The new VR remains in Stopped state.

We checked these, they weren't the case:

  • Pod, host and cluster have enough capacity to start the VR.
  • VR uses the correct system vm template.
  • We removed all of our custom features from Cloudstack and ran exactly the community version but that didn't help.
STEPS TO REPRODUCE
1. Create a VPC with redundant VRs with 11 tiers in it.
2. Create ACLs with 10 rules and use it for all the tiers to make sure you have big amount of firewall rules in each VPC tier.
3. Attach VMs to each tier and make sure the VMs are started, so all VPC tiers turn into "Implemented" state.
4. Try to restart VPC with cleanup.
EXPECTED RESULTS
Both VRs will be reprovisioned in reasonable amount of time (several minutes).
ACTUAL RESULTS
One of the VRs will be recreated but never started.
ADDITIONAL INFORMATION

We notice that the actual libvirt domain of VR was never even attempted to be provisioned on the hypervisor.
Looks like the management server holds this job until it gets cancelled by some timeout.

Watching the management server log with grep by VR instance name shows the following behavior:

tail -f /var/log/cloudstack/management/management-server.log | grep r-27209-VM
2023-10-09 12:57:50,398 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Found 0 firewall Egress rule(s) to apply as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start.
2023-10-09 12:57:50,399 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Found 0 Ipv6 firewall rule(s) to apply as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start.
2023-10-09 12:57:50,408 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Found 3 network ACLs to apply as a part of VPC VR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start for guest network id=4140
2023-10-09 12:57:50,455 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Creating  monitoring services on VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start...

# 1,5 minutes later

2023-10-09 12:59:22,010 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Reapplying dhcp entries as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start...
2023-10-09 12:59:22,023 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Creating dhcp entry for vm VM instance {"id":27189,"instanceName":"i-877-27189-VM","type":"User","uuid":"f34641cc-1885-430e-9eef-0f2560527f69"} on domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"}.
2023-10-09 12:59:22,037 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Reapplying vm data (userData and metaData) entries as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start...
2023-10-09 12:59:22,048 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Creating user data entry for vm VM instance {"id":27189,"instanceName":"i-877-27189-VM","type":"User","uuid":"f34641cc-1885-430e-9eef-0f2560527f69"} on domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"}
2023-10-09 12:59:22,092 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Found 0 firewall Egress rule(s) to apply as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start.
2023-10-09 12:59:22,092 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Found 0 Ipv6 firewall rule(s) to apply as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start.
2023-10-09 12:59:22,100 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Found 3 network ACLs to apply as a part of VPC VR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start for guest network id=4141
2023-10-09 12:59:22,146 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Creating  monitoring services on VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start...

# 1,5 minutes later

2023-10-09 13:00:56,723 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Reapplying dhcp entries as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start...
2023-10-09 13:00:56,734 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Creating dhcp entry for vm VM instance {"id":27190,"instanceName":"i-877-27190-VM","type":"User","uuid":"4aa04dae-091d-4f78-bc32-af5da5a1ca2d"} on domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"}.
2023-10-09 13:00:56,747 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Reapplying vm data (userData and metaData) entries as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start...
2023-10-09 13:00:56,758 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Creating user data entry for vm VM instance {"id":27190,"instanceName":"i-877-27190-VM","type":"User","uuid":"4aa04dae-091d-4f78-bc32-af5da5a1ca2d"} on domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"}
2023-10-09 13:00:56,797 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Found 0 firewall Egress rule(s) to apply as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start.
2023-10-09 13:00:56,798 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Found 0 Ipv6 firewall rule(s) to apply as a part of domR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start.
2023-10-09 13:00:56,805 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Found 3 network ACLs to apply as a part of VPC VR VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start for guest network id=4142
2023-10-09 13:00:56,851 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-6:ctx-ba2a3ea5 job-2776242/job-2776251 ctx-b405daf9) (logid:46201edd) Creating  monitoring services on VM instance {"id":27209,"instanceName":"r-27209-VM","type":"DomainRouter","uuid":"b78b470a-2a88-4705-9909-26ba84277fca"} start...

... and it contines iterating over networks, one network per 1,5 minutes. The router hasn't started on the node yet.
@soreana
Copy link
Member Author

soreana commented Oct 9, 2023

Any suggestion appreciated.

@weizhouapache
Copy link
Member

@soreana
what hypervisor do you use ?
can you check if all mgmt servers and hypervisors are upgraded to 4.18.1.0 ?

can you ssh into the backup router (from kvm host), and share the /var/log/cloud.log ?
I suspect you are facing the issue fixed by #7953

@phsm
Copy link
Contributor

phsm commented Oct 9, 2023

@weizhouapache
Thanks for prompt reply.

Yes, both management server and hypervisors are updated to 4.18.1.0.

However, it doesn't seem this is the bug you mentioned.
As per Cloudstack management server, the job for creating a new virtual router is created, but the hypervisor never receives the command to start the new routerVM.

Here are the last messages from the management server regarding starting new VM:

2023-10-09 10:49:49,488 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:49:49,510 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 10:49:49,531 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 10:49:49,552 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4143
2023-10-09 10:49:49,574 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4144
2023-10-09 10:49:49,596 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4145
2023-10-09 10:49:49,616 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4146
2023-10-09 10:49:49,637 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4147
2023-10-09 10:49:49,678 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4148
2023-10-09 10:49:49,699 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4149
2023-10-09 10:49:49,721 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4150
2023-10-09 10:49:50,028 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:49:50,060 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:49:50,087 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:49:50,140 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 10:49:50,172 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 10:49:50,200 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 10:49:50,253 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 10:49:50,285 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 10:49:50,313 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 10:49:50,363 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4143
2023-10-09 10:49:50,394 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4143
2023-10-09 10:49:50,421 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4143
2023-10-09 10:49:50,472 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4144
2023-10-09 10:49:50,503 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4144
2023-10-09 10:49:50,533 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4144
2023-10-09 10:49:50,581 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4145
2023-10-09 10:49:50,612 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4145
2023-10-09 10:49:50,641 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4145
2023-10-09 10:49:50,692 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4146
2023-10-09 10:49:50,721 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4146
2023-10-09 10:49:50,750 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4146
2023-10-09 10:49:50,808 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4147
2023-10-09 10:49:50,842 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4147
2023-10-09 10:49:50,874 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4147
2023-10-09 10:49:50,929 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4148
2023-10-09 10:49:50,960 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4148
2023-10-09 10:49:50,989 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4148
2023-10-09 10:49:51,041 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4149
2023-10-09 10:49:51,075 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4149
2023-10-09 10:49:51,102 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4149
2023-10-09 10:49:51,151 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4150
2023-10-09 10:49:51,182 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4150
2023-10-09 10:49:51,211 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4150
2023-10-09 10:49:51,242 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 static routes to apply as a part of vpc route VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start
2023-10-09 10:49:51,250 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating  monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 10:57:21,381 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Resending ipAssoc, port forwarding, load balancing rules as a part of Virtual router start
2023-10-09 10:57:21,389 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 firewall Egress rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 10:57:21,389 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 Ipv6 firewall rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 10:57:21,397 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 3 network ACLs to apply as a part of VPC VR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start for guest network id=4140
2023-10-09 10:57:21,415 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4140
2023-10-09 10:57:21,441 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating  monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:04:49,515 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying dhcp entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:04:49,525 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating dhcp entry for vm VM instance {"id":27189,"instanceName":"i-877-27189-VM","type":"User","uuid":"f34641cc-1885-430e-9eef-0f2560527f69"} on domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"}.
2023-10-09 11:04:49,539 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying vm data (userData and metaData) entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:04:49,546 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating user data entry for vm VM instance {"id":27189,"instanceName":"i-877-27189-VM","type":"User","uuid":"f34641cc-1885-430e-9eef-0f2560527f69"} on domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"}
2023-10-09 11:04:49,581 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Resending ipAssoc, port forwarding, load balancing rules as a part of Virtual router start
2023-10-09 11:04:49,590 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 firewall Egress rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 11:04:49,590 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 Ipv6 firewall rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 11:04:49,600 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 3 network ACLs to apply as a part of VPC VR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start for guest network id=4141
2023-10-09 11:04:49,625 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4141
2023-10-09 11:04:49,654 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating  monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:12:18,234 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying dhcp entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:12:18,245 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating dhcp entry for vm VM instance {"id":27190,"instanceName":"i-877-27190-VM","type":"User","uuid":"4aa04dae-091d-4f78-bc32-af5da5a1ca2d"} on domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"}.
2023-10-09 11:12:18,262 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying vm data (userData and metaData) entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...
2023-10-09 11:12:18,270 DEBUG [c.c.n.r.CommandSetupHelper] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating user data entry for vm VM instance {"id":27190,"instanceName":"i-877-27190-VM","type":"User","uuid":"4aa04dae-091d-4f78-bc32-af5da5a1ca2d"} on domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"}
2023-10-09 11:12:18,307 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Resending ipAssoc, port forwarding, load balancing rules as a part of Virtual router start
2023-10-09 11:12:18,315 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 firewall Egress rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 11:12:18,315 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 0 Ipv6 firewall rule(s) to apply as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start.
2023-10-09 11:12:18,325 DEBUG [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Found 3 network ACLs to apply as a part of VPC VR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start for guest network id=4142
2023-10-09 11:12:18,345 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Service SecurityGroup is not supported in the network id=4142
2023-10-09 11:12:18,372 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating  monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...

After this line, no activity on this job anymore (the new routerVM appears as "Starting").

The hypervisor that is chosen for the routerVM doesn't have any mentions of its name in the agent.log:

# grep r-27201-VM /var/log/cloudstack/agent/agent.log
#

It looks like the management server doesn't even send VM creation command to the hypervisor.

An hour later the job ends with the following message:

(VPC restart test ) Unable to serialize: Job is cancelled as it has been blocking others for too long

The running routerVM (the one in Running state while Cloudstack is trying to create the second routerVM) indeed shows the lines from the bug you've mentioned:

root@r-27188-VM:~# cat /var/log/cloud.log  | grep md5
Mon 09 Oct 2023 08:39:31 AM UTC Scripts checksum detected: oldmd5=da16d6a2533fd19960304152dd78e02c newmd5=da16d6a2533fd19960304152dd78e02c
Mon 09 Oct 2023 08:39:39 AM UTC Scripts checksum detected: oldmd5=da16d6a2533fd19960304152dd78e02c newmd5=28d602db87f545cced0664df0ed9d437

However the timestamps in the log correspond to the VPC creation time (I created this test VPC at 09 Oct 2023 08:39:10).

I'm still attempting to figure out the cause of the bug by removing VPC elements and trying to restart it. So far I have a suspicion that large amount of tiers (the VPC has 11 tiers) is the cause but it is yet to be confirmed.

I will post an update here with the reproduce scenario once I have it.

@weizhouapache
Copy link
Member

@soreana
do other vpcs have the same issue ?

@soreana
Copy link
Member Author

soreana commented Oct 9, 2023

@soreana
do other vpcs have the same issue ?

@weizhouapache When VPC has a lot of tiers (around 10 to 11 tiers) the issue happens.

@weizhouapache
Copy link
Member

@soreana
do other vpcs have the same issue ?

@weizhouapache When VPC has a lot of tiers (around 10 to 11 tiers) the issue happens.

It is strange that primary router works but backup does not.
there is no difference on the management server for them. (The primary/backup role is determined by the keepalived running inside VRs.)

can you try the steps below

  • restart the cloudstack-agent on the kvm host where backup router is supposed to be running on
  • the backup router should be running on stopped. if so, destroy it
  • restart network without cleanup. the primary router will not be touched. ACS will try to create a new backup router.

@soreana
Copy link
Member Author

soreana commented Oct 10, 2023

@weizhouapache Thanks for the comment. They didn't help.

I've updated the issue description with more information. Any help appreciated 🙏

@weizhouapache
Copy link
Member

It is strange that primary router works but backup does not. there is no difference on the management server for them. (The primary/backup role is determined by the keepalived running inside VRs.)

The primary router keeps working as it is still running on 4.17. The restart with cleanup first tries to recreate the backup router, which is not created.

When you delete both primary and backup followed by restart with cleanup, no router is created.

@kriegsmanj
Oh, I thought that the primary router has been recreated successfully ...
This seems to be a critical issue.

as @soreana said, you have created another VPC with same configuration, it works, right ?

@soreana
Copy link
Member Author

soreana commented Oct 11, 2023

@weizhouapache

We did several tests:

  1. We created another VPC with the same configuration on the same platform: it also fails to recreate VR when restarting with cleanup.

  2. We created a VPC with the same configuration on a different cloudstack installation, that has lot less load on it: we could not reproduce the issue there.
    However, we noticed that it takes approximately 20 minutes before VR instance even appears in the hypervisor's "virsh list" output. Once its appreared on the hypervisor, it reports as online in Cloudstack within couple of minutes.

  3. We tried reproducing this behavior on 4.17: it seems that 4.17 is not affected. The same VPC configuration fully restarts in 9 minutes.

Do you have any suggestion?

@weizhouapache
Copy link
Member

@soreana @kriegsmanj

I created a vpc with 12 tiers (each tier has a vm), restart with cleanup worked well.
there is not such long time intervan between these 2 logs

2023-10-09 11:04:49,654 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Creating  monitoring services on VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...

2023-10-09 11:12:18,234 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-29:ctx-50e781f7 job-2775939/job-2775942 ctx-4b8d8442) (logid:32899a45) Reapplying dhcp entries as a part of domR VM instance {"id":27201,"instanceName":"r-27201-VM","type":"DomainRouter","uuid":"40d534e2-c6cd-43a6-8d1c-4c2f5bb6e377"} start...

I did not see any code changes between 4.17 and 4.18 which might cause the issue.

@weizhouapache
Copy link
Member

@soreana @kriegsmanj
any update on this issue ?

@DaanHoogland DaanHoogland added this to the unplanned milestone Oct 18, 2023
@phsm
Copy link
Contributor

phsm commented Oct 18, 2023

We have tried to reporoduce the bug multiple times under different circumstances.
Unfortunately, there's no reliable reproduce scenario so far. Here are our findings:

  • On big Cloudstack platform that had been running for 10 years: we can reproduce the issue. It takes 20+ minutes for each routervm instance to even appear on the selected hypervisor.
    If a VPC has many tiers, the amount of time for the routervm to appear becomes too much. In this case a hypervisor fencing mechanism reports to the management server that the VM is missing, and it is turned into "Stopped" state.
  • On a small but old Cloudstack platform: we can also reproduce the issue: it takes 20+ minutes for each routervm to start.
  • On smaller platforms, or freshly installed platforms: the issue doesn't exist.

We continue our investigation.

@soreana
Copy link
Member Author

soreana commented Nov 3, 2023

@weizhouapache We have been performing more tests and now are suspecting that ACL with multiple rules could be the cause of this delay in deploying routers. A VPC with 6 tiers and 7 customer ACL rules took more than 30 minutes to create, while the same VPC with default ACL rule took about 15 minutes (50% quicker).

Can you please explain the process of creation of ACL rules within a VPC and where it is taking so long to apply? Also how we could speed up this process in order to verify if it helps.

Thanks in advance!

@weizhouapache
Copy link
Member

@soreana @kriegsmanj

can you search "vr_cfg.sh" in the management-server.log and agent.log ?

@weizhouapache
Copy link
Member

@soreana @kriegsmanj

can you search "vr_cfg.sh" in the management-server.log and agent.log ?
is your issue same as #8201 ?

@soreana
Copy link
Member Author

soreana commented Nov 10, 2023

@weizhouapache Sorry about the delay, it was a busy day.

I checked the vr_cfg.sh in management servers and agents logs. There were nothing in the management server, and only three agents has the vr_cfg.sh in their logs.

node1

2023-11-08 13:49:12,258 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-10c9e64e-6ad9-4d9b-8d24-fe1496b59146.cfg
2023-11-08 13:49:15,310 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:15,488 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-e7dfd01c-8c30-43be-bb1d-38229c8c1edf.cfg
2023-11-08 13:49:18,352 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:18,524 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-5d36d38c-a1ce-49af-9e01-15d4f19054fa.cfg
2023-11-08 13:49:21,423 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:21,601 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-67f79b72-a3ed-4d62-a6e3-919a7f3dfd92.cfg
2023-11-08 13:49:24,574 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:24,760 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-a3318f66-9666-41e3-91b9-3ede5ab05962.cfg
2023-11-08 13:49:27,821 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:28,005 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-b1fd4fbd-c7db-4970-b17a-9eb374543fa9.cfg
2023-11-08 13:49:31,281 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:31,450 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-0fcabd27-eeee-4061-aeac-663daf9775f0.cfg
2023-11-08 13:49:34,570 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh
2023-11-08 13:49:34,776 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.1.144 -c /var/cache/cloud/VR-216b0865-9cd1-45d3-a313-650e7d96d299.cfg
2023-11-08 13:49:38,086 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing script in VR: vr_cfg.sh

node2:

2023-11-08 14:25:55,894 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-b5d6ddb5-b5e8-4366-b8d3-ab4772c22b66.cfg
2023-11-08 14:25:58,480 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:25:58,670 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-fc6ccf8d-1d74-4766-9a96-46308e1eada7.cfg
2023-11-08 14:26:01,337 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:01,507 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-3acdb46e-5d06-4744-b324-8c18bc407a74.cfg
2023-11-08 14:26:04,199 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:04,373 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-75c0d306-2a39-44a5-95d7-bcd207433203.cfg
2023-11-08 14:26:07,147 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:07,315 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-09efdd5f-60fe-4ac1-b588-43ccb8ac8a98.cfg
2023-11-08 14:26:10,067 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:10,233 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-927f2fe4-389a-4275-91e3-17ab8c569861.cfg
2023-11-08 14:26:13,000 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:13,170 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-32c655e4-44f8-4e90-9a27-218b07f76c17.cfg
2023-11-08 14:26:15,955 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh
2023-11-08 14:26:16,126 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.248 -c /var/cache/cloud/VR-6d6d1c86-235e-4c5c-a0d1-6cd1d3f54852.cfg
2023-11-08 14:26:18,982 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh

node3:

2023-11-08 11:02:13,577 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.155 -c /var/cache/cloud/VR-eaaf92a8-5a49-4c58-bebf-ffae9cefa301.cfg
2023-11-08 11:02:22,879 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:null) Executing script in VR: vr_cfg.sh
2023-11-08 11:02:30,521 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing: /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh vr_cfg.sh 169.254.3.155 -c /var/cache/cloud/VR-d3be0b78-a04d-4437-89f2-40a83650edd0.cfg
2023-11-08 11:02:38,899 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:null) Executing script in VR: vr_cfg.sh

IPs are as follow:
169.254.1.144 is primary router IP the VPC-01
169.254.3.248 is backup router IP the VPC-01
169.254.3.155 is router ip of an isolated network

@phsm
Copy link
Contributor

phsm commented Dec 19, 2023

@weizhouapache

An update on this issue:
This bug only occurs on VPCs on large platforms "with history": the platforms that have a lot of both live and removed VMs, NICs etc.

The mechanics of the process:

  1. The VPC restart initiated
  2. The first VR is destroyed
  3. The new VR is created and remains in "Starting" state for 30+ minutes
  4. The hypervisor node fencing mechanism detects that the VM instance doesn't exist in libvirt
  5. The node notifies the management server about it
  6. The management server abandons the start and puts the VR into Stopped state.

With trusty s_logger.debug() I was able to narrow down the VR start execution procedure to a specific method that introduces such a big delay. It is this method: getRouterHealthChecksConfig

More specifically, its this call inside the method that introduces the delay: userVmJoinDao.search(scvm, null)
It is executed in a loop for each virtual router NIC. Each call to search() is expensive, so the getRouterHealthChecksConfig() takes 10-20 seconds to complete.

But there is more: getRouterHealthChecksConfig() is executed for each VPC tier as a part of createMonitorServiceCommand() call during VR startup process.

The createMonitorServiceCommand() is a part of finalizeMonitorService() method.

And finalizeMonitorService() method is executed twice here: 1, 2

So eventually the getRouterHealthChecksConfig() is executed more than a hundred times during the VR startup, and each time it adds its 10-20 seconds to the process.

What do I propose
It seems that getRouterHealthChecksConfig() returns the same result per VR object. That means that it can be executed once in the beginning of VR initialisation instead of deep inside of foreach loops.
But my Java knowledge unfortunately isn't good enough to actually implement it.

@weizhouapache
Copy link
Member

@phsm
Great, thanks for the update

maybe @soreana can pick this up ?

@phsm
Copy link
Contributor

phsm commented Dec 20, 2023

@weizhouapache
Hi Wei,

I've actually managed to implement the patch myself.

The test looks promising:
Without this patch applied, the 10-tier VPC took 61 minutes to fully restart
With this patch applied, the same VPC took 9 minutes to fully restart

Can you take a look at it? If it looks good to you, I can make a PR with it.

  • I changed getRouterHealthChecksConfig() to be protected as VpcVirtualNetworkApplianceManagerImpl is going to call it.
  • Moved the getRouterHealthChecksConfig() upper to finalizeCommandsOnStart() in the execution stack. The health check config is now passed down the stack as a function argument.
  • Adjusted other methods that use getRouterHealthChecksConfig().
diff --git a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
index f7bfb1c4af..3d236c0a13 100644
--- a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
+++ b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
@@ -1623,7 +1623,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
     }

     private SetMonitorServiceCommand createMonitorServiceCommand(DomainRouterVO router, List<MonitorServiceTO> services,
-                                                                 boolean reconfigure, boolean deleteFromProcessedCache) {
+                                                                 boolean reconfigure, boolean deleteFromProcessedCache, Map<String, String> routerHealthCheckConfig) {
         final SetMonitorServiceCommand command = new SetMonitorServiceCommand(services);
         command.setAccessDetail(NetworkElementCommand.ROUTER_IP, _routerControlHelper.getRouterControlIp(router.getId()));
         command.setAccessDetail(NetworkElementCommand.ROUTER_NAME, router.getInstanceName());
@@ -1641,7 +1641,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
         }

         command.setAccessDetail(SetMonitorServiceCommand.ROUTER_HEALTH_CHECKS_EXCLUDED, excludedTests);
-        command.setHealthChecksConfig(getRouterHealthChecksConfig(router));
+        command.setHealthChecksConfig(routerHealthCheckConfig);
         command.setReconfigureAfterUpdate(reconfigure);
         command.setDeleteFromProcessedCache(deleteFromProcessedCache); // As part of updating
         return command;
@@ -1666,7 +1666,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
         s_logger.info("Updating data for router health checks for router " + router.getUuid());
         Answer origAnswer = null;
         try {
-            SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true);
+            SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true, getRouterHealthChecksConfig(router));
             origAnswer = _agentMgr.easySend(router.getHostId(), command);
         } catch (final Exception e) {
             s_logger.error("Error while sending update data for health check to router: " + router.getInstanceName(), e);
@@ -1891,7 +1891,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
                 .append(generateKeyValuePairOrEmptyString("server.maxqueue", serverMaxqueue));
     }

-    private Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) {
+    protected Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) {
         Map<String, String> data = new HashMap<>();
         List<DomainRouterJoinVO> routerJoinVOs = domainRouterJoinDao.searchByIds(router.getId());
         StringBuilder vmsData = new StringBuilder();
@@ -2464,7 +2464,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
             if (reprogramGuestNtwks) {
                 finalizeIpAssocForNetwork(cmds, router, provider, guestNetworkId, null);
                 finalizeNetworkRulesForNetwork(cmds, router, provider, guestNetworkId);
-                finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true);
+                finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true, getRouterHealthChecksConfig(router));
             }

             finalizeUserDataAndDhcpOnStart(cmds, router, provider, guestNetworkId);
@@ -2478,7 +2478,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
     }

     protected void finalizeMonitorService(final Commands cmds, final VirtualMachineProfile profile, final DomainRouterVO router, final Provider provider,
-                                          final long networkId, boolean onStart) {
+                                          final long networkId, boolean onStart, Map<String, String> routerHealthCheckConfig) {
         final NetworkOffering offering = _networkOfferingDao.findById(_networkDao.findById(networkId).getNetworkOfferingId());
         if (offering.isRedundantRouter()) {
             // service monitoring is currently not added in RVR
@@ -2528,7 +2528,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
         }

         // As part of aggregate command we don't need to reconfigure if onStart and persist in processed cache. Subsequent updates are not needed.
-        SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false);
+        SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false, routerHealthCheckConfig);
         command.setAccessDetail(NetworkElementCommand.ROUTER_GUEST_IP, _routerControlHelper.getRouterIpInNetwork(networkId, router.getId()));
         if (!isMonitoringServicesEnabled) {
             command.setAccessDetail(SetMonitorServiceCommand.ROUTER_MONITORING_ENABLED, isMonitoringServicesEnabled.toString());
diff --git a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
index 18801eb01f..27958126e7 100644
--- a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
+++ b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
@@ -482,8 +482,9 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
                 throw new CloudRuntimeException("Cannot find related provider of virtual router provider: " + vrProvider.getType().toString());
             }

+            Map<String, String> routerHealthCheckConfig = getRouterHealthChecksConfig(domainRouterVO);
             if (reprogramGuestNtwks && publicNics.size() > 0) {
-                finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true);
+                finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true, routerHealthCheckConfig);
             }

             for (final Pair<Nic, Network> nicNtwk : guestNics) {
@@ -495,7 +496,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
                 if (reprogramGuestNtwks) {
                     finalizeIpAssocForNetwork(cmds, domainRouterVO, provider, guestNetworkId, vlanMacAddress);
                     finalizeNetworkRulesForNetwork(cmds, domainRouterVO, provider, guestNetworkId);
-                    finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true);
+                    finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true, routerHealthCheckConfig);
                 }

                 finalizeUserDataAndDhcpOnStart(cmds, domainRouterVO, provider, guestNetworkId);
@@ -554,7 +555,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
             finalizeNetworkRulesForNetwork(cmds, router, provider, networkId);
         }

-        finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false);
+        finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false, getRouterHealthChecksConfig(router));

         return _nwHelper.sendCommandsToRouter(router, cmds);
     }

@weizhouapache
Copy link
Member

@weizhouapache Hi Wei,

I've actually managed to implement the patch myself.

The test looks promising: Without this patch applied, the 10-tier VPC took 61 minutes to fully restart With this patch applied, the same VPC took 9 minutes to fully restart

Can you take a look at it? If it looks good to you, I can make a PR with it.

  • I changed getRouterHealthChecksConfig() to be protected as VpcVirtualNetworkApplianceManagerImpl is going to call it.
  • Moved the getRouterHealthChecksConfig() upper to finalizeCommandsOnStart() in the execution stack. The health check config is now passed down the stack as a function argument.
  • Adjusted other methods that use getRouterHealthChecksConfig().
diff --git a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
index f7bfb1c4af..3d236c0a13 100644
--- a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
+++ b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
@@ -1623,7 +1623,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
     }

     private SetMonitorServiceCommand createMonitorServiceCommand(DomainRouterVO router, List<MonitorServiceTO> services,
-                                                                 boolean reconfigure, boolean deleteFromProcessedCache) {
+                                                                 boolean reconfigure, boolean deleteFromProcessedCache, Map<String, String> routerHealthCheckConfig) {
         final SetMonitorServiceCommand command = new SetMonitorServiceCommand(services);
         command.setAccessDetail(NetworkElementCommand.ROUTER_IP, _routerControlHelper.getRouterControlIp(router.getId()));
         command.setAccessDetail(NetworkElementCommand.ROUTER_NAME, router.getInstanceName());
@@ -1641,7 +1641,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
         }

         command.setAccessDetail(SetMonitorServiceCommand.ROUTER_HEALTH_CHECKS_EXCLUDED, excludedTests);
-        command.setHealthChecksConfig(getRouterHealthChecksConfig(router));
+        command.setHealthChecksConfig(routerHealthCheckConfig);
         command.setReconfigureAfterUpdate(reconfigure);
         command.setDeleteFromProcessedCache(deleteFromProcessedCache); // As part of updating
         return command;
@@ -1666,7 +1666,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
         s_logger.info("Updating data for router health checks for router " + router.getUuid());
         Answer origAnswer = null;
         try {
-            SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true);
+            SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true, getRouterHealthChecksConfig(router));
             origAnswer = _agentMgr.easySend(router.getHostId(), command);
         } catch (final Exception e) {
             s_logger.error("Error while sending update data for health check to router: " + router.getInstanceName(), e);
@@ -1891,7 +1891,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
                 .append(generateKeyValuePairOrEmptyString("server.maxqueue", serverMaxqueue));
     }

-    private Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) {
+    protected Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) {
         Map<String, String> data = new HashMap<>();
         List<DomainRouterJoinVO> routerJoinVOs = domainRouterJoinDao.searchByIds(router.getId());
         StringBuilder vmsData = new StringBuilder();
@@ -2464,7 +2464,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
             if (reprogramGuestNtwks) {
                 finalizeIpAssocForNetwork(cmds, router, provider, guestNetworkId, null);
                 finalizeNetworkRulesForNetwork(cmds, router, provider, guestNetworkId);
-                finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true);
+                finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true, getRouterHealthChecksConfig(router));
             }

             finalizeUserDataAndDhcpOnStart(cmds, router, provider, guestNetworkId);
@@ -2478,7 +2478,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
     }

     protected void finalizeMonitorService(final Commands cmds, final VirtualMachineProfile profile, final DomainRouterVO router, final Provider provider,
-                                          final long networkId, boolean onStart) {
+                                          final long networkId, boolean onStart, Map<String, String> routerHealthCheckConfig) {
         final NetworkOffering offering = _networkOfferingDao.findById(_networkDao.findById(networkId).getNetworkOfferingId());
         if (offering.isRedundantRouter()) {
             // service monitoring is currently not added in RVR
@@ -2528,7 +2528,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
         }

         // As part of aggregate command we don't need to reconfigure if onStart and persist in processed cache. Subsequent updates are not needed.
-        SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false);
+        SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false, routerHealthCheckConfig);
         command.setAccessDetail(NetworkElementCommand.ROUTER_GUEST_IP, _routerControlHelper.getRouterIpInNetwork(networkId, router.getId()));
         if (!isMonitoringServicesEnabled) {
             command.setAccessDetail(SetMonitorServiceCommand.ROUTER_MONITORING_ENABLED, isMonitoringServicesEnabled.toString());
diff --git a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
index 18801eb01f..27958126e7 100644
--- a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
+++ b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
@@ -482,8 +482,9 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
                 throw new CloudRuntimeException("Cannot find related provider of virtual router provider: " + vrProvider.getType().toString());
             }

+            Map<String, String> routerHealthCheckConfig = getRouterHealthChecksConfig(domainRouterVO);
             if (reprogramGuestNtwks && publicNics.size() > 0) {
-                finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true);
+                finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true, routerHealthCheckConfig);
             }

             for (final Pair<Nic, Network> nicNtwk : guestNics) {
@@ -495,7 +496,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
                 if (reprogramGuestNtwks) {
                     finalizeIpAssocForNetwork(cmds, domainRouterVO, provider, guestNetworkId, vlanMacAddress);
                     finalizeNetworkRulesForNetwork(cmds, domainRouterVO, provider, guestNetworkId);
-                    finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true);
+                    finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true, routerHealthCheckConfig);
                 }

                 finalizeUserDataAndDhcpOnStart(cmds, domainRouterVO, provider, guestNetworkId);
@@ -554,7 +555,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
             finalizeNetworkRulesForNetwork(cmds, router, provider, networkId);
         }

-        finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false);
+        finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false, getRouterHealthChecksConfig(router));

         return _nwHelper.sendCommandsToRouter(router, cmds);
     }

@phsm
looks good.
can you create a PR ?

phsm added a commit to phsm/cloudstack that referenced this issue Dec 20, 2023
Avoid an expensive getRouterHealthChecksConfig() execution multiple times during VPC restart.

Fixes apache#8055
phsm added a commit to phsm/cloudstack that referenced this issue Feb 15, 2024
Avoid an expensive getRouterHealthChecksConfig() execution multiple times during VPC restart.

Fixes apache#8055
weizhouapache pushed a commit that referenced this issue Feb 19, 2024
* Optimize createMonitorServiceCommand() execution.
Avoid an expensive getRouterHealthChecksConfig() execution multiple times during VPC restart.

Fixes #8055

* Move getRouterHealthChecksConfig() outside of loop
@weizhouapache
Copy link
Member

fixed by #8385

@weizhouapache weizhouapache modified the milestones: unplanned, 4.18.2.0 Feb 19, 2024
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this issue Mar 5, 2024
* Optimize createMonitorServiceCommand() execution.
Avoid an expensive getRouterHealthChecksConfig() execution multiple times during VPC restart.

Fixes apache#8055

* Move getRouterHealthChecksConfig() outside of loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants