-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup virtual router of a redundant VPC stays in starting state after restart with clean up. #8055
Comments
Any suggestion appreciated. |
@weizhouapache Yes, both management server and hypervisors are updated to 4.18.1.0. However, it doesn't seem this is the bug you mentioned. Here are the last messages from the management server regarding starting new VM:
After this line, no activity on this job anymore (the new routerVM appears as "Starting"). The hypervisor that is chosen for the routerVM doesn't have any mentions of its name in the agent.log:
It looks like the management server doesn't even send VM creation command to the hypervisor. An hour later the job ends with the following message:
The running routerVM (the one in Running state while Cloudstack is trying to create the second routerVM) indeed shows the lines from the bug you've mentioned:
However the timestamps in the log correspond to the VPC creation time (I created this test VPC at 09 Oct 2023 08:39:10). I'm still attempting to figure out the cause of the bug by removing VPC elements and trying to restart it. So far I have a suspicion that large amount of tiers (the VPC has 11 tiers) is the cause but it is yet to be confirmed. I will post an update here with the reproduce scenario once I have it. |
@soreana |
@weizhouapache When VPC has a lot of tiers (around 10 to 11 tiers) the issue happens. |
It is strange that primary router works but backup does not. can you try the steps below
|
@weizhouapache Thanks for the comment. They didn't help. I've updated the issue description with more information. Any help appreciated 🙏 |
@kriegsmanj as @soreana said, you have created another VPC with same configuration, it works, right ? |
We did several tests:
Do you have any suggestion? |
I created a vpc with 12 tiers (each tier has a vm), restart with cleanup worked well.
I did not see any code changes between 4.17 and 4.18 which might cause the issue. |
@soreana @kriegsmanj |
We have tried to reporoduce the bug multiple times under different circumstances.
We continue our investigation. |
@weizhouapache We have been performing more tests and now are suspecting that ACL with multiple rules could be the cause of this delay in deploying routers. A VPC with 6 tiers and 7 customer ACL rules took more than 30 minutes to create, while the same VPC with default ACL rule took about 15 minutes (50% quicker). Can you please explain the process of creation of ACL rules within a VPC and where it is taking so long to apply? Also how we could speed up this process in order to verify if it helps. Thanks in advance! |
can you search "vr_cfg.sh" in the management-server.log and agent.log ? |
can you search "vr_cfg.sh" in the management-server.log and agent.log ? |
@weizhouapache Sorry about the delay, it was a busy day. I checked the node1
node2:
node3:
IPs are as follow: |
An update on this issue: The mechanics of the process:
With trusty s_logger.debug() I was able to narrow down the VR start execution procedure to a specific method that introduces such a big delay. It is this method: getRouterHealthChecksConfig More specifically, its this call inside the method that introduces the delay: userVmJoinDao.search(scvm, null) But there is more: getRouterHealthChecksConfig() is executed for each VPC tier as a part of createMonitorServiceCommand() call during VR startup process. The createMonitorServiceCommand() is a part of finalizeMonitorService() method. And finalizeMonitorService() method is executed twice here: 1, 2 So eventually the getRouterHealthChecksConfig() is executed more than a hundred times during the VR startup, and each time it adds its 10-20 seconds to the process. What do I propose |
@weizhouapache I've actually managed to implement the patch myself. The test looks promising: Can you take a look at it? If it looks good to you, I can make a PR with it.
diff --git a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
index f7bfb1c4af..3d236c0a13 100644
--- a/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
+++ b/server/src/main/java/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
@@ -1623,7 +1623,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
}
private SetMonitorServiceCommand createMonitorServiceCommand(DomainRouterVO router, List<MonitorServiceTO> services,
- boolean reconfigure, boolean deleteFromProcessedCache) {
+ boolean reconfigure, boolean deleteFromProcessedCache, Map<String, String> routerHealthCheckConfig) {
final SetMonitorServiceCommand command = new SetMonitorServiceCommand(services);
command.setAccessDetail(NetworkElementCommand.ROUTER_IP, _routerControlHelper.getRouterControlIp(router.getId()));
command.setAccessDetail(NetworkElementCommand.ROUTER_NAME, router.getInstanceName());
@@ -1641,7 +1641,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
}
command.setAccessDetail(SetMonitorServiceCommand.ROUTER_HEALTH_CHECKS_EXCLUDED, excludedTests);
- command.setHealthChecksConfig(getRouterHealthChecksConfig(router));
+ command.setHealthChecksConfig(routerHealthCheckConfig);
command.setReconfigureAfterUpdate(reconfigure);
command.setDeleteFromProcessedCache(deleteFromProcessedCache); // As part of updating
return command;
@@ -1666,7 +1666,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
s_logger.info("Updating data for router health checks for router " + router.getUuid());
Answer origAnswer = null;
try {
- SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true);
+ SetMonitorServiceCommand command = createMonitorServiceCommand(router, null, true, true, getRouterHealthChecksConfig(router));
origAnswer = _agentMgr.easySend(router.getHostId(), command);
} catch (final Exception e) {
s_logger.error("Error while sending update data for health check to router: " + router.getInstanceName(), e);
@@ -1891,7 +1891,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
.append(generateKeyValuePairOrEmptyString("server.maxqueue", serverMaxqueue));
}
- private Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) {
+ protected Map<String, String> getRouterHealthChecksConfig(final DomainRouterVO router) {
Map<String, String> data = new HashMap<>();
List<DomainRouterJoinVO> routerJoinVOs = domainRouterJoinDao.searchByIds(router.getId());
StringBuilder vmsData = new StringBuilder();
@@ -2464,7 +2464,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
if (reprogramGuestNtwks) {
finalizeIpAssocForNetwork(cmds, router, provider, guestNetworkId, null);
finalizeNetworkRulesForNetwork(cmds, router, provider, guestNetworkId);
- finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true);
+ finalizeMonitorService(cmds, profile, router, provider, guestNetworkId, true, getRouterHealthChecksConfig(router));
}
finalizeUserDataAndDhcpOnStart(cmds, router, provider, guestNetworkId);
@@ -2478,7 +2478,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
}
protected void finalizeMonitorService(final Commands cmds, final VirtualMachineProfile profile, final DomainRouterVO router, final Provider provider,
- final long networkId, boolean onStart) {
+ final long networkId, boolean onStart, Map<String, String> routerHealthCheckConfig) {
final NetworkOffering offering = _networkOfferingDao.findById(_networkDao.findById(networkId).getNetworkOfferingId());
if (offering.isRedundantRouter()) {
// service monitoring is currently not added in RVR
@@ -2528,7 +2528,7 @@ Configurable, StateListener<VirtualMachine.State, VirtualMachine.Event, VirtualM
}
// As part of aggregate command we don't need to reconfigure if onStart and persist in processed cache. Subsequent updates are not needed.
- SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false);
+ SetMonitorServiceCommand command = createMonitorServiceCommand(router, servicesTO, !onStart, false, routerHealthCheckConfig);
command.setAccessDetail(NetworkElementCommand.ROUTER_GUEST_IP, _routerControlHelper.getRouterIpInNetwork(networkId, router.getId()));
if (!isMonitoringServicesEnabled) {
command.setAccessDetail(SetMonitorServiceCommand.ROUTER_MONITORING_ENABLED, isMonitoringServicesEnabled.toString());
diff --git a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
index 18801eb01f..27958126e7 100644
--- a/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
+++ b/server/src/main/java/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java
@@ -482,8 +482,9 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
throw new CloudRuntimeException("Cannot find related provider of virtual router provider: " + vrProvider.getType().toString());
}
+ Map<String, String> routerHealthCheckConfig = getRouterHealthChecksConfig(domainRouterVO);
if (reprogramGuestNtwks && publicNics.size() > 0) {
- finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true);
+ finalizeMonitorService(cmds, profile, domainRouterVO, provider, publicNics.get(0).second().getId(), true, routerHealthCheckConfig);
}
for (final Pair<Nic, Network> nicNtwk : guestNics) {
@@ -495,7 +496,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
if (reprogramGuestNtwks) {
finalizeIpAssocForNetwork(cmds, domainRouterVO, provider, guestNetworkId, vlanMacAddress);
finalizeNetworkRulesForNetwork(cmds, domainRouterVO, provider, guestNetworkId);
- finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true);
+ finalizeMonitorService(cmds, profile, domainRouterVO, provider, guestNetworkId, true, routerHealthCheckConfig);
}
finalizeUserDataAndDhcpOnStart(cmds, domainRouterVO, provider, guestNetworkId);
@@ -554,7 +555,7 @@ public class VpcVirtualNetworkApplianceManagerImpl extends VirtualNetworkApplian
finalizeNetworkRulesForNetwork(cmds, router, provider, networkId);
}
- finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false);
+ finalizeMonitorService(cmds, getVirtualMachineProfile(router), router, provider, networkId, false, getRouterHealthChecksConfig(router));
return _nwHelper.sendCommandsToRouter(router, cmds);
} |
@phsm |
Avoid an expensive getRouterHealthChecksConfig() execution multiple times during VPC restart. Fixes apache#8055
Avoid an expensive getRouterHealthChecksConfig() execution multiple times during VPC restart. Fixes apache#8055
* Optimize createMonitorServiceCommand() execution. Avoid an expensive getRouterHealthChecksConfig() execution multiple times during VPC restart. Fixes #8055 * Move getRouterHealthChecksConfig() outside of loop
fixed by #8385 |
* Optimize createMonitorServiceCommand() execution. Avoid an expensive getRouterHealthChecksConfig() execution multiple times during VPC restart. Fixes apache#8055 * Move getRouterHealthChecksConfig() outside of loop
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
CONFIGURATION
Advance zone with isolated network
SUMMARY
One of our CloudStack environments has been upgraded from
4.17.2
to4.18.1
, last week. Then we restarted a big VPC with 11 tiers and ACLs that have 8 rules in them.After restarting that VPC, we noticed the following behavior:
We checked these, they weren't the case:
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS
ADDITIONAL INFORMATION
We notice that the actual libvirt domain of VR was never even attempted to be provisioned on the hypervisor.
Looks like the management server holds this job until it gets cancelled by some timeout.
Watching the management server log with grep by VR instance name shows the following behavior:
The text was updated successfully, but these errors were encountered: