Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLOUDSTACK-9319: Use timeout when applying config to virtual router #1451

Merged
merged 2 commits into from
Oct 25, 2016

Conversation

insom
Copy link
Contributor

@insom insom commented Mar 23, 2016

From the JIRA issue:

The timeout parameter is not passed down to applyConfigToVR inside VirtualRoutingResource in all cases.

This timeout is worked out as 3 seconds per command or 120 seconds (whichever is larger), but because it's not passed to the first invocation, the default (120 seconds, DEFAULT_EXECUTEINVR_TIMEOUT) is used.

In a recent upgrade of our Virtual Routers, the timeout was being hit and increasing router.aggregation.command.each.timeout had no effect. I built a custom 4.8 agent with the timeout increased to allow the upgrade to continue.

@DaanHoogland
Copy link
Contributor

@insom any chance you can supply some test-mech?

@alexandrelimassantana
Copy link
Contributor

@insom with this changes there is still one place where the applyConfigToVR is not provided with this timeout parameter (line 183, same class). This function is actually called only in those 2 places. If it comes to a state where you will always supply the timeout, remove the function which doesn't require you to supply it.

Also, if you plan to always supply the timeout, a change into lines 378 and 379 would provide more coersion:

if (timeout < 120) {
    timeout = 120;
}

into:

if (timeout < VRScripts.DEFAULT_EXECUTEINVR_TIMEOUT) {
    timeout = VRScripts.DEFAULT_EXECUTEINVR_TIMEOUT;
}

@insom
Copy link
Contributor Author

insom commented Mar 26, 2016

@alexandrelimassantana Agreed, thanks for the review. I've added a commit removing the prototype with the default. That has allowed me to remove that if altogether and place it inside the applyConfigToVR method.

@DaanHoogland Because the original commit is about the wrong signature for the method being used I'm not quite sure how to add tests for it. I hope that this second commit will help as it means the compiler can enforce things. Thanks for the review!

(Also it's Friday and I don't have a Maven install on this laptop - should pass the CI tests, if not, I may need to come back to it on Tuesday).

@@ -180,7 +179,7 @@ private Answer applyConfig(NetworkElementCommand cmd, List<ConfigItem> cfg) {
boolean finalResult = false;
for (ConfigItem configItem : cfg) {
long startTimestamp = System.currentTimeMillis();
ExecutionResult result = applyConfigToVR(cmd.getRouterAccessIp(), configItem);
ExecutionResult result = applyConfigToVR(cmd.getRouterAccessIp(), configItem, VRScripts.DEFAULT_EXECUTEINVR_TIMEOUT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@insom Since you're now checking the timeout's validity when calling applyConfigToVR shouldn't the last parameter passed through this call be timeout instead of VRScripts.DEFAULT_EXECUTEINVR_TIMEOUT?

@alexandrelimassantana
Copy link
Contributor

@cristofolini there is no timeout variable in the scope you commented. His change is valid because VRScripts.DEFAULT_EXECUTEINVR_TIMEOUT is mapped to the previous default timeout value. This is equivalent to the previous applyConfigToVR() call without passing the timout value.

@cristofolini
Copy link
Contributor

@alexandrelimassantana Ah, I see that now. Thanks for the clarification! :)

@@ -180,7 +179,7 @@ private Answer applyConfig(NetworkElementCommand cmd, List<ConfigItem> cfg) {
boolean finalResult = false;
for (ConfigItem configItem : cfg) {
long startTimestamp = System.currentTimeMillis();
ExecutionResult result = applyConfigToVR(cmd.getRouterAccessIp(), configItem);
ExecutionResult result = applyConfigToVR(cmd.getRouterAccessIp(), configItem, VRScripts.DEFAULT_EXECUTEINVR_TIMEOUT);
if (s_logger.isDebugEnabled()) {
long elapsed = System.currentTimeMillis() - startTimestamp;
s_logger.debug("Processing " + configItem + " took " + elapsed + "ms");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @insom .

I know that this isn't your code but could you use String.format to create the string that is used by logger?
It turns the code more readable than using multiple strings concatenation.

Ty.

@insom
Copy link
Contributor Author

insom commented Mar 29, 2016

@pedro-martins Thanks for the review but I don't want to make this change because (a) it'll mix two intents into on PR - fixing the bug and improving the logging - and (b) it would be less consistent with the rest of the CloudStack code base which, for better or worse, largely uses string concatenation for debug logging.

@bvbharatk
Copy link
Contributor

ACS CI BVT Run

Sumarry:
Build Number 152
Hypervisor xenserver
NetworkType Advanced
Passed=68
Failed=5
Skipped=3

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0

Failed tests:

  • test_vpc_vpn.py
    • ContextSuite context=TestRVPCSite2SiteVpn>:setup Failing since 13 runs
    • ContextSuite context=TestVpcRemoteAccessVpn>:setup Failing since 13 runs
    • ContextSuite context=TestVpcSite2SiteVpn>:setup Failing since 13 runs
  • test_volumes.py
    • test_06_download_detached_volume Failed
  • test_vm_life_cycle.py
    • test_10_attachAndDetach_iso Failed

Skipped tests:
test_vm_nic_adapter_vmxnet3
test_static_role_account_acls
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_routers.py
test_reset_vm_on_reboot.py
test_snapshots.py
test_deploy_vms_with_varied_deploymentplanners.py
test_login.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_disk_offerings.py

@insom
Copy link
Contributor Author

insom commented Jun 9, 2016

@bvbharatk Hi, am I supposed to do something with this CI output? It looks like all of the errors are related to other parts of CloudStack that aren't touched by my commits. I came back to check on this bug because I got bitten by this again today (timeout when adding rules to a virtual router, but the timeout is actually not getting applied properly).

@insom
Copy link
Contributor Author

insom commented Sep 23, 2016

I got contacted directly by someone else hit by this bug and looking for a fix. They saw I had commits relating to their problem. What needs to happen to get it merged?

@ustcweizhou
Copy link
Contributor

many people thought that changing router.aggregation.command.each.timeout in global setting will increase the timeout. Actually it is not, at least for KVM.

for KVM hypervisors, we need to put router.aggregation.command.each.timeout=??? in agent.properties and restart cloudstack-agent.

This PR increases the minimal timeout to 120 seconds which is helpful but does not really fix the issue.

I think it is better to add host-level setting for each host in the future, like we did for domain/account, zone/cluster.

@insom
Copy link
Contributor Author

insom commented Sep 23, 2016

@ustcweizhou A side effect of this PR is setting the minimal timeout to 120, but that's what the original code intended - except that because of not passing through the timeout to applyConfigToVR it wasn't taking effect.

What this PR really does is make the code work as intended. I'm up for improvements but if this is blocking upgrades (and at least iWeb and one other CloudStack user were hit by it) seems like merging this and then moving it to a per-host setting would be a good idea?

@ustcweizhou
Copy link
Contributor

@insom do not worry, actually I am +1 with this PR. I agree with what you said.

@ozhanka
Copy link

ozhanka commented Sep 26, 2016

Hi, last week we experience some problems because of the timeouts which mentioned on this PR. Could we get this PR to upcoming versions asap please. Without this timeout configuration functionality working, its very hard to make upgrades on VR, timeout issue generates real headaches.

@karuturi
Copy link
Member

Code LGTM.

BVTs are good(iso failures are URL access issues and not related to this PR).

merging this now

@asfgit asfgit merged commit ddfc368 into apache:master Oct 25, 2016
asfgit pushed a commit that referenced this pull request Oct 25, 2016
CLOUDSTACK-9319: Use timeout when applying config to virtual routerFrom the [JIRA issue](https://issues.apache.org/jira/browse/CLOUDSTACK-9319):

> The timeout parameter is not passed down to `applyConfigToVR` inside `VirtualRoutingResource` in all cases.
>
> This timeout is worked out as 3 seconds per command or 120 seconds (whichever is larger), but because it's not passed to the first invocation, the default (120 seconds, DEFAULT_EXECUTEINVR_TIMEOUT) is used.
>
> In a recent upgrade of our Virtual Routers, the timeout was being hit and increasing `router.aggregation.command.each.timeout` had no effect. I built a custom 4.8 agent with the timeout increased to allow the upgrade to continue.

* pr/1451:
  Remove dangerous prototype of applyConfigToVR
  Use timeout when applying config to virtual router

Signed-off-by: Rajani Karuturi <rajani.karuturi@accelerite.com>
@cloudmonger
Copy link

ACS CI BVT Run

Sumarry:
Build Number 121
Hypervisor xenserver
NetworkType Advanced
Passed=103
Failed=2
Skipped=5

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0

Failed tests:

  • test_deploy_vm_iso.py
    • test_deploy_vm_from_iso Failing since 6 runs
  • test_vm_life_cycle.py
    • test_10_attachAndDetach_iso Failing since 7 runs

Skipped tests:
test_01_test_vm_volume_snapshot
test_vm_nic_adapter_vmxnet3
test_static_role_account_acls
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_snapshots.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_non_contigiousvlan.py
test_login.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_volumes.py
test_ssvm.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_routers_network_ops.py
test_disk_offerings.py

@insom insom deleted the CLOUDSTACK-9319 branch October 25, 2016 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.