Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLOUDSTACK-9652 Job framework - Cancelling async jobs #1832

Closed
wants to merge 11 commits into from
Closed

CLOUDSTACK-9652 Job framework - Cancelling async jobs #1832

wants to merge 11 commits into from

Conversation

karuturi
Copy link
Member

enabled cancellation of long running or subsequent queued up async jobs

@karuturi karuturi changed the title cloudstack-9652 Job framework - Cancelling async jobs CLOUDSTACK-9652 Job framework - Cancelling async jobs Dec 15, 2016
@marcaurele
Copy link
Member

@karuturi It's a good feature but I don't see where the job gets actually cancelled. Reading the changes I can only see that the response of the job will become a cancelled operation, but the actual job does not get kill/cancel on the hypervisor for example. Did I miss something or is it not fully implemented yet?

@karuturi
Copy link
Member Author

@marcaurele The thread which is running the job gets killed due to OperationCancelledException. But, if any command is already sent and is being executed on hypervisor, that wont be cancelled. check for the changes in AgentAttache where this new exception is thrown.
For example, if a deployvm is cancelled and the command is already sent to hypervisor, the vm will continue to launch on hypervisor. but, on cloudstack side, the threads and the jobs will be cleanedup and cancelled. Eventually, vm sync will sync the states. There can be instances when the job cancellation was successful but, the vm is in running state after sometime.
The cancellation should be used with caution only by admin keeping in mind that some resource cleanup on hypervisor might be required.
This will only unblock the jobs in cloudstack which are waiting for a long time for a certain job to complete.

@marcaurele
Copy link
Member

@karuturi Ok thanks for the clarifications, and it's the scenario I thought about too. That being said, I'm currently thinking of a new approach for the command sequencer because having implemented the live migration, the non-parallel commands isn't optimal at all when you have long running sequential commands on a hypervisor. And I tend to think that's the reason behind your PR, isn't it? The way it's currently done is too simple (if a job cannot be run in parallel on the HV, it will put in a queue any other coming job that needs to run on this same HV). IMO this sequencing should take into account what kind of job is coming and for which type of resources. For example a security group update, a VM start and a migration for different VMs should be able to run in parallel because they are unrelated. With today design, it isn't possible.

So don't you think we're better of rewriting the sequencer to let more commands being executed in parallel to avoid this bottleneck on the AgentAttache? It would normally make the cancellation not needed in the way you implemented it since less jobs will be queued.

If we wish to be able to cancel a job, IMHO it should cancel the job down on the hypervisor too, thus clearing normally the resources involved as if the execution didn't go well.

Otherwise, the way you implemented it. I would not let a job being cancel if it has been sent to the hypervisor to clearly return to the user that it wasn't cancelable anymore (you're too late! -> seq number isn't in _requests anymore so it has been sent to the HV). I'm putting more comment in the code.

What do you think?

try {
answers = sl.waitFor(wait);
job = _agentMgr._asyncJobDao.findById(jobId);
if (job != null && job.getStatus() == JobInfo.Status.CANCELLED) {
throw new OperationCancelledException(req.getCommands(), _id, seq, wait, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to throw an OperationCancelledException if the we have the job answer. It's better to let the normal response come back to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. will do the change.

@rohityadavcloud
Copy link
Member

Nice feature, @karuturi can you add suitable marvin tests, squash your changes.

@karuturi
Copy link
Member Author

@marcaurele out for the next three days. will get back
@rhtyd marvin test is in progress. I will take it up in a separate PR.
Regarding squashing, a humongous commit would be difficult. I already grouped them into individual logical units.

@rohityadavcloud
Copy link
Member

rohityadavcloud commented Jan 3, 2017

@karuturi please squash your changes and fix the merge conflict. We've merged very large changes that were squashed into a single commit, it becomes easier to then cherry-pick them or revert them.

@karuturi
Copy link
Member Author

karuturi commented Jan 4, 2017

@rhtyd I dont buy that argument. Just for the ease of revert or cherry-pick we shouldnt commit all the changes together. The revert or cherry-pick quickly becomes irrelevant as more code gets committed on top of it. I would like to commit them as is.
regarding the conflicts, will resolve them as soon as I get time. right now, busy with $dayjob.

@karuturi
Copy link
Member Author

karuturi commented Feb 28, 2017

@marcaurele

@karuturi Ok thanks for the clarifications, and it's the scenario I thought about too. That being said, I'm currently thinking of a new approach for the command sequencer because having implemented the live migration, the non-parallel commands isn't optimal at all when you have long running sequential commands on a hypervisor. And I tend to think that's the reason behind your PR, isn't it?

Yes, thats right.

The way it's currently done is too simple (if a job cannot be run in parallel on the HV, it will put in
So don't you think we're better of rewriting the sequencer to let more commands being executed in parallel to avoid this bottleneck on the AgentAttache? It would normally make the cancellation not needed in the way you implemented it since less jobs will be queued.

As you already said, with todays design it isn't possible. Rewriting is obviously better. But, thats a bigger job. In the current design, this was the only possible way to allow cloudstack to process queued up jobs.

If we wish to be able to cancel a job, IMHO it should cancel the job down on the hypervisor too, thus clearing normally the resources involved as if the execution didn't go well.

I agree. But, thats a huge task given the number of hypervisors we support and their versions.

@marcaurele
Copy link
Member

@karuturi Ok

@cloudmonger
Copy link

ACS CI BVT Run

Sumarry:
Build Number 551
Hypervisor xenserver
NetworkType Advanced
Passed=110
Failed=2
Skipped=7

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0

Failed tests:

  • test_routers_network_ops.py

  • test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failing since 2 runs

  • test_03_RVR_Network_check_router_state Failing since 2 runs

Skipped tests:
test_01_test_vm_volume_snapshot
test_vm_nic_adapter_vmxnet3
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_non_contigiousvlan.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_metrics_api.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_volumes.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_vm_life_cycle.py
test_disk_offerings.py

return isCancelled;
}

public void setIsCancelled(boolean isCancelled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setter is getting used only within the ctor, if not used anywhere else this can be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/**
* job can be cancelled using async job cancel api
*/
public class OperationCancelledException extends CloudException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding a serial version ID if this is getting (de)serialised

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return _isCancelled;
}

public void setCancelled(boolean isCancelled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not getting used anywhere, please remove

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -67,4 +68,12 @@ public int getWaitTime() {
public boolean isActive() {
return _isActive;
}

public boolean isCancelled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -38,6 +38,7 @@
//
transient Command[] _cmds;
boolean _isActive;
boolean _isCancelled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since setter/getter not used, this can be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -202,6 +205,14 @@ public String getInstanceUuid() {
return instanceUuid;
}

public void setRelated(String related) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setter is not used please remove

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

s_logger.info("Moving on to the next host because " + h.toString() + " is unavailable");
continue;
} catch (OperationTimedoutException e) {
} catch (AgentUnavailableException | OperationCancelledException | OperationTimedoutException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cancelled exception why it should move to the next host?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

throw new AgentUnavailableException("Unable to send commands to virtual router ", router.getHostId(), e);
} catch (final OperationCancelledException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancelled exception is handled without any rethrow, can it cause issues somewhere up the chain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added rethrow.

@@ -2793,6 +2795,8 @@ public long getMemoryOrCpuCapacityByHost(final Long hostId, final short capacity
cmdList.add(UpdateIsoCmd.class);
cmdList.add(UpdateIsoPermissionsCmd.class);
cmdList.add(ListAsyncJobsCmd.class);
cmdList.add(ListLongRunningAsyncJobsCmd.class);
cmdList.add(ListQueuedUpAsyncJobsCmd.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about CancelAsyncJob?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its in asyncjobmanagerimpl

@@ -3761,6 +3767,8 @@ public boolean setupVmForPvlan(boolean add, Long hostId, NicProfile nic) {
} catch (AgentUnavailableException e) {
s_logger.warn("Agent Unavailable ", e);
return false;
} catch (OperationCancelledException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no return value for operation cancelled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it answer is null, false is returned at the end. removed return in the above exception handling.

@shwetaag
Copy link

  1. Deploy a VM ... === TestName: test_Cancel_Add_NIC_to_VM | Status : SUCCESS ===
    ok
  2. Deploy a VM ... === TestName: test_Cancel_Destory_VM | Status : SUCCESS ===
    ok
  3. Deploy a VM ... === TestName: test_Cancel_Reboot_VM | Status : SUCCESS ===
    ok
    Negative test to verify Admin can not cancel restore VM ... === TestName: test_Cancel_Reset_VM | Status : SUCCESS ===
    ok
  4. Deploy a VM ... === TestName: test_Cancel_Start_VM | Status : SUCCESS ===
    ok
  5. Deploy a VM ... === TestName: test_Cancel_Stop_VM | Status : SUCCESS ===
    ok

Ran 6 tests in 4308.224s

OK

@shwetaag
Copy link

  1. Start Deploying a VM ... === TestName: test_Cancel_VM_Deployment | Status : SUCCESS ===
    ok
    Test to cancel volume snapshot that is being deployed ... === TestName: test_Cancel_Volume_Snapshot_Creation | Status : SUCCESS ===
    ok

karuturi and others added 11 commits June 6, 2017 11:23
added listlongrunnningjobs api

added listqueuedupasyncjobs api
Throwing an exception in agentattache incase the job is cancelled. The
top layers should handle the exception and take necessary action for
cleaning of resources.
Added exception handling at various agent commands for this new
checked exception.
check the state of the parent job before submitting the worker thread.
starting work thread only if parent job is not done.
…operation is cancelled while implementing a new network as part of it

In this case the cancel resulted in failure to start VR and so the network couldn't be implemented. This triggered network cleanup which
attempted to stop VR. The stop failed as there is no VM state transition defined from 'Starting' state for event 'StopRequested'. Failure
to stop VR resulted in it getting stuck in 'Starting' state thus preventing any subsequent VM creation. As part of the fix added the missing
VM state transition.
yvsubhash pushed a commit to yvsubhash/cloudstack that referenced this pull request Aug 30, 2017
* CLOUDSTACK-9652:
  CLOUDSTACK-9652: updated with review comments
  CLOUDSTACK-9652 VM life cycle test cases for cancel async job
  CLOUDSTACK-9652: Subsequent VM creation fails if current VM creation operation is cancelled while implementing a new network as part of it In this case the cancel resulted in failure to start VR and so the network couldn't be implemented. This triggered network cleanup which attempted to stop VR. The stop failed as there is no VM state transition defined from 'Starting' state for event 'StopRequested'. Failure to stop VR resulted in it getting stuck in 'Starting' state thus preventing any subsequent VM creation. As part of the fix added the missing VM state transition.
  CLOUDSTACK-9652 cleaning up async jobs on graceful MS shutdown
  CLOUDSTACK-9652 cancelling a job in queue should not throw exception
  CLOUDSTACK-9652 unittests for cancelAsyncJob cmd
  CLOUDSTACK-9652: added OperationCancelledException for a cancelled job
  CLOUDSTACK-9652 Annotating async APIs as cancellable or not
  CLOUDSTACK-9652 Cleanup at Agent Layer
  CLOUDSTACK-9652 API to find long running jobs
  CLOUDSTACK-9652: added new cancel async job api
@rohityadavcloud
Copy link
Member

Ping @karuturi -- looks like good feature, can you fix conflicts?

@DaanHoogland
Copy link
Contributor

@karuturi please rebase and re-open if still relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants