CLOUDSTACK-9652 Job framework - Cancelling async jobs #1832

karuturi · 2016-12-15T11:17:03Z

enabled cancellation of long running or subsequent queued up async jobs

marcaurele · 2016-12-15T12:42:49Z

@karuturi It's a good feature but I don't see where the job gets actually cancelled. Reading the changes I can only see that the response of the job will become a cancelled operation, but the actual job does not get kill/cancel on the hypervisor for example. Did I miss something or is it not fully implemented yet?

karuturi · 2016-12-15T17:52:27Z

@marcaurele The thread which is running the job gets killed due to OperationCancelledException. But, if any command is already sent and is being executed on hypervisor, that wont be cancelled. check for the changes in AgentAttache where this new exception is thrown.
For example, if a deployvm is cancelled and the command is already sent to hypervisor, the vm will continue to launch on hypervisor. but, on cloudstack side, the threads and the jobs will be cleanedup and cancelled. Eventually, vm sync will sync the states. There can be instances when the job cancellation was successful but, the vm is in running state after sometime.
The cancellation should be used with caution only by admin keeping in mind that some resource cleanup on hypervisor might be required.
This will only unblock the jobs in cloudstack which are waiting for a long time for a certain job to complete.

marcaurele · 2016-12-15T19:00:21Z

@karuturi Ok thanks for the clarifications, and it's the scenario I thought about too. That being said, I'm currently thinking of a new approach for the command sequencer because having implemented the live migration, the non-parallel commands isn't optimal at all when you have long running sequential commands on a hypervisor. And I tend to think that's the reason behind your PR, isn't it? The way it's currently done is too simple (if a job cannot be run in parallel on the HV, it will put in a queue any other coming job that needs to run on this same HV). IMO this sequencing should take into account what kind of job is coming and for which type of resources. For example a security group update, a VM start and a migration for different VMs should be able to run in parallel because they are unrelated. With today design, it isn't possible.

So don't you think we're better of rewriting the sequencer to let more commands being executed in parallel to avoid this bottleneck on the AgentAttache? It would normally make the cancellation not needed in the way you implemented it since less jobs will be queued.

If we wish to be able to cancel a job, IMHO it should cancel the job down on the hypervisor too, thus clearing normally the resources involved as if the execution didn't go well.

Otherwise, the way you implemented it. I would not let a job being cancel if it has been sent to the hypervisor to clearly return to the user that it wasn't cancelable anymore (you're too late! -> seq number isn't in _requests anymore so it has been sent to the HV). I'm putting more comment in the code.

What do you think?

marcaurele · 2016-12-15T19:03:15Z

engine/orchestration/src/com/cloud/agent/manager/AgentAttache.java

                try {
                    answers = sl.waitFor(wait);
+                    job = _agentMgr._asyncJobDao.findById(jobId);
+                    if (job != null && job.getStatus() == JobInfo.Status.CANCELLED) {
+                        throw new OperationCancelledException(req.getCommands(), _id, seq, wait, false);


Why do you want to throw an OperationCancelledException if the we have the job answer. It's better to let the normal response come back to the user.

I agree. will do the change.

rohityadavcloud · 2016-12-19T07:13:51Z

Nice feature, @karuturi can you add suitable marvin tests, squash your changes.

karuturi · 2016-12-19T11:52:39Z

@marcaurele out for the next three days. will get back
@rhtyd marvin test is in progress. I will take it up in a separate PR.
Regarding squashing, a humongous commit would be difficult. I already grouped them into individual logical units.

rohityadavcloud · 2017-01-03T07:35:19Z

@karuturi please squash your changes and fix the merge conflict. We've merged very large changes that were squashed into a single commit, it becomes easier to then cherry-pick them or revert them.

karuturi · 2017-01-04T04:28:59Z

@rhtyd I dont buy that argument. Just for the ease of revert or cherry-pick we shouldnt commit all the changes together. The revert or cherry-pick quickly becomes irrelevant as more code gets committed on top of it. I would like to commit them as is.
regarding the conflicts, will resolve them as soon as I get time. right now, busy with $dayjob.

karuturi · 2017-02-28T11:42:00Z

@marcaurele

@karuturi Ok thanks for the clarifications, and it's the scenario I thought about too. That being said, I'm currently thinking of a new approach for the command sequencer because having implemented the live migration, the non-parallel commands isn't optimal at all when you have long running sequential commands on a hypervisor. And I tend to think that's the reason behind your PR, isn't it?

Yes, thats right.

The way it's currently done is too simple (if a job cannot be run in parallel on the HV, it will put in
So don't you think we're better of rewriting the sequencer to let more commands being executed in parallel to avoid this bottleneck on the AgentAttache? It would normally make the cancellation not needed in the way you implemented it since less jobs will be queued.

As you already said, with todays design it isn't possible. Rewriting is obviously better. But, thats a bigger job. In the current design, this was the only possible way to allow cloudstack to process queued up jobs.

If we wish to be able to cancel a job, IMHO it should cancel the job down on the hypervisor too, thus clearing normally the resources involved as if the execution didn't go well.

I agree. But, thats a huge task given the number of hypervisors we support and their versions.

marcaurele · 2017-02-28T13:37:49Z

@karuturi Ok

cloudmonger · 2017-04-18T15:07:17Z

ACS CI BVT Run

Sumarry:
Build Number 551
Hypervisor xenserver
NetworkType Advanced
Passed=110
Failed=2
Skipped=7

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0

Failed tests:

test_routers_network_ops.py
test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failing since 2 runs
test_03_RVR_Network_check_router_state Failing since 2 runs

Skipped tests:
test_01_test_vm_volume_snapshot
test_vm_nic_adapter_vmxnet3
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_non_contigiousvlan.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_metrics_api.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_volumes.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_vm_life_cycle.py
test_disk_offerings.py

koushik-das · 2017-05-02T06:26:16Z

api/src/com/cloud/exception/AgentUnavailableException.java

+        return isCancelled;
+    }
+
+    public void setIsCancelled(boolean isCancelled) {


This setter is getting used only within the ctor, if not used anywhere else this can be removed

koushik-das · 2017-05-02T06:27:11Z

api/src/com/cloud/exception/OperationCancelledException.java

+/**
+ * job can be cancelled using async job cancel api
+ */
+public class OperationCancelledException extends CloudException {


Please consider adding a serial version ID if this is getting (de)serialised

koushik-das · 2017-05-02T06:27:47Z

api/src/com/cloud/exception/OperationTimedoutException.java

+        return _isCancelled;
+    }
+
+    public void setCancelled(boolean isCancelled) {


This is not getting used anywhere, please remove

koushik-das · 2017-05-02T06:28:30Z

api/src/com/cloud/exception/OperationTimedoutException.java

@@ -67,4 +68,12 @@ public int getWaitTime() {
    public boolean isActive() {
        return _isActive;
    }
+
+    public boolean isCancelled() {


koushik-das · 2017-05-02T06:29:01Z

api/src/com/cloud/exception/OperationTimedoutException.java

@@ -38,6 +38,7 @@
    //
    transient Command[] _cmds;
    boolean _isActive;
+    boolean _isCancelled;


Since setter/getter not used, this can be removed

koushik-das · 2017-05-02T09:13:23Z

server/src/com/cloud/api/query/vo/AsyncJobJoinVO.java

@@ -202,6 +205,14 @@ public String getInstanceUuid() {
        return instanceUuid;
    }

+    public void setRelated(String related) {


Setter is not used please remove

koushik-das · 2017-05-02T09:14:23Z

server/src/com/cloud/ha/KVMFencer.java

-                    s_logger.info("Moving on to the next host because " + h.toString() + " is unavailable");
-                    continue;
-                } catch (OperationTimedoutException e) {
+                } catch (AgentUnavailableException | OperationCancelledException | OperationTimedoutException e) {


For cancelled exception why it should move to the next host?

koushik-das · 2017-05-02T09:16:12Z

server/src/com/cloud/network/router/NetworkHelperImpl.java

            throw new AgentUnavailableException("Unable to send commands to virtual router ", router.getHostId(), e);
+        } catch (final OperationCancelledException e) {


Cancelled exception is handled without any rethrow, can it cause issues somewhere up the chain

added rethrow.

koushik-das · 2017-05-02T09:17:03Z

server/src/com/cloud/server/ManagementServerImpl.java

@@ -2793,6 +2795,8 @@ public long getMemoryOrCpuCapacityByHost(final Long hostId, final short capacity
        cmdList.add(UpdateIsoCmd.class);
        cmdList.add(UpdateIsoPermissionsCmd.class);
        cmdList.add(ListAsyncJobsCmd.class);
+        cmdList.add(ListLongRunningAsyncJobsCmd.class);
+        cmdList.add(ListQueuedUpAsyncJobsCmd.class);


What about CancelAsyncJob?

its in asyncjobmanagerimpl

koushik-das · 2017-05-02T09:17:50Z

server/src/com/cloud/vm/UserVmManagerImpl.java

@@ -3761,6 +3767,8 @@ public boolean setupVmForPvlan(boolean add, Long hostId, NicProfile nic) {
        } catch (AgentUnavailableException e) {
            s_logger.warn("Agent Unavailable ", e);
            return false;
+        } catch (OperationCancelledException e) {


Why no return value for operation cancelled?

it answer is null, false is returned at the end. removed return in the above exception handling.

shwetaag · 2017-05-15T09:15:06Z

Deploy a VM ... === TestName: test_Cancel_Add_NIC_to_VM | Status : SUCCESS ===
ok
Deploy a VM ... === TestName: test_Cancel_Destory_VM | Status : SUCCESS ===
ok
Deploy a VM ... === TestName: test_Cancel_Reboot_VM | Status : SUCCESS ===
ok
Negative test to verify Admin can not cancel restore VM ... === TestName: test_Cancel_Reset_VM | Status : SUCCESS ===
ok
Deploy a VM ... === TestName: test_Cancel_Start_VM | Status : SUCCESS ===
ok
Deploy a VM ... === TestName: test_Cancel_Stop_VM | Status : SUCCESS ===
ok

Ran 6 tests in 4308.224s

OK

shwetaag · 2017-05-15T09:20:29Z

Start Deploying a VM ... === TestName: test_Cancel_VM_Deployment | Status : SUCCESS ===
ok
Test to cancel volume snapshot that is being deployed ... === TestName: test_Cancel_Volume_Snapshot_Creation | Status : SUCCESS ===
ok

added listlongrunnningjobs api added listqueuedupasyncjobs api

Throwing an exception in agentattache incase the job is cancelled. The top layers should handle the exception and take necessary action for cleaning of resources.

Added exception handling at various agent commands for this new checked exception.

check the state of the parent job before submitting the worker thread. starting work thread only if parent job is not done.

…operation is cancelled while implementing a new network as part of it In this case the cancel resulted in failure to start VR and so the network couldn't be implemented. This triggered network cleanup which attempted to stop VR. The stop failed as there is no VM state transition defined from 'Starting' state for event 'StopRequested'. Failure to stop VR resulted in it getting stuck in 'Starting' state thus preventing any subsequent VM creation. As part of the fix added the missing VM state transition.

* CLOUDSTACK-9652: CLOUDSTACK-9652: updated with review comments CLOUDSTACK-9652 VM life cycle test cases for cancel async job CLOUDSTACK-9652: Subsequent VM creation fails if current VM creation operation is cancelled while implementing a new network as part of it In this case the cancel resulted in failure to start VR and so the network couldn't be implemented. This triggered network cleanup which attempted to stop VR. The stop failed as there is no VM state transition defined from 'Starting' state for event 'StopRequested'. Failure to stop VR resulted in it getting stuck in 'Starting' state thus preventing any subsequent VM creation. As part of the fix added the missing VM state transition. CLOUDSTACK-9652 cleaning up async jobs on graceful MS shutdown CLOUDSTACK-9652 cancelling a job in queue should not throw exception CLOUDSTACK-9652 unittests for cancelAsyncJob cmd CLOUDSTACK-9652: added OperationCancelledException for a cancelled job CLOUDSTACK-9652 Annotating async APIs as cancellable or not CLOUDSTACK-9652 Cleanup at Agent Layer CLOUDSTACK-9652 API to find long running jobs CLOUDSTACK-9652: added new cancel async job api

rohityadavcloud · 2017-11-05T16:38:19Z

Ping @karuturi -- looks like good feature, can you fix conflicts?

DaanHoogland · 2020-03-06T15:03:43Z

@karuturi please rebase and re-open if still relevant

karuturi changed the title ~~cloudstack-9652 Job framework - Cancelling async jobs~~ CLOUDSTACK-9652 Job framework - Cancelling async jobs Dec 15, 2016

marcaurele reviewed Dec 15, 2016

View reviewed changes

koushik-das reviewed May 2, 2017

View reviewed changes

karuturi added the status:waiting-for-reviewer label May 15, 2017

karuturi and others added 11 commits June 6, 2017 11:23

CLOUDSTACK-9652: added new cancel async job api

ddb056b

CLOUDSTACK-9652 API to find long running jobs

2db309a

added listlongrunnningjobs api added listqueuedupasyncjobs api

CLOUDSTACK-9652 Cleanup at Agent Layer

ef471a5

Throwing an exception in agentattache incase the job is cancelled. The top layers should handle the exception and take necessary action for cleaning of resources.

CLOUDSTACK-9652 Annotating async APIs as cancellable or not

8bf9f8c

CLOUDSTACK-9652: added OperationCancelledException for a cancelled job

c99de8d

Added exception handling at various agent commands for this new checked exception.

CLOUDSTACK-9652 unittests for cancelAsyncJob cmd

cf05b23

CLOUDSTACK-9652 cancelling a job in queue should not throw exception

830d63c

check the state of the parent job before submitting the worker thread. starting work thread only if parent job is not done.

CLOUDSTACK-9652 cleaning up async jobs on graceful MS shutdown

3f0bea9

CLOUDSTACK-9652 VM life cycle test cases for cancel async job

1dbeab5

CLOUDSTACK-9652: updated with review comments

b1b586b

DaanHoogland closed this Mar 6, 2020

		throw new AgentUnavailableException("Unable to send commands to virtual router ", router.getHostId(), e);
		} catch (final OperationCancelledException e) {

CLOUDSTACK-9652 Job framework - Cancelling async jobs #1832

CLOUDSTACK-9652 Job framework - Cancelling async jobs #1832

Conversation

karuturi commented Dec 15, 2016

marcaurele commented Dec 15, 2016

karuturi commented Dec 15, 2016

marcaurele commented Dec 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohityadavcloud commented Dec 19, 2016

karuturi commented Dec 19, 2016

rohityadavcloud commented Jan 3, 2017 • edited

karuturi commented Jan 4, 2017

karuturi commented Feb 28, 2017 • edited

marcaurele commented Feb 28, 2017

cloudmonger commented Apr 18, 2017

ACS CI BVT Run

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shwetaag commented May 15, 2017

shwetaag commented May 15, 2017

rohityadavcloud commented Nov 5, 2017

DaanHoogland commented Mar 6, 2020

rohityadavcloud commented Jan 3, 2017 •

edited

karuturi commented Feb 28, 2017 •

edited