New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLOUDSTACK-9652 Job framework - Cancelling async jobs #1832
Conversation
@karuturi It's a good feature but I don't see where the job gets actually cancelled. Reading the changes I can only see that the response of the job will become a cancelled operation, but the actual job does not get kill/cancel on the hypervisor for example. Did I miss something or is it not fully implemented yet? |
@marcaurele The thread which is running the job gets killed due to OperationCancelledException. But, if any command is already sent and is being executed on hypervisor, that wont be cancelled. check for the changes in AgentAttache where this new exception is thrown. |
@karuturi Ok thanks for the clarifications, and it's the scenario I thought about too. That being said, I'm currently thinking of a new approach for the command sequencer because having implemented the live migration, the non-parallel commands isn't optimal at all when you have long running sequential commands on a hypervisor. And I tend to think that's the reason behind your PR, isn't it? The way it's currently done is too simple (if a job cannot be run in parallel on the HV, it will put in a queue any other coming job that needs to run on this same HV). IMO this sequencing should take into account what kind of job is coming and for which type of resources. For example a security group update, a VM start and a migration for different VMs should be able to run in parallel because they are unrelated. With today design, it isn't possible. So don't you think we're better of rewriting the sequencer to let more commands being executed in parallel to avoid this bottleneck on the AgentAttache? It would normally make the cancellation not needed in the way you implemented it since less jobs will be queued. If we wish to be able to cancel a job, IMHO it should cancel the job down on the hypervisor too, thus clearing normally the resources involved as if the execution didn't go well. Otherwise, the way you implemented it. I would not let a job being cancel if it has been sent to the hypervisor to clearly return to the user that it wasn't cancelable anymore (you're too late! -> seq number isn't in What do you think? |
try { | ||
answers = sl.waitFor(wait); | ||
job = _agentMgr._asyncJobDao.findById(jobId); | ||
if (job != null && job.getStatus() == JobInfo.Status.CANCELLED) { | ||
throw new OperationCancelledException(req.getCommands(), _id, seq, wait, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to throw an OperationCancelledException
if the we have the job answer. It's better to let the normal response come back to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. will do the change.
Nice feature, @karuturi can you add suitable marvin tests, squash your changes. |
@marcaurele out for the next three days. will get back |
@karuturi please squash your changes and fix the merge conflict. We've merged very large changes that were squashed into a single commit, it becomes easier to then cherry-pick them or revert them. |
@rhtyd I dont buy that argument. Just for the ease of revert or cherry-pick we shouldnt commit all the changes together. The revert or cherry-pick quickly becomes irrelevant as more code gets committed on top of it. I would like to commit them as is. |
Yes, thats right.
As you already said, with todays design it isn't possible. Rewriting is obviously better. But, thats a bigger job. In the current design, this was the only possible way to allow cloudstack to process queued up jobs.
I agree. But, thats a huge task given the number of hypervisors we support and their versions. |
@karuturi Ok |
ACS CI BVT RunSumarry: Link to logs Folder (search by build_no): https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0 Failed tests:
Skipped tests: Passed test suits: |
return isCancelled; | ||
} | ||
|
||
public void setIsCancelled(boolean isCancelled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setter is getting used only within the ctor, if not used anywhere else this can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/** | ||
* job can be cancelled using async job cancel api | ||
*/ | ||
public class OperationCancelledException extends CloudException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider adding a serial version ID if this is getting (de)serialised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return _isCancelled; | ||
} | ||
|
||
public void setCancelled(boolean isCancelled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not getting used anywhere, please remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -67,4 +68,12 @@ public int getWaitTime() { | |||
public boolean isActive() { | |||
return _isActive; | |||
} | |||
|
|||
public boolean isCancelled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -38,6 +38,7 @@ | |||
// | |||
transient Command[] _cmds; | |||
boolean _isActive; | |||
boolean _isCancelled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since setter/getter not used, this can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -202,6 +205,14 @@ public String getInstanceUuid() { | |||
return instanceUuid; | |||
} | |||
|
|||
public void setRelated(String related) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setter is not used please remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
s_logger.info("Moving on to the next host because " + h.toString() + " is unavailable"); | ||
continue; | ||
} catch (OperationTimedoutException e) { | ||
} catch (AgentUnavailableException | OperationCancelledException | OperationTimedoutException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cancelled exception why it should move to the next host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
throw new AgentUnavailableException("Unable to send commands to virtual router ", router.getHostId(), e); | ||
} catch (final OperationCancelledException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancelled exception is handled without any rethrow, can it cause issues somewhere up the chain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added rethrow.
@@ -2793,6 +2795,8 @@ public long getMemoryOrCpuCapacityByHost(final Long hostId, final short capacity | |||
cmdList.add(UpdateIsoCmd.class); | |||
cmdList.add(UpdateIsoPermissionsCmd.class); | |||
cmdList.add(ListAsyncJobsCmd.class); | |||
cmdList.add(ListLongRunningAsyncJobsCmd.class); | |||
cmdList.add(ListQueuedUpAsyncJobsCmd.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about CancelAsyncJob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its in asyncjobmanagerimpl
@@ -3761,6 +3767,8 @@ public boolean setupVmForPvlan(boolean add, Long hostId, NicProfile nic) { | |||
} catch (AgentUnavailableException e) { | |||
s_logger.warn("Agent Unavailable ", e); | |||
return false; | |||
} catch (OperationCancelledException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no return value for operation cancelled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it answer is null, false is returned at the end. removed return in the above exception handling.
Ran 6 tests in 4308.224s OK |
|
added listlongrunnningjobs api added listqueuedupasyncjobs api
Throwing an exception in agentattache incase the job is cancelled. The top layers should handle the exception and take necessary action for cleaning of resources.
Added exception handling at various agent commands for this new checked exception.
check the state of the parent job before submitting the worker thread. starting work thread only if parent job is not done.
…operation is cancelled while implementing a new network as part of it In this case the cancel resulted in failure to start VR and so the network couldn't be implemented. This triggered network cleanup which attempted to stop VR. The stop failed as there is no VM state transition defined from 'Starting' state for event 'StopRequested'. Failure to stop VR resulted in it getting stuck in 'Starting' state thus preventing any subsequent VM creation. As part of the fix added the missing VM state transition.
* CLOUDSTACK-9652: CLOUDSTACK-9652: updated with review comments CLOUDSTACK-9652 VM life cycle test cases for cancel async job CLOUDSTACK-9652: Subsequent VM creation fails if current VM creation operation is cancelled while implementing a new network as part of it In this case the cancel resulted in failure to start VR and so the network couldn't be implemented. This triggered network cleanup which attempted to stop VR. The stop failed as there is no VM state transition defined from 'Starting' state for event 'StopRequested'. Failure to stop VR resulted in it getting stuck in 'Starting' state thus preventing any subsequent VM creation. As part of the fix added the missing VM state transition. CLOUDSTACK-9652 cleaning up async jobs on graceful MS shutdown CLOUDSTACK-9652 cancelling a job in queue should not throw exception CLOUDSTACK-9652 unittests for cancelAsyncJob cmd CLOUDSTACK-9652: added OperationCancelledException for a cancelled job CLOUDSTACK-9652 Annotating async APIs as cancellable or not CLOUDSTACK-9652 Cleanup at Agent Layer CLOUDSTACK-9652 API to find long running jobs CLOUDSTACK-9652: added new cancel async job api
Ping @karuturi -- looks like good feature, can you fix conflicts? |
@karuturi please rebase and re-open if still relevant |
enabled cancellation of long running or subsequent queued up async jobs