-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail on failed split #780
Fail on failed split #780
Conversation
@@ -1472,6 +1471,9 @@ def submit(self, keep_going=None, keep_on_fail=None, prepare=False): | |||
# split into subjobs | |||
rjobs = self._doSplitting() | |||
|
|||
if not rjobs or len(rjobs) == 1 and rjobs[0] is self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we care if there is only one subjob? Is it ever the case that we have a splitter which simply sets itself as rjobs[0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as per the code in _doSplitting
it may return [self]
when there are no subjobs which means that no splitting occured but we have a splitter.
Adding check for splitter
@@ -1472,6 +1471,9 @@ def submit(self, keep_going=None, keep_on_fail=None, prepare=False): | |||
# split into subjobs | |||
rjobs = self._doSplitting() | |||
|
|||
if not rjobs or len(rjobs) == 1 and rjobs[0] is self and self.splitter is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to just check against len(self.subjobs)
and if there is a splitter. I actually think this rjobs
stuff probably isn't needed (or at least can be greatly simplified) but for that's probably for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a lot of cases you will get a list containing only a reference to the single master job and in this case the exception is not thrown as you would expect.
We could fix the rjobs stuff but frankly I'd rather not go near something in-case I break it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean a conditional like:
if self.splitter and len(self.subjobs) == 0:
raise ...
If you have a splitter then subjobs should have been filled by this point. If not it's an error.
Changing condition
retest this please |
I strongly feel we should not have DIRAC tests which can timeout in the integration tests. I'll bump this to an external in another PR as we want the integration tests to fail when and only when there is a problem with the PR imo. Any objections to merging this PR? |
@@ -1472,6 +1471,10 @@ def submit(self, keep_going=None, keep_on_fail=None, prepare=False): | |||
# split into subjobs | |||
rjobs = self._doSplitting() | |||
|
|||
# Some old jobs may still contain None assigned to the self.subjobs so be careful when checking length | |||
if self.subjobs is not None and len(self.subjobs) == 0 and self.splitter is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: You should just need the len
check as subjobs
defaults to a GangaList
. And can you just do and self.splitter
rather than the explicit None
check?
Other than that 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still potentially jobs before the default was changed. Minor number of jobs from minor number of users but I'm erring on the side of caution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even before the default was changed it was set to []
I think which is safe to do a len()
on? Regardless, as this code is run on submit()
, the chance of someone having a job in new
from that long ago which they have not submitted yet is very small.
It should be as simple as:
if self.splitter is not None and not self.subjobs:
raise ...
Changing exit condition
well replacing the original condition with what has been suggested here seems to cause more problems. I'm going to return back to the original check condition unless anyone can understand why the other suggestions here fail? |
@@ -1472,6 +1471,10 @@ def submit(self, keep_going=None, keep_on_fail=None, prepare=False): | |||
# split into subjobs | |||
rjobs = self._doSplitting() | |||
|
|||
# Some old jobs may still contain None assigned to the self.subjobs so be careful when checking length | |||
if self.subjobs is not None and not self.subjobs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be:
if self.splitter....
That's why it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nvm, missed that 👍
OK this should be working now. I plan to merge this if nobody objects? |
Adding a SplitterError if a spitter fails to split a job.
This should only pick up the case where we have 1 master job which has failed to have subjobs associated with it from a splitter but I'm keen to hear feedback.