Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail on failed split #780

Merged
merged 9 commits into from
Sep 19, 2016
Merged

Fail on failed split #780

merged 9 commits into from
Sep 19, 2016

Conversation

rob-c
Copy link
Member

@rob-c rob-c commented Sep 7, 2016

Adding a SplitterError if a spitter fails to split a job.

This should only pick up the case where we have 1 master job which has failed to have subjobs associated with it from a splitter but I'm keen to hear feedback.

@rob-c rob-c added this to the 6.2.1 milestone Sep 7, 2016
@rob-c rob-c self-assigned this Sep 7, 2016
@@ -1472,6 +1471,9 @@ def submit(self, keep_going=None, keep_on_fail=None, prepare=False):
# split into subjobs
rjobs = self._doSplitting()

if not rjobs or len(rjobs) == 1 and rjobs[0] is self:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care if there is only one subjob? Is it ever the case that we have a splitter which simply sets itself as rjobs[0]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as per the code in _doSplitting it may return [self] when there are no subjobs which means that no splitting occured but we have a splitter.

@@ -1472,6 +1471,9 @@ def submit(self, keep_going=None, keep_on_fail=None, prepare=False):
# split into subjobs
rjobs = self._doSplitting()

if not rjobs or len(rjobs) == 1 and rjobs[0] is self and self.splitter is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to just check against len(self.subjobs) and if there is a splitter. I actually think this rjobs stuff probably isn't needed (or at least can be greatly simplified) but for that's probably for another PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a lot of cases you will get a list containing only a reference to the single master job and in this case the exception is not thrown as you would expect.
We could fix the rjobs stuff but frankly I'd rather not go near something in-case I break it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean a conditional like:

if self.splitter and len(self.subjobs) == 0:
    raise ...

If you have a splitter then subjobs should have been filled by this point. If not it's an error.

Changing condition
@rob-c
Copy link
Member Author

rob-c commented Sep 8, 2016

retest this please

@rob-c
Copy link
Member Author

rob-c commented Sep 8, 2016

I strongly feel we should not have DIRAC tests which can timeout in the integration tests. I'll bump this to an external in another PR as we want the integration tests to fail when and only when there is a problem with the PR imo.

Any objections to merging this PR?

@@ -1472,6 +1471,10 @@ def submit(self, keep_going=None, keep_on_fail=None, prepare=False):
# split into subjobs
rjobs = self._doSplitting()

# Some old jobs may still contain None assigned to the self.subjobs so be careful when checking length
if self.subjobs is not None and len(self.subjobs) == 0 and self.splitter is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: You should just need the len check as subjobs defaults to a GangaList. And can you just do and self.splitter rather than the explicit None check?

Other than that 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still potentially jobs before the default was changed. Minor number of jobs from minor number of users but I'm erring on the side of caution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even before the default was changed it was set to [] I think which is safe to do a len() on? Regardless, as this code is run on submit(), the chance of someone having a job in new from that long ago which they have not submitted yet is very small.

It should be as simple as:

if self.splitter is not None and not self.subjobs:
    raise ...

Changing exit condition
@rob-c
Copy link
Member Author

rob-c commented Sep 19, 2016

well replacing the original condition with what has been suggested here seems to cause more problems. I'm going to return back to the original check condition unless anyone can understand why the other suggestions here fail?

@@ -1472,6 +1471,10 @@ def submit(self, keep_going=None, keep_on_fail=None, prepare=False):
# split into subjobs
rjobs = self._doSplitting()

# Some old jobs may still contain None assigned to the self.subjobs so be careful when checking length
if self.subjobs is not None and not self.subjobs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be:

if self.splitter....

That's why it fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nvm, missed that 👍

@rob-c
Copy link
Member Author

rob-c commented Sep 19, 2016

OK this should be working now. I plan to merge this if nobody objects?

@rob-c rob-c merged commit 12ed73f into develop Sep 19, 2016
@rob-c rob-c deleted the failOnFailedSplit branch September 19, 2016 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants