Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM: when scontrol fails, try to get job state with sacct #2403

Merged
merged 4 commits into from Jun 5, 2016

Conversation

Projects
None yet
5 participants
@nsoranzo
Copy link
Member

commented May 24, 2016

Also:

  • Merge _complete_terminal_job() from BaseJobRunner into DRMAAJobRunner
  • Clarify _complete_terminal_job() return value
job_info = __get_jobinfo()
if job_info['JobState'] == 'NOT_FOUND':
slurm_state = _get_slurm_state()
if slurm_state == 'NOT_FOUND':
log.warning( '(%s/%s) Job not found, assuming job check exceeded MinJobAge and completing as successful', ajs.job_wrapper.get_id_tag(), ajs.job_id )

This comment has been minimized.

Copy link
@natefoo

natefoo May 24, 2016

Member

With the new sacct check this message isn't entirely accurate, but the logic is probably still correct (since not all sites may have accounting configured properly)

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo May 24, 2016

Author Member

Which part of the message is not accurate? I can try to rephrase.

This comment has been minimized.

Copy link
@natefoo

natefoo May 26, 2016

Member

MinJobAge is how long a job's details will be available from scontrol. After that, the only thing you can about a job is from sacct. If it's not found in sacct that means either there's an error or the site hasn't configured accounting.

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 2, 2016

Author Member

If the accounting storage is not properly configured, a warning will be now logged. I think that the message is essentially correct, let me know if you have a better version!

@@ -101,7 +121,7 @@ def __get_jobinfo():
f.write(line)
f.truncate()
# by default, finish as if the job was successful.
super( SlurmJobRunner, self )._complete_terminal_job( ajs, drmaa_state=drmaa_state )
return super( SlurmJobRunner, self )._complete_terminal_job( ajs, drmaa_state=drmaa_state )

This comment has been minimized.

Copy link
@natefoo

natefoo May 24, 2016

Member

This is in #2402 as well

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo May 24, 2016

Author Member

I'll rebase this once it is merged.

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo May 25, 2016

Author Member

Rebased.

@@ -23,7 +23,21 @@ class SlurmJobRunner( DRMAAJobRunner ):
restrict_job_name_length = False

def _complete_terminal_job( self, ajs, drmaa_state, **kwargs ):
def __get_jobinfo():
"""
Returns not None if job was not actually terminal.

This comment has been minimized.

Copy link
@natefoo

natefoo May 24, 2016

Member

Returns True if not actually terminal. Returns None otherwise. Not sure that it makes sense but that's the way it currently is.

@natefoo

This comment has been minimized.

Copy link
Member

commented May 24, 2016

The BaseJobRunner version of _complete_terminal_job() was there because theoretically other runners should probably make use of it as well. But none of them were actually updated to do that.

@nsoranzo nsoranzo force-pushed the nsoranzo:slurm branch from ff00269 to 9a7b370 May 25, 2016

@nsoranzo nsoranzo force-pushed the nsoranzo:slurm branch from 9a7b370 to 301f7db May 25, 2016

stdout, stderr = p.communicate()
if p.returncode != 0:
if stderr == 'SLURM accounting storage is disabled':
return

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo May 25, 2016

Author Member

@natefo: Not sure if we want to log something in this case.

This comment has been minimized.

Copy link
@natefoo

natefoo May 26, 2016

Member

Probably okay to log, it wouldn't be encountered very often.

@nsoranzo nsoranzo added status/review and removed status/WIP labels Jun 2, 2016

@galaxybot galaxybot added the status/WIP label Jun 2, 2016

@nsoranzo nsoranzo changed the title [WIP] SLURM: when scontrol fails, try to get job state with sacct SLURM: when scontrol fails, try to get job state with sacct Jun 2, 2016

@nsoranzo nsoranzo removed the status/WIP label Jun 2, 2016

@nsoranzo

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2016

Ready for review!

@jmchilton jmchilton merged commit a97cae3 into galaxyproject:dev Jun 5, 2016

4 checks passed

api test Build finished. 213 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 107 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 581 tests run, 0 skipped, 0 failed.
Details
@martenson

This comment has been minimized.

Copy link
Member

commented Jun 5, 2016

thanks a bunch @nsoranzo !

@nsoranzo nsoranzo deleted the nsoranzo:slurm branch Jul 20, 2016

@natefoo

This comment has been minimized.

Copy link
Member

commented on lib/galaxy/jobs/runners/slurm.py in 301f7db Aug 17, 2016

@nsoranzo str.split()'s keyword arguments only exist in Python 3. Do you have any objection to reverting this line to its previous form?

This comment has been minimized.

Copy link
Member Author

replied Aug 17, 2016

@natefoo Of course, sorry for that! Are you going to open the PR for 16.07 or should I?

This comment has been minimized.

Copy link
Member

replied Aug 17, 2016

I'll do it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.