Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new runner for DRMAA (currently UNIVA) #7004

Merged
merged 1 commit into from Nov 19, 2018

Conversation

Projects
None yet
4 participants
@bernt-matthias
Copy link
Contributor

commented Nov 12, 2018

Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy):

  • this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information.
  • this has the additional advantage that if the drmaa library functions are not working DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools.

Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations:

  • memory violations are determined by comparing the used and the requested memory
  • run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time where the used memory and time are determined with drmaa.wait() or qacct

TODO:

  • There is still a problem if the Galaxy user kills the job before it entered the schedulers job data base (can not be accessed by qstat or qacct). The bug is that nonexistent members of extdata or accessed. So I need to add some tests for members of extdata or set useful defaults.
  • The external_kill script needs a bit of testing. I guess it does also not work if jobs are submitted as real user (since only jobs that are started in the same session can be accessed).

Open (or better perspective):

  • adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function check_watched_items was put into a new function check_watched_item in order to make subclassing more convenient.

Replaces #4857 (which replaced #4275 ), since I did mess up with git again (there were some duplicated commits).

@galaxybot galaxybot added the triage label Nov 12, 2018

@galaxybot galaxybot added this to the 19.01 milestone Nov 12, 2018

@bernt-matthias bernt-matthias force-pushed the bernt-matthias:topic/univa3 branch from 1b921d6 to 17e2b7a Nov 12, 2018

@datakid

This comment has been minimized.

Copy link

commented Nov 12, 2018

Because I'm new around here and not 100% sure what this implies, I'm going to ask some questions.

  1. Does this replace or offer an alternative to natefoo's slurm-drmaa?
  2. Alternatively, is this just the interface that natefoo's plugin uses?
@bernt-matthias

This comment has been minimized.

Copy link
Contributor Author

commented Nov 13, 2018

Hi @datakid

Does this replace or offer an alternative to natefoo's slurm-drmaa?

No. Slurm-drmaa is for SLURM clusters (which use sacct,... for querying jobs). univa-drmaa is for clusters running UNIVA grid engine (which use qacct,... for querying) -- but I guess it also works for SUN grid engine (but I can not test this).

Both SlurmJobRunner and UnivaJobRunner derive from DRMAAJobRunner which can not be used in the setting that submits jobs as the real user. This is because the (python) drmaa library can only query jobs that are created in the same drmaa session, but in the real user setting jobs are started by an external script (drmaa_external_run|kill) which uses its own drmaa session which can not be accessed by the galaxy process. Hence Galaxy can not query the job state.

The solution of SlurmJobRunner and UnivaJobRunner is to use the corresponding command line tools to query the job state.

Alternatively, is this just the interface that natefoo's plugin uses?

I do not understand this question.

@bernt-matthias bernt-matthias force-pushed the bernt-matthias:topic/univa3 branch 2 times, most recently from 763dc12 to 9d71988 Nov 13, 2018

A new runner for DRMAA (currently UNIVA)
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces #6931 (which replaced #4275), since I did mess up with git
again (there were some duplicated commits).

@bernt-matthias bernt-matthias force-pushed the bernt-matthias:topic/univa3 branch from 9d71988 to 55f5235 Nov 14, 2018

@jmchilton jmchilton merged commit 72b2781 into galaxyproject:dev Nov 19, 2018

5 of 6 checks passed

selenium test Build finished. 151 tests run, 3 skipped, 1 failed.
Details
api test Build finished. 439 tests run, 1 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 190 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 269 tests run, 10 skipped, 0 failed.
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@jmchilton

This comment has been minimized.

Copy link
Member

commented Nov 19, 2018

This looks like a good, isolated first start so I'm merging. Thanks so much for the work, and sorry for making you jump through hoops about the memory handling.

@bernt-matthias bernt-matthias deleted the bernt-matthias:topic/univa3 branch Jan 2, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.