Skip to content
This repository has been archived by the owner on Feb 10, 2021. It is now read-only.

Add Torque job info #78

Merged
merged 2 commits into from Apr 14, 2018
Merged

Add Torque job info #78

merged 2 commits into from Apr 14, 2018

Conversation

maxnoe
Copy link
Contributor

@maxnoe maxnoe commented Apr 10, 2018

This makes dask-drmaa work on PBS/Torque

I'm not sure what exactly the difference between JOB_PARAM and JOB_ID should be.

Here are all PBS related env variables:

PBS_ENVIRONMENT=PBS_INTERACTIVE
PBS_GPUFILE=/var/spool/torque/aux//7641024.hpc-main3.cm.clustergpu
PBS_JOBCOOKIE=D57B25C527004153B093AC7B41FC9CA8
PBS_JOBID=7641024.hpc-main3.cm.cluster
PBS_JOBNAME=STDIN
PBS_MICFILE=/var/spool/torque/aux//7641024.hpc-main3.cm.clustermic
PBS_MOMPORT=15003
PBS_NODEFILE=/var/spool/torque/aux//7641024.hpc-main3.cm.cluster
PBS_NODENUM=0
PBS_NP=1
PBS_NUM_NODES=1
PBS_NUM_PPN=1
PBS_O_HOME=/home/mnoethe
PBS_O_HOST=hpc-gw3.cm.cluster
PBS_O_LANG=en_US.UTF-8
PBS_O_LOGNAME=mnoethe
PBS_O_MAIL=/var/spool/mail/mnoethe
PBS_O_PATH=/home/mnoethe/.conda/envs/dask/bin:/sl6/sw/python/anaconda36/bin:/sl6/sw/cmake/3.5.1/bin:/sl6/sw/gcc/5.1.0/rtf/bin:/sl6/sw/java/jdk1.8.0_112/bin:/home/mnoethe/.local/bin:/sl6/sw/maven/3.2.1/bin:/sl6/sw/vim/7.4/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/mnoethe/bin
PBS_O_QUEUE=batch
PBS_O_SERVER=hpc-main3.cm.cluster
PBS_O_SHELL=/bin/bash
PBS_O_WORKDIR=/home/mnoethe/dask-drmaa
PBS_QUEUE=short
PBS_TASKNUM=1
PBS_VERSION=TORQUE-4.2.8
PBS_VNODENUM=0
PBS_WALLTIME=7200

@jakirkham
Copy link
Member

Supporting Torque sounds like a great idea. Had looked into it before, but lacked access to such a cluster or a user to ask these questions. Thanks @maxnoe for both raising and providing suggestions on how to improve.

Am in the midst of a conference ATM. So should be able to look at this more closely next week. Though this generally seems like a good idea.

Noticed you had a raised a few other topics for discussion. Would you be open to chatting briefly at some point in a call (on Google Chat for instance)? It would be really helpful to understand your use case generally. May help with making recommendations about things to try and play with as well as how we might want to improve dask-drmaa generally.

Copy link

@kbruegge kbruegge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. This is similar to how we did it when hacking pygridmap.

@maxnoe
Copy link
Contributor Author

maxnoe commented Apr 11, 2018

Noticed you had a raised a few other topics for discussion. Would you be open to chatting briefly at some point in a call (on Google Chat for instance)?

Sure!

@maxnoe
Copy link
Contributor Author

maxnoe commented Apr 11, 2018

Just to describe our situation:

We are astroparticle physiscists with lots of data @ https://github.com/fact-project/ . We have access to three or four clusters, one ancient SGE, two old PBS/Torque and one pretty new SLURM cluster.

Our usecase is mainly abstracting those different engines away and be able to run our python and java based analysis on each of them without thinking to much about the underlying scheduler.

For the largest part of the data munging, the function would

  • subprocess.run a java program
  • read the output
  • return a dataframe

And the client process would write that to an output file.

For our higher level analysis, the mapped functions would basically apply pretrained scikit-learn models and to some relatively expensive astronomical calculations.

@lesteve
Copy link
Member

lesteve commented Apr 13, 2018

Your commit history looks weird, can you fix that? Edit: By weird I mean that you have old commits (e.g. November 2017) from other people (e.g. jakirkham) in your branch.

Also out of curiosity if you know how to reproduce a weird history like this, I am interested. I always wonder how people do that !

On the PR itself, I know nothing about Torque but it looks fine. In the medium term, may be better to just try a bunch of environment variable rather than rely on _drm_info or _drmaa_implementation. If we had done that previously it "would have just worked" for Torque since it uses the same environment variables as PBS unless I missed something.

@maxnoe
Copy link
Contributor Author

maxnoe commented Apr 13, 2018

@lesteve I rebased against upstream/master and now it looks fine. Now idea what went wrong, I forked it and started a new branch from the master (or so I thought).

@lesteve
Copy link
Member

lesteve commented Apr 14, 2018

Merging this one, thanks @maxnoe!

@lesteve lesteve merged commit 5d3b57f into dask:master Apr 14, 2018
@maxnoe maxnoe deleted the add_torque branch April 14, 2018 06:26
@jakirkham
Copy link
Member

Thanks for reviewing @lesteve. Also thanks for contributing @maxnoe.

@maxnoe maxnoe mentioned this pull request May 2, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants