New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTCondorCluster #245
HTCondorCluster #245
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many many thanks @matyasselmeci! This is great to finally have this!
HTCondor looks really specific though...
In order to minimize code, I ended up using the command-line tools instead of the Python bindings and only submitting one job at a time
I like that! From what I understand, this makes HTCondor implementation closer to others.
Looks like some of the other resource managers use docker-compose to set up test clusters so I'll try to get something like that going for HTCondor too
Yep, but obviously this can wait for another PR!
I've a few comments/suggestions/questions on the changes, but overall this looks really good! I've a small concern on overriding _script_template
and so job_script()
, but maybe using the provided ones would have led to something difficult to understand.
For the test failure, apparently you've got a flake8 check error.
cc @szs8 @jrbourbeau @mivade @simone-codeluppi or anyone with an access to an HTCondor cluster for trying this!
|
||
# condor sets argv[0] of the executable to "condor_exec.exe", which confuses | ||
# Python (can't find its libs), so we have to go through the shell. | ||
executable = "/bin/sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means we cannot put something like dask-worker
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dask-worker would work well, I just noticed that none of the other implementations were using it so I thought it was deprecated or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually use
'%(python)s -m distributed.cli.dask_worker' % dict(python=sys.executable)
see #77
This is just to be sure to find dask-worker
script if not in the PATH or python bin directory.
But you may not be able to pass several arguments here... So maybe it's just simpler to use /bin/sh.
dask_jobqueue/htcondor.py
Outdated
for item in split_env_line: | ||
if '=' in item: | ||
k, v = item.split('=', 1) | ||
self.env_dict[k] = v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make at least the above lines a function.
Could it also be move directly into quote_environment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move it to a function but I'd like to keep it separate from quote_environment; I think turning the config lines into a dict is a separate task from turning the dict into a string, and I'd like to leave the option open to add to/edit the dict after parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fine by me.
dask_jobqueue/htcondor.py
Outdated
|
||
self.make_job_header() | ||
|
||
def make_job_header(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really useful to have a function for a one liner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can inline it, but in that case, I'd like to move it into job_script(). I'd like a subclass to be able to edit the job header via the job_header_dict and have those changes be reflected in the generated script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but this is somehow different from what we do for other implementation. Do you foresee there will be some subclasses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'd like to subclass it to add HTCondor file transfer support and wrapper scripts and other things needed to get rid of the shared file system dependency. As I suggested in #100, I'd put this in a separate repo so you folks wouldn't have to maintain the extra code.
Sorry, should have tested my code changes locally first. Let me fix. |
cluster.job_header is no longer directly accessible; instead, check a few things in job_header_dict and test later in test_job_script() that the attributes got properly translated.
Thanks @matyasselmeci for the work, this looks quite good to me. However, I'd like to have someone else review this, ping @lesteve and @jhamman. It would also be good if somebody actually using HTCondor could give this a try! If nobody shows up in the following days, I'm still happy to merge this as a first implementation for HTCondor, and we'll get feedback later on as people find time to play with it! |
This is a feature we have been waiting for and much thanks to @matyasselmeci and @guillaumeeb for getting it this far. We would love to try out the feature once its merged in and give you all feedback. |
I've been pretty busy since this PR was submitted, but I hope to be able to test it out this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick pass through this PR. Thanks @matyasselmeci for pulling it together. I've not personally used HTCondor before so I'm just going to take you at your word on the specifics of the job script implementation. I'd like to see this go it and let people try it out.
Of course, at some point it would be nice to get a docker test environment for this cluster setup but I think that come come later.
Okay this is in, please give some feedback HTCondor users! |
Thanks @matyasselmeci! |
This implements HTCondorCluster, as requested in #100. This allows using HTCondor as a resource manager, provided you have a shared file system.
In order to minimize code, I ended up using the command-line tools instead of the Python bindings and only submitting one job at a time (i.e. run
condor_submit
N times with-queue 1
instead of once with-queue N
). However, I did leave some hooks for subclassing.Some differences usage-wise from the other resource managers:
disk
is a required parameter, for how much storage space to request per job; specify the same way as memory, e.g. "1 GB"job-extra
, which is arbitrary extra submit file attributes, is a dict instead of a listI tested this with Python 2.7 and Python 3.6, but only on a single machine. Looks like some of the other resource managers use docker-compose to set up test clusters so I'll try to get something like that going for HTCondor too.