Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTCondorCluster #245

Merged
merged 12 commits into from Mar 7, 2019

Conversation

@matyasselmeci
Copy link
Contributor

matyasselmeci commented Mar 2, 2019

This implements HTCondorCluster, as requested in #100. This allows using HTCondor as a resource manager, provided you have a shared file system.

In order to minimize code, I ended up using the command-line tools instead of the Python bindings and only submitting one job at a time (i.e. run condor_submit N times with -queue 1 instead of once with -queue N). However, I did leave some hooks for subclassing.

Some differences usage-wise from the other resource managers:

  • disk is a required parameter, for how much storage space to request per job; specify the same way as memory, e.g. "1 GB"
  • job-extra, which is arbitrary extra submit file attributes, is a dict instead of a list

I tested this with Python 2.7 and Python 3.6, but only on a single machine. Looks like some of the other resource managers use docker-compose to set up test clusters so I'll try to get something like that going for HTCondor too.

Copy link
Member

guillaumeeb left a comment

Many many thanks @matyasselmeci! This is great to finally have this!

HTCondor looks really specific though...

In order to minimize code, I ended up using the command-line tools instead of the Python bindings and only submitting one job at a time

I like that! From what I understand, this makes HTCondor implementation closer to others.

Looks like some of the other resource managers use docker-compose to set up test clusters so I'll try to get something like that going for HTCondor too

Yep, but obviously this can wait for another PR!

I've a few comments/suggestions/questions on the changes, but overall this looks really good! I've a small concern on overriding _script_template and so job_script(), but maybe using the provided ones would have led to something difficult to understand.

For the test failure, apparently you've got a flake8 check error.

cc @szs8 @jrbourbeau @mivade @simone-codeluppi or anyone with an access to an HTCondor cluster for trying this!


# condor sets argv[0] of the executable to "condor_exec.exe", which confuses
# Python (can't find its libs), so we have to go through the shell.
executable = "/bin/sh"

This comment has been minimized.

Copy link
@guillaumeeb

guillaumeeb Mar 2, 2019

Member

So this means we cannot put something like dask-worker here?

This comment has been minimized.

Copy link
@matyasselmeci

matyasselmeci Mar 4, 2019

Author Contributor

I think dask-worker would work well, I just noticed that none of the other implementations were using it so I thought it was deprecated or something.

This comment has been minimized.

Copy link
@guillaumeeb

guillaumeeb Mar 4, 2019

Member

We actually use

'%(python)s -m distributed.cli.dask_worker' % dict(python=sys.executable)

see #77

This is just to be sure to find dask-worker script if not in the PATH or python bin directory.

But you may not be able to pass several arguments here... So maybe it's just simpler to use /bin/sh.

for item in split_env_line:
if '=' in item:
k, v = item.split('=', 1)
self.env_dict[k] = v

This comment has been minimized.

Copy link
@guillaumeeb

guillaumeeb Mar 2, 2019

Member

I'd make at least the above lines a function.

Could it also be move directly into quote_environment?

This comment has been minimized.

Copy link
@matyasselmeci

matyasselmeci Mar 4, 2019

Author Contributor

I'll move it to a function but I'd like to keep it separate from quote_environment; I think turning the config lines into a dict is a separate task from turning the dict into a string, and I'd like to leave the option open to add to/edit the dict after parsing.

This comment has been minimized.

Copy link
@guillaumeeb

guillaumeeb Mar 4, 2019

Member

Ok, fine by me.


self.make_job_header()

def make_job_header(self):

This comment has been minimized.

Copy link
@guillaumeeb

guillaumeeb Mar 2, 2019

Member

Is it really useful to have a function for a one liner?

This comment has been minimized.

Copy link
@matyasselmeci

matyasselmeci Mar 4, 2019

Author Contributor

I can inline it, but in that case, I'd like to move it into job_script(). I'd like a subclass to be able to edit the job header via the job_header_dict and have those changes be reflected in the generated script.

This comment has been minimized.

Copy link
@guillaumeeb

guillaumeeb Mar 4, 2019

Member

I see your point, but this is somehow different from what we do for other implementation. Do you foresee there will be some subclasses?

This comment has been minimized.

Copy link
@matyasselmeci

matyasselmeci Mar 5, 2019

Author Contributor

Well, I'd like to subclass it to add HTCondor file transfer support and wrapper scripts and other things needed to get rid of the shared file system dependency. As I suggested in #100, I'd put this in a separate repo so you folks wouldn't have to maintain the extra code.

dask_jobqueue/htcondor.py Show resolved Hide resolved
dask_jobqueue/htcondor.py Show resolved Hide resolved
@guillaumeeb guillaumeeb requested review from lesteve and jhamman Mar 4, 2019
@matyasselmeci

This comment has been minimized.

Copy link
Contributor Author

matyasselmeci commented Mar 5, 2019

Sorry, should have tested my code changes locally first. Let me fix.

cluster.job_header is no longer directly accessible; instead,
check a few things in job_header_dict and test later in test_job_script()
that the attributes got properly translated.
@guillaumeeb

This comment has been minimized.

Copy link
Member

guillaumeeb commented Mar 6, 2019

Thanks @matyasselmeci for the work, this looks quite good to me.

However, I'd like to have someone else review this, ping @lesteve and @jhamman.

It would also be good if somebody actually using HTCondor could give this a try!

If nobody shows up in the following days, I'm still happy to merge this as a first implementation for HTCondor, and we'll get feedback later on as people find time to play with it!

@cloudronin

This comment has been minimized.

Copy link

cloudronin commented Mar 6, 2019

This is a feature we have been waiting for and much thanks to @matyasselmeci and @guillaumeeb for getting it this far. We would love to try out the feature once its merged in and give you all feedback.

@mivade

This comment has been minimized.

Copy link

mivade commented Mar 7, 2019

I've been pretty busy since this PR was submitted, but I hope to be able to test it out this weekend.

@jhamman
jhamman approved these changes Mar 7, 2019
Copy link
Member

jhamman left a comment

I took a quick pass through this PR. Thanks @matyasselmeci for pulling it together. I've not personally used HTCondor before so I'm just going to take you at your word on the specifics of the job script implementation. I'd like to see this go it and let people try it out.

Of course, at some point it would be nice to get a docker test environment for this cluster setup but I think that come come later.

@guillaumeeb guillaumeeb merged commit dee88f0 into dask:master Mar 7, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@guillaumeeb

This comment has been minimized.

Copy link
Member

guillaumeeb commented Mar 7, 2019

Okay this is in, please give some feedback HTCondor users!

This was referenced Mar 7, 2019
@mivade

This comment has been minimized.

Copy link

mivade commented Mar 7, 2019

Thanks @matyasselmeci!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.