Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTCondorCluster #245

Merged
merged 12 commits into from Mar 7, 2019
Merged

HTCondorCluster #245

merged 12 commits into from Mar 7, 2019

Conversation

matyasselmeci
Copy link
Contributor

This implements HTCondorCluster, as requested in #100. This allows using HTCondor as a resource manager, provided you have a shared file system.

In order to minimize code, I ended up using the command-line tools instead of the Python bindings and only submitting one job at a time (i.e. run condor_submit N times with -queue 1 instead of once with -queue N). However, I did leave some hooks for subclassing.

Some differences usage-wise from the other resource managers:

  • disk is a required parameter, for how much storage space to request per job; specify the same way as memory, e.g. "1 GB"
  • job-extra, which is arbitrary extra submit file attributes, is a dict instead of a list

I tested this with Python 2.7 and Python 3.6, but only on a single machine. Looks like some of the other resource managers use docker-compose to set up test clusters so I'll try to get something like that going for HTCondor too.

Copy link
Member

@guillaumeeb guillaumeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many many thanks @matyasselmeci! This is great to finally have this!

HTCondor looks really specific though...

In order to minimize code, I ended up using the command-line tools instead of the Python bindings and only submitting one job at a time

I like that! From what I understand, this makes HTCondor implementation closer to others.

Looks like some of the other resource managers use docker-compose to set up test clusters so I'll try to get something like that going for HTCondor too

Yep, but obviously this can wait for another PR!

I've a few comments/suggestions/questions on the changes, but overall this looks really good! I've a small concern on overriding _script_template and so job_script(), but maybe using the provided ones would have led to something difficult to understand.

For the test failure, apparently you've got a flake8 check error.

cc @szs8 @jrbourbeau @mivade @simone-codeluppi or anyone with an access to an HTCondor cluster for trying this!


# condor sets argv[0] of the executable to "condor_exec.exe", which confuses
# Python (can't find its libs), so we have to go through the shell.
executable = "/bin/sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means we cannot put something like dask-worker here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dask-worker would work well, I just noticed that none of the other implementations were using it so I thought it was deprecated or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually use

'%(python)s -m distributed.cli.dask_worker' % dict(python=sys.executable)

see #77

This is just to be sure to find dask-worker script if not in the PATH or python bin directory.

But you may not be able to pass several arguments here... So maybe it's just simpler to use /bin/sh.

for item in split_env_line:
if '=' in item:
k, v = item.split('=', 1)
self.env_dict[k] = v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make at least the above lines a function.

Could it also be move directly into quote_environment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move it to a function but I'd like to keep it separate from quote_environment; I think turning the config lines into a dict is a separate task from turning the dict into a string, and I'd like to leave the option open to add to/edit the dict after parsing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fine by me.


self.make_job_header()

def make_job_header(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really useful to have a function for a one liner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can inline it, but in that case, I'd like to move it into job_script(). I'd like a subclass to be able to edit the job header via the job_header_dict and have those changes be reflected in the generated script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but this is somehow different from what we do for other implementation. Do you foresee there will be some subclasses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'd like to subclass it to add HTCondor file transfer support and wrapper scripts and other things needed to get rid of the shared file system dependency. As I suggested in #100, I'd put this in a separate repo so you folks wouldn't have to maintain the extra code.

dask_jobqueue/htcondor.py Show resolved Hide resolved
dask_jobqueue/htcondor.py Show resolved Hide resolved
@matyasselmeci
Copy link
Contributor Author

Sorry, should have tested my code changes locally first. Let me fix.

cluster.job_header is no longer directly accessible; instead,
check a few things in job_header_dict and test later in test_job_script()
that the attributes got properly translated.
@guillaumeeb
Copy link
Member

Thanks @matyasselmeci for the work, this looks quite good to me.

However, I'd like to have someone else review this, ping @lesteve and @jhamman.

It would also be good if somebody actually using HTCondor could give this a try!

If nobody shows up in the following days, I'm still happy to merge this as a first implementation for HTCondor, and we'll get feedback later on as people find time to play with it!

@cloudronin
Copy link

This is a feature we have been waiting for and much thanks to @matyasselmeci and @guillaumeeb for getting it this far. We would love to try out the feature once its merged in and give you all feedback.

@mivade
Copy link

mivade commented Mar 7, 2019

I've been pretty busy since this PR was submitted, but I hope to be able to test it out this weekend.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick pass through this PR. Thanks @matyasselmeci for pulling it together. I've not personally used HTCondor before so I'm just going to take you at your word on the specifics of the job script implementation. I'd like to see this go it and let people try it out.

Of course, at some point it would be nice to get a docker test environment for this cluster setup but I think that come come later.

@guillaumeeb guillaumeeb merged commit dee88f0 into dask:master Mar 7, 2019
@guillaumeeb
Copy link
Member

Okay this is in, please give some feedback HTCondor users!

This was referenced Mar 7, 2019
@mivade
Copy link

mivade commented Mar 7, 2019

Thanks @matyasselmeci!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants