Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTCondor support #100

Closed
szs8 opened this issue Jul 18, 2018 · 30 comments
Closed

HTCondor support #100

szs8 opened this issue Jul 18, 2018 · 30 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@szs8
Copy link

szs8 commented Jul 18, 2018

Does this project have plans for HTCondor support?

@mrocklin
Copy link
Member

mrocklin commented Jul 18, 2018 via email

@jhamman
Copy link
Member

jhamman commented Jul 19, 2018

I don't think we have any current plans to implement an HTCondorCluster but it does seem in scope for this project. If an interested developer wanted to give it a try, I suspect it would be relatively straightforward. That said, I don't have any experience with HTCondor so take my measure of "straightforward" with a grain of salt.

@mrocklin
Copy link
Member

mrocklin commented Jul 19, 2018 via email

@jrbourbeau
Copy link
Member

Sorry for the delayed response. @mrocklin is correct, I'm currently working on an HTCondorCluster implementation to add to dask-jobqueue and plan to open up a PR soon.

@szs8
Copy link
Author

szs8 commented Jul 19, 2018

@jrbourbeau That's great! Looking forward to it.

@guillaumeeb
Copy link
Member

@jrbourbeau any update? Do you have something to share? Maybe @szs8 would like to take a look at this?

@jrbourbeau
Copy link
Member

Apologies for letting this linger. I got side-tracked with other work and haven't come back to this yet. @szs8 if you, or anyone else, are interested in contributing here, please feel free to take over.

@mivade
Copy link

mivade commented Oct 16, 2018

@jrbourbeau Where does the existing implementation live?

@guillaumeeb guillaumeeb added the help wanted Extra attention is needed label Oct 17, 2018
@guillaumeeb guillaumeeb added enhancement New feature or request good first issue Good for newcomers labels Oct 27, 2018
@guillaumeeb
Copy link
Member

@jrbourbeau do you have any prototype of HTCondorCluster to share, as mentioned above?

@simone-codeluppi
Copy link

Hi
I am also interested in a prototype of HTCondorCluster. At the moment I run some sub-optimal python code to start/stop the cluster and run jobs without any possibility of scaling etc...

@guillaumeeb
Copy link
Member

If anyone with access to an HTCondor cluster give it a try, I'd be happy to help!

@simone-codeluppi
Copy link

Hi
our lab has a small cluster with 12 nodes that is managed by HTCondor.
I made a script for submitting the jobs and starting a dask cluster but it is not flexible and doesn't allow scaling etc... just submit jobs for scheduler and workers and start a processing job. @guillaumeeb I will be happy to run some tests! Thank you!

@djhoese
Copy link

djhoese commented Jan 8, 2019

I was just asked by a supervisor about this compatibility. Google searches brought me to @matyasselmeci and their repository here. I'm hoping they or someone else can give another status update.

@matyasselmeci
Copy link
Contributor

Hi @djhoese and others,

My project was written to work with dask/distributed instead of dask/dask-jobqueue so some work would have to be done to adapt it.

My code differs from the other from the other batch system interfaces primarily because we wanted it to work without a shared filesystem, by having the user build a special tarball (using build-worker-tarball or build-worker-tarball-conda) and having the worker jobs use HTCondor file transfer to transfer it.

The main difficulty I ran into in the end was that I couldn't figure out a way of building the worker tarball that was sufficiently user-friendly; because of this and because of a lack of interest in our department, the project fizzled.

I don't have time to work on the project anymore but I'm happy to answer any questions if someone wants to use the code as a starting point...

@djhoese
Copy link

djhoese commented Jan 13, 2019

@matyasselmeci That's too bad. Do you have any idea how much work would be needed to migrate your distributed PR to jobqueue? What if you assumed that users had a shared/networked file system?

@matyasselmeci
Copy link
Contributor

I haven't looked at jobqueue at all, so I don't know how much the interface is different from distributed. I can get you an estimate later this week.

Since it sounds like there's interest in this again, I'll talk to my PI about getting some time to work on it.

@djhoese
Copy link

djhoese commented Jan 14, 2019

@matyasselmeci That would be great. My "secret" goal is to possibly get a JupyterHub instance running on the University of Wisconsin clusters, or at least at the SSEC where I work. I assume you also work on campus? If you want to meet in person to discuss some of this stuff and my ideas, let me know.

@guillaumeeb
Copy link
Member

@matyasselmeci or @simone-codeluppi, again I would be happy to help extract the relevant bits of your scripts to a JobQueueCluster implementation for HTCondor. I think this should be quite easy if you've already run a Dask cluster using dask-scheduler or dask-worker commands wrapped into job scripts.

@matyasselmeci
Copy link
Contributor

@guillaumeeb yes, it looks like the changes wouldn't be too severe. One thing is that dask_condor uses HTCondor's Python bindings (which include some compiled code) to submit and control jobs instead of command-line tools (as the other JobQueueClusters do), so it may have to stay a separate project.

@djhoese yes, I work at Comp Sci. My colleagues and I would be very interested in meeting with you and discussing your projects; mind if I contact you via email? (I found your SSEC email address in the UW directory, if that's a good one to use.)

@djhoese
Copy link

djhoese commented Jan 15, 2019

@matyasselmeci Yes that works.

@matyasselmeci
Copy link
Contributor

I've started work on this and it shouldn't be too difficult; I can have a branch in dask_condor that's usable for external testing in a week or two. @mrocklin, if you're OK with adding an optional dependency on htcondor (extras_require=dict(htcondor="htcondor>=8.6.0") in setup.py) then I can turn it into a pull request after that.

@mrocklin
Copy link
Member

mrocklin commented Jan 22, 2019 via email

@guillaumeeb
Copy link
Member

I've no strong argument against it, provinding its not too big and don't break other things!

I've some questions though:

  • Would this be possible to follow the same scheme as for other job queuing system instead? Just checking...
  • Is this method compatible with JobQueueCluster abstract class? Does it implies overriding many part of it?

My main concern is future ease of maintenance, I'd rather not have a too specific solution for HTCondor. But I think the priority is to have a solution!

@matyasselmeci
Copy link
Contributor

I'm using JobQueueCluster as the base class; I'll try to make it as compatible as possible and will follow the conventions the other batch system implementations are using.

@matyasselmeci
Copy link
Contributor

I mostly have to override start_workers and stop_workers/stop_jobs since it doesn't work by creating submission script and running it. If you want, I can split out some of the advanced features (e.g. file transfer) into a subclass that won't be part of the PR and I can maintain separately.

@jhamman
Copy link
Member

jhamman commented Jan 23, 2019

@matyasselmeci - thanks for sticking around here and your continued interest. I have not personally used HTCondor so forgive me if my questions are off base a bit. I'm mostly interested in discussing if there is value in have the HTCondorCluster use the the same machinery that the other cluster managers do. I quickly looked at these docs and it seems this should be quite easy to do.

It seems we could basically template out a job script (just like in the PBSCluster) that looks like:

# /tmp/tempfile.sub -- run one instance of dask_worker
executable              = %(python)s -m distributed.cli.dask_worker
arguments               = "%(worker_command)s"
log                     = %(log_directory)/dask-worker-log.txt
queue

and then the submit/cancel commands:

    submit_command = 'condor_submit'
    cancel_command = 'condor_rm'

It seems like additional arguments are possible in the job script to specify wall time, resources, etc.

Now, I could be missing a fundemental part of HTCondor so please correct me if I'm wrong here.

@matyasselmeci
Copy link
Contributor

Little more complicated than that; executable needs to be a single file (whatever happened to the dask-worker executable btw?); the quoting for arguments is... nonstandard...; log is actually the record of what the HTCondor scheduler does to the job, not the dask-worker's output.

HTCondor is used for submitting multiple jobs at the same time so the job ID has the form 13.5 where the 13 is the ID of the group of jobs submitted ("ClusterId" -- bad name but it's historical) and the 5 is the specific job within that group ("ProcId" -- same here); condor_submit only gives you the ClusterId in the output (but the ProcIds are 0..n-1 as expected)...

Basically the devil is in the details. It's possible to do it that way but you'll need to do a lot of pre-processing of the parameters anyway or else it won't behave the way you expect it to.

@guillaumeeb
Copy link
Member

I've no strong opinion here. I would prefer to have the same scheme for all implementations as said before, but I'm happy if we can have a different working solution, that still fits in dask-jobqueue.

If you've started work on this and have some interest in it, I say keep going and ping us as soon as you have something thats understandable so that we can give feedback!

@lesteve
Copy link
Member

lesteve commented Jan 30, 2019

Just a note: it may be useful to look at IPython.parallel implementation for HTCondor:
https://github.com/ipython/ipyparallel/blob/beb400fd87f59504f231baa28c9b89ea76ab4f79/ipyparallel/apps/launcher.py#L1436

@guillaumeeb
Copy link
Member

Closed by #245.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

10 participants