Tracking DRMAA Job IDs in Dask Scheduler #14

mrocklin · 2017-01-18T15:47:31Z

When DRMAA launches a worker it gives us a job id like 34.1. This starts a worker on a random port which registers with the scheduler. Now each worker has two identifiers:

A DRMAA job id like 34.1
A Dask ip:port address like 192.168.0.1:37482

The Dask scheduler knows when it should clean up a worker. It goes ahead and terminates the worker remotely. However, despite the process terminating, the DRMAA job appears to continue to run. This causes some confusion, especially when trying to understand how many workers we have in flight when determining if we should scale down the cluster.

I see two solutions here:

Somehow ensure that DRMAA jobs finish when their process finishes
Maintain a mapping between DRMAA Job IDs and Dask ip:port worker addresses

The second option would be useful generally. We can do it by passing the Job ID as the worker's name/alias

dask-worker scheduler-address:8786 --name 34.1

Ideally we would use a job-scheduler-provided environment variable here

dask-worker scheduler-address:8786 --name $JOBID
dask-worker scheduler-address:8786 --name $drmaa_incr_ph$
...

However, it appears that environment variables can not be used within the args, but only with a batch script. Currently we specify a job template by pointing to the dask-worker process directly

wt = get_session().createJobTemplate()
wt.jobName = 'dask-drmaa'
wt.remoteCommand = 'dask-worker
wt.args = [scheduler_address, '--name', '$JOBID']
wt.outputPath = ...
wt.errorPath = ...

However, as stated above using environment variables with args seems to not work with DRMAA. Instead, it is recommended to use environment variables within scripts

# worker script
dask-worker $1 --name $JOBID $@

wt = get_session().createJobTemplate()
wt.jobName = 'dask-drmaa'
wt.remoteCommand = 'my-worker-script
wt.args = [scheduler_address]
wt.outputPath = ...
wt.errorPath = ...

I tried this out and didn't have much success. I suspect that I'm missing something simple. Additionally it would be good have different output paths for different jobs. Currently they all dump to the same file.

There is an xfailed test in test_core.py that checks for the correct job names as worker names. If anyone more familiar with DRMAA can help to make this test pass I would be grateful.

@pytest.mark.xfail(reason="Can't use job name environment variable as arg")
def test_job_name_as_name(loop):
    with DRMAACluster(scheduler_port=0) as cluster:
        cluster.start_workers(2)
        while len(cluster.scheduler.workers) < 1:
            sleep(0.1)
            names = {cluster.scheduler.worker_info[w]['name']
                     for w in cluster.scheduler.workers}

            assert names == set(cluster.workers)

cc @davidr

The text was updated successfully, but these errors were encountered:

mrocklin · 2017-01-19T15:18:35Z

I've managed to close worker jobs using only dask mechanisms. Previously we were closing workers but not cleanly terminating their nanny processes.

This allows for a few things: 1. Users can customize this script to their liking, introducing various environment variables as they see fit. 2. We get to use DRMAA/SGE environment variables within this script See dask#14

* Use bash script rather than dask-worker executable This allows for a few things: 1. Users can customize this script to their liking, introducing various environment variables as they see fit. 2. We get to use DRMAA/SGE environment variables within this script See #14 * add SLURM and LSF JOB/TASK ids as well * Name log files based on job and task ID * synchronize with dask/master * clean up scripts after running * pip install drmaa * run test from source directory

mrocklin mentioned this issue Jan 24, 2017

Use bash script rather than dask-worker executable #20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking DRMAA Job IDs in Dask Scheduler #14

Tracking DRMAA Job IDs in Dask Scheduler #14

mrocklin commented Jan 18, 2017

mrocklin commented Jan 19, 2017

Tracking DRMAA Job IDs in Dask Scheduler #14

Tracking DRMAA Job IDs in Dask Scheduler #14

Comments

mrocklin commented Jan 18, 2017

mrocklin commented Jan 19, 2017