Skip to content
This repository has been archived by the owner on Feb 10, 2021. It is now read-only.

Tracking DRMAA Job IDs in Dask Scheduler #14

Open
mrocklin opened this issue Jan 18, 2017 · 1 comment
Open

Tracking DRMAA Job IDs in Dask Scheduler #14

mrocklin opened this issue Jan 18, 2017 · 1 comment

Comments

@mrocklin
Copy link
Member

When DRMAA launches a worker it gives us a job id like 34.1. This starts a worker on a random port which registers with the scheduler. Now each worker has two identifiers:

  1. A DRMAA job id like 34.1
  2. A Dask ip:port address like 192.168.0.1:37482

The Dask scheduler knows when it should clean up a worker. It goes ahead and terminates the worker remotely. However, despite the process terminating, the DRMAA job appears to continue to run. This causes some confusion, especially when trying to understand how many workers we have in flight when determining if we should scale down the cluster.

I see two solutions here:

  1. Somehow ensure that DRMAA jobs finish when their process finishes
  2. Maintain a mapping between DRMAA Job IDs and Dask ip:port worker addresses

The second option would be useful generally. We can do it by passing the Job ID as the worker's name/alias

dask-worker scheduler-address:8786 --name 34.1

Ideally we would use a job-scheduler-provided environment variable here

dask-worker scheduler-address:8786 --name $JOBID
dask-worker scheduler-address:8786 --name $drmaa_incr_ph$
...

However, it appears that environment variables can not be used within the args, but only with a batch script. Currently we specify a job template by pointing to the dask-worker process directly

wt = get_session().createJobTemplate()
wt.jobName = 'dask-drmaa'
wt.remoteCommand = 'dask-worker
wt.args = [scheduler_address, '--name', '$JOBID']
wt.outputPath = ...
wt.errorPath = ...

However, as stated above using environment variables with args seems to not work with DRMAA. Instead, it is recommended to use environment variables within scripts

# worker script
dask-worker $1 --name $JOBID $@
wt = get_session().createJobTemplate()
wt.jobName = 'dask-drmaa'
wt.remoteCommand = 'my-worker-script
wt.args = [scheduler_address]
wt.outputPath = ...
wt.errorPath = ...

I tried this out and didn't have much success. I suspect that I'm missing something simple. Additionally it would be good have different output paths for different jobs. Currently they all dump to the same file.

There is an xfailed test in test_core.py that checks for the correct job names as worker names. If anyone more familiar with DRMAA can help to make this test pass I would be grateful.

@pytest.mark.xfail(reason="Can't use job name environment variable as arg")
def test_job_name_as_name(loop):
    with DRMAACluster(scheduler_port=0) as cluster:
        cluster.start_workers(2)
        while len(cluster.scheduler.workers) < 1:
            sleep(0.1)
            names = {cluster.scheduler.worker_info[w]['name']
                     for w in cluster.scheduler.workers}

            assert names == set(cluster.workers)

cc @davidr

@mrocklin
Copy link
Member Author

I've managed to close worker jobs using only dask mechanisms. Previously we were closing workers but not cleanly terminating their nanny processes.

mrocklin added a commit to mrocklin/dask-drmaa that referenced this issue Jan 24, 2017
This allows for a few things:

1.  Users can customize this script to their liking, introducing various
    environment variables as they see fit.

2.  We get to use DRMAA/SGE environment variables within this script
    See dask#14
mrocklin added a commit that referenced this issue Mar 2, 2017
* Use bash script rather than dask-worker executable

This allows for a few things:

1.  Users can customize this script to their liking, introducing various
    environment variables as they see fit.

2.  We get to use DRMAA/SGE environment variables within this script
    See #14

* add SLURM and LSF JOB/TASK ids as well

* Name log files based on job and task ID

* synchronize with dask/master

* clean up scripts after running

* pip install drmaa

* run test from source directory
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant