You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 10, 2021. It is now read-only.
When DRMAA launches a worker it gives us a job id like 34.1. This starts a worker on a random port which registers with the scheduler. Now each worker has two identifiers:
A DRMAA job id like 34.1
A Dask ip:port address like 192.168.0.1:37482
The Dask scheduler knows when it should clean up a worker. It goes ahead and terminates the worker remotely. However, despite the process terminating, the DRMAA job appears to continue to run. This causes some confusion, especially when trying to understand how many workers we have in flight when determining if we should scale down the cluster.
I see two solutions here:
Somehow ensure that DRMAA jobs finish when their process finishes
Maintain a mapping between DRMAA Job IDs and Dask ip:port worker addresses
The second option would be useful generally. We can do it by passing the Job ID as the worker's name/alias
dask-worker scheduler-address:8786 --name 34.1
Ideally we would use a job-scheduler-provided environment variable here
However, it appears that environment variables can not be used within the args, but only with a batch script. Currently we specify a job template by pointing to the dask-worker process directly
However, as stated above using environment variables with args seems to not work with DRMAA. Instead, it is recommended to use environment variables within scripts
I tried this out and didn't have much success. I suspect that I'm missing something simple. Additionally it would be good have different output paths for different jobs. Currently they all dump to the same file.
There is an xfailed test in test_core.py that checks for the correct job names as worker names. If anyone more familiar with DRMAA can help to make this test pass I would be grateful.
@pytest.mark.xfail(reason="Can't use job name environment variable as arg")deftest_job_name_as_name(loop):
withDRMAACluster(scheduler_port=0) ascluster:
cluster.start_workers(2)
whilelen(cluster.scheduler.workers) <1:
sleep(0.1)
names= {cluster.scheduler.worker_info[w]['name']
forwincluster.scheduler.workers}
assertnames==set(cluster.workers)
This allows for a few things:
1. Users can customize this script to their liking, introducing various
environment variables as they see fit.
2. We get to use DRMAA/SGE environment variables within this script
See dask#14
* Use bash script rather than dask-worker executable
This allows for a few things:
1. Users can customize this script to their liking, introducing various
environment variables as they see fit.
2. We get to use DRMAA/SGE environment variables within this script
See #14
* add SLURM and LSF JOB/TASK ids as well
* Name log files based on job and task ID
* synchronize with dask/master
* clean up scripts after running
* pip install drmaa
* run test from source directory
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
When DRMAA launches a worker it gives us a job id like
34.1
. This starts a worker on a random port which registers with the scheduler. Now each worker has two identifiers:34.1
192.168.0.1:37482
The Dask scheduler knows when it should clean up a worker. It goes ahead and terminates the worker remotely. However, despite the process terminating, the DRMAA job appears to continue to run. This causes some confusion, especially when trying to understand how many workers we have in flight when determining if we should scale down the cluster.
I see two solutions here:
The second option would be useful generally. We can do it by passing the Job ID as the worker's name/alias
Ideally we would use a job-scheduler-provided environment variable here
However, it appears that environment variables can not be used within the args, but only with a batch script. Currently we specify a job template by pointing to the dask-worker process directly
However, as stated above using environment variables with args seems to not work with DRMAA. Instead, it is recommended to use environment variables within scripts
I tried this out and didn't have much success. I suspect that I'm missing something simple. Additionally it would be good have different output paths for different jobs. Currently they all dump to the same file.
There is an xfailed test in
test_core.py
that checks for the correct job names as worker names. If anyone more familiar with DRMAA can help to make this test pass I would be grateful.cc @davidr
The text was updated successfully, but these errors were encountered: