Skip to content

Adds the ability to provide a list of worker options to SSHCluster similar to connect_options#5044

Open
BENR0 wants to merge 7 commits intodask:mainfrom
BENR0:feat_sshcluster_list_worker_options
Open

Adds the ability to provide a list of worker options to SSHCluster similar to connect_options#5044
BENR0 wants to merge 7 commits intodask:mainfrom
BENR0:feat_sshcluster_list_worker_options

Conversation

@BENR0
Copy link
Copy Markdown

@BENR0 BENR0 commented Jul 9, 2021

I ran tests but they did not finish (they break off on distributed/deploy/tests/test_ssh.py::test_defer_to_old. When skipping the test for the feature does not pass but other tests using SSHCluster fail to with a PermissionDenied error for asyncssh. I am not really familiar with testing async routines so I have no solution.

Black passes, flake8 has one SyntaxError which was not introduced by me.

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BENR0! Apologies for all the unrelated test failures, those should be resolved soon. Though the distributed/deploy/tests/test_ssh.py::test_list_of_worker_options failure does look relevant.

cc @jacobtomlinson

@BENR0
Copy link
Copy Markdown
Author

BENR0 commented Jul 15, 2021

It seems that the dicts in the client.scheduler_info()["workers"] values have no nprocs key even though I think nprocs is a valid argument to worker_options. I don't know if this can not be tested then. I will remove the test for nprocs and hopefully the test will pass then.

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Made a couple of docstrings suggestions but this looks ready to go!

Comment thread distributed/deploy/ssh.py Outdated
Comment thread distributed/deploy/ssh.py Outdated
@jacobtomlinson
Copy link
Copy Markdown
Member

Thanks @BENR0! Happy to merge on passing tests.

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the dicts in the client.scheduler_info()["workers"] values have no nprocs key even though I think nprocs is a valid argument to worker_options. I don't know if this can not be tested then. I will remove the test for nprocs and hopefully the test will pass then.

nprocs is a somewhat special option in that it specifies the number of worker processes, while nthreads specifies the size of the threadpool for each worker. So in the test you've added you have

worker_options=[{"nprocs": 4, "nthreads": 1}, {"nprocs": 2, "nthreads": 1}]

for which I would expect there to be 6 total workers, each of which have a nthreads=1. Would you mind adding an additional assert on the total number of workers. This should be something like

assert len(client.scheduler_info()["workers"]) == 6

though I haven't tested the above snippet at all

Comment thread distributed/deploy/tests/test_ssh.py Outdated
result = await client.submit(lambda x: x + 1, 10)
assert result == 11
d = client.scheduler_info()["workers"]
assert all(v["threads"] == 1 for v in d.values())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert all(v["threads"] == 1 for v in d.values())
assert all(v["nthreads"] == 1 for v in d.values())

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Added the test.

@jrbourbeau
Copy link
Copy Markdown
Member

I added an extra await client.wait_for_workers(6) as a check to make sure the cluster has 6 workers. Interestingly, we get past that line in the test but things fail later when the cluster attempts to update its current state. @jacobtomlinson any thoughts about what might be causing this?

@jacobtomlinson
Copy link
Copy Markdown
Member

I have spent some time today trying to reproduce this locally. I haven't been able to get the test to pass, but for me it gets stuck at the wait for workers call.

The test is failing consistently on Linux (and skipped on mac and windows).

Having the test fail after wait for workers suggests that maybe the idle-timeout is causing the scheduler to close. Perhaps extending that timeout will avoid a race condition?

@BENR0 BENR0 requested a review from fjetter as a code owner January 23, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SSHCluster multiple worker options

3 participants