-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TMCS starts too many processes and dies #329
Fix TMCS starts too many processes and dies #329
Conversation
It is almost the same as the one from the base Executor class but it escapes the start characters because sphinx complains about starting emphasis character without matching ending character
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the cancelling of tasks, this PR has highlighted our inconsistent (and possibly bogus) use of n_jobs
everywhere. I think we need to fix it
More generally, I think that we are not really using ray as it's supposed to be used. For one, ParallelConfig.n_local_workers
is used for num_cpus
in ray.init()
which does not have the effect we document: instead it's the number of cpus for a "raylet" (which I guess is the number of cpus available to each node), which is fine when starting a local cluster, but probably not when using an existing one.
What do you think about the idea of setting max_workers
in the parallel config (with it being a noop for ray, or maybe a check against the number of nodes available for a running cluster), and then using n_jobs
as the number of tasks, and setting num_cpus
to 1 in the call to ray.remote()
?
…te_period, rename n_concurrent_computations to max_workers
- Make RayExecutor respect max_workers - Use a local thread and queues to manage work items and submit futures - Make init_executor take max_workers as n_workers from parallel config
@mdbenito let's discuss this during the next meeting. |
Co-authored-by: Miguel de Benito Delgado <m.debenito@appliedai-institute.de>
It was not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that there are some inconsistencies wrt. max_workers
. The number of cpus available in the cluster is an external factor over which the code has no effect. So we must ignore that, in particular in ray.init()
, where num_cpus
does not refer to the number of cpus used for a local cluster.
max_workers
could then be used as either of:
- the maximum number of vCPUs to be used by the executor (num_jobs * num_cpus_per_job)
- the maximum number of tasks to be run by the executor (so that effective_cpus_used = max_workers * n_cpus_per_job).
We need to fix the names once and for all:
- task = job
- worker = single-core process = CPU
I find the second one horrible, but that seems to be ray's convention, right? We don't have to follow it though: in the ParallelConfig
and elsewhere we could use max_cpus
instead of max_workers
. The question is then what to do when we allow for additional resources like GPUs
Co-authored-by: Miguel de Benito Delgado <m.debenito@appliedai-institute.de>
Co-authored-by: Miguel de Benito Delgado <m.debenito@appliedai-institute.de>
@mdbenito I read the ray documentation and architecture more thoroughly and here's what I found:
I finally understand this better. Thanks for the link to that document. Here's what I suggest:
What do you think? |
Co-authored-by: Miguel de Benito Delgado <m.debenito@appliedai-institute.de>
…s kwargs This is done because mypy complains if we don't have the same signature as the base Executor class
Description
This PR closes #292
It does so by using an abstraction based on concurrent.futures instead of using actors.
I tried first to use ray queues to avoid passing the coordinator to the workers
but they also rely on an actor and it kept dying.
Changes
RayExecutor
class based concurrent.futures to parallel package.EDIT More changes:
n_concurrent_computations
and usedn_jobs
instead of it to mean the number of tasks to submit before waiting for results.n_local_workers
in theParallelConfig
class with n_workers and use that to set the number of max_workers in the given Executor.__post_init__
method toParallelConfig
to make sure that n_workers is None when address is set.ThreadPoolExecutor
withmax_workers=1
.EDIT 2 More changes:
n_cpus_per_job
field toParallelConfig
.cancel_futures_on_exit
boolean parameter toRayExecutor
.EDIT 3 More changes:
n_workers
in ParallelConfig ton_cpus_local
to align more closely with its meaning in ray.n_cpus_per_job
from ParallelConfig and pass it instead as an option to the executor's submit method as part of the kwargs parameter. Otherwise mypy will complain that the method does not have the same signature as the one defined in the base Executor class.max_workers
in RayExecutor as the maximum number of submitted jobs. Take its value fromn_jobs
instead ofn_workers
(which was renamed ton_cpus_local
).2 * effective_n_jobs
to represent the total number of submitted jobs, including the jobs that are running.Checklist
"nbsphinx":"hidden"