New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable multiprocessing groups within project config #10774
Enable multiprocessing groups within project config #10774
Conversation
@explosion-bot please test_gpu |
URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/68 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a great feature to have, thanks for looking into it!
As a generic suggestion, when introducing new functionality like this it might be worth showing how it can be used in the PR description. This is useful for reviewers, but also for anyone who might stumble upon this PR at some point in the future :-)
More specifically, I have been trying to experiment with this functionality but couldn't define a good example (in yaml) that works - can you provide one just for quick testing purposes?
Retrying my previous experiment with the current PR code does work - some of the error handling you improved must have fixed it. Also thanks for the cool example in the PR description! That's definitely helpful. One comment I have there is that when you execute
which implies an order between |
One more thing I was realising is that this change does introduce a potential for breaking compatibility: when users create a .yml with this parallelization for a new version of spaCy, that same .yml will result in errors for older versions of spaCy, and the error is actually quite cryptic:
I don't think we can do much about that though... |
What we could do is to introduce an additional field at the top declaring a schema version number: older versions of spaCy would then fail because they wouldn't recognise the new field and this would be less cryptic for the user. However, it would also mean that all config files would no longer be backwards-compatible rather than just config files that involve parallelism, which would be an unacceptably large price to pay for a slightly clearer error message. I think the best course of action is just to live with this. UPDATE: during internal discussions we considered various fields (like the max_parallel_processes field) as a possible route to a more user-friendly error message. In fact it turns out that an unrecognized top-level field in the config file does not trigger an error, so that there really is no other option than to live with the current situation. |
@explosion-bot please test_gpu |
URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/113 |
I'm adding this one to our internal backlog - when we continue working on this, we'll need to move this into 'weasel'. |
Description
Enable the specification of a group of commands within a spaCy project workflow that are to be executed in parallel.
Features
n
operating-system-level commands that are executed in series. The spaCy projects commands within a parallel group are executed in parallel, but the operating-system-level commands within each spaCy projects command are still executed in series.n
, only the firstn
commands within the group are started. Whenever a command completes, the next command in the group is started, and so on until all the commands in the group have executed.docker pull
command-line interface.Architecture
multiprocessing
module to spawn two or more worker processes, each of which is responsible for a single spaCy project command, and each of which executes — in series, as subprocesses, and using thesubprocess
module — the operating-system-level commands that it contains.failed
andterminated
processes that is only meainingful on POSIX platforms; on Windows, both states are reported asfailed/terminated
.The background to the high-level decision to use
subprocess
in conjunction withmultiprocessing
and a status queue is documented here.Design decisions in need of review
These design decisions reflect various tradeoffs that others may well judge differently; all of them could be altered with minimal effort.
STDERR
output is treated identically toSTDOUT
output and there is no option to change this.STDERR
output is only likely to be important when debugging complex issues, in which case the user is in any case likely to get the problematic commands working in series before moving them into a parallel group.SIGTERM
signal and there is no option to use alternatives likeSIGKILL
.SIGTERM
would be to specify a different signal.parallel.py
module without any options to change them.Demonstration
This demonstration has been tested on Linux, macOS and Windows 10. To try out the functionality, create these two files in a directory:
script.py
:project.yml
:Then type from within the directory:
spacy project run all
(successful and then unsuccessful execution of a parallel group; running commands are terminated when another command fails in the second group; command with unchanged output is not run the second time; status table is displayed)spacy project run all --force
spacy project run all --dry
spacy project run
spacy project run all --help
spacy project document
spacy project dvc
(can only be attempted ifdvc
is already installed; exits with an error message asdvc
does not support parallel groups)Types of change
Enhancement or new feature
Checklist