Enable multiprocessing groups within project config #10774

richardpaulhudson · 2022-05-09T15:33:39Z

Description

Enable the specification of a group of commands within a spaCy project workflow that are to be executed in parallel.

Features

Each spaCy projects command is made up of n operating-system-level commands that are executed in series. The spaCy projects commands within a parallel group are executed in parallel, but the operating-system-level commands within each spaCy projects command are still executed in series.
spaCy project workflows support the definition of dependencies and outputs (files created by commands) in order to ensure that spaCy projects commands are not re-executed unnecessarily on consecutive workflow runs. All this functionality works for each command in a parallel group with respect to the rest of the project file in the same way as it does for a serial command. However, the management of dependencies between the members of a parallel group is out of scope: the user is responsible for ensuring that no problems occur.
It is possible to specify a project-file-wide maximum number of parallel processes. If a parallel group contains more commands than this maximum n, only the first n commands within the group are started. Whenever a command completes, the next command in the group is started, and so on until all the commands in the group have executed.
A table is displayed with real-time status information about the executing commands. The status table works in a similar fashion to the docker pull command-line interface.
If any command within a group returns a non-zero return code, the execution of the other commands within the group is terminated.
Console output relating to a command as well as outputted by the OS-level commands it comprises is initially redirected to a logfile in a temporary directory that a user can monitor in real time if they so wish. When all the commands in a parallel group have completed, the contents of these logfiles are reproduced in the console. This ensures that output from different commands is not mixed up in the console. It also means that the console output after a parallel group has executed is very similar to the console output that would have resulted from the group being executed in series.

Architecture

The main process that is executing the project file uses the multiprocessing module to spawn two or more worker processes, each of which is responsible for a single spaCy project command, and each of which executes — in series, as subprocesses, and using the subprocess module — the operating-system-level commands that it contains.
Each worker process sends status information about the command and operating-system-level commands that it is executing back to the main process via a queue. This status information includes the process ids of the subprocesses that the worker process starts for each operating-system level command.
If a subprocess has failed or been terminated, the worker process that started it does not execute any remaining operating-system-level commands within the spaCy project command for which it is responsible.
If the main process receives notice from a worker process that a subprocess has failed, it terminates directly any other subprocesses that are currently running. Each worker process controlling one of these subprocesses detects that its subprocess has been terminated and notifies the main process of this fact. Doing things this way round has the advantage that subprocesses terminated by the main process are detected, treated and reported identically to subprocesses terminated from outside the library, i.e. directly from the operating system console.
The main process maintains a simple state machine that reflects its knowledge of the activities of the worker processes. It uses this knowledge to decide when to start new worker processes or terminate subprocesses, and as the basis for the status table. Note that the state machine maintains a distinction between failed and terminated processes that is only meainingful on POSIX platforms; on Windows, both states are reported as failed/terminated.

The background to the high-level decision to use subprocess in conjunction with multiprocessing and a status queue is documented here.

Design decisions in need of review

These design decisions reflect various tradeoffs that others may well judge differently; all of them could be altered with minimal effort.

Nr	Decision	Rationale
1.	The console output for each command, having initially been redirected to its own logfile, is reproduced on the console once execution of the parallel group is complete, and there is no option to switch this off.	Reproducing the log output has the advantage that the console output is virtually unchanged whether commands are executed in series or in parallel; an option to switch off reproducing log output would complicate the project file syntax; in most cases commands do not log significant amounts of output, so reproducing it is unproblematic.
2.	The logfiles for the individual commands are written to a temporary directory; the name of each logfile is derived from the name of the command whose output it contains (this is possible because parallel groups are not allowed to contain the same command twice); this behaviour is fixed.	The output is reproduced on the console which is where most users will view it; adding options around the location of the log files would complicate the project file syntax.
3.	The console output from each worker process is sent to the main process via the queue, although it would also be possible for the main process to read it directly from the file system.	Although this procedure means the output has to be serialized an extra time, the files are expected to remain small, so the improved encapsulation the procedure provides is more important; I confirmed with a short load test that the queue can handle much larger amounts of data than it would ever realistically need to transfer.
4.	`STDERR` output is treated identically to `STDOUT` output and there is no option to change this.	In most cases, the most important requirement is that commands being executed in parallel do not output to the console in real time as this messes up the status table display. `STDERR` output is only likely to be important when debugging complex issues, in which case the user is in any case likely to get the problematic commands working in series before moving them into a parallel group.
5.	Execution of a parallel group is halted if any command returns a non-zero return code, although there are situations in which a non-zero return code might be expected.	The same is true of spaCy projects in general when commands are executed in series.
6.	Processes are terminated using the `SIGTERM` signal and there is no option to use alternatives like `SIGKILL`.	Terminating the other commands in a group when one has failed is a convenience feature. The user's next step is likely to be to debug the failed command on its own, i.e. serially outside a parallel group. It seems most unlikely that a user's response to a process within a parallel group failing and it not having been possible to terminate some other process in the group with `SIGTERM` would be to specify a different signal.
7.	The keepalive message interval, maximum width of a parallel command group name within a divider and temporary logfile directory name are all hardcoded in the `parallel.py` module without any options to change them.	Options to change these parameters would complicate the syntax of the project file; it does not seem likely that anyone would have a reason to want values different from the defaults.
8.	The main process spawns worker processes regardless of the platform, although forking is more efficient on Unix/Linux.	Because the main process does not have a significant memory footprint, the additional cost of spawning is not relevant; on the other hand, it makes sense to keep the behaviour as consistent as possible across platforms.
9.	The execution of serial commands is left as it was, meaning that there are differences in how a command is executed depending on whether it is part of a parallel group or not.	It would be possible to execute each serial command as a 1-member parallel group, but this would unnecessary complicate what is still the standard way of executing commands, and would also increase the risk of this PR.

Demonstration

This demonstration has been tested on Linux, macOS and Windows 10. To try out the functionality, create these two files in a directory:

script.py:

import sys
from time import sleep
_, sleep_secs, rc = sys.argv
print("Output before sleep to stdout")
sleep(int(sleep_secs))
print("Output after sleep to stderr", file=sys.stderr)
sys.exit(int(rc))

project.yml:

workflows:
  all:
    - sleepC
    - parallel: [sleepC, sleepA, sleepB, sleepD, sleepE]
    - parallel: [sleepE, sleepA, fail, sleepC, sleepD]
    - sleepB

max_parallel_processes: 2
commands:

  - name: sleepA
    script:
      - "python script.py 2 0"
      - "python script.py 2 0"
      - "python script.py 3 0"
      - "python script.py 4 0"

  - name: sleepB
    script:
     - "python script.py 1 0"
     - "python script.py 2 0"

  - name: sleepC
    script:
    - "python script.py 4 0"

  - name: sleepD
    script:
    - "python script.py 2 0"
    outputs:
    - someOutput

  - name: sleepE
    script:
    - "python script.py 1 0"
    - "python script.py 1 0"
    - "python script.py 1 0"
    - "python script.py 1 0"
    - "python script.py 1 0"

  - name: fail
    script:
    - "python script.py 1 1"

Then type from within the directory:

spacy project run all (successful and then unsuccessful execution of a parallel group; running commands are terminated when another command fails in the second group; command with unchanged output is not run the second time; status table is displayed)
spacy project run all --force
spacy project run all --dry
spacy project run
spacy project run all --help
spacy project document
spacy project dvc (can only be attempted if dvc is already installed; exits with an error message as dvc does not support parallel groups)

Types of change

Enhancement or new feature

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

richardpaulhudson · 2022-05-10T08:32:43Z

@explosion-bot please test_gpu

explosion-bot · 2022-05-10T08:33:22Z

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/68

svlandeg

This will be a great feature to have, thanks for looking into it!

As a generic suggestion, when introducing new functionality like this it might be worth showing how it can be used in the PR description. This is useful for reviewers, but also for anyone who might stumble upon this PR at some point in the future :-)

More specifically, I have been trying to experiment with this functionality but couldn't define a good example (in yaml) that works - can you provide one just for quick testing purposes?

spacy/cli/_util.py

spacy/cli/project/run.py

spacy/cli/project/dvc.py

svlandeg · 2022-06-09T10:23:43Z

I have been trying to experiment with this functionality but couldn't define a good example (in yaml) that work

Retrying my previous experiment with the current PR code does work - some of the error handling you improved must have fixed it. Also thanks for the cool example in the PR description! That's definitely helpful. One comment I have there is that when you execute spacy project run it shows

Available workflows in project.yml
Usage: python -m spacy project run [WORKFLOW]

all setup -> commandA -> commandB

which implies an order between commandA and commandB -> could you look into fixing that as well?

svlandeg · 2022-06-14T13:29:16Z

One more thing I was realising is that this change does introduce a potential for breaking compatibility: when users create a .yml with this parallelization for a new version of spaCy, that same .yml will result in errors for older versions of spaCy, and the error is actually quite cryptic:

✘ Invalid project.yml. Double-check that the YAML is correct.
[workflows -> all -> 1] str type expected

I don't think we can do much about that though...

richardpaulhudson · 2022-06-15T15:07:22Z

One more thing I was realising is that this change does introduce a potential for breaking compatibility: when users create a .yml with this parallelization for a new version of spaCy, that same .yml will result in errors for older versions of spaCy, and the error is actually quite cryptic:
✘ Invalid project.yml. Double-check that the YAML is correct.
[workflows -> all -> 1] str type expected
I don't think we can do much about that though...

What we could do is to introduce an additional field at the top declaring a schema version number: older versions of spaCy would then fail because they wouldn't recognise the new field and this would be less cryptic for the user. However, it would also mean that all config files would no longer be backwards-compatible rather than just config files that involve parallelism, which would be an unacceptably large price to pay for a slightly clearer error message. I think the best course of action is just to live with this.

UPDATE: during internal discussions we considered various fields (like the max_parallel_processes field) as a possible route to a more user-friendly error message. In fact it turns out that an unrecognized top-level field in the config file does not trigger an error, so that there really is no other option than to live with the current situation.

richardpaulhudson · 2022-10-04T16:18:12Z

@explosion-bot please test_gpu

explosion-bot · 2022-10-04T16:18:48Z

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/113

svlandeg · 2023-06-15T22:40:57Z

I'm adding this one to our internal backlog - when we continue working on this, we'll need to move this into 'weasel'.

richardpaulhudson added 2 commits May 9, 2022 12:50

Permit multiprocessing groups in YAML

8d08a68

Basic multiprocessing functionality

12e8600

richardpaulhudson added enhancement Feature requests and improvements ⚠️ wip Work in progress feat / cli Feature: Command-line interface scaling Scaling, serving and parallelizing spaCy labels May 9, 2022

richardpaulhudson added 4 commits May 9, 2022 19:03

Mypy corrections

e3b4ee7

Secondary functionality and documentation

a2bd489

Fixed formatting issues

8c8b81a

Corrections

a481698

richardpaulhudson removed the ⚠️ wip Work in progress label May 10, 2022

richardpaulhudson marked this pull request as ready for review May 10, 2022 08:57

svlandeg self-requested a review May 17, 2022 16:14

svlandeg reviewed May 20, 2022

View reviewed changes

spacy/cli/_util.py Outdated Show resolved Hide resolved

spacy/cli/_util.py Outdated Show resolved Hide resolved

spacy/cli/_util.py Outdated Show resolved Hide resolved

Changes after review

ae82568

richardpaulhudson commented May 20, 2022

View reviewed changes

spacy/cli/project/run.py Outdated Show resolved Hide resolved

richardpaulhudson marked this pull request as draft May 22, 2022 11:50

Changes based on review

4daffdd

richardpaulhudson commented May 23, 2022

View reviewed changes

spacy/cli/project/dvc.py Outdated Show resolved Hide resolved

richardpaulhudson commented May 23, 2022

View reviewed changes

spacy/cli/project/dvc.py Outdated Show resolved Hide resolved

Readability improvement

2eb13f2

richardpaulhudson marked this pull request as ready for review May 23, 2022 08:36

richardpaulhudson requested a review from svlandeg May 23, 2022 09:18

richardpaulhudson marked this pull request as draft June 14, 2022 16:09

svlandeg removed their request for review June 20, 2022 09:25

svlandeg requested review from njsmith, honnibal and svlandeg and removed request for njsmith August 16, 2022 14:09

richardpaulhudson added 2 commits August 24, 2022 19:14

Widened errors caught from os.kill()

1bf82db

Revert to diagnose error

78ee9c3

svlandeg removed request for honnibal, svlandeg and njsmith August 30, 2022 15:24

richardpaulhudson added 9 commits October 4, 2022 09:45

Merge branch 'master' into feature/projects-multiprocessing

5aa95ce

Copied changes from spaCy/tmp/project-multiprocess

70fa1ce

Improve error logging

afba051

Correction

b48f2e1

Handle PermissionError in Windows CI

522b0ed

Correction

c8b7912

Switch to use TemporaryDirectory

786473d

Merge branch 'explosion:master' into feature/projects-multiprocessing

b8a299f

Use mkdtemp()

2cc2cc1

svlandeg added the projects spaCy projects and project templates label Oct 10, 2022

adrianeboyd changed the base branch from v3.4.x to master October 21, 2022 06:29

Merge branch 'master' into feature/projects-multiprocessing

cfaa902

richardpaulhudson marked this pull request as ready for review January 23, 2023 19:33

Empty commit to trigger CI

b3bcfe5

svlandeg requested review from svlandeg and removed request for svlandeg March 28, 2023 18:15

svlandeg closed this Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multiprocessing groups within project config #10774

Enable multiprocessing groups within project config #10774

richardpaulhudson commented May 9, 2022 •

edited

richardpaulhudson commented May 10, 2022

explosion-bot commented May 10, 2022 •

edited

svlandeg left a comment •

edited

svlandeg commented Jun 9, 2022

svlandeg commented Jun 14, 2022 •

edited

richardpaulhudson commented Jun 15, 2022 •

edited

richardpaulhudson commented Oct 4, 2022

explosion-bot commented Oct 4, 2022 •

edited

svlandeg commented Jun 15, 2023 •

edited

Enable multiprocessing groups within project config #10774

Enable multiprocessing groups within project config #10774

Conversation

richardpaulhudson commented May 9, 2022 • edited

Description

Features

Architecture

Design decisions in need of review

Demonstration

Types of change

Checklist

richardpaulhudson commented May 10, 2022

explosion-bot commented May 10, 2022 • edited

svlandeg left a comment • edited

Choose a reason for hiding this comment

svlandeg commented Jun 9, 2022

svlandeg commented Jun 14, 2022 • edited

richardpaulhudson commented Jun 15, 2022 • edited

richardpaulhudson commented Oct 4, 2022

explosion-bot commented Oct 4, 2022 • edited

svlandeg commented Jun 15, 2023 • edited

richardpaulhudson commented May 9, 2022 •

edited

explosion-bot commented May 10, 2022 •

edited

svlandeg left a comment •

edited

svlandeg commented Jun 14, 2022 •

edited

richardpaulhudson commented Jun 15, 2022 •

edited

explosion-bot commented Oct 4, 2022 •

edited

svlandeg commented Jun 15, 2023 •

edited