Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multiprocessing groups within project config #10774

Conversation

richardpaulhudson
Copy link
Contributor

@richardpaulhudson richardpaulhudson commented May 9, 2022

Description

Enable the specification of a group of commands within a spaCy project workflow that are to be executed in parallel.

Features

  1. Each spaCy projects command is made up of n operating-system-level commands that are executed in series. The spaCy projects commands within a parallel group are executed in parallel, but the operating-system-level commands within each spaCy projects command are still executed in series.
  2. spaCy project workflows support the definition of dependencies and outputs (files created by commands) in order to ensure that spaCy projects commands are not re-executed unnecessarily on consecutive workflow runs. All this functionality works for each command in a parallel group with respect to the rest of the project file in the same way as it does for a serial command. However, the management of dependencies between the members of a parallel group is out of scope: the user is responsible for ensuring that no problems occur.
  3. It is possible to specify a project-file-wide maximum number of parallel processes. If a parallel group contains more commands than this maximum n, only the first n commands within the group are started. Whenever a command completes, the next command in the group is started, and so on until all the commands in the group have executed.
  4. A table is displayed with real-time status information about the executing commands. The status table works in a similar fashion to the docker pull command-line interface.
  5. If any command within a group returns a non-zero return code, the execution of the other commands within the group is terminated.
  6. Console output relating to a command as well as outputted by the OS-level commands it comprises is initially redirected to a logfile in a temporary directory that a user can monitor in real time if they so wish. When all the commands in a parallel group have completed, the contents of these logfiles are reproduced in the console. This ensures that output from different commands is not mixed up in the console. It also means that the console output after a parallel group has executed is very similar to the console output that would have resulted from the group being executed in series.

Architecture

  1. The main process that is executing the project file uses the multiprocessing module to spawn two or more worker processes, each of which is responsible for a single spaCy project command, and each of which executes — in series, as subprocesses, and using the subprocess module — the operating-system-level commands that it contains.
  2. Each worker process sends status information about the command and operating-system-level commands that it is executing back to the main process via a queue. This status information includes the process ids of the subprocesses that the worker process starts for each operating-system level command.
  3. If a subprocess has failed or been terminated, the worker process that started it does not execute any remaining operating-system-level commands within the spaCy project command for which it is responsible.
  4. If the main process receives notice from a worker process that a subprocess has failed, it terminates directly any other subprocesses that are currently running. Each worker process controlling one of these subprocesses detects that its subprocess has been terminated and notifies the main process of this fact. Doing things this way round has the advantage that subprocesses terminated by the main process are detected, treated and reported identically to subprocesses terminated from outside the library, i.e. directly from the operating system console.
  5. The main process maintains a simple state machine that reflects its knowledge of the activities of the worker processes. It uses this knowledge to decide when to start new worker processes or terminate subprocesses, and as the basis for the status table. Note that the state machine maintains a distinction between failed and terminated processes that is only meainingful on POSIX platforms; on Windows, both states are reported as failed/terminated.

The background to the high-level decision to use subprocess in conjunction with multiprocessing and a status queue is documented here.

Design decisions in need of review

These design decisions reflect various tradeoffs that others may well judge differently; all of them could be altered with minimal effort.

Nr Decision Rationale
1. The console output for each command, having initially been redirected to its own logfile, is reproduced on the console once execution of the parallel group is complete, and there is no option to switch this off. Reproducing the log output has the advantage that the console output is virtually unchanged whether commands are executed in series or in parallel; an option to switch off reproducing log output would complicate the project file syntax; in most cases commands do not log significant amounts of output, so reproducing it is unproblematic.
2. The logfiles for the individual commands are written to a temporary directory; the name of each logfile is derived from the name of the command whose output it contains (this is possible because parallel groups are not allowed to contain the same command twice); this behaviour is fixed. The output is reproduced on the console which is where most users will view it; adding options around the location of the log files would complicate the project file syntax.
3. The console output from each worker process is sent to the main process via the queue, although it would also be possible for the main process to read it directly from the file system. Although this procedure means the output has to be serialized an extra time, the files are expected to remain small, so the improved encapsulation the procedure provides is more important; I confirmed with a short load test that the queue can handle much larger amounts of data than it would ever realistically need to transfer.
4. STDERR output is treated identically to STDOUT output and there is no option to change this. In most cases, the most important requirement is that commands being executed in parallel do not output to the console in real time as this messes up the status table display. STDERR output is only likely to be important when debugging complex issues, in which case the user is in any case likely to get the problematic commands working in series before moving them into a parallel group.
5. Execution of a parallel group is halted if any command returns a non-zero return code, although there are situations in which a non-zero return code might be expected. The same is true of spaCy projects in general when commands are executed in series.
6. Processes are terminated using the SIGTERM signal and there is no option to use alternatives like SIGKILL. Terminating the other commands in a group when one has failed is a convenience feature. The user's next step is likely to be to debug the failed command on its own, i.e. serially outside a parallel group. It seems most unlikely that a user's response to a process within a parallel group failing and it not having been possible to terminate some other process in the group with SIGTERM would be to specify a different signal.
7. The keepalive message interval, maximum width of a parallel command group name within a divider and temporary logfile directory name are all hardcoded in the parallel.py module without any options to change them. Options to change these parameters would complicate the syntax of the project file; it does not seem likely that anyone would have a reason to want values different from the defaults.
8. The main process spawns worker processes regardless of the platform, although forking is more efficient on Unix/Linux. Because the main process does not have a significant memory footprint, the additional cost of spawning is not relevant; on the other hand, it makes sense to keep the behaviour as consistent as possible across platforms.
9. The execution of serial commands is left as it was, meaning that there are differences in how a command is executed depending on whether it is part of a parallel group or not. It would be possible to execute each serial command as a 1-member parallel group, but this would unnecessary complicate what is still the standard way of executing commands, and would also increase the risk of this PR.

Demonstration

This demonstration has been tested on Linux, macOS and Windows 10. To try out the functionality, create these two files in a directory:

script.py:

import sys
from time import sleep
_, sleep_secs, rc = sys.argv
print("Output before sleep to stdout")
sleep(int(sleep_secs))
print("Output after sleep to stderr", file=sys.stderr)
sys.exit(int(rc))

project.yml:

workflows:
  all:
    - sleepC
    - parallel: [sleepC, sleepA, sleepB, sleepD, sleepE]
    - parallel: [sleepE, sleepA, fail, sleepC, sleepD]
    - sleepB

max_parallel_processes: 2
commands:

  - name: sleepA
    script:
      - "python script.py 2 0"
      - "python script.py 2 0"
      - "python script.py 3 0"
      - "python script.py 4 0"

  - name: sleepB
    script:
     - "python script.py 1 0"
     - "python script.py 2 0"

  - name: sleepC
    script:
    - "python script.py 4 0"

  - name: sleepD
    script:
    - "python script.py 2 0"
    outputs:
    - someOutput

  - name: sleepE
    script:
    - "python script.py 1 0"
    - "python script.py 1 0"
    - "python script.py 1 0"
    - "python script.py 1 0"
    - "python script.py 1 0"

  - name: fail
    script:
    - "python script.py 1 1"

Then type from within the directory:

  1. spacy project run all (successful and then unsuccessful execution of a parallel group; running commands are terminated when another command fails in the second group; command with unchanged output is not run the second time; status table is displayed)
  2. spacy project run all --force
  3. spacy project run all --dry
  4. spacy project run
  5. spacy project run all --help
  6. spacy project document
  7. spacy project dvc (can only be attempted if dvc is already installed; exits with an error message as dvc does not support parallel groups)

Types of change

Enhancement or new feature

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@richardpaulhudson richardpaulhudson added enhancement Feature requests and improvements ⚠️ wip Work in progress feat / cli Feature: Command-line interface scaling Scaling, serving and parallelizing spaCy labels May 9, 2022
@richardpaulhudson
Copy link
Contributor Author

@explosion-bot please test_gpu

@explosion-bot
Copy link
Collaborator

explosion-bot commented May 10, 2022

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/68

@richardpaulhudson richardpaulhudson removed the ⚠️ wip Work in progress label May 10, 2022
@richardpaulhudson richardpaulhudson marked this pull request as ready for review May 10, 2022 08:57
@svlandeg svlandeg self-requested a review May 17, 2022 16:14
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a great feature to have, thanks for looking into it!

As a generic suggestion, when introducing new functionality like this it might be worth showing how it can be used in the PR description. This is useful for reviewers, but also for anyone who might stumble upon this PR at some point in the future :-)

More specifically, I have been trying to experiment with this functionality but couldn't define a good example (in yaml) that works - can you provide one just for quick testing purposes?

spacy/cli/_util.py Outdated Show resolved Hide resolved
spacy/cli/_util.py Outdated Show resolved Hide resolved
spacy/cli/_util.py Outdated Show resolved Hide resolved
@richardpaulhudson richardpaulhudson marked this pull request as draft May 22, 2022 11:50
@richardpaulhudson richardpaulhudson marked this pull request as ready for review May 23, 2022 08:36
@svlandeg
Copy link
Member

svlandeg commented Jun 9, 2022

I have been trying to experiment with this functionality but couldn't define a good example (in yaml) that work

Retrying my previous experiment with the current PR code does work - some of the error handling you improved must have fixed it. Also thanks for the cool example in the PR description! That's definitely helpful. One comment I have there is that when you execute spacy project run it shows

Available workflows in project.yml
Usage: python -m spacy project run [WORKFLOW]

all setup -> commandA -> commandB

which implies an order between commandA and commandB -> could you look into fixing that as well?

@svlandeg
Copy link
Member

svlandeg commented Jun 14, 2022

One more thing I was realising is that this change does introduce a potential for breaking compatibility: when users create a .yml with this parallelization for a new version of spaCy, that same .yml will result in errors for older versions of spaCy, and the error is actually quite cryptic:

✘ Invalid project.yml. Double-check that the YAML is correct.
[workflows -> all -> 1] str type expected

I don't think we can do much about that though...

@richardpaulhudson richardpaulhudson marked this pull request as draft June 14, 2022 16:09
@richardpaulhudson
Copy link
Contributor Author

richardpaulhudson commented Jun 15, 2022

One more thing I was realising is that this change does introduce a potential for breaking compatibility: when users create a .yml with this parallelization for a new version of spaCy, that same .yml will result in errors for older versions of spaCy, and the error is actually quite cryptic:

✘ Invalid project.yml. Double-check that the YAML is correct.
[workflows -> all -> 1] str type expected

I don't think we can do much about that though...

What we could do is to introduce an additional field at the top declaring a schema version number: older versions of spaCy would then fail because they wouldn't recognise the new field and this would be less cryptic for the user. However, it would also mean that all config files would no longer be backwards-compatible rather than just config files that involve parallelism, which would be an unacceptably large price to pay for a slightly clearer error message. I think the best course of action is just to live with this.

UPDATE: during internal discussions we considered various fields (like the max_parallel_processes field) as a possible route to a more user-friendly error message. In fact it turns out that an unrecognized top-level field in the config file does not trigger an error, so that there really is no other option than to live with the current situation.

@svlandeg svlandeg removed their request for review June 20, 2022 09:25
@svlandeg svlandeg requested review from njsmith, honnibal and svlandeg and removed request for njsmith August 16, 2022 14:09
@richardpaulhudson
Copy link
Contributor Author

@explosion-bot please test_gpu

@explosion-bot
Copy link
Collaborator

explosion-bot commented Oct 4, 2022

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/113

@svlandeg svlandeg added the projects spaCy projects and project templates label Oct 10, 2022
@adrianeboyd adrianeboyd changed the base branch from v3.4.x to master October 21, 2022 06:29
@richardpaulhudson richardpaulhudson marked this pull request as ready for review January 23, 2023 19:33
@svlandeg svlandeg requested review from svlandeg and removed request for svlandeg March 28, 2023 18:15
@svlandeg
Copy link
Member

svlandeg commented Jun 15, 2023

I'm adding this one to our internal backlog - when we continue working on this, we'll need to move this into 'weasel'.

@svlandeg svlandeg closed this Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface projects spaCy projects and project templates scaling Scaling, serving and parallelizing spaCy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants