Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly clean up multiprocessing workers #13259

Merged
merged 1 commit into from Jan 23, 2024

Conversation

danieldk
Copy link
Contributor

@danieldk danieldk commented Jan 22, 2024

Description

Before this change, the workers of pipe call with n_process != 1 were
stopped by calling terminate on the processes. However, terminating a
process can leave queues, pipes, and other concurrent data structures in
an invalid state.

With this change, we stop using terminate and take the following approach
instead:

  • When the all documents are processed, the parent process puts a
    sentinel in the queue of each worker.
  • The parent process then calls join on each worker process to
    let them finish up gracefully.
  • Worker processes break from the queue processing loop when the
    sentinel is encountered, so that they exit.

We need special handling when one of the workers encounters an error and
the error handler is set to raise an exception. In this case, we cannot
rely on the sentinel to finish all workers -- the queue is a FIFO queue
and there may be other work queued up before the sentinel. We use the
following approach to handle error scenarios:

  • The parent puts the end-of-work sentinel in the queue of each worker.
  • The parent closes the reading-end of the channel of each worker.
  • Then:
    • If the worker was waiting for work, it will encounter the sentinel
      and break from the processing loop.
    • If the worker was processing a batch, it will attempt to write
      results to the channel. This will fail because the channel was
      closed by the parent and the worker will break from the processing
      loop.

Types of change

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added tests New, missing or incorrect tests bug Bugs and behaviour differing from documentation scaling Scaling, serving and parallelizing spaCy labels Jan 22, 2024
@danieldk danieldk force-pushed the bugfix/ci-fail branch 5 times, most recently from 36d88b9 to e58fbdd Compare January 23, 2024 07:54
Before this change, the workers of pipe call with n_process != 1 were
stopped by calling `terminate` on the processes. However, terminating a
process can leave queues, pipes, and other concurrent data structures in
an invalid state.

With this change, we stop using terminate and take the following approach
instead:

* When the all documents are processed, the parent process puts a
  sentinel in the queue of each worker.
* The parent process then calls `join` on each worker process to
  let them finish up gracefully.
* Worker processes break from the queue processing loop when the
  sentinel is encountered, so that they exit.

We need special handling when one of the workers encounters an error and
the error handler is set to raise an exception. In this case, we cannot
rely on the sentinel to finish all workers -- the queue is a FIFO queue
and there may be other work queued up before the sentinel. We use the
following approach to handle error scenarios:

* The parent puts the end-of-work sentinel in the queue of each worker.
* The parent closes the reading-end of the channel of each worker.
* Then:
  - If the worker was waiting for work, it will encounter the sentinel
    and break from the processing loop.
  - If the worker was processing a batch, it will attempt to write
    results to the channel. This will fail because the channel was
    closed by the parent and the worker will break from the processing
    loop.
@danieldk danieldk changed the title WIP: Properly clean up multiprocessing workers Properly clean up multiprocessing workers Jan 23, 2024
@danieldk danieldk marked this pull request as ready for review January 23, 2024 09:04
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - thanks for the detailed breakdown & documentation of the code! 🙏

@svlandeg svlandeg merged commit 128197a into explosion:master Jan 23, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation scaling Scaling, serving and parallelizing spaCy tests New, missing or incorrect tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants