Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid pickling Doc inputs passed to Language.pipe() #10864

Merged
merged 16 commits into from
Jun 2, 2022

Conversation

shadeMe
Copy link
Contributor

@shadeMe shadeMe commented May 27, 2022

Description

A user reported significantly slower performance when calling Language.pipe() with as_tuples=True (as compared to as_tuples=False). Closer investigation revealed the following:

  • When as_tuples is set to True, the inputs passed as texts are preemptively converted to Doc objects (to store their respective context data in the _context attribute). Then, pipe() is called again with the augmented inputs.
  • When multiprocessing is enabled (n_process > 1), Doc objects passed to pipe() are pickled as-is and sent to child processes along with their shared Vocab and Vectors instances.
  • This incurs a significant serialization/deserialization overhead, which is further exacerbated by small batch sizes.

This PR address the issue through the following changes:

  • Serialize Doc objects in the input to byte arrays using Doc.to_bytes() when multiprocessing is enabled. The overhead of pickling a byte array is negligible.
  • In child processes, deserialize incoming byte arrays to their correspond Doc representations before executing the pipeline.
  • Ensure that the _context attribute is serialized when calling Doc.to_bytes/to_dict() in order to prevent the loss of context information.

Relevant issues

Types of change

Performance enhancement

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@shadeMe shadeMe added feat / pipeline Feature: Processing pipeline and components perf / speed Performance: speed labels May 27, 2022
@shadeMe shadeMe requested a review from adrianeboyd May 27, 2022 16:33
@shadeMe
Copy link
Contributor Author

shadeMe commented May 27, 2022

@explosion-bot please test_slow

@explosion-bot
Copy link
Collaborator

explosion-bot commented May 27, 2022

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-slow-tests/builds/95

@adrianeboyd
Copy link
Contributor

_context is just internals used for multiprocessing with pipe, so we don't want to add it to Doc.to_bytes(), which affects every serialized DocBin/.spacy.

The existing implementation added it to pickle_doc to send docs and added it to the tuples that are sent back to receive docs.

byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]

For this, I think we can take it out of pickle_doc since we're not using pickle anymore, and use the same approach (some kind of tuple?) when sending so that _context stays more as internals.

@adrianeboyd adrianeboyd linked an issue May 30, 2022 that may be closed by this pull request
@shadeMe
Copy link
Contributor Author

shadeMe commented May 30, 2022

If _context is only ever used in pipe and we want to pass the context data directly as part of a tuple, shall we then remove _context from Doc entirely?

@adrianeboyd
Copy link
Contributor

Hmm, maybe we don't actually need Doc._context at all here anymore?

@shadeMe
Copy link
Contributor Author

shadeMe commented May 31, 2022

Posted here for context:

Seems like we can't get rid of _context so easily. pipe is allowed to return processed batches that are smaller then the input (this PR talks about custom error handlers potentially skipping specific documents). So, maintaining the alignment between the source Docs and their corresponding contexts will require either storing the context object on the document itself (like we do currently) or using a secondary mapping of unique Doc IDs (that cannot be mutated after init) to context objects. So, we'll continue using the Doc._context field when passing Docs to the pipeline.

@adrianeboyd
Copy link
Contributor

So longer term we'd like to remove as_tuples as a option and only accept string or Doc input.

While nothing looks technically incorrect, my main concern about the current proposal is that it seems to add a ton of machinery that really obscures the much more common no-context, no-multiprocessing path through nlp.pipe and would then need to be removed when we remove as_tuples.

Maybe I'm overlooking some details, but I thought it would be possible to leave the rest as is and have the multiprocessing code use a multiprocessing-specific ensure_doc that works from (str_or_doc, context) tuples?

@shadeMe
Copy link
Contributor Author

shadeMe commented Jun 1, 2022

I've simplified the code further.

@shadeMe
Copy link
Contributor Author

shadeMe commented Jun 1, 2022

@explosion-bot please test_slow

@explosion-bot
Copy link
Collaborator

explosion-bot commented Jun 1, 2022

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-slow-tests/builds/100

Copy link
Contributor

@adrianeboyd adrianeboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few very minor naming/formatting suggestions...

spacy/language.py Outdated Show resolved Hide resolved
spacy/language.py Outdated Show resolved Hide resolved
spacy/language.py Outdated Show resolved Hide resolved
spacy/tokens/doc.pyx Outdated Show resolved Hide resolved
spacy/language.py Outdated Show resolved Hide resolved
spacy/language.py Outdated Show resolved Hide resolved
spacy/errors.py Outdated Show resolved Hide resolved
Copy link
Contributor

@adrianeboyd adrianeboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised the types weren't a problem with the renaming...

spacy/language.py Outdated Show resolved Hide resolved
spacy/language.py Outdated Show resolved Hide resolved
@adrianeboyd adrianeboyd merged commit 41389ff into explosion:master Jun 2, 2022
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this pull request Jun 2, 2022
)

* `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead

* `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()`

* Correct type annotations

* Fix typo

* `Doc`: Do not serialize `_context`

* `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling

* Fix type annotation

* `Language.pipe`: Simplify `as_tuple` multiprocessor handling

* Cleanup code, fix typos

* MyPy fixes

* Move doc preparation function into `_multiprocessing_pipe`
Whitespace changes

* Remove superfluous comma

* Rename `prepare_doc` to `prepare_input`

* Update spacy/errors.py

* Undo renaming for error

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
@shadeMe shadeMe deleted the refactor/pipe-docs-as-bytes branch June 2, 2022 18:46
danieldk pushed a commit that referenced this pull request Jun 7, 2022
* `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead

* `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()`

* Correct type annotations

* Fix typo

* `Doc`: Do not serialize `_context`

* `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling

* Fix type annotation

* `Language.pipe`: Simplify `as_tuple` multiprocessor handling

* Cleanup code, fix typos

* MyPy fixes

* Move doc preparation function into `_multiprocessing_pipe`
Whitespace changes

* Remove superfluous comma

* Rename `prepare_doc` to `prepare_input`

* Update spacy/errors.py

* Undo renaming for error

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / pipeline Feature: Processing pipeline and components perf / speed Performance: speed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipe() process time is very slow with as_tuples attribute
3 participants