Avoid pickling `Doc` inputs passed to `Language.pipe()` #10864

shadeMe · 2022-05-27T16:33:23Z

Description

A user reported significantly slower performance when calling Language.pipe() with as_tuples=True (as compared to as_tuples=False). Closer investigation revealed the following:

When as_tuples is set to True, the inputs passed as texts are preemptively converted to Doc objects (to store their respective context data in the _context attribute). Then, pipe() is called again with the augmented inputs.
When multiprocessing is enabled (n_process > 1), Doc objects passed to pipe() are pickled as-is and sent to child processes along with their shared Vocab and Vectors instances.
This incurs a significant serialization/deserialization overhead, which is further exacerbated by small batch sizes.

This PR address the issue through the following changes:

Serialize Doc objects in the input to byte arrays using Doc.to_bytes() when multiprocessing is enabled. The overhead of pickling a byte array is negligible.
In child processes, deserialize incoming byte arrays to their correspond Doc representations before executing the pipeline.
Ensure that the _context attribute is serialized when calling Doc.to_bytes/to_dict() in order to prevent the loss of context information.

Relevant issues

Pipe() process time is very slow with as_tuples attribute #10839

Types of change

Performance enhancement

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

…rocessing to avoid pickling overhead

… `(un)pickle_doc()`

shadeMe · 2022-05-27T16:33:42Z

@explosion-bot please test_slow

explosion-bot · 2022-05-27T16:34:18Z

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-slow-tests/builds/95

adrianeboyd · 2022-05-30T06:17:37Z

_context is just internals used for multiprocessing with pipe, so we don't want to add it to Doc.to_bytes(), which affects every serialized DocBin/.spacy.

The existing implementation added it to pickle_doc to send docs and added it to the tuples that are sent back to receive docs.

spaCy/spacy/language.py

Line 2192 in 87adb32

byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]

For this, I think we can take it out of pickle_doc since we're not using pickle anymore, and use the same approach (some kind of tuple?) when sending so that _context stays more as internals.

shadeMe · 2022-05-30T09:06:56Z

If _context is only ever used in pipe and we want to pass the context data directly as part of a tuple, shall we then remove _context from Doc entirely?

adrianeboyd · 2022-05-30T09:11:50Z

Hmm, maybe we don't actually need Doc._context at all here anymore?

…s_tuples` handling

shadeMe · 2022-05-31T08:03:04Z

Posted here for context:

Seems like we can't get rid of _context so easily. pipe is allowed to return processed batches that are smaller then the input (this PR talks about custom error handlers potentially skipping specific documents). So, maintaining the alignment between the source Docs and their corresponding contexts will require either storing the context object on the document itself (like we do currently) or using a secondary mapping of unique Doc IDs (that cannot be mutated after init) to context objects. So, we'll continue using the Doc._context field when passing Docs to the pipeline.

adrianeboyd · 2022-06-01T12:30:18Z

So longer term we'd like to remove as_tuples as a option and only accept string or Doc input.

While nothing looks technically incorrect, my main concern about the current proposal is that it seems to add a ton of machinery that really obscures the much more common no-context, no-multiprocessing path through nlp.pipe and would then need to be removed when we remove as_tuples.

Maybe I'm overlooking some details, but I thought it would be possible to leave the rest as is and have the multiprocessing code use a multiprocessing-specific ensure_doc that works from (str_or_doc, context) tuples?

shadeMe · 2022-06-01T14:24:40Z

I've simplified the code further.

shadeMe · 2022-06-01T16:14:44Z

@explosion-bot please test_slow

explosion-bot · 2022-06-01T16:15:32Z

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-slow-tests/builds/100

adrianeboyd

Looks good! A few very minor naming/formatting suggestions...

spacy/language.py

spacy/tokens/doc.pyx

spacy/language.py

Whitespace changes

spacy/language.py

spacy/errors.py

…s-as-bytes

adrianeboyd

I'm surprised the types weren't a problem with the renaming...

spacy/language.py

) * `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead * `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()` * Correct type annotations * Fix typo * `Doc`: Do not serialize `_context` * `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling * Fix type annotation * `Language.pipe`: Simplify `as_tuple` multiprocessor handling * Cleanup code, fix typos * MyPy fixes * Move doc preparation function into `_multiprocessing_pipe` Whitespace changes * Remove superfluous comma * Rename `prepare_doc` to `prepare_input` * Update spacy/errors.py * Undo renaming for error Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead * `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()` * Correct type annotations * Fix typo * `Doc`: Do not serialize `_context` * `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling * Fix type annotation * `Language.pipe`: Simplify `as_tuple` multiprocessor handling * Cleanup code, fix typos * MyPy fixes * Move doc preparation function into `_multiprocessing_pipe` Whitespace changes * Remove superfluous comma * Rename `prepare_doc` to `prepare_input` * Update spacy/errors.py * Undo renaming for error Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

shadeMe added 2 commits May 27, 2022 17:58

Language.pipe(): Serialize Doc objects to bytes when using multip…

b59c6a5

…rocessing to avoid pickling overhead

Doc.to_dict(): Serialize _context attribute (keeping in line with…

7c9ff2c

… `(un)pickle_doc()`

shadeMe added feat / pipeline Feature: Processing pipeline and components perf / speed Performance: speed labels May 27, 2022

shadeMe requested a review from adrianeboyd May 27, 2022 16:33

shadeMe added 2 commits May 27, 2022 18:56

Correct type annotations

32da6a8

Fix typo

3120201

adrianeboyd linked an issue May 30, 2022 that may be closed by this pull request

Pipe() process time is very slow with as_tuples attribute #10839

Closed

shadeMe added 2 commits May 31, 2022 09:56

Doc: Do not serialize _context

826381a

Language.pipe: Send context objects to child processes, Simplify `a…

d29b7c2

…s_tuples` handling

Fix type annotation

46cec2b

shadeMe added 2 commits June 1, 2022 16:10

Language.pipe: Simplify as_tuple multiprocessor handling

76f80e7

Cleanup code, fix typos

b7e78ba

MyPy fixes

2b0ad70

adrianeboyd reviewed Jun 2, 2022

View reviewed changes

spacy/language.py Outdated Show resolved Hide resolved

spacy/language.py Outdated Show resolved Hide resolved

spacy/language.py Outdated Show resolved Hide resolved

spacy/tokens/doc.pyx Outdated Show resolved Hide resolved

spacy/language.py Outdated Show resolved Hide resolved

shadeMe added 2 commits June 2, 2022 10:20

Move doc preparation function into _multiprocessing_pipe

6dc42bb

Whitespace changes

Remove superfluous comma

f1bb2f6

adrianeboyd reviewed Jun 2, 2022

View reviewed changes

spacy/language.py Outdated Show resolved Hide resolved

Rename prepare_doc to prepare_input

4a4ec19

adrianeboyd approved these changes Jun 2, 2022

View reviewed changes

adrianeboyd reviewed Jun 2, 2022

View reviewed changes

spacy/errors.py Outdated Show resolved Hide resolved

adrianeboyd added 2 commits June 2, 2022 12:48

Update spacy/errors.py

dd34e04

Merge remote-tracking branch 'upstream/master' into refactor/pipe-doc…

406ae9c

…s-as-bytes

adrianeboyd reviewed Jun 2, 2022

View reviewed changes

spacy/language.py Outdated Show resolved Hide resolved

spacy/language.py Outdated Show resolved Hide resolved

Undo renaming for error

3647be5

adrianeboyd merged commit 41389ff into explosion:master Jun 2, 2022

shadeMe deleted the refactor/pipe-docs-as-bytes branch June 2, 2022 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid pickling `Doc` inputs passed to `Language.pipe()` #10864

Avoid pickling `Doc` inputs passed to `Language.pipe()` #10864

shadeMe commented May 27, 2022 •

edited

shadeMe commented May 27, 2022

explosion-bot commented May 27, 2022 •

edited

adrianeboyd commented May 30, 2022

shadeMe commented May 30, 2022

adrianeboyd commented May 30, 2022

shadeMe commented May 31, 2022 •

edited

adrianeboyd commented Jun 1, 2022

shadeMe commented Jun 1, 2022

shadeMe commented Jun 1, 2022

explosion-bot commented Jun 1, 2022 •

edited

adrianeboyd left a comment

adrianeboyd left a comment

Avoid pickling Doc inputs passed to Language.pipe() #10864

Avoid pickling Doc inputs passed to Language.pipe() #10864

Conversation

shadeMe commented May 27, 2022 • edited

Description

Relevant issues

Types of change

Checklist

shadeMe commented May 27, 2022

explosion-bot commented May 27, 2022 • edited

adrianeboyd commented May 30, 2022

shadeMe commented May 30, 2022

adrianeboyd commented May 30, 2022

shadeMe commented May 31, 2022 • edited

adrianeboyd commented Jun 1, 2022

shadeMe commented Jun 1, 2022

shadeMe commented Jun 1, 2022

explosion-bot commented Jun 1, 2022 • edited

adrianeboyd left a comment

Choose a reason for hiding this comment

adrianeboyd left a comment

Choose a reason for hiding this comment

Avoid pickling `Doc` inputs passed to `Language.pipe()` #10864

Avoid pickling `Doc` inputs passed to `Language.pipe()` #10864

shadeMe commented May 27, 2022 •

edited

explosion-bot commented May 27, 2022 •

edited

shadeMe commented May 31, 2022 •

edited

explosion-bot commented Jun 1, 2022 •

edited