-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid pickling Doc
inputs passed to Language.pipe()
#10864
Avoid pickling Doc
inputs passed to Language.pipe()
#10864
Conversation
…rocessing to avoid pickling overhead
… `(un)pickle_doc()`
@explosion-bot please test_slow |
URL: https://buildkite.com/explosion-ai/spacy-slow-tests/builds/95 |
The existing implementation added it to Line 2192 in 87adb32
For this, I think we can take it out of |
If |
Hmm, maybe we don't actually need |
Posted here for context: Seems like we can't get rid of |
So longer term we'd like to remove While nothing looks technically incorrect, my main concern about the current proposal is that it seems to add a ton of machinery that really obscures the much more common no-context, no-multiprocessing path through Maybe I'm overlooking some details, but I thought it would be possible to leave the rest as is and have the multiprocessing code use a multiprocessing-specific |
I've simplified the code further. |
@explosion-bot please test_slow |
URL: https://buildkite.com/explosion-ai/spacy-slow-tests/builds/100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! A few very minor naming/formatting suggestions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised the types weren't a problem with the renaming...
) * `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead * `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()` * Correct type annotations * Fix typo * `Doc`: Do not serialize `_context` * `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling * Fix type annotation * `Language.pipe`: Simplify `as_tuple` multiprocessor handling * Cleanup code, fix typos * MyPy fixes * Move doc preparation function into `_multiprocessing_pipe` Whitespace changes * Remove superfluous comma * Rename `prepare_doc` to `prepare_input` * Update spacy/errors.py * Undo renaming for error Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead * `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()` * Correct type annotations * Fix typo * `Doc`: Do not serialize `_context` * `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling * Fix type annotation * `Language.pipe`: Simplify `as_tuple` multiprocessor handling * Cleanup code, fix typos * MyPy fixes * Move doc preparation function into `_multiprocessing_pipe` Whitespace changes * Remove superfluous comma * Rename `prepare_doc` to `prepare_input` * Update spacy/errors.py * Undo renaming for error Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Description
A user reported significantly slower performance when calling
Language.pipe()
withas_tuples=True
(as compared toas_tuples=False
). Closer investigation revealed the following:as_tuples
is set toTrue
, the inputs passed astexts
are preemptively converted toDoc
objects (to store their respective context data in the_context
attribute). Then,pipe()
is called again with the augmented inputs.n_process > 1
),Doc
objects passed topipe()
are pickled as-is and sent to child processes along with their sharedVocab
andVectors
instances.This PR address the issue through the following changes:
Doc
objects in the input to byte arrays usingDoc.to_bytes()
when multiprocessing is enabled. The overhead of pickling a byte array is negligible.Doc
representations before executing the pipeline._context
attribute is serialized when callingDoc.to_bytes/to_dict()
in order to prevent the loss of context information.Relevant issues
Types of change
Performance enhancement
Checklist