Memory leak issue #10015
-
How to reproduce the behaviourHi, facing memory leakage issue for the following code.
Your Environment
I have attached the Also would like to know whether spacy is caching the results ? @adrianeboyd any views on this or anything that you can help me out? |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 29 replies
-
The memory usage increases slightly during processing because the pipeline vocab in If you're saving If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full string store for all your docs at once. The recommended solution if the memory usage is a problem is to periodically reload the pipeline with |
Beta Was this translation helpful? Give feedback.
-
You can reload the model when it has grown too large:
|
Beta Was this translation helpful? Give feedback.
-
I need help in two aspects on pre-loading.
|
Beta Was this translation helpful? Give feedback.
-
having a built in memory leak seems pretty strange to me. Reloading the model isn't necessarily the easiest thing to do. Is there seriously no way to control the size of the vocab cache? |
Beta Was this translation helpful? Give feedback.
-
I seem to be running into this while training a transformer model on a very large dataset. Memory consumption continues to increase throughout the course of training until it runs out and gives OOM error. It would be annoying to have to stop training periodically and then restart from the saved checkpoint (especially if we wanted to resume the learning rate etc at the point it was before). Is there another solution to this that would work for model training? @adrianeboyd @svlandeg |
Beta Was this translation helpful? Give feedback.
-
It's not vocab problem, even passing the same text causes the memory to grow. |
Beta Was this translation helpful? Give feedback.
The memory usage increases slightly during processing because the pipeline vocab in
nlp.vocab
is not static. The lexeme cache (nlp.vocab
) and string store (nlp.vocab.strings
) grow as texts are processed and new tokens are seen. The lexeme cache is just for speed, but the string store is needed to map the string hashes back to strings.If you're saving
Doc
objects directly for future processing, you'd need the string store cache to know which token strings were in the docs, since theDoc
object just includes the hashes (ints) for the tokens and not the strings.If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full stri…