Memory leak issue #10015

rkoystart · 2022-01-10T03:34:07Z

rkoystart
Jan 10, 2022

How to reproduce the behaviour

Hi, facing memory leakage issue for the following code.

import spacy
import psutil
import os


if __name__ == "__main__":
     data = open("1l_test.txt","r").read().splitlines() // test file that contains 100000 sentences
     model = spacy.load("en_core_web_md")
     while(True):
       for i in data:
           tokens =  model(i)
           with open("loop_leak.txt","a") as writer:
                process = psutil.Process(os.getpid())
                writer.write(str(process)+"\t"+ str(process.memory_info().rss/1000000)+"\n")
       break

Your Environment

[ravi@ravi spacy_api]$ python -m spacy info --markdown

## Info about spaCy

- **spaCy version:** 3.1.4
- **Platform:** Linux-4.18.0-240.10.1.el8_3.x86_64-x86_64-with-glibc2.10
- **Python version:** 3.8.3
- **Pipelines:** es_core_news_md (3.1.0), it_core_news_md (3.1.0), en_core_web_md (3.1.0), pt_core_news_md (3.1.0), el_core_news_md (3.1.0), fr_core_news_md (3.1.0), nl_core_news_md (3.1.0), de_core_news_md (3.1.0)

Operating System: Centos Linux
Python Version Used: 3.8.3
spaCy Version Used: 3.1.4

I have attached the loop_leak.txt file which has the information about the memory occupied by the process after each sentences being passed through spacy model. The same issue is also happening for spacy sm model for 2.3.4 spacy version also.

Also would like to know whether spacy is caching the results ?

@adrianeboyd any views on this or anything that you can help me out?
loop_leak.txt

Answered by adrianeboyd

Jan 10, 2022

The memory usage increases slightly during processing because the pipeline vocab in nlp.vocab is not static. The lexeme cache (nlp.vocab) and string store (nlp.vocab.strings) grow as texts are processed and new tokens are seen. The lexeme cache is just for speed, but the string store is needed to map the string hashes back to strings.

If you're saving Doc objects directly for future processing, you'd need the string store cache to know which token strings were in the docs, since the Doc object just includes the hashes (ints) for the tokens and not the strings.

If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full stri…

View full answer

adrianeboyd · 2022-01-10T11:29:48Z

adrianeboyd
Jan 10, 2022

The memory usage increases slightly during processing because the pipeline vocab in nlp.vocab is not static. The lexeme cache (nlp.vocab) and string store (nlp.vocab.strings) grow as texts are processed and new tokens are seen. The lexeme cache is just for speed, but the string store is needed to map the string hashes back to strings.

If you're saving Doc objects directly for future processing, you'd need the string store cache to know which token strings were in the docs, since the Doc object just includes the hashes (ints) for the tokens and not the strings.

If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full string store for all your docs at once. The recommended solution if the memory usage is a problem is to periodically reload the pipeline with spacy.load.

16 replies

rkoystart Mar 1, 2022
Author

Hi @adrianeboyd , currently we have gunicorn+flask (with preload option as true and multiple worker :assume around 12 workers) setup based application which will be getting the text from the input request and passing it to spacy pretrained model and returning the output obtained as the response. Since the gunicorn only forks the workers process , I think initially all the workers will be having the same spacy model. But as the worker starts processing the incoming request, each worker will start having its own copy of the complete spacy model (or atleast some part of the spacy related things) as it sees new tokens and adds them to the vocab. So after certain number of request handling , each worker will be having its own copy of the spacy model( or atleast its own copy of some portions of the spacy model). Imagine now I put a check before the forward pass that if the size of vocab and string store together is greater than say 10 lakh(suggestion provided by @chrishmorris ) , then assign model=spacy.load("en-core-web-md") and then do the forward pass and continue.
Will this affect the flow in the other workers request processing or the accuracy of the model in the other worker or provide some error output etc.
Is the suggestion provided by @chrishmorris preferred by you ?

svlandeg Mar 10, 2022
Maintainer

That suggestion is pretty much the same thing what Adriane wrote in her very first reply:

The recommended solution if the memory usage is a problem is to periodically reload the pipeline with spacy.load.

chrishmorris Mar 11, 2022

Yes, exactly the same, thanks Adriane, my contribution was merely to code it.

jey07 Mar 7, 2023

Hi @adrianeboyd , could you please direct me to the official documentation of entire process on what you had mentioned "lexeme cache (nlp.vocab) and string store (nlp.vocab.strings) grow as texts are processed a..."

thomashacker Mar 9, 2023

Hey jey07, thanks for the question!
We currently do not have this detail in the docs, but we agree that this could be valuable for users. We plan to add it to the docs in the future 👍

chrishmorris · 2022-02-22T15:55:45Z

chrishmorris
Feb 22, 2022

You can reload the model when it has grown too large:

_parse = spacy.load('en_core_sci_lg')
def parse(section):
    global _parse
    if 2**30< _parse.vocab.__sizeof__()+ _parse.vocab.strings.__sizeof__():
        _parse = spacy.load('en_core_sci_lg')
        logging.info('Reloaded the Spacy model')
    return _parse(section)

8 replies

adrianeboyd Mar 16, 2022

Garbage collection is complicated. Objects aren't necessarily garbage-collected immediately, and even if you run garbage collection manually, python doesn't necessarily release all the memory back to the OS.

You can see this with a short example:

import spacy
import os
import psutil
import gc


process = psutil.Process(os.getpid())
print("initial     ", process.memory_info().rss / 1000000)

nlp = spacy.load("en_core_web_lg")
print("load        ", process.memory_info().rss / 1000000)

del nlp
print("del         ", process.memory_info().rss / 1000000)

gc.collect()
print("gc.collect()", process.memory_info().rss / 1000000)

Local output:

initial      381.198336
load         1492.82816
del          1492.82816
gc.collect() 481.816576

It sounds like this may be too many models / too many workers for the amount of RAM? You could also trying automatically restarting workers after a certain number of requests, as mentioned in a related discussion here: #10496 (comment)

rkoystart Jun 14, 2022
Author

Thanks @adrianeboyd Will try it 👍

marzooq-unbxd Oct 3, 2022

Did this solve your issue? (len(self.model.vocab))
@rkoystart

and

@chrishmorris

_parse.vocab.sizeof() is basically a constant number, as far as i have seen

marzooq-unbxd Oct 3, 2022

According to https://stackoverflow.com/questions/59515740/how-to-find-the-vocabulary-size-of-a-spacy-model
_n_tokens_with_vectors = len(nlp.vocab.vectors)
_n_unique_word_vectors = len(nlp.vocab.vectors)
But I think these are static and dont change with time

polm Oct 4, 2022

The number of word vectors doesn't change after loading a model. The vocabulary size will typically grow over time, though the more text you process the more slowly it would be expected to grow, as fewer new words are added. If you have a vocabulary it should get larger if you process a new string, like blargfizzle ZXDFERJLSDROIE or something.

kaliaanup · 2022-10-11T16:40:37Z

kaliaanup
Oct 11, 2022

I need help in two aspects on pre-loading.

How to define a constraint? Defining a memory size is not correct as it is impossible to compute that.
I am running a Python based microservice where I am initiating the spacy model initially. The model seems to grow with incoming messages. I added a constrain e.g., if length of vocab > 100K, Spacy should be reloaded. However after reloading Spacy the vocab size stopped growing. Is there any issue? Did it create a new process or something? Ideally it should continue growing from the beginning. Can anyone check on this?

2 replies

polm Oct 12, 2022

Typically, if the memory use of the Vocabulary is an issue, it's enough to check the size (number of entries) and reload the model after it exceeds a threshold. You can also reload after a certain amount of time like a day or something.

If you reload a model and give it input, the vocabulary size should grow when it sees new tokens. For example, if you pass Zoo8PaiRNaez6sooUo4ahtha the model has probably never seen that before. You should be able to verify this.

Without knowing how your server is actually structured there's not really any way for us to explain why your vocab would stop growing, but it's possible you're using the old object somehow.

kaliaanup Oct 20, 2022

Thanks @polm. Will investigate.

RichJackson · 2023-06-20T21:16:29Z

RichJackson
Jun 20, 2023

having a built in memory leak seems pretty strange to me. Reloading the model isn't necessarily the easiest thing to do. Is there seriously no way to control the size of the vocab cache?

1 reply

svlandeg Jun 26, 2023
Maintainer

Hi Richard, it's not so much a "built in memory leak" rather than a "built in cache" that grows as texts are being processed.

We're aware that it's annoying to have to reload the model to resolve the memory usage. We've looked at alternatives, but ultimately it'd always come down to reloading - we can't just purge the cache because then various components would stop working (e.g. labels that are assumed to be in the cache etc)

jricheimer · 2023-08-04T16:58:26Z

jricheimer
Aug 4, 2023

I seem to be running into this while training a transformer model on a very large dataset. Memory consumption continues to increase throughout the course of training until it runs out and gives OOM error.

It would be annoying to have to stop training periodically and then restart from the saved checkpoint (especially if we wanted to resume the learning rate etc at the point it was before).

Is there another solution to this that would work for model training? @adrianeboyd @svlandeg

1 reply

adrianeboyd Aug 7, 2023

Sorry, we don't currently have a solution that would work for model training. I can say that we've sometimes noticed similar problems during transformer training for the provided trf pipelines. The vocab issue could be a part of it, but it doesn't look like it's only the vocab because we haven't noticed similar problems for the same data with the sm/md/lg pipelines. I'm afraid that we've never figured out exactly what was contributing to the memory usage.

Rfank2021 · 2023-10-04T11:11:48Z

Rfank2021
Oct 4, 2023

It's not vocab problem, even passing the same text causes the memory to grow.

1 reply

adrianeboyd Oct 9, 2023

If you have time to test it, it might be useful to try spacy-curated-transformers instead of spacy-transformers, which could potentially help figure out whether the memory problem is in spacy or in another library like spacy-transformers or transformers.

We don't have all the user-friendly scaffolding set up for spacy init config with spacy-curated-transformers yet, but you can use the configs from any of the spacy v3.7 trf models like en_core_web_trf as a starting point. You should only need to swap out the [components.transformers] block for one with curated_transformer instead of transformer. Note that curated_transformer has a number of additional model-specific settings that need to be filled in in advance in the config.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak issue #10015

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 29 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Memory leak issue #10015

How to reproduce the behaviour

Your Environment

Replies: 6 comments · 29 replies

rkoystart Mar 1, 2022 Author

svlandeg Mar 10, 2022 Maintainer

rkoystart Jun 14, 2022 Author

svlandeg Jun 26, 2023 Maintainer

Replies: 6 comments 29 replies

rkoystart Mar 1, 2022
Author

svlandeg Mar 10, 2022
Maintainer

rkoystart Jun 14, 2022
Author

svlandeg Jun 26, 2023
Maintainer