Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when processing a large number of documents with Spacy transformers #12037

Closed
saketsharmabmb opened this issue Dec 30, 2022 · 3 comments
Labels
feat / transformer Feature: Transformer perf / memory Performance: memory use

Comments

@saketsharmabmb
Copy link

saketsharmabmb commented Dec 30, 2022

I have a Spacy distilbert transformer model trained for NER. When I use this model for predictions on a large corpus of documents, the RAM usage spikes up very quickly, and then keeps increasing over time, until I run out of memory, and my process gets killed.
I am running this on a CPU AWS machine m5.12xlarge
I see the same behavior when using en_core_web_trf model.

The following code can be used to reproduce error with en_core_web_trf model


import sys, pickle, time, os
import spacy

print(f"CPU Count: {os.cpu_count()}")

model = spacy.load("en_core_web_trf")

## Docs are English text documents with average character length of 2479, std dev 3487, max 69000
docs = pickle.load( open( "memory_analysis/data/docs.p", "rb" ) )
print(len(docs))

for i, body in (enumerate(docs)):
    if i==10000:
        break
    ## Spacy prediction 
    list( model.pipe([body], disable=["tok2vec", "parser", "attribute_ruler", "lemmatizer"] ))
    if i%400==0:
        print(f"Doc number: {i}")

Environment:

spacy-transformers==1.1.8
spacy==3.4.3
torch==1.12.1

Additional info:
I notice that model vocab length and cached string store grows with the processed documents as well, although unsure if this is causing the memory leak.
I tried periodically reloading model, but that does not help either.

Using Memray for memory usage analysis:

python3 -m memray run -o memory_usage_trf_max.bin  memory_analysis.py
python3 -m memray flamegraph memory_usage_trf_max_len.bin    
@shadeMe shadeMe added perf / memory Performance: memory use feat / transformer Feature: Transformer labels Jan 3, 2023
@shadeMe
Copy link
Contributor

shadeMe commented Jan 3, 2023

Thanks for the report and the code! A couple of questions:

  • Is this behaviour dependent on the documents that being processed? Specifically, does it also happen with other languages too?
  • Could you upload the memray profile to the issue so that we have a baseline for comparison?

@saketsharmabmb
Copy link
Author

Hi @shadeMe ,
Happy 2023!
model_predict_stream_long_run
model_predict
vocab_len

I have attached three files:

  • model_predict_stream_long_run : This is the memory usage when I run the trained NER distilbert model on a stream of English web pages. As you can see, the memory usage spikes, and then flattens but increases asymptotically. I find it strange that the distilbert model (I only have "ner" and "transformer" in the pipeline) takes 10-12GB of memory. This memory analysis is printed by the kubernetes container model is running in.

  • model_predict: This is when I run the same model on a small batch of data, and report memray analysis. Again, the usage seems to increase over time. (It does not reach the same level as I do not run it long enough -> See time stamps)

  • vocab_len : This is an example of how model.vocab.length changes as I process 200,000 articles. Similar behavior is seen for len (list (model.vocab.strings) )

About the documents being processed, I am processing this on a stream of English web pages. I am quite confident that these are all English documents (barring some minimal noise) due to the way I source them. I am not sure if this happens also with other languages. It is a stream of new content, and the shape of the memory usage graph is always as seen in model_predict_stream_long_run.

@shadeMe
Copy link
Contributor

shadeMe commented Jan 11, 2023

Thanks for the extra context. After some extensive testing, we were able to reproduce the same memory behaviour, but the potential causes for that do not seem to point to a memory leak. Let's move this to the discussion forum as the underlying issue is not a bug per-se.

Background

A bit of background on how the inference transformer pipeline works: The user passes in strings or Doc instances to the model's pipe method. In the case of the former, the model initially runs the tokenizer on the strings and constructs Doc objects, since pipeline components only work with Doc inputs.

When a batch of documents are passed to the predict method of the Transformer pipe, it has to split the tokens in theDoc object into spans to ensure that the inputs do not exceed the maximum input length of the huggingface transformer model. Furthermore, it performs a second tokenization pass using the tokenizer associated with the transformer model and computes the alignments between the result and spaCy's tokenization. The newly tokenized input is passed to the transformer model, which returns its corresponding representations. To convert this output back to spaCy's format, we re-align the outputs to the original spaCy tokens and merge the spans to reconstruct the per-document, per-token representations. These are then stored on each Doc object so that downstream components can use them as inputs for their own models.

As you can imagine, the complexity of the above process requires us to maintain additional state for book-keeping and lazy-evaluation purposes. When combined with the transformer representations, each Doc instance ultimately ends up storing a significant amount of data when it's annotated with a transformer model.

Profiling results

During our testing, we only noticed ballooning memory usage when the Doc instances generated by the pipeline were kept around in memory, i.e., there was at least one reference to the Doc object, which prevented it from being garbage-collected. Is this also the case on your end? It does seem to be the case in your example code, in any event: Even though you immediately discard the output of model.pipe, the inputs are Doc objects from the docs container. Since the docs container keeps around references to them until the end of the program, the transformer data also sticks around with it.

Screenshot_20230111_145730

The above graph is from a profiling session where the en_core_web_trf model was used run inference on ~18000 documents using a modified version of your example code. The only change was to pass the documents as strings to model.pipe, resulting in the immediate disposal of the Doc objects returned by the method. The spike seen towards the end correlates to a batch of documents with a large number of tokens.

Re. vocab length: Given that you're processing webpages, there's a high chance of the model encountering novel tokens that are not found in its pre-trained vocabulary. This results in their being added to its string store. While this also contributes to the increase in memory usage, it will likely be eclipsed by the transformer data. Nevertheless, periodically reloading the model should reset its vocabulary.

Misc

One further point of note: huggingface transformer models are built using PyTorch and TensorFlow, but spaCy only supports the former. PyTorch uses a custom memory allocator that caches allocations. This means that when a particular block of PyTorch-allocated memory is freed, it doesn't immediately get released to the OS - the framework instead attempts to reuse the freed blocks as efficiently as possible. So, increasing memory usage doesn't necessarily mean that there's a memory leak behind the scenes.

@explosion explosion locked and limited conversation to collaborators Jan 11, 2023
@shadeMe shadeMe converted this issue into discussion #12093 Jan 11, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feat / transformer Feature: Transformer perf / memory Performance: memory use
Projects
None yet
Development

No branches or pull requests

2 participants