Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues for long-running parsing processes #3618

Closed
oterrier opened this issue Apr 19, 2019 · 35 comments
Closed

Memory issues for long-running parsing processes #3618

oterrier opened this issue Apr 19, 2019 · 35 comments
Labels
help wanted Contributions welcome! perf / memory Performance: memory use

Comments

@oterrier
Copy link
Contributor

How to reproduce the behaviour

Hi,
I'm suspecting a memory leak when using intensively nlp.pipe(), my process is growing in memory and it looks that it never garbage collect. Do you think that it is possible ?
Here is the little python I'am using to reproduce the problem:

import random
import spacy
import plac
import psutil
import sys


def load_data():
    return ["This is a fake test document number %d."%i for i in random.sample(range(100_000), 10_000)]


def parse_texts(nlp, texts, iterations=1_000):
    for i in range(iterations):
        for doc in nlp.pipe(texts):
            yield doc


@plac.annotations(
    iterations=("Number of iterations", "option", "n", int),
    model=("spaCy model to load", "positional", None, str)
)
def main(model='en_core_web_sm', iterations=1_000):
    nlp = spacy.load(model)
    texts = load_data()
    for i, doc in enumerate(parse_texts(nlp, texts, iterations=iterations)):
        if i % 10_000 == 0:
            print(i, psutil.virtual_memory().percent)
            sys.stdout.flush()


if __name__ == '__main__':
    plac.call(main)

For me the output looks like

0 28.3
10000 28.3
20000 28.7
30000 28.9
40000 28.9
50000 29.1
60000 29.3
70000 29.7
80000 29.9
90000 30.1
...
420000 37.0
430000 37.2
440000 37.6
450000 37.8
460000 37.6
470000 38.0
...
920000 53.4
930000 53.9
940000 54.0
950000 54.2
960000 54.4
970000 54.4
980000 54.8
990000 54.9
1000000 54.9

And at the end the process pmap output is:
total 4879812K
Can you confirm ?

Your Environment

  • Operating System: Ubuntu 18.04 Bionic Beaver (64-bit)
  • spaCy version: 2.1.3
  • Platform: Linux-4.15.0-47-generic-x86_64-with-debian-buster-sid
  • Python version: 3.6.7
  • Models: en, xx, fr

Best regards and congrats for the amazing work on SpaCy

Olivier Terrier

@honnibal
Copy link
Member

Unfortunately it's really hard to tell whether a Python program is leaking memory or whether the Python interpreter is just refusing to give memory back to the operating system.

This blog post explains some good tips: https://medium.com/zendesk-engineering/hunting-for-memory-leaks-in-python-applications-6824d0518774

A small amount of increasing memory usage is expected as you pass over new text, as we add new strings to the StringStore. For the first period of work, the Vocab will also cache some new Lexeme objects. However, you're looping over the same text, so this cannot be the problem. My suspicion is there's no actual memory leak --- Python is just being a jerk. There's a way to change spaCy to use malloc and free instead of the default PyMem_Malloc, but you can't do it from Python --- you need to do it from Cython. For reference, the code to do that would be:

cimport cymem.cymem

from libc.stdlib cimport malloc, free

cymem.cymem.Default_Malloc._set(malloc)
cymem.cymem.Default_Free._set(free)

@oterrier
Copy link
Contributor Author

Hi Honnibal
Thx for your fast answer and the nice link to the blog.
You're probably right, I've been playing around with objgraph and no real leaks are detected using
objgraph.show_growth()
between calls.
Calling gc.collect() after each iteration does not help neither...

Best

Olivier

@ines ines added the perf / memory Performance: memory use label Apr 19, 2019
@zsavvas
Copy link

zsavvas commented Apr 20, 2019

Hi Honnibal,

I have a similar issue where i loop over different texts over time and i observe increased memory usage over time until the memory fills completely because i process millions of different texts (using nlp.pipe and en_core_web_lg model). Is there a way to somehow limit the memory use over time or free manually some memory?

Thanks,
Savvas

@honnibal
Copy link
Member

@zsavvas The simplest strategy is to just reload the NLP object periodically: `nlp = spacy.load("en_core_web_lg"). It's not a very satisfying solution, but everything we've tried to implement to flush the string store periodically ends up really difficult, because we can't guarantee what other objects are still alive that are trying to reference those strings or vocab items.

@BramVanroy
Copy link
Contributor

@honnibal Just to know; what is the string store for? A cache for faster lookup? How does it differ from Vocab?

@oterrier
Copy link
Contributor Author

@honnibal I'm not totally convinced that reloading the NLP object actually helps

I tried with this little program and the memory is growing the same way

import random
import spacy
import plac
import psutil
import sys
import objgraph
import gc

gc.set_debug(gc.DEBUG_SAVEALL)

def load_data():
    return ["This is a fake test document number %d."%i for i in random.sample(range(100_000), 10_000)]


# def print_memory_usage():
#     print(objgraph.show_growth(limit=5))
#     print("GC count="+str(gc.get_count()))
#     gc.collect()

def parse_texts(model, texts, iterations=1_000):
    for i in range(iterations):
        nlp = spacy.load(model)
        for doc in nlp.pipe(texts, cleanup=True):
            yield doc

@plac.annotations(
    iterations=("Number of iterations", "option", "n", int),
    model=("spaCy model to load", "positional", None, str)
)
def main(model='en_core_web_sm', iterations=1_000):
    texts = load_data()
    for i, doc in enumerate(parse_texts(model, texts, iterations=iterations)):
        if i % 10_000 == 0:
            print(i, psutil.virtual_memory().percent)
            #print_memory_usage()
            sys.stdout.flush()


if __name__ == '__main__':
    plac.call(main)

So I'm a bit puzzled with python memory management ...

Best

Olivier

@BramVanroy
Copy link
Contributor

BramVanroy commented Apr 23, 2019

Interesting. Could you try a more explicit approach where you explicitly del nlp, do a garbage collect, and only then re-load spaCy? Running a quick experiment, that does seem to work for me.

@oterrier
Copy link
Contributor Author

@BramVanroy , Sure here is the modified code snippet

import random
import spacy
import plac
import psutil
import sys
import objgraph
import gc

gc.set_debug(gc.DEBUG_SAVEALL)

def load_data():
    return ["This is a fake test document number %d."%i for i in random.sample(range(100_000), 10_000)]


# def print_memory_usage():
#     print(objgraph.show_growth(limit=5))
#     print("GC count="+str(gc.get_count()))
#     gc.collect()

class ReloadableNlp:
    def __init__(self, model, reload=1000):
        self.model = model
        self.reload = reload
        self.count = 0
        self.nlp = spacy.load(model)

    def get_nlp(self):
        self.count += 1
        if self.count % 1_000 == 0:
            del self.nlp
            gc.collect()
            self.nlp = spacy.load(self.model)
        return self.nlp



def parse_texts(reloadable, texts, iterations=1_000):
    for i in range(iterations):
        for doc in reloadable.get_nlp().pipe(texts, cleanup=True):
            yield doc

@plac.annotations(
    iterations=("Number of iterations", "option", "n", int),
    model=("spaCy model to load", "positional", None, str)
)
def main(model='en_core_web_sm', iterations=1_000):
    texts = load_data()
    reloadable = ReloadableNlp(model)
    for i, doc in enumerate(parse_texts(reloadable, texts, iterations=iterations)):
        if i % 10_000 == 0:
            print(i, psutil.virtual_memory().percent)
            #print_memory_usage()
            sys.stdout.flush()


if __name__ == '__main__':
    plac.call(main)

And the output

0 88.4
10000 88.6
20000 88.8
30000 88.7
40000 88.9
50000 89.3
60000 89.3
70000 89.5
80000 89.8
90000 89.9
100000 90.1
110000 90.3
120000 90.4
130000 90.5
140000 90.4
150000 90.9
160000 91.2
170000 91.0
180000 91.3
190000 91.2
200000 91.6
210000 91.8
220000 91.8
230000 91.8
240000 92.7
250000 92.6
260000 92.7
270000 92.8
280000 93.2
290000 93.4

I would say that the memory is still growing but maybe slowly ...

Best

Olivier

@BramVanroy
Copy link
Contributor

BramVanroy commented Apr 23, 2019

Okay, then I am out of ideas. I understand @honnibal 's response that it is hard to tell who is responsible, spaCy or Python itself. I'm not sure how to debug this, especially since it is happening on a low-level where deleting the object itself doesn't help. But that only makes this issue the more dangerous/important, I believe.

Even if this issue only arises when processing large batches of text, a memory leak is still a memory leak. Personally, I think that his should be a priority bug. But of course I have all the respect for the maintainers of this package and their priorities, and I'm not skilled enough to contribute to this issue, except for testing.

Some good Stack Overflow posts

From further testing, I also found that this is not only related to pipe(). The same issue occurs when you use the slower approach nlp(text) for text in texts.

@BramVanroy
Copy link
Contributor

BramVanroy commented Apr 29, 2019

One sneaky thing that I found that you can do, is using a multiprocessing Pool but limiting the tasks a child process can do. E.g. (un-tested, from the top of my head):

with Pool(16, maxtasksperchild=5) as pool:
    nlp = spacy.load('en')
    proc_func = partial(process_batch_func, nlp)
    for batch in minibatch(...):
        pool.apply_async(proc_func , (batch, ))

This way, after having processed 5 batches, a child process will be killed and replaced by a new one. This ensures that all 'leaked memory' from that child process is freed.

I can confirm that with this method, I could parse a single file of 300M sentences. Using batches of 80K sentences, maxtasksperchild=10, 24 active cores and a machine with 384GB of RAM, memory usage never exceeded 40%.

Here is a worked-out example script, with comments.

https://github.com/BramVanroy/spacy-extreme

@ines ines added the help wanted Contributions welcome! label May 3, 2019
@BramVanroy
Copy link
Contributor

BramVanroy commented May 7, 2019

We were having a discussion about this on Twitter, but I propose to keep the discussion here to make it easily accessible to others.

@honnibal asked how the lingering memory leak should be discussed and communicated to users going forward.

I do not think this should be stated explicitly anywhere. It may scare people off who would never run into the problem anyway. The problem only arises when you are having a huge dataset, and I assume that people who do are knowledgeable enough to go look on Github for issues they run into. (But I might be too naive here?)

To that end, though, I think that the title of this issue should be changed. The issue is not specific to pipe(). Maybe a title such as 'Memory issues when parsing a lot of data' is a better fit so that it can be easily found.

As I said before, I do not have the skills to debug this on a low-level, nor do I have the time if I did. I will, however, try to improve the repo I shared above. I will also add an example where instead of one huge file, you'd want to parse many (many) smaller files, i.e. where you want to parse all files in a directory efficiently. If requested, these examples can reside in spaCy/examples/.

@oterrier
Copy link
Contributor Author

oterrier commented May 7, 2019

Hi all,
As originator of the issue I'm perfectly fine with the proposal of renaming

Best

Olivier

@oterrier oterrier changed the title Memory leak in Language.pipe Memory issues when parsing a lot of data May 7, 2019
@honnibal honnibal changed the title Memory issues when parsing a lot of data Memory issues for long-running parsing processes May 11, 2019
@fabio-reale
Copy link

Hello,

I'm facing issues which fit the description of "Memory issues for long-running processes". It doesn't seem to involve memory leaks. I'm getting inconsistent errors, around the 200th processed text (2000 characters each, aprox.)

I'm running python on gdb. I mostly get SIGSEGV just acusing a segmentation fault. I also get a SIGABRT which usually is a double free or corruption. The backtrace from these SIGABRT cases usually have a sequence of calls from spacy/strings.cpython (std::_Rb_tree<unsigned long, unsigned long, std::_Identity, std::less, std::allocator >::_M_erase(std::_Rb_tree_node*)) leading up to the problem.

Due to the inconsistency of the issue and that the problematic code is part of a large project, I've been so far unable to create a reasonably sized code with the issue, but I'm working on it.

If there is any extra information I can provide to help or tips in how I can better explain the problem, please let me know

@ines
Copy link
Member

ines commented May 14, 2019

@fabio-reale Thanks for the detailed report. Which version of spaCy are you running?

@fabio-reale
Copy link

@fabio-reale Thanks for the detailed report. Which version of spaCy are you running?

I'm running version 2.1.3

@sadovnychyi
Copy link
Contributor

explosion/srsly#4 – source of at least one leak. Hopefully there's no more!

In our case spacy actually held up pretty good on not so long running batch jobs (means we didn't hit OOM), but we been using srsly.ujson a lot and it been huge pain trying to understand why we are leaking memory and crashing.

I guess trying to monkey patch srsly.ujson with stdlib json before importing spacy could be a temporary workaround.

@honnibal
Copy link
Member

honnibal commented Jun 7, 2019

Thanks to @sadovnychyi 's hard sleuthing and patch, we've now fixed the ujson memory leak.

I'm still not sure whether this would be the problem in this thread, since the problem should have been the same between v2.0.18 and v2.1, as both were using the same ujson code. It could be that we converted some call over from json to ujson though?

@BramVanroy and @oterrier If you have time, could you check whether your long-running jobs are more memory-stable after doing pip install -U srsly? (It should give you v0.0.6).

@fabio-reale I would say that's a different issue, so we should move it out into a different thread. If possible, could you log the texts during parsing? Then see if parsing just those texts triggers the issue for you. This should help us isolate the problem.

@oterrier
Copy link
Contributor Author

oterrier commented Jun 11, 2019

@honnibal , Sorry Matthew but I confirm that the problem of long-running jobs remain so +1 to move this ujson issue into a different thread

Best

Olivier

@fabio-reale
Copy link

@fabio-reale I would say that's a different issue, so we should move it out into a different thread. If possible, could you log the texts during parsing? Then see if parsing just those texts triggers the issue for you. This should help us isolate the problem.

I was able to process all the texts I needed in small-batches, and those also ran into the problem some times. This tells me 2 things:

  1. It's not the specific texts.
  2. It's not really about long running processes.

I will keep investigating it further. When I have a meaningful issue title I'll sure to start a new one.

Thanks

@p-sodmann
Copy link
Contributor

I ran into a memory leak as well, maybe it is related.

When I was using beam parse extensively while training NER Model and calculating the loss on a test dataset for every epoch, the memory usage is quickly growing to 10 gigs after about 20 epochs.

If I remove the beam parser everything is fine and the ram usage stays at around 500mb.

@shaked571
Copy link

shaked571 commented Jul 9, 2019

I also suffered from a memory leak. But it was actually in the spacy.pipeline.EntityRuler.
My objective was to create a ruler with about 50K patterns, every tome it just explode with 30 GB RAM (my machine has 32GB).
In first I tried to take down from the spacy nlp object the pipelines I didn't use (i.e nlp = en_core_web_md.load(disable=['parser', 'tagger', 'ner']) but it didn't help (I use it for the tokenization).

finally I wrote the following script that check the memory use, if it more than a threshold x, I dump the ruler to a file and start a new one.:

 def _create_ruler(self, data_path):
        from spacy.pipeline import EntityRuler
        import en_core_web_md
        nlp = en_core_web_md.load(disable=['parser', 'tagger', 'ner'])
        directory = os.fsencode(data_path)
        for file in os.listdir(directory):

            ruler = EntityRuler(nlp, overwrite_ents=True)
            with open(os.path.abspath(os.path.join(data_path, filename)), errors='ignore', encoding='utf-8') as f:
                for line in f:
                    if psutil.virtual_memory().available < 10000000000:
                        print(f"virtual memory is under 10 GB. currently: {psutil.virtual_memory().available /1000000000} GB ")
                        ruler.to_disk(data_path)
                        ruler = EntityRuler(self._nlp, overwrite_ents=True)
                    ruler.add_patterns([{"label": label, "pattern": [{"lower": i.text.lower()} for i in nlp(line.strip())]}])
        return

I used the following patch to do so.

@ines
Copy link
Member

ines commented Jul 9, 2019

@shaked571 Thanks for updating and sharing your code. That's very interesting 🤔 I wonder if this could be related to #3541 – we've been suspecting that there might be some memory issue in the Matcher (which is what both the EntityRuler and PhraseMatcher call into).

@usptact
Copy link

usptact commented Jul 16, 2019

I am having the same issue with Python 3.6.8 on Mac with Spacy 2.1.4.

My script loads the "web_en_core_lg" model and gets NER tags for relatively short text utterances.

for line in f:
  doc = nlp(line.strip())
  for token in doc:
    # getting token.ent_iob_ and token.ent_type_

@azhuchkov
Copy link

It seem like disabling parser and ner prevents memory from leaking:

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

Spacy: 2.1.4.

Hope this helps.

@usptact
Copy link

usptact commented Aug 4, 2019

@azhuchkov Except I need NER in my application. It may be different for others, of course.

@Richie94
Copy link

Richie94 commented Sep 5, 2019

Running with spacy 2.1.8 seems at least to run smoother? The script from @oterrier without deletion of nlp object and gc now does not has only one direction...

10000 57.7
20000 58.0
30000 58.0
40000 58.1
50000 56.6
60000 56.7
70000 56.8
80000 56.8
90000 57.0
100000 57.1
110000 57.2
120000 57.3
130000 57.4
140000 57.5
150000 57.5
160000 57.6
170000 57.8
180000 57.8
190000 58.0
200000 58.6
210000 58.5
220000 58.5
230000 58.6
240000 58.7
250000 59.0
260000 59.1
270000 59.1
280000 59.1

@radoslawkrolikowski
Copy link

radoslawkrolikowski commented Sep 19, 2019

Hello.
I stumbled across very similar problem. When processing a large file (50,000 reviews) by nlp.pipe(), I noticed the very high memory usage (actually near the end I am OOM). I compared the performance between 3 functions, that actually do the same job:

  1. Use nlp.pipe()
nlp = spacy.load('en_core_web_sm', disable = ['ner', 'parser', 'textcat', '...'])
required_tags = ['PROPN', 'PUNCT', 'NOUN', 'ADJ', 'VERB']

def pos(df, batch_size, n_threads, required_tags):
    review_dict = collections.defaultdict(dict)
    for i, doc in enumerate(nlp.pipe(df, batch_size=batch_size, n_threads=n_threads)):
         for token in doc:
            pos = token.pos_
            if pos in required_tags:
                review_dict[i].setdefault(pos, 0)
                review_dict[i][pos] = review_dict[i][pos] + 1
    return pd.DataFrame(review_dict).transpose()
  1. Use nlp.pipe(as_tuples=True)
nlp = spacy.load('en_core_web_sm', disable = ['ner', 'parser', 'textcat', '...'])
required_tags = ['PROPN', 'PUNCT', 'NOUN', 'ADJ', 'VERB']

def pos(df, batch_size, n_threads, required_tags):
    # Add index to reviews and change column order
    reviews = df.reset_index(drop=False)[['review', 'index']]
    # Convert dataframe to list of tuples (review, index)
    review_list = list(zip(*[reviews[c].values.tolist() for c in reviews]))
    # Create empty dictionary
    review_dict = collections.defaultdict(dict)
    
    for doc, context in list(nlp.pipe(review_list, as_tuples=True, batch_size=batch_size, n_threads=n_threads)):
        review_dict[context] = {}
        for token in doc:
            pos = token.pos_
            if pos in required_tags:
                review_dict[context].setdefault(pos, 0)
                review_dict[context][pos] = review_dict[context][pos] + 1
    # Transpose data frame to shape (index, tags)
    return pd.DataFrame(review_dict).transpose()
  1. Without nlp.pipe()
nlp = spacy.load('en_core_web_sm', disable = ['ner', 'parser', 'textcat', '...'])
required_tags = ['PROPN', 'PUNCT', 'NOUN', 'ADJ', 'VERB']

def pos(df, required_tags):
    pos_list = []
    for i in range(df.shape[0]):
        doc = nlp(df[i])
        pos_dict = {}
        for token in doc:
            pos = token.pos_
            if pos in required_tags:
                pos_dict.setdefault(pos, 0)
                pos_dict[pos] = pos_dict[pos] + 1
        pos_list.append(pos_dict)
    return pd.DataFrame(pos_list)

The results of processing 1000 examples are the following:

  • increase in memory usage by python:
    1. Use nlp.pipe() - 60 MB
    2. Use nlp.pipe(as_tuples=True) - 170 MB
    3. Without nlp.pipe() - 20 MB
  • using disable=['parser', 'ner'] decrease the memory usage a bit, and speed up the processing but memory leak is still huge.
  • using batch_size = 512 or batch_size = 4 gives almost equal memory usage.
  • using gc.collect() didn't help as well.

I think that you can draw conclusions from these results, that using nlp.pipe() especially with as_tuples=True can lead to huge memory leak, while using standard approach without nlp.pipe() with the same amount of processed examples gives much better results in terms of memory usage, and actually not so worse results in terms of processing time. What really surprised me, was the difference between nlp.pipe() and nlp.pipe(as_tuples=True); why using as_tuples almost triple the RAM usage?
In summary I think that considering the results I am forced to give up using nlp.pipe() at least up the time the problem is solved. I know that it was said that this issue is not related to nlp.pipe(), but in my case it looks like it is...?

Spacy version: 2.1.8

Edit: I confirm that dividing document into smaller chunks and then reloading nlp helped.

Best regards.
Radek.

@flawaetz
Copy link

flawaetz commented Oct 2, 2019

I think I might be hitting this issue too. I'm feeding a spacy nlp.pipe() with a generator that yields elasticsearch documents. I run into trouble even before hitting 100MB of text. This is on spaCy 2.2.0, Python 3.6.5 (and 3.5), Ubuntu 16.04.6 running dockerized.

I do not encounter this memory leak running spaCy 1.10.1.
Edit: Incorrect, mostly. In spaCy 1.10.1 there is no issue when using the built-in en model. With en_core_web_sm however the pipeline however stops progressing at about the same point as the below. Instead of memory growth is just sits there using a couple of cores.

On Python 3.6.5, spaCy 2.2.0 with model en_core_web_sm eventually this interesting stacktrace was generated:

Traceback (most recent call last):
  File "/x.py", line 48, in <module>
    for doc in nlp.pipe(gen):
  File "/usr/local/lib/python3.5/dist-packages/spacy/language.py", line 751, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 221, in pipe
  File "/usr/local/lib/python3.5/dist-packages/spacy/util.py", line 463, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "pipes.pyx", line 399, in pipe
  File "pipes.pyx", line 411, in spacy.pipeline.pipes.Tagger.predict
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 169, in __call__
    return self.predict(x)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/feed_forward.py", line 40, in predict
    X = layer(X)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 169, in __call__
    return self.predict(x)
  File "/usr/local/lib/python3.5/dist-packages/thinc/api.py", line 310, in predict
    X = layer(layer.ops.flatten(seqs_in, pad=pad))
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 169, in __call__
    return self.predict(x)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/feed_forward.py", line 40, in predict
    X = layer(X)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 169, in __call__
    return self.predict(x)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/resnet.py", line 14, in predict
    Y = self._layers[0](X)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 169, in __call__
    return self.predict(x)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/feed_forward.py", line 40, in predict
    X = layer(X)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/model.py", line 169, in __call__
    return self.predict(x)
  File "/usr/local/lib/python3.5/dist-packages/thinc/neural/_classes/convolution.py", line 30, in predict
    return self.ops.seq2col(X, self.nW)
  File "ops.pyx", line 553, in thinc.neural.ops.NumpyOps.seq2col
  File "cymem.pyx", line 68, in cymem.cymem.Pool.alloc
MemoryError: Error assigning 18446744070330497280 bytes

That's quite the memory allocation request (18 exabytes for the curious).. Maybe the memory issue is with thinc and not spacy?

I'm not much of a gdb expert (so take this all with a grain of salt) but I re-ran this job with a python debug build and halted the program at times when it appeared to be aggressively requesting memory. Looking at backtrace full I frequently noticed this recursion error:

 implementing_args = {<numpy.ndarray at remote 0x7f1b846cc8f0>, <unknown at remote 0x30>, (None,), <code at remote 0x7f1ba6b5f420>, <unknown at remote 0x7f1b9020d908>, '__builtins__',
Python Exception <class 'RecursionError'> maximum recursion depth exceeded in comparison:
          <unknown at remote 0xffffffff>, <unknown at remote 0x7ffde922d440>, < at remote 0x7ffde922d428>, < at remote 0x7ffde922d420>, <unknown at remote 0x7ffde922d418>, <unknown at remote 0x5a2db0>,
          ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((...(truncated), 

as well as:

        implementing_args = {<numpy.ndarray at remote 0x7f9fc036c210>, <unknown at remote 0x7f9f00000001>, (None,), <code at remote 0x7f9fe28ff4b0>, , '__builtins__', <unknown at remote 0xffffffff>,
Python Exception <class 'UnicodeDecodeError'> 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128):
          <unknown at remote 0x7ffc8cf046c0>, , <unknown at remote 0x7ffc8cf046a0>, <unknown at remote 0x7ffc8cf04698>, <unknown at remote 0x5a2db0>, <unknown at remote 0x8>,
Python Exception <class 'UnicodeDecodeError'> 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128):
          <unknown at remote 0x2732ef0>, 0x0, <unknown at remote 0x7f9fffffffff>, 'drop', <unknown at remote 0x55565613444d6b00>, , <code at remote 0x7f9fe28ff4b0>,
Python Exception <class 'UnicodeDecodeError'> 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128):```


I'm happy to provide any other info or re-run under different configurations as might be helpful.

@BramVanroy
Copy link
Contributor

@flawaetz Very useful! Especially the RecursionError seems to indicate an issue that can explain the memory leak.

@honnibal
Copy link
Member

honnibal commented Oct 3, 2019

@flawaetz Great analysis, thanks!

The huge allocation is especially interesting. I wonder whether there might be a memory corruption behind that? If something wrote out-of-bounds, that might explain how we end up with such a ridiculous size to allocate...

@flawaetz
Copy link

flawaetz commented Oct 3, 2019

@honnibal Could be memory corruption. It's interesting to me that spaCy 1.10.1 with the same en_core_web_sm model grinds to a halt around the place. No memory explosion just churn with no progress.

I'm happy to provide (out of band) code and source data to reproduce if that would be helpful.

@ines
Copy link
Member

ines commented Oct 22, 2019

Closing this one since we're pretty sure the leak fixed in #4486 was the underlying problem.

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Oct 22, 2019

After applying #4486, my memory usage for the script from @oterrier's comment above without reloading nlp at all:

0 22.1
10000 22.2
20000 22.1
30000 22.7
40000 22.2
50000 22.2
60000 22.0
70000 21.9
80000 22.0
90000 22.0
100000 22.1
110000 22.2
120000 22.9
130000 21.6
140000 21.6
150000 21.6
160000 21.6
170000 21.7
180000 21.6
190000 21.6
200000 21.4
210000 21.8
220000 22.0
230000 22.3
240000 22.4
250000 21.9

For the unusual behavior related to thinc/numpy, I wonder if these bugs might be related?

cython/cython#2828
cython/cython#3046

@BramVanroy
Copy link
Contributor

This looks great. Thanks for your fix @adrianeboyd !

@lock
Copy link

lock bot commented Nov 21, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Contributions welcome! perf / memory Performance: memory use
Projects
None yet
Development

No branches or pull requests