Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matcher OOM after pathological pattern #3541

Closed
RandomJungle opened this issue Apr 4, 2019 · 8 comments
Closed

Matcher OOM after pathological pattern #3541

RandomJungle opened this issue Apr 4, 2019 · 8 comments
Labels
feat / matcher Feature: Token, phrase and dependency matcher perf / speed Performance: speed

Comments

@RandomJungle
Copy link

First of all, thanks a lot for all the great new features of Pattern Matching in version 2.1, they are amazing ! However, I keep getting an error with several patterns that I had to update from version 2.0 to 2.1. Here's the description :

How to reproduce the behaviour

The following code produces an error, that might be caused by the many " * " operators on the pattern (maybe because of too many combinations of matches ?) :

nlp = spacy.load("fr_core_news_md-2.1.0")
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN', None,
            [
                {'LOWER': {'IN': ['société', 'banque', 'crédit']}},
                {
                    'ENT_TYPE': {'IN': ['MISC', 'LOC', 'PER', 'ORG']},
                    'OP': '+'
                },
                {'LOWER': {'IN': ['et', ',', '-', '&', 'de']}, 'OP': '*'},
                {
                    'ENT_TYPE': {'IN': ['MISC', 'LOC', 'PER', 'ORG']},
                    'OP': '*'
                },
                {'LOWER': {'IN': ['et', ',', '-', '&', 'de']}, 'OP': '*'},
                {
                    'ENT_TYPE': {'IN': ['MISC', 'LOC', 'PER', 'ORG']},
                    'OP': '*'
                },
                {'LOWER': {'IN': ['et', ',', '-', '&', 'de']}, 'OP': '*'},
                {
                    'ENT_TYPE': {'IN': ['MISC', 'LOC', 'PER', 'ORG']},
                    'OP': '*'
                },
            ])

doc = nlp(" des sociétés Crédit lyonnais, Banque nationale de Paris, Société générale, Banque populaire du Val-de-Loire et Caisse d'épargne et de prévoyance,")

matches = matcher(doc)
print(matches)

The error :

  File "matcher.pyx", line 177, in spacy.matcher.matcher.Matcher.__call__
  File "matcher.pyx", line 238, in spacy.matcher.matcher.find_matches
  File "matcher.pyx", line 295, in spacy.matcher.matcher.transition_states
MemoryError: std::bad_alloc

Your Environment

  • Python Version Used: 3.5
  • spaCy Version Used: 2.1.3
@ines ines added feat / matcher Feature: Token, phrase and dependency matcher perf / memory Performance: memory use labels Apr 5, 2019
@ines
Copy link
Member

ines commented Apr 5, 2019

Thanks for the report! Just to double-check: did you actually run out of memory during matching or not?

@ines ines added the more-info-needed This issue needs more information label Apr 5, 2019
@RandomJungle
Copy link
Author

Thanks for the quick answer ! Just to be sure, we did not have an explicit

out of memory

we just had this bad allocation error

@no-response no-response bot removed the more-info-needed This issue needs more information label Apr 5, 2019
@RandomJungle
Copy link
Author

Coming back on this issue as we run into a lot of allocation errors on our documents, with several different patterns. I don't know if the error is connected in some ways to the issue #3618 and the solution that was proposed in spacy extreme, but it seems it's either linked to too many potential matches in a sentence (e.g. patterns being too complex and generating many potential matches) or with documents being simply too long (we tend to have long documents, that we segment in shorter sentences that can still be long to parse)

File "matcher.pyx", line 177, in spacy.matcher.matcher.Matcher.__call__ File "matcher.pyx", line 238, in spacy.matcher.matcher.find_matches File "matcher.pyx", line 333, in spacy.matcher.matcher.transition_states MemoryError: std::bad_alloc

Do you think that there's a connection between this issue and the ones exposed in #3618 ? we didn't have such errors before upgrading to spacy 2.1, before that we were running on spacy 2.0.16, and we get better results with the new version, but we can't use them in production due to these memory allocation errors.

I wasn't able to reproduce another version of one of the patterns that issued the error due to the size of them (we use several alternative of very large patterns in our matchers), so please let me know if I can provide any additional information or details about the error.

@RandomJungle
Copy link
Author

Hello @ines I don't know if there is evolution on this issue, but it is causing unsolvable errors for my matchers on tasks that used to run smoother, we have tried a lot of different approaches, especially with multi-processing, to try and avoid getting these errors, but it seems we can't find a way to resolve the problem. The odd aspect of it is that some of these errors happen on rather short sentences, so it is strange to me that all the memory should be consumed on such short spans of text.

Thanks a lot for the great work on spacy, hope this issue is solvable :)

@paulrinckens
Copy link

Hi all,

I can confirm, running into the same issue when using many Matchers. It seems that the issue is especially present in cases where the processed text contains German umlauts (ä, ü, ö).

@honnibal
Copy link
Member

honnibal commented Jul 9, 2019

It can be tricky to track down these memory errors, so thanks for your patience on this.

I think I've tracked this to what must be an out-of-bounds access. Unfortunately not: I think the bug I fixed in #3839 is unlikely to be the same as this one.

@honnibal honnibal added bug Bugs and behaviour differing from documentation and removed perf / memory Performance: memory use labels Jul 9, 2019
honnibal added a commit that referenced this issue Jul 11, 2019
honnibal added a commit that referenced this issue Jul 11, 2019
…3949)

* Add regression test for issue #3541

* Add comment on bugfix

* Remove incorrect test

* Un-xfail test
@honnibal
Copy link
Member

After running the code a bit, I do think this is just a pathological pattern, unfortunately. I don't think we can really do anything about this, so I wouldn't class it as a bug: it's just the nature of the operator semantics that you'll be able to construct patterns like this.

@honnibal honnibal added perf / speed Performance: speed and removed bug Bugs and behaviour differing from documentation labels Jul 11, 2019
@honnibal honnibal changed the title std::bad_alloc error on Pattern Matcher Matcher OOM after pathological pattern Jul 11, 2019
@ines ines closed this as completed Jul 23, 2019
polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019
…erators (explosion#3949)

* Add regression test for issue explosion#3541

* Add comment on bugfix

* Remove incorrect test

* Un-xfail test
@lock
Copy link

lock bot commented Aug 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / matcher Feature: Token, phrase and dependency matcher perf / speed Performance: speed
Projects
None yet
Development

No branches or pull requests

4 participants