Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely long single-word tokenization in French #1078

Closed
raphael0202 opened this issue May 22, 2017 · 5 comments
Closed

Extremely long single-word tokenization in French #1078

raphael0202 opened this issue May 22, 2017 · 5 comments
Labels
lang / fr French language data and models

Comments

@raphael0202
Copy link
Contributor

Minimal example to reproduce:

import spacy

nlp = spacy.load('fr')
doc = nlp("heiiiiiiiiiiiiiiiiiiiiiiiin")

The process hangs during tokenization. Depending on the number of 'i', the process sometimes ends (after at least a minute).

If we stop the process, we get this stack trace:

^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-13-6eb60d541cb6> in <module>()
----> 1 doc = nlp("heiiiiiiiiiiiiiiiiiiiiiiiiin")

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/language.py in __call__(self, text, tag, parse, entity)
    318             ('An', 'NN')
    319         """
--> 320         doc = self.make_doc(text)
    321         if self.entity and entity:
    322             # Add any of the entity labels already set, in case we don't have them.

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/language.py in <lambda>(text)
    291             self.make_doc = overrides['create_make_doc'](self)
    292         elif not hasattr(self, 'make_doc'):
--> 293             self.make_doc = lambda text: self.tokenizer(text)
    294         if 'pipeline' in overrides:
    295             self.pipeline = overrides['pipeline']

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer.__call__ (spacy/tokenizer.cpp:5486)()

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer._tokenize (spacy/tokenizer.cpp:6047)()

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer._split_affixes (spacy/tokenizer.cpp:6185)()

This issue doesn't occur using the English model, so it seems it has to do with the tokenization exceptions.

Your Environment

  • spaCy version: 1.8.2
  • Python version: 3.5.2
  • Platform: Linux-4.4.0-78-generic-x86_64-with-Ubuntu-16.04-xenial
  • Installed models: fr, en
@honnibal honnibal added lang / fr French language data and models performance labels May 22, 2017
@Arnie0426
Copy link

Arnie0426 commented May 30, 2017

It also happens with repeated f, i and s characters.

Also I wouldn't call it a performance issue, it is definitely a "bug" because it is essentially unusable for those repeated characters.

@Arnie0426
Copy link

Played around with it a bit, and found that the URL pattern is the culprit here. https://github.com/explosion/spaCy/blob/master/spacy/fr/tokenizer_exceptions.py#L211
https://github.com/explosion/spaCy/blob/master/spacy/language_data/tokenizer_exceptions.py#L35

Removing that fixes it. I am not sure why it's even there, do we really need to match url patterns for tokens?

@raphael0202
Copy link
Contributor Author

raphael0202 commented Aug 14, 2017

This URL pattern has already caused some hanging issues in the past: #957

I've digged further and it seems the matching hangs when we use the re.IGNORECASE flag here: TOKEN_MATCH = re.compile('|'.join('(?:{})'.format(m) for m in REGULAR_EXP), re.IGNORECASE).match. Here, the regex module is used (import regex as re on top of file). If the re standard library module is used, the process does not hang anymore.

So it seems there is a bug related to how the regex module handles this IGNORECASE flag.
One easy solution would be to use the re module instead of regex. Alternatively, we could rewrite (and simplify) the _URL_PATTERN pattern.

@honnibal What do you think would be the best?

edit: I agree with @Arnie0426, this is not only a performance issue but a blocker, as spaCy is unreliable using the French pipeline

raphael0202 pushed a commit to raphael0202/spaCy that referenced this issue Oct 11, 2017
- avoid catastrophic backtracking
- reduce character range of host name, domain name and TLD identifier
ines added a commit that referenced this issue Oct 11, 2017
Resolve issue #1078 by simplifying URL pattern
@raphael0202
Copy link
Contributor Author

Closing issue (see PR #1411)

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / fr French language data and models
Projects
None yet
Development

No branches or pull requests

3 participants