Extremely long single-word tokenization in French #1078

raphael0202 · 2017-05-22T09:00:23Z

Minimal example to reproduce:

import spacy

nlp = spacy.load('fr')
doc = nlp("heiiiiiiiiiiiiiiiiiiiiiiiin")

The process hangs during tokenization. Depending on the number of 'i', the process sometimes ends (after at least a minute).

If we stop the process, we get this stack trace:

^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-13-6eb60d541cb6> in <module>()
----> 1 doc = nlp("heiiiiiiiiiiiiiiiiiiiiiiiiin")

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/language.py in __call__(self, text, tag, parse, entity)
    318             ('An', 'NN')
    319         """
--> 320         doc = self.make_doc(text)
    321         if self.entity and entity:
    322             # Add any of the entity labels already set, in case we don't have them.

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/language.py in <lambda>(text)
    291             self.make_doc = overrides['create_make_doc'](self)
    292         elif not hasattr(self, 'make_doc'):
--> 293             self.make_doc = lambda text: self.tokenizer(text)
    294         if 'pipeline' in overrides:
    295             self.pipeline = overrides['pipeline']

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer.__call__ (spacy/tokenizer.cpp:5486)()

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer._tokenize (spacy/tokenizer.cpp:6047)()

/home/raphael/.virtualenvs/spacy/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer._split_affixes (spacy/tokenizer.cpp:6185)()

This issue doesn't occur using the English model, so it seems it has to do with the tokenization exceptions.

Your Environment

spaCy version: 1.8.2
Python version: 3.5.2
Platform: Linux-4.4.0-78-generic-x86_64-with-Ubuntu-16.04-xenial
Installed models: fr, en

The text was updated successfully, but these errors were encountered:

Arnie0426 · 2017-05-30T19:51:06Z

It also happens with repeated f, i and s characters.

Also I wouldn't call it a performance issue, it is definitely a "bug" because it is essentially unusable for those repeated characters.

Arnie0426 · 2017-05-30T22:09:54Z

Played around with it a bit, and found that the URL pattern is the culprit here. https://github.com/explosion/spaCy/blob/master/spacy/fr/tokenizer_exceptions.py#L211
https://github.com/explosion/spaCy/blob/master/spacy/language_data/tokenizer_exceptions.py#L35

Removing that fixes it. I am not sure why it's even there, do we really need to match url patterns for tokens?

raphael0202 · 2017-08-14T13:53:13Z

This URL pattern has already caused some hanging issues in the past: #957

I've digged further and it seems the matching hangs when we use the re.IGNORECASE flag here: TOKEN_MATCH = re.compile('|'.join('(?:{})'.format(m) for m in REGULAR_EXP), re.IGNORECASE).match. Here, the regex module is used (import regex as re on top of file). If the re standard library module is used, the process does not hang anymore.

So it seems there is a bug related to how the regex module handles this IGNORECASE flag.
One easy solution would be to use the re module instead of regex. Alternatively, we could rewrite (and simplify) the _URL_PATTERN pattern.

@honnibal What do you think would be the best?

edit: I agree with @Arnie0426, this is not only a performance issue but a blocker, as spaCy is unreliable using the French pipeline

- avoid catastrophic backtracking - reduce character range of host name, domain name and TLD identifier

Resolve issue #1078 by simplifying URL pattern

raphael0202 · 2017-10-11T10:58:16Z

Closing issue (see PR #1411)

lock · 2018-05-08T15:27:29Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added lang / fr French language data and models performance labels May 22, 2017

raphael0202 pushed a commit to raphael0202/spaCy that referenced this issue Oct 11, 2017

Resolve issue explosion#1078 by simplifying URL pattern

3452d6c

- avoid catastrophic backtracking - reduce character range of host name, domain name and TLD identifier

raphael0202 mentioned this issue Oct 11, 2017

Resolve issue #1078 by simplifying URL pattern #1411

Merged

8 tasks

ines added a commit that referenced this issue Oct 11, 2017

Merge pull request #1411 from raphael0202/issue_1078

ffc2fef

Resolve issue #1078 by simplifying URL pattern

raphael0202 closed this as completed Oct 11, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely long single-word tokenization in French #1078

Extremely long single-word tokenization in French #1078

raphael0202 commented May 22, 2017

Arnie0426 commented May 30, 2017 •

edited

Arnie0426 commented May 30, 2017

raphael0202 commented Aug 14, 2017 •

edited

raphael0202 commented Oct 11, 2017

lock bot commented May 8, 2018

Extremely long single-word tokenization in French #1078

Extremely long single-word tokenization in French #1078

Comments

raphael0202 commented May 22, 2017

Your Environment

Arnie0426 commented May 30, 2017 • edited

Arnie0426 commented May 30, 2017

raphael0202 commented Aug 14, 2017 • edited

raphael0202 commented Oct 11, 2017

lock bot commented May 8, 2018

Arnie0426 commented May 30, 2017 •

edited

raphael0202 commented Aug 14, 2017 •

edited