Spacy 2.1.4 fails to initialize or load fasttext model using python2.7.5 #3734

mohamed-ali · 2019-05-13T11:07:39Z

How to reproduce the behaviour

The following code chunk from https://github.com/explosion/spaCy/blob/master/spacy/lang/tokenizer_exceptions.py fails in python 2.7.5 which makes init-model and spacy.load fail:

from __future__ import unicode_literals
import re

# URL validation regex courtesy of: https://mathiasbynens.be/demo/url-regex
# A few minor mods to this regex to account for use cases represented in test_urls
URL_PATTERN = (
    r"^"
    # in order to support the prefix tokenization (see prefix test cases in test_urls).
    r"(?=[\w])"
    # protocol identifier
    r"(?:(?:https?|ftp|mailto)://)?"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    # MH: Do we really need this? Seems excessive, and seems to have caused
    # Issue #957
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host name
    r"(?:(?:[a-z0-9\-]*)?[a-z0-9]+)"
    # domain name
    r"(?:\.(?:[a-z0-9])(?:[a-z0-9\-])*[a-z0-9])?"
    # TLD identifier
    r"(?:\.(?:[a-z]{2,}))"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:/\S*)?"
    # query parameters
    r"\??(:?\S*)?"
    # in order to support the suffix tokenization (see suffix test cases in test_urls),
    r"(?<=[\w/])"
    r"$"
).strip()

TOKEN_MATCH = re.compile(URL_PATTERN, re.UNICODE).match

The error message:

 Traceback (most recent call last):
  File "test.py", line 50, in <module>
    TOKEN_MATCH = re.compile(URL_PATTERN, re.UNICODE).match
  File "/usr/lib64/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/usr/lib64/python2.7/re.py", line 242, in _compile
    raise error, v # invalid expression
sre_constants.error: nothing to repeat

The issues is known: (SO discussion):

https://stackoverflow.com/questions/3675144/regex-error-nothing-to-repeat)

Was fixed:

https://bugs.python.org/issue18647

Info about spaCy & enviroment:

Python version: 2.7.5
Platform: Linux-3.10.0-327.4.5.el7.x86_64-x86_64-with-centos-7.2.1511-Core
spaCy version: 2.1.4

Potential solutions

This issue doesn't occur when working with spacy 2.0.18.

Based on the blame history here the issue started to occur since the switch from regex to re three months ago.

Some quick fixes would be:

Drop support for python 2.7.5
Default to using regex if python version is 2.7.5

The text was updated successfully, but these errors were encountered:

BramVanroy · 2019-05-13T13:00:17Z

Considering 2.7.5 dates from May, 2013 I am in favour of dropping support. On a related note, NLTK seems to push to drop 2.7 support as a whole (nltk/nltk#2296). If this would help development, this may be a good idea for spaCy as well.

mohamed-ali · 2019-05-16T07:34:41Z

@BramVanroy Actually python2.7 is going to retire next year (you can see how much time is left for it here: https://pythonclock.org/ ). Also many DS/machine learning libraries will drop it including tensorflow, scikit-learn, numpy, pandas and many others. (check this link https://python3statement.org/)

However, it will take more time for production servers to actually drop python2.7 especially that some of them, like centos use it within yum package manager.

BramVanroy · 2019-05-16T08:14:29Z

Oh wow, I didn't know the deprecation was that wide-spread. Thanks for the link, nice to see that many packages join in.

honnibal · 2019-05-16T19:59:16Z

Does it only affect Python 2.7.5? We test a more recent Python2.7 in our CI, and it works.

We'll drop Python2.7 eventually, sure. We've already dropped Python2.7 on Windows. Tentatively I would suggest December 31 2020 as a date to start developing versions which only work on Python 3.

BramVanroy · 2019-05-17T07:22:32Z

Does it only affect Python 2.7.5? We test a more recent Python2.7 in our CI, and it works.

We'll drop Python2.7 eventually, sure. We've already dropped Python2.7 on Windows. Tentatively I would suggest December 31 2020 as a date to start developing versions which only work on Python 3.

Yes. From the discussion on the Python boards it seems that the fix was pushed between 2.7.5 and 2.7.6. 2.7.6 should be safe, and as such it doesn't seem bad to just drop support for <= 2.7.5 specifically. In the bug report there is also a mention of Python 3.3 and 3.4, but I can't seem to find out which subversion they are talking about.

mohamed-ali · 2019-05-17T14:55:22Z

@BramVanroy @honnibal I did more debugging, and the part of regex that causes the problem is

# host name
r"(?:(?:[a-z0-9\-]*)?[a-z0-9]+)"

Despite that it's considered a bug in python2.7.5 . I think there is a small error in the code as well. Because the regex syntax (seomthing*)? means that optionally we can much a null, which doesn't make sense. I think the regex can be tweaked a bit, for example like this:

(?:(?:[a-z0-9]*)\-?[a-z0-9]+)

to match a string like server-local-host.

PS: the above tweak might not be the optimal alternative but just to make a statement that the regex can be fixed without dropping the support for python2.7.5.

honnibal · 2019-10-03T19:54:00Z

@mohamed-ali I think this should be fixed in v2.2. Could you try and see whether it works for you?

no-response · 2019-10-17T19:56:05Z

This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.

lock · 2019-11-16T20:54:40Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

mohamed-ali changed the title ~~Spacy 2.1.4 fails to load fasttext French model~~ Spacy 2.1.4 fails to initialize or load fasttext model using python2.7.5 May 13, 2019

honnibal added the compat Cross-platform and cross-Python compatibility label May 16, 2019

ines added feat / tokenizer Feature: Tokenizer lang / all Global language data labels May 16, 2019

honnibal added the more-info-needed This issue needs more information label Oct 3, 2019

no-response bot closed this as completed Oct 17, 2019

lock bot locked as resolved and limited conversation to collaborators Nov 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy 2.1.4 fails to initialize or load fasttext model using python2.7.5 #3734

Spacy 2.1.4 fails to initialize or load fasttext model using python2.7.5 #3734

mohamed-ali commented May 13, 2019 •

edited

Loading

BramVanroy commented May 13, 2019 •

edited

Loading

mohamed-ali commented May 16, 2019 •

edited

Loading

BramVanroy commented May 16, 2019

honnibal commented May 16, 2019

BramVanroy commented May 17, 2019 •

edited

Loading

mohamed-ali commented May 17, 2019

honnibal commented Oct 3, 2019

no-response bot commented Oct 17, 2019

lock bot commented Nov 16, 2019

Spacy 2.1.4 fails to initialize or load fasttext model using python2.7.5 #3734

Spacy 2.1.4 fails to initialize or load fasttext model using python2.7.5 #3734

Comments

mohamed-ali commented May 13, 2019 • edited Loading

How to reproduce the behaviour

Info about spaCy & enviroment:

Potential solutions

BramVanroy commented May 13, 2019 • edited Loading

mohamed-ali commented May 16, 2019 • edited Loading

BramVanroy commented May 16, 2019

honnibal commented May 16, 2019

BramVanroy commented May 17, 2019 • edited Loading

mohamed-ali commented May 17, 2019

honnibal commented Oct 3, 2019

no-response bot commented Oct 17, 2019

lock bot commented Nov 16, 2019

mohamed-ali commented May 13, 2019 •

edited

Loading

BramVanroy commented May 13, 2019 •

edited

Loading

mohamed-ali commented May 16, 2019 •

edited

Loading

BramVanroy commented May 17, 2019 •

edited

Loading