Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacy 2.1.4 fails to initialize or load fasttext model using python2.7.5 #3734

Closed
mohamed-ali opened this issue May 13, 2019 · 9 comments
Closed
Labels
compat Cross-platform and cross-Python compatibility feat / tokenizer Feature: Tokenizer lang / all Global language data more-info-needed This issue needs more information

Comments

@mohamed-ali
Copy link

mohamed-ali commented May 13, 2019

How to reproduce the behaviour

The following code chunk from https://github.com/explosion/spaCy/blob/master/spacy/lang/tokenizer_exceptions.py fails in python 2.7.5 which makes init-model and spacy.load fail:

from __future__ import unicode_literals
import re

# URL validation regex courtesy of: https://mathiasbynens.be/demo/url-regex
# A few minor mods to this regex to account for use cases represented in test_urls
URL_PATTERN = (
    r"^"
    # in order to support the prefix tokenization (see prefix test cases in test_urls).
    r"(?=[\w])"
    # protocol identifier
    r"(?:(?:https?|ftp|mailto)://)?"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    # MH: Do we really need this? Seems excessive, and seems to have caused
    # Issue #957
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host name
    r"(?:(?:[a-z0-9\-]*)?[a-z0-9]+)"
    # domain name
    r"(?:\.(?:[a-z0-9])(?:[a-z0-9\-])*[a-z0-9])?"
    # TLD identifier
    r"(?:\.(?:[a-z]{2,}))"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:/\S*)?"
    # query parameters
    r"\??(:?\S*)?"
    # in order to support the suffix tokenization (see suffix test cases in test_urls),
    r"(?<=[\w/])"
    r"$"
).strip()

TOKEN_MATCH = re.compile(URL_PATTERN, re.UNICODE).match

The error message:

 Traceback (most recent call last):
  File "test.py", line 50, in <module>
    TOKEN_MATCH = re.compile(URL_PATTERN, re.UNICODE).match
  File "/usr/lib64/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/usr/lib64/python2.7/re.py", line 242, in _compile
    raise error, v # invalid expression
sre_constants.error: nothing to repeat

The issues is known: (SO discussion):

Was fixed:

Info about spaCy & enviroment:

  • Python version: 2.7.5
  • Platform: Linux-3.10.0-327.4.5.el7.x86_64-x86_64-with-centos-7.2.1511-Core
  • spaCy version: 2.1.4

Potential solutions

This issue doesn't occur when working with spacy 2.0.18.

Based on the blame history here the issue started to occur since the switch from regex to re three months ago.

Some quick fixes would be:

  • Drop support for python 2.7.5
  • Default to using regex if python version is 2.7.5
@mohamed-ali mohamed-ali changed the title Spacy 2.1.4 fails to load fasttext French model Spacy 2.1.4 fails to initialize or load fasttext model using python2.7.5 May 13, 2019
@BramVanroy
Copy link
Contributor

BramVanroy commented May 13, 2019

Considering 2.7.5 dates from May, 2013 I am in favour of dropping support. On a related note, NLTK seems to push to drop 2.7 support as a whole (nltk/nltk#2296). If this would help development, this may be a good idea for spaCy as well.

@mohamed-ali
Copy link
Author

mohamed-ali commented May 16, 2019

@BramVanroy Actually python2.7 is going to retire next year (you can see how much time is left for it here: https://pythonclock.org/ ). Also many DS/machine learning libraries will drop it including tensorflow, scikit-learn, numpy, pandas and many others. (check this link https://python3statement.org/)

However, it will take more time for production servers to actually drop python2.7 especially that some of them, like centos use it within yum package manager.

@BramVanroy
Copy link
Contributor

Oh wow, I didn't know the deprecation was that wide-spread. Thanks for the link, nice to see that many packages join in.

@honnibal
Copy link
Member

Does it only affect Python 2.7.5? We test a more recent Python2.7 in our CI, and it works.

We'll drop Python2.7 eventually, sure. We've already dropped Python2.7 on Windows. Tentatively I would suggest December 31 2020 as a date to start developing versions which only work on Python 3.

@honnibal honnibal added the compat Cross-platform and cross-Python compatibility label May 16, 2019
@ines ines added feat / tokenizer Feature: Tokenizer lang / all Global language data labels May 16, 2019
@BramVanroy
Copy link
Contributor

BramVanroy commented May 17, 2019

Does it only affect Python 2.7.5? We test a more recent Python2.7 in our CI, and it works.

We'll drop Python2.7 eventually, sure. We've already dropped Python2.7 on Windows. Tentatively I would suggest December 31 2020 as a date to start developing versions which only work on Python 3.

Yes. From the discussion on the Python boards it seems that the fix was pushed between 2.7.5 and 2.7.6. 2.7.6 should be safe, and as such it doesn't seem bad to just drop support for <= 2.7.5 specifically. In the bug report there is also a mention of Python 3.3 and 3.4, but I can't seem to find out which subversion they are talking about.

@mohamed-ali
Copy link
Author

@BramVanroy @honnibal I did more debugging, and the part of regex that causes the problem is

# host name
r"(?:(?:[a-z0-9\-]*)?[a-z0-9]+)"

Despite that it's considered a bug in python2.7.5 . I think there is a small error in the code as well. Because the regex syntax (seomthing*)? means that optionally we can much a null, which doesn't make sense. I think the regex can be tweaked a bit, for example like this:

(?:(?:[a-z0-9]*)\-?[a-z0-9]+)

to match a string like server-local-host.

PS: the above tweak might not be the optimal alternative but just to make a statement that the regex can be fixed without dropping the support for python2.7.5.

@honnibal
Copy link
Member

honnibal commented Oct 3, 2019

@mohamed-ali I think this should be fixed in v2.2. Could you try and see whether it works for you?

@honnibal honnibal added the more-info-needed This issue needs more information label Oct 3, 2019
@no-response
Copy link

no-response bot commented Oct 17, 2019

This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.

@no-response no-response bot closed this as completed Oct 17, 2019
@lock
Copy link

lock bot commented Nov 16, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compat Cross-platform and cross-Python compatibility feat / tokenizer Feature: Tokenizer lang / all Global language data more-info-needed This issue needs more information
Projects
None yet
Development

No branches or pull requests

4 participants