Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In spacy 2.1, linking a model from within a python interpreter fails #3435

Closed
matt-gardner opened this issue Mar 19, 2019 · 7 comments
Closed
Labels
feat / cli Feature: Command-line interface models Issues related to the statistical models upgrade Issues related to upgrading spaCy

Comments

@matt-gardner
Copy link

I'm not certain I named this right, as I don't know enough of spacy's internals to know if this is a linking issue or not, but it sure looks like it to me.

Prior to spacy 2.1, we were able to download a model from within a python interpreter and then, without restarting the interpreter, load the model successfully. It appears that in 2.1 we can no longer do this. I had a bit of a hit-and-miss reproducing this - my mac laptop originally reproduced it, then failed to (maybe a symlink was created that didn't get removed when I tried to clean the environment...?) - but on a fresh docker image, the issue is obviously there.

How to reproduce the behaviour

In spacy 2.0.x:

>>> import spacy
>>> from spacy.cli.download import download
>>> download('en_core_web_sm')
Collecting en_core_web_sm==2.0.0 ...
...
You can now load the model via spacy.load('en_core_web_sm')
>>> spacy.load('en_core_web_sm')
<spacy.lang.en.English object at 0x11b479fd0>
>>>

In spacy 2.1.0:

>>> import spacy
>>> from spacy.cli.download import download
>>> download('en_core_web_sm')
Collecting en_core_web_sm==2.1.0 ...
...
You can now load the model via spacy.load('en_core_web_sm')
>>> spacy.load('en_core_web_sm')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/spacy/__init__.py", line 22, in load
    return util.load_model(name, **overrides)
  File "/usr/local/lib/python3.6/site-packages/spacy/util.py", line 136, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
>>>

If I exit the interpreter and restart it after this, loading the model works without issue.

Context for why we want this functionality

In allennlp, we want to install spacy models when they are requested, inside of a training command: https://github.com/allenai/allennlp/blob/1b07b481007a52c76531fb4295120448493e6a41/allennlp/common/util.py#L289-L294. This dramatically simplifies our setup, so we don't have to worry about having a default model installed when you pip install allennlp.

Your Environment

  • Operating System: Ubuntu (via docker)
  • Python Version Used: 3.6.8
  • spaCy Version Used: 2.1.0
  • Environment Information: fresh install on a docker image (in particular, this one at commit 1b07b481007a52c76531fb4295120448493e6a41)
@ines
Copy link
Member

ines commented Mar 19, 2019

Ah, I think this is a side-effect of not creating the symlink for models that are also packages and can be loaded via the package name. I'm surprised that it can't find the model from within the same session... I guess if you install a new package, importlib only finds it if you reload?

To explain this in more detail: There are essentially 3 ways of loading that are supported by spacy.load: from a symlink, from an installed package, and from a path. Previously, the download command would always create a symlink for a model – not only for the en shortcuts, but also for the full model names like en_core_web_sm. This was kinda redundant and caused various issues – even if you only wanted to load the model from a package, you had to go through the symlink process for pretty much no reason. This can fail if you don't have the right user permissions set, and produce confusing errors. It was also much easier to end up with stale symlinks. Like, in theory, a user could have something else linked as en_core_web_sm.

Does creating the symlink explicitly via link work? Within the download function, we also pass in the package path again – pretty sure this was intended to trick importlib.

from spacy.cli import link
from spacy.util import get_package_path

model_name = "en_core_web_sm"
package_path = get_package_path(model_name)
link(model_name, model_name, force=True, package_path=package_path)

If not, we could re-add an argument to the download function that re-enables the symlinks for all model names. (It shouldn't be the default, because we kinda want to transition away from symlinks.)

@ines ines added models Issues related to the statistical models feat / cli Feature: Command-line interface upgrade Issues related to upgrading spaCy labels Mar 19, 2019
@matt-gardner
Copy link
Author

Yes, putting an explicit link in there works, thanks for the workaround. Personally, I'd say that finding a solution for getting importlib to do the right thing here is the best way to solve this - no need to add a flag to the download function, as this workaround accomplishes the same thing.

@ines
Copy link
Member

ines commented Mar 19, 2019

Yay, glad it worked 👍

Already added the flag, and I guess it's okay to have it, just so there's a clear way to replicate the exact behaviour of 2.0. We'll think about this some more!

@onema
Copy link

onema commented Mar 21, 2019

I'm getting this same error on a brand new virtual env using spacy 2.1.1 and python 3.6

pip install spacy==2.1.1
... 

And then:

[I] ➜ python3
Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 03:02:14)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> spacy_model_name = "en_core_web_sm"
>>> nlp = spacy.load(spacy_model_name)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/juant/.virtualenvs/machine_learning/lib/python3.6/site-packages/spacy/__init__.py", line 27, in load
    return util.load_model(name, **overrides)
  File "/Users/juant/.virtualenvs/machine_learning/lib/python3.6/site-packages/spacy/util.py", line 136, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

@matt-gardner
Copy link
Author

You forgot these two lines:

>>> from spacy.cli.download import download
>>> download('en_core_web_sm')

@jklaise
Copy link
Contributor

jklaise commented Mar 28, 2019

@ines what was the reason for reverting the always_link flag?

@lock
Copy link

lock bot commented Apr 27, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Apr 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / cli Feature: Command-line interface models Issues related to the statistical models upgrade Issues related to upgrading spaCy
Projects
None yet
Development

No branches or pull requests

4 participants