Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 Finalise vector support and add vector specs to model meta #1457

Closed
ines opened this issue Oct 24, 2017 · 3 comments
Closed

💫 Finalise vector support and add vector specs to model meta #1457

ines opened this issue Oct 24, 2017 · 3 comments
Labels
docs Documentation and website enhancement Feature requests and improvements models Issues related to the statistical models 🌙 nightly Discussion and contributions related to nightly builds

Comments

@ines
Copy link
Member

ines commented Oct 24, 2017

Related issues: #1092, #1341, #1204

Finalise vector support

The vector support of the en_core_web_sm model in v2.0 is still being finalised. However, the stable version will definitely include some vectors, and will let you get context-sensitive token vectors from the tensorizer. This needs to be wired up properly again.

Documentation of model vector specs

The way the included word vectors are documented in the current models documentation and new v2.0 model directory still isn't ideal. Vector details are only present in the "description" – instead, they should be added to their own "vectors" key in the meta.json. The details could be read off the model automatically after training, e.g. by spacy train. This would also mean that users training their own model would have this information added automatically.

Example

{
    "lang": "en",
    "name": "core_web_sm",
    "version": "2.0.0",
    "pipeline": ["tagger", "parser", "ner"],
    "vectors": {
        "width": 300,
        "entries": 5000
    }
}

The v2.0 model directory requests each model's meta.json and uses this info to populate the model details. This ensures that the website is always up to date with the latest release. On the front-end, all that has to be done is add a row for the vectors info, and populate it via the ModelLoader script if a "vectors" object is present in the meta. We'll also need to update our internal model build process to make sure the vectors info is added to each individual model release.

Other documentation

While fixing this, we also need to revisit the word vectors & similarity guide to make sure it doesn't contain any misleading information about the vectors included in the models.

@ines
Copy link
Member Author

ines commented Nov 1, 2017

Fixed and documented on develop, and will be included in the next version.

@beowulfenator
Copy link

beowulfenator commented Dec 27, 2017

Hi! All tokens still seem to return True for is_oov. Checked with spacy 2.0.5 and en_core_web_sm 2.0.0.

>>> en = spacy.load('en_core_web_sm')
>>> x = en('This is a tessssst')
>>> [w.is_oov for w in x]
[True, True, True, True]

However, en_core_web_md 2.0.0 works just fine!

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website enhancement Feature requests and improvements models Issues related to the statistical models 🌙 nightly Discussion and contributions related to nightly builds
Projects
None yet
Development

No branches or pull requests

2 participants