-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange behaviour for german word vectors #2523
Comments
An answer would be very nice. I think there is somehting very evil for the german model. Maybe a bug?
I read the OOV issue will be fixed. But what about the standard word vectors features? The |
What's going on here is that the small models don't have "true" word vectors – only context-sensitive tensors that are shared across the pipeline. See here:
For spaCy (We've been going back and forth on how the |
Many thanks. Sounds interesting. When do you release version 2.1.0? Will there also a better german lemmatizer shipped with? Are there right now some larger models available?
What are those attributes? And you also get a norm for the words. What is this when you have no vectors? What properties are now available within the small model? |
The nightly version (2.1.0 alpha) will hopefully be available for public testing this week – getting the version ready has taken a lot longer than expected, because we had to take care of a bunch of infrastructure stuff, in order to re-train our full model families for all different languages, plus the additional larger models and word vectors.
The lemmatizer is separate from the models. For English, spaCy implements a rule-based lemmatizer that's usually more accurate. For all other languages that support lemmatization, we currently only have lookup tables. However, we did make various improvements to the way the lookup tables are integrated, so the previous bugs should be fixed. In the future, we'd love to move towards rule-based and/or statistical lemmatization for all languages, but this obviously takes time, because the implementation needs to be language-specific.
Right now, the only German model available is
We're calling them "context tensors", but they're essentially the output of the CNN layer. They're rows of the tensor, so also vectors and representations of the words in context. Compared to "real" word vectors, they're trained with a different objective, on different data and with different features. You could also think of them as a byproduct of the neural network models.
Everything, just not the word vectors, because the small models don't ship with word vectors. You can find more details of the individual models and their capabilities in the models overview, for example here. |
I would be just interested in the larger models for german. Could you use the md and lg model then with v2.0.9? |
No, changes in the parser and the model features require re-training the models, so models are generally not backwards compatible across major or minor versions (as of spaCy 2.x, we'll make sure to always express this via the model versions, though – so models with version v2.1.x are compatible with spaCy v2.1.x and so on). If you just want to add word vectors, you can already do that with the existing German model – for example, you could add the pre-trained fastText vectors. You could also use a library like Gensim to train your very own, super custom word vectors specific to your data (if you have lots of text). |
Thanks. I will try that. But I saw there are larger models for v1.7 available. Could you use such a model within v.2.0.9, so the other direction. |
No, the models for v1.x are linear models, so they're very different from the neural network models in v2.x. (It's also difficult to really compare the sizes, because the linear models were much larger in general, even without word vectors.) |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Hey,
I wanted to check words if they are in the german spacy model and found strange things. For almost every garbage word I get a true for has_vector and at the same moment that it is OOV! Look at:
gives:
It gives a norm and says it has vector but also that it is OOV!!
spacy: 2.0.9
The text was updated successfully, but these errors were encountered: