Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour for german word vectors #2523

Closed
ctrado18 opened this issue Jul 6, 2018 · 9 comments
Closed

Strange behaviour for german word vectors #2523

ctrado18 opened this issue Jul 6, 2018 · 9 comments
Labels
feat / vectors Feature: Word vectors and similarity lang / de German language data and models

Comments

@ctrado18
Copy link

ctrado18 commented Jul 6, 2018

Hey,

I wanted to check words if they are in the german spacy model and found strange things. For almost every garbage word I get a true for has_vector and at the same moment that it is OOV! Look at:

nlp = spacy.load('de_core_news_sm')
doc = nlp(u'was kostet stein zzcfkfjduhüüü')

for token in doc:
    print(token.text, token.lemma_, token.vector_norm, token.is_oov)
    print("has_vector:", token.has_vector)

gives:

was wer 43.980087 True
has_vector: True
kostet kosten 45.899117 True
has_vector: True
stein stein 44.52947 True
has_vector: True
zzcfkfjduhüüü zzcfkfjduhüüü 39.83339 True
has_vector: True

It gives a norm and says it has vector but also that it is OOV!!

spacy: 2.0.9

@ctrado18
Copy link
Author

ctrado18 commented Jul 9, 2018

An answer would be very nice. I think there is somehting very evil for the german model. Maybe a bug?

doc = nlp(u'wie viel entgelte als selbstständige')
for token in doc:
    print(token.text)
    print("has_vector:", token.has_vector)

for token in doc:
    print(token.text in nlp.vocab)


wie 
has_vector: True
viel 
has_vector: True
entgelte 
has_vector: True
als 
has_vector: True
selbstständige 
has_vector: True
True
True
False
True
False

I read the OOV issue will be fixed. But what about the standard word vectors features? The has_vector is bugged?

@ines
Copy link
Member

ines commented Jul 9, 2018

What's going on here is that the small models don't have "true" word vectors – only context-sensitive tensors that are shared across the pipeline. See here:

To make them compact and fast, spaCy's small models (all packages that end in sm) don't ship with word vectors, and only include context-sensitive tensors. This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won't be as good, and individual tokens won't have any vectors assigned. So in order to use real word vectors, you need to download a larger model.

For spaCy v2.1.0, we're training new models, including our own word vectors for German, so you'll be able to download pre-trained German md and lg packages. (You can also always add your own vectors to the sm models.) spaCy will also explicitly show a message similar as a warning if you're accessing the similarity methods with a model that doesn't include "real" word vectors (if you have warnings enabled).

(We've been going back and forth on how the has_vector should behave in cases like this. There is a vector, so having it return False would be misleading. Similarly, if the model doesn't come with a pre-trained vocab, technically all lexemes are OOV.)

@ines ines added lang / de German language data and models feat / vectors Feature: Word vectors and similarity labels Jul 9, 2018
@ctrado18
Copy link
Author

ctrado18 commented Jul 9, 2018

Many thanks. Sounds interesting. When do you release version 2.1.0? Will there also a better german lemmatizer shipped with? Are there right now some larger models available?

only context-sensitive tensors that are shared across the pipeline

What are those attributes? And you also get a norm for the words. What is this when you have no vectors? What properties are now available within the small model?

@ines
Copy link
Member

ines commented Jul 9, 2018

Many thanks. Sounds interesting. When do you release version 2.1.0?

The nightly version (2.1.0 alpha) will hopefully be available for public testing this week – getting the version ready has taken a lot longer than expected, because we had to take care of a bunch of infrastructure stuff, in order to re-train our full model families for all different languages, plus the additional larger models and word vectors.

Will there also a better german lemmatizer shipped with?

The lemmatizer is separate from the models. For English, spaCy implements a rule-based lemmatizer that's usually more accurate. For all other languages that support lemmatization, we currently only have lookup tables. However, we did make various improvements to the way the lookup tables are integrated, so the previous bugs should be fixed. In the future, we'd love to move towards rule-based and/or statistical lemmatization for all languages, but this obviously takes time, because the implementation needs to be language-specific.

Are there right now some larger models available?

Right now, the only German model available is de_core_news_sm.

What are those attributes? And you also get a norm for the words. What is this when you have no vectors?

We're calling them "context tensors", but they're essentially the output of the CNN layer. They're rows of the tensor, so also vectors and representations of the words in context. Compared to "real" word vectors, they're trained with a different objective, on different data and with different features. You could also think of them as a byproduct of the neural network models.

What properties are now available within the small model?

Everything, just not the word vectors, because the small models don't ship with word vectors. You can find more details of the individual models and their capabilities in the models overview, for example here.

@ctrado18
Copy link
Author

ctrado18 commented Jul 9, 2018

I would be just interested in the larger models for german. Could you use the md and lg model then with v2.0.9?

@ines
Copy link
Member

ines commented Jul 9, 2018

I would be just interested in the larger models for german. Could you use the md and lg model then with v2.0.9?

No, changes in the parser and the model features require re-training the models, so models are generally not backwards compatible across major or minor versions (as of spaCy 2.x, we'll make sure to always express this via the model versions, though – so models with version v2.1.x are compatible with spaCy v2.1.x and so on).

If you just want to add word vectors, you can already do that with the existing German model – for example, you could add the pre-trained fastText vectors. You could also use a library like Gensim to train your very own, super custom word vectors specific to your data (if you have lots of text).

@ctrado18
Copy link
Author

Thanks. I will try that. But I saw there are larger models for v1.7 available. Could you use such a model within v.2.0.9, so the other direction.

@ines
Copy link
Member

ines commented Jul 10, 2018

No, the models for v1.x are linear models, so they're very different from the neural network models in v2.x. (It's also difficult to really compare the sizes, because the linear models were much larger in general, even without word vectors.)

@ines ines closed this as completed Jul 10, 2018
@lock
Copy link

lock bot commented Aug 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / vectors Feature: Word vectors and similarity lang / de German language data and models
Projects
None yet
Development

No branches or pull requests

2 participants