Transformer based word vectors #6511
-
I was wondering if spaCy nightly is supporting similarity method for BERT trained transformer model. I have trained the model but fail to find semantic similarity on
Your Environment============================== Info about spaCy ============================== spaCy version 3.0.0rc2 |
Beta Was this translation helpful? Give feedback.
Replies: 11 comments 10 replies
-
I don't think that functionality is readily available, but you should be able to implement something yourself. If you've trained a spaCy pipeline with a transformer component, you can access the data in You can also specify a |
Beta Was this translation helpful? Give feedback.
-
Thanks for the response, Basically what I need to do is take the output hidden states calculate the average using the alignment information and I can set the vector value in the vocabulary to the token and then I can use the similarity method as is? |
Beta Was this translation helpful? Give feedback.
-
Yes - that sounds about right :-) |
Beta Was this translation helpful? Give feedback.
-
Thank You very much :-) |
Beta Was this translation helpful? Give feedback.
-
@svlandeg sorry for reopening the post but I have one more question. When setting the vector in vocabulary that is the same but based on the context has different vector, so for example the in two sentences "Apple je računalna tvrtka" ("Apple is an IT company") and "Imao je psa koji je se zvao Apple" ("He had a dog named Apple") the word "Apple" has different vector based on its context, but the last occurrence of the word Apple overrides the initial vector, so every word has the same vector (based on my implementation the last one) as in FastText or Word2Vec models. I suppose there is another way to set contextual vectors but I could not find anything in the documentation. Here is my implementation based on the previous sugesstions:
|
Beta Was this translation helpful? Give feedback.
-
@danielvasic I think each word in the vocab can only have one vector, so in the above code you end up having the last one you assigned. I would assign transformer-based vectors to an extension attribute and override the similarity function. Or you may be able use user hooks to define your own vector property for tokens. @svlandeg and @danielvasic The requested functionality is available in the transformer models for spacy 2: https://explosion.ai/blog/spacy-transformers. It would be nice, if it could be made available in spacy 3 as well |
Beta Was this translation helpful? Give feedback.
-
Dear Peter, Yes I was looking into word hash maybe that could be the key but in sentences "Računalo je stroj" and "Računalo se da će doći" in the first sentence "Računalo" was correctly classified as a noun and in the second sentence word "Računalo" is correctly classified as a verb, but the orth attribute for these tokens are the same that seems strange because tags are not the same. I have to emphasize that lemmatization is not available even though spacy-nightly[lookups] is installed. I'm certainly doing something wrong. Thanks for the suggestion, I was hoping for out-of-the-box solution because reimplementation of similarity method requires to reimplement the method for finding similarity for list of words (Span, Doc), just averaging word vectors is giving me poor results. I was wondering what kind of method is used for finding similarity for multiple vectors Word Mover Distance or something else? |
Beta Was this translation helpful? Give feedback.
-
The relevant distinction here is between Tokens and Lexemes (I think): Lexemes are entries in the Vocab, they are unique (i.e. one string corresponds to one Lexeme). Tokens represent tokens in a text and they contain the corresponding Lexeme, but are otherwise distinct and can carry different annotations (otherwise PoS tagging and parsing would be impossible). So the two "Računalo" in your examples are different Tokens, but they contain (or point to) the same Lexeme containing lexical properties. Lemmas used to be lexical properties. I think this is changing in spacy 3 to allow for lexical ambiguity, but I'm not sure if it going to be the same across all languages. The default word vectors are also associated with Lexemes, but you access them through the |
Beta Was this translation helpful? Give feedback.
-
Have a look at how it was done in the previous version of (To be clear, this may not be exactly what you want, in particular |
Beta Was this translation helpful? Give feedback.
-
Thanks, @adrianeboyd i have reimplemented the similarity hooks as well, seems to be working good but I need to test it more thoroughly (I'm not happy with just averaging the word vectors for span similarity but seems to be working well for simple clauses). Many thanks @adrianeboyd @svlandeg @peter-exos you guys are awsome. 😊 |
Beta Was this translation helpful? Give feedback.
-
I just wanted to add my little solution to this. Which calculated transformers vectors for each token for spacy versions > 3.0. Here is a gist if anyone wants to check it out: https://gist.github.com/yeus/a4d7cc6c97485597eb1e0d7fd720b4e3 |
Beta Was this translation helpful? Give feedback.
Have a look at how it was done in the previous version of
spacy-transformers
: https://github.com/explosion/spacy-transformers/blob/ebef817af5c0077e8f7019b77d6768da1fd482eb/spacy_transformers/pipeline/tok2vec.py#L202-L246(To be clear, this may not be exactly what you want, in particular
sum
, but it shows how to set up the user hooks.)