Transformer based word vectors #6511

danielvasic · 2020-12-06T19:12:17Z

danielvasic
Dec 6, 2020

I was wondering if spaCy nightly is supporting similarity method for BERT trained transformer model. I have trained the model but fail to find semantic similarity on Word, Doc, Span with this error:

UserWarning: [W008] Evaluating Token.similarity based on empty vectors.

Your Environment

============================== Info about spaCy ==============================

spaCy version 3.0.0rc2
Platform Darwin-19.6.0-x86_64-i386-64bit
Python version 3.7.9
Pipelines hr_pipeline (0.0.1)

Answered by adrianeboyd

Dec 14, 2020

Have a look at how it was done in the previous version of spacy-transformers: https://github.com/explosion/spacy-transformers/blob/ebef817af5c0077e8f7019b77d6768da1fd482eb/spacy_transformers/pipeline/tok2vec.py#L202-L246

(To be clear, this may not be exactly what you want, in particular sum, but it shows how to set up the user hooks.)

View full answer

svlandeg · 2020-12-07T08:57:36Z

svlandeg
Dec 7, 2020
Maintainer

I don't think that functionality is readily available, but you should be able to implement something yourself.

If you've trained a spaCy pipeline with a transformer component, you can access the data in doc._.trf_data. This holds the tensor outputs for the doc, the tokens data, and the alignment information. More information in the docs: https://nightly.spacy.io/api/transformer#transformerdata

You can also specify a set_extra_annotations function in the config, to store the data in the Doc in any other way you'd like: https://nightly.spacy.io/usage/embeddings-transformers#transformers-runtime

0 replies

danielvasic · 2020-12-07T17:58:05Z

danielvasic
Dec 7, 2020
Author

Thanks for the response,

Basically what I need to do is take the output hidden states calculate the average using the alignment information and I can set the vector value in the vocabulary to the token and then I can use the similarity method as is?

0 replies

svlandeg · 2020-12-07T20:22:56Z

svlandeg
Dec 7, 2020
Maintainer

Yes - that sounds about right :-)

0 replies

danielvasic · 2020-12-07T21:01:32Z

danielvasic
Dec 7, 2020
Author

Thank You very much :-)

0 replies

danielvasic · 2020-12-11T11:35:48Z

danielvasic
Dec 11, 2020
Author

@svlandeg sorry for reopening the post but I have one more question. When setting the vector in vocabulary that is the same but based on the context has different vector, so for example the in two sentences "Apple je računalna tvrtka" ("Apple is an IT company") and "Imao je psa koji je se zvao Apple" ("He had a dog named Apple") the word "Apple" has different vector based on its context, but the last occurrence of the word Apple overrides the initial vector, so every word has the same vector (based on my implementation the last one) as in FastText or Word2Vec models. I suppose there is another way to set contextual vectors but I could not find anything in the documentation. Here is my implementation based on the previous sugesstions:

for doc in docs:
    lengths = doc._.trf_data.align.lengths
    for i, token in enumerate(doc):
        trf_vector = []
        for len_idx in range(lengths[i]):
            trf_vector.append(doc._.trf_data.tensors[0][0][i+len_idx])
        trf_vector = np.array(trf_vector)
        parser.nlp.vocab.set_vector(token.text, np.average(trf_vector, axis=0))

0 replies

peter-exos · 2020-12-11T19:19:32Z

peter-exos
Dec 11, 2020

@danielvasic I think each word in the vocab can only have one vector, so in the above code you end up having the last one you assigned. I would assign transformer-based vectors to an extension attribute and override the similarity function. Or you may be able use user hooks to define your own vector property for tokens.

@svlandeg and @danielvasic The requested functionality is available in the transformer models for spacy 2: https://explosion.ai/blog/spacy-transformers. It would be nice, if it could be made available in spacy 3 as well

0 replies

danielvasic · 2020-12-11T20:02:58Z

danielvasic
Dec 11, 2020
Author

Dear Peter,

Yes I was looking into word hash maybe that could be the key but in sentences "Računalo je stroj" and "Računalo se da će doći" in the first sentence "Računalo" was correctly classified as a noun and in the second sentence word "Računalo" is correctly classified as a verb, but the orth attribute for these tokens are the same that seems strange because tags are not the same. I have to emphasize that lemmatization is not available even though spacy-nightly[lookups] is installed. I'm certainly doing something wrong.

Thanks for the suggestion, I was hoping for out-of-the-box solution because reimplementation of similarity method requires to reimplement the method for finding similarity for list of words (Span, Doc), just averaging word vectors is giving me poor results. I was wondering what kind of method is used for finding similarity for multiple vectors Word Mover Distance or something else?

0 replies

peter-exos · 2020-12-11T20:33:30Z

peter-exos
Dec 11, 2020

The relevant distinction here is between Tokens and Lexemes (I think): Lexemes are entries in the Vocab, they are unique (i.e. one string corresponds to one Lexeme). Tokens represent tokens in a text and they contain the corresponding Lexeme, but are otherwise distinct and can carry different annotations (otherwise PoS tagging and parsing would be impossible).

So the two "Računalo" in your examples are different Tokens, but they contain (or point to) the same Lexeme containing lexical properties. Lemmas used to be lexical properties. I think this is changing in spacy 3 to allow for lexical ambiguity, but I'm not sure if it going to be the same across all languages.

The default word vectors are also associated with Lexemes, but you access them through the vector property of a Token.
Through user hooks it is supposedly possible to override Token.vector.

4 replies

danielvasic Dec 11, 2020
Author

Dear Peter, I get it now. I think that user hooks could be the solution here, is there any documentation on how to use them?

svlandeg Dec 14, 2020
Maintainer

The docs on user hooks are here: https://nightly.spacy.io/usage/processing-pipelines#custom-components-user-hooks

danielvasic Dec 14, 2020
Author

Thank you @svlandeg ,

I have crated the pipeline component from the example but have some problems when adding this to pipeline, the user_hook class looks like this:

@Language.component("trf_vector_hook", assigns=["doc.user_token_hooks"])
class ContextualVectors(Pipe):
    def __init__(self, nlp):
        self._nlp = nlp

    def __call__(self, doc):
        if type(doc) == str:
            doc = self._nlp(doc)
        self.lengths = doc._.trf_data.align.lengths
        self.tensors = doc._.trf_data.tensors
        doc.user_token_hooks["vector"] = self.vector
        return doc

    def vector(self, token):
        trf_vector = []
        for len_idx in range(self.lengths[token.i]):
            trf_vector.append(self.tensors[0][0][token.i+len_idx])
        trf_vector = np.array(trf_vector)
        return np.average(trf_vector, axis=0)

But after adding this to my pipeline and running the model no errors occur but it returns ContextalVector object:

parser.nlp.add_pipe("trf_vector_hook", last=True)
parser.nlp("Apple je računalna tvrtka")

It returns <__main__.ContextualVectors at 0x1868fa730> i suppose I'm not returning the right type of object?

danielvasic Dec 14, 2020
Author

@Language.factory("trf_vector_hook", assigns=["doc.user_token_hooks"])
def create_contextual_hook(nlp, name):
    return ContextualVectors(nlp)

This successfully added the component, problem is that I added the class itself not the object. Also there is an issue with vector hook, the vectors are successfully set but the similarity method returns the warning that the vectors are not set, I had overriden the has_vector hook but that didn't help.

adrianeboyd · 2020-12-14T11:39:40Z

adrianeboyd
Dec 14, 2020

Have a look at how it was done in the previous version of spacy-transformers: https://github.com/explosion/spacy-transformers/blob/ebef817af5c0077e8f7019b77d6768da1fd482eb/spacy_transformers/pipeline/tok2vec.py#L202-L246

(To be clear, this may not be exactly what you want, in particular sum, but it shows how to set up the user hooks.)

0 replies

danielvasic · 2020-12-14T13:59:15Z

danielvasic
Dec 14, 2020
Author

Thanks, @adrianeboyd i have reimplemented the similarity hooks as well, seems to be working good but I need to test it more thoroughly (I'm not happy with just averaging the word vectors for span similarity but seems to be working well for simple clauses).

Many thanks @adrianeboyd @svlandeg @peter-exos you guys are awsome. 😊

6 replies

danielvasic Jul 6, 2021
Author

Hello @simonschoe sure here is the source code - https://bitbucket.org/danielvasic/croatianinference/src/master/croatianinference/similarity.py if its private just let me know. To elaborate this code uses CLF token vector for similarity, because it gave me the best reaults for sequence similarity, you can average the vectors for every token to get unique vector for the whole sequence, but that gave me slightly worse results than using CLF vector. Here the combination of FastText vectors and CLF token similarity was used.

simonschoe Jul 7, 2021

Thank you so much, I can access the file indeed. I am wondering tho: Usually the transformers models have a maximum sequence length (e.g., 512 token). If I access the hidden layer CLS tensor via doc._.trf_data.tensors[0][0][0] and my doc is a very long text (i.e. thousands of token), does the CLS tensor only relate to the first 512 token of the text?

danielvasic Jul 7, 2021
Author

Hello @simonschoe, I think that such constraints apply to spaCy too but I wouldn't know because I used shorter sequences of text. I would suggest to to try to segment such text into paragraphs to obtain more precise results I dont think that BERT as an architecture is very well adjusted to such texts, maybe try some different architectures such as https://huggingface.co/transformers/model_doc/longformer.html I think that spaCy pipeline can load any huggingface model.

perrozzi Nov 4, 2021

Hello @danielvasic, thanks really a lot for the link to your code!
I noticed that in line 26
https://bitbucket.org/danielvasic/croatianinference/src/75e61c42c43c5c91e32c7c11d99d38a5940ae361/croatianinference/similarity.py#lines-26
you access the first element of the token tensors doc._.trf_data.tensors[0][0][0].
Removing the fasttext piece and playing around a bit I noticed that the transformers wordpieces, at least in my case, are always surrounded by special characters, cfr

apple1 = nlp_trf("Apple shares rose on the news.")
apple2 = nlp_trf("Apple")

print(apple1._.trf_data.wordpieces)
print(apple2._.trf_data.wordpieces)

returning

WordpieceBatch(strings=[['<s>', 'Apple', 'Gshares', 'Grose', 'Gon', 'Gthe', 'Gnews', '.', '</s>']], input_ids=array([[    0, 20770,   327,  1458,    15,     5,   340,     4,     2]]), attention_mask=array([[1, 1, 1, 1, 1, 1, 1, 1, 1]]), lengths=[9], token_type_ids=None)
WordpieceBatch(strings=[['<s>', 'Apple', '</s>']], input_ids=array([[    0, 20770,     2]]), attention_mask=array([[1, 1, 1]]), lengths=[3], token_type_ids=None)

This seems to be true also for the tokens, so you are effectively always selecting the "sentence start" wordpiece vector and not the word one.
In fact, the similarities apple1.similarity(apple2) and apple1[0].similarity(apple2[0]) return exactly the same value to me with the transformer part of your code, although in one case you would be comparing the full sentence, in the other only the word "Apple".

Is this intended? Could please check?

I believe that the code should either average on the 3 wordpieces (e.g. ['<s>', 'Apple', '</s>']) or use the Apple vector, e.g. doc._.trf_data.tensors[0][0][1]

danielvasic Nov 4, 2021
Author

@perrozzi yes this is in fact true. The idea is to use CLS vector for sentence similarity because BERT vectors in my case were not fine tuned for single word vectors but for sentence prediction task. Thus the vectors for single word pieces are not relevant for semantic similarity so we combined FastText vectors for single word similarity but enhanced with aditional context by multiplying with sentence similarity. This approach gave us better results for single word similarity.

yeus · 2022-04-05T21:09:55Z

yeus
Apr 5, 2022

I just wanted to add my little solution to this. Which calculated transformers vectors for each token for spacy versions > 3.0. Here is a gist if anyone wants to check it out:

https://gist.github.com/yeus/a4d7cc6c97485597eb1e0d7fd720b4e3

0 replies

Transformer based word vectors #6511

danielvasic Dec 6, 2020

Your Environment

Replies: 11 comments · 10 replies

svlandeg Dec 7, 2020 Maintainer

danielvasic Dec 7, 2020 Author

svlandeg Dec 7, 2020 Maintainer

danielvasic Dec 7, 2020 Author

danielvasic Dec 11, 2020 Author

peter-exos Dec 11, 2020

danielvasic Dec 11, 2020 Author

peter-exos Dec 11, 2020

danielvasic Dec 11, 2020 Author

svlandeg Dec 14, 2020 Maintainer

danielvasic Dec 14, 2020 Author

danielvasic Dec 14, 2020 Author

adrianeboyd Dec 14, 2020

danielvasic Dec 14, 2020 Author

danielvasic Jul 6, 2021 Author

simonschoe Jul 7, 2021

danielvasic Jul 7, 2021 Author

perrozzi Nov 4, 2021

danielvasic Nov 4, 2021 Author

yeus Apr 5, 2022

danielvasic
Dec 6, 2020

Replies: 11 comments 10 replies

svlandeg
Dec 7, 2020
Maintainer

danielvasic
Dec 7, 2020
Author

svlandeg
Dec 7, 2020
Maintainer

danielvasic
Dec 7, 2020
Author

danielvasic
Dec 11, 2020
Author

peter-exos
Dec 11, 2020

danielvasic
Dec 11, 2020
Author

peter-exos
Dec 11, 2020

danielvasic Dec 11, 2020
Author

svlandeg Dec 14, 2020
Maintainer

danielvasic Dec 14, 2020
Author

danielvasic Dec 14, 2020
Author

adrianeboyd
Dec 14, 2020

danielvasic
Dec 14, 2020
Author

danielvasic Jul 6, 2021
Author

danielvasic Jul 7, 2021
Author

danielvasic Nov 4, 2021
Author

yeus
Apr 5, 2022