New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to use a RoBERTa-like model (Microsoft/codebert-base) for NER sequence-tagging #437
Comments
We expect to support non-BERT models for NER embeddings soon in an upcoming release. Right now, the embeddings are constructed in a way that is specific to BERT/DistlBert, etc. I have marked this issue as enhancement and will update it as soon as this is released. |
Alright! That would be of great use for my research. What is the ETA on this enhancement, as I currently only have a few more weeks for my experiments? Perhaps I could fork it and add the embeddings myself? |
@amaiya I've tried and implement a RoBERTa preprocessor with no success currently, could you point me in the right direction where I should be adding the embedding process for RoBERTa? |
Hello @Niekvdplas : Looks like I missed your earlier post. The |
Hi @amaiya , I've got some time to try and make this change in the code. After some research I came up with this implementation that should be correct. Matching shapes and outputs of the original implementation this checks out however, during lr_find() as a test I am receiving a Graph execution error: "ConcatOp : Dimension 1 in both shapes must be equal: shape[0] = [32,85,100] vs. shape[1] = [32,97,768]" *(batch size, length, layers). Here is my implementation:
and ofcourse add 'TFRobertaModel' to the accepted models.
I am not quite sure that model_output[0] is directly in the correct form and that perhaps I've made a mistake there, maybe you could help out. |
Hi @Niekvdplas: Consider an input like this: 'I use ktrain 99.9% of the time.' We would like to assign an embedding vector to each word in the input that we'd like a prediction for. However, the input gets tokenized like this by Roberta: ['I', 'Ġuse', 'Ġk', 'train', 'Ġ99', '.', '9', '%', 'Ġof', 'Ġthe', 'Ġtime', '.'] So, we have 12 embedding vectors but only 8 input tokens because words like "ktrain" are split up into multiple subwords. One way to resolve this is to take the embedding of the first subword and use that as the representation for the entire word. Right now, the way this is done is suboptimal and specific to BERT (see the rest of the |
Hi @amaiya, I have figured out how to get the mapping of words and subwords, but I found some contradicting results in regards to your example. The way to find the correct mapping of subwords into the actual words is through word_ids() which can only be used in 'fast' tokenizers. (I've also went down the rabbit hole with 'return_offset_mapping' (which works if you think of subwords as parts of where there are no spaces in between) of which I'll also share my thoughts.) I'll share the values I got returned from those functions after inputting your example here:
According to word_mapping, only 'train' is a subword belonging to 'k' while according to offset_mapping 'train' is a subword to 'k' (as the the starting index is the same as the former end index), '99.9%' is a full word (as there are no spaces between the tokens) and 'time.' is a full word (idem). Two different results via two different ways of determining the input tokens which are also different from your value of 8 input tokens. Based on the offset_mapping (as it is closer to what you were expecting and without having to resort to fast tokenizers) I've implemented the following code:
And this implementation got rid of my former bug and actually started running lr_find(), I did get a similar a few steps in though.
Halfway through the steps I got a similar error again: "Dimension 1 in both shapes must be equal: shape[0] = [32,509,100] vs. shape[1] = [32,449,768]" This implementation is a generalized solution for all transformers. Here the same code but with filtering by word_ids (Which is the correct way according to Huggingface link
Now I get the same error as the other implementation however right at the start: ConcatOp : Dimension 1 in both shapes must be equal: shape[0] = [32,509,1] vs. shape[1] = [32,450,768] |
@Niekvdplas This is fantastic - thanks. I will take a closer look at this early next week. We should hopefully have a working generalized solution soon. |
Yes, I hope so too! :-) I created a fork for this feature. Please let me know what I am missing so I can implement those facets as well |
In my earlier post, I may have stated something inaccurately. In the sentence, "I use ktrain 99.9% of the time.", "ktrain" is assigned a single embedding (e.g., average over subwords or take first subword vector), but "99.9%" is assigned 4 embeddings (one for each token). This is the way things are done now and is consistent with the word IDs returned from #11 word embeddings for this sentence (not 12) since ktrain is assigned a single word embedding
[None, 0, 1, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10, None] The issue is that the input text may or may not tokenize "99.9%" as separate tokens (e.g., separate rows in a CoNLL-formatted training set), which causes problems like the shape mismatch you're seeing. The current solution is to transform the input to be consistent with the the tokenization scheme of transformers model (minus the subword tokenizations for words like "ktrain"). That is, if "99.9%" appears as a single token in input training set, then it will be transformed to separate tokens if that's what the transformers model does:
This is done in If you generalize the |
Alright, will start on that tomorrow morning. Hopefully at mid-day I will come back with results. If all is tested and well I will create a PR. |
Hi @amaiya, so I've given it quite some time now and here is what I've come up with:
As you can see I do some string manipulation by converting special characters to normal ones (é to e, etc.) as this breaks the tokenization for some reason. I also did the same with accents as it did that as well. Rest is relatively in line with "logits and labels must be broadcastable: logits_size=[3808,9] labels_size=[3872,9]" (always a multiple of 32 (batch size) off), I can't find where the problem is and perhaps you've encountered a similar one. Let me know if you can think of any better fixes than the string manipulation I've done in the above. |
Thanks, @Niekvdplas. I've never seen this error before. My guess is it's something in Are you using BERT or CodeBert/Roberta to test? If the latter, I would recommend testing Also, the string manipulation may be problematic. For instance, running |
@amaiya I ran it with the notebook and |
Okay, so with some alterations in Let me know if you find a solution @amaiya |
@Niekvdplas Thanks a lot. It turned out that there was only an extra minor fix that was needed to get things working. The shape mismatches with Roberta were caused by I tested this with both You did the heavy-lifting on this, so you if you want to copy the |
Alright, I did that @amaiya. Thanks for the help! |
Right now I am getting a tensor shape error and I feel like this has to be because of the model expecting a BERT like input and this goes wrong, could this be the case?
The text was updated successfully, but these errors were encountered: