Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to use a RoBERTa-like model (Microsoft/codebert-base) for NER sequence-tagging #437

Closed
Niekvdplas opened this issue Apr 13, 2022 · 17 comments
Labels
enhancement New feature or request

Comments

@Niekvdplas
Copy link
Contributor

Right now I am getting a tensor shape error and I feel like this has to be because of the model expecting a BERT like input and this goes wrong, could this be the case?

@amaiya amaiya added the enhancement New feature or request label Apr 14, 2022
@amaiya
Copy link
Owner

amaiya commented Apr 14, 2022

We expect to support non-BERT models for NER embeddings soon in an upcoming release. Right now, the embeddings are constructed in a way that is specific to BERT/DistlBert, etc.

I have marked this issue as enhancement and will update it as soon as this is released.

@Niekvdplas
Copy link
Contributor Author

Niekvdplas commented Apr 19, 2022

Alright! That would be of great use for my research. What is the ETA on this enhancement, as I currently only have a few more weeks for my experiments? Perhaps I could fork it and add the embeddings myself?

@Niekvdplas
Copy link
Contributor Author

@amaiya I've tried and implement a RoBERTa preprocessor with no success currently, could you point me in the right direction where I should be adding the embedding process for RoBERTa?

@amaiya
Copy link
Owner

amaiya commented Apr 25, 2022

Hello @Niekvdplas : Looks like I missed your earlier post. The embed method in the text/preprocessor module needs to be generalized, which probably involves using tokenizer.encode (in this case it will be the tokenizer for CodeBert) instead of manually constructing the input IDs for BERT models. If you make the change and submit a PR, I can merge it.

@Niekvdplas
Copy link
Contributor Author

Niekvdplas commented Apr 28, 2022

Hi @amaiya ,

I've got some time to try and make this change in the code. After some research I came up with this implementation that should be correct. Matching shapes and outputs of the original implementation this checks out however, during lr_find() as a test I am receiving a Graph execution error: "ConcatOp : Dimension 1 in both shapes must be equal: shape[0] = [32,85,100] vs. shape[1] = [32,97,768]" *(batch size, length, layers).

Here is my implementation:

def embed(self, texts, word_level=True, max_length=512):
    if isinstance(texts, str): texts = [texts]
    if not isinstance(texts[0], str): texts = [" ".join(text) for text in texts]
    sentences = []

    for text in texts:
        sentences.append(self.tokenizer.tokenize(text))
    maxlen = len(max([tokens for tokens in sentences], key=len,)) + 2
    if max_length is not None and maxlen > max_length: maxlen = max_length # added due to issue #270

    encoded_input = self.tokenizer.batch_encode_plus(texts, max_length=maxlen, padding='max_length', truncation=True, return_tensors='tf')
    model_output = self.model(**encoded_input) (source: [link](https://discuss.huggingface.co/t/get-word-embeddings-from-transformer-model/6929/2)
    raw_embeddings = model_output[0] 

    if not word_level:
        k = np.mean(raw_embeddings, axis=1)
        return k

    embeddings = []
    for e in raw_embeddings:
        embeddings.append(np.array(e))
    return np.array(embeddings)

and ofcourse add 'TFRobertaModel' to the accepted models.

if type(self.model).__name__ not in ['TFBertModel', 'TFDistilBertModel', 'TFAlbertModel', 'TFRobertaModel']:

I am not quite sure that model_output[0] is directly in the correct form and that perhaps I've made a mistake there, maybe you could help out.

@amaiya
Copy link
Owner

amaiya commented Apr 28, 2022

Hi @Niekvdplas: Consider an input like this: 'I use ktrain 99.9% of the time.' We would like to assign an embedding vector to each word in the input that we'd like a prediction for.

However, the input gets tokenized like this by Roberta:

['I', 'Ġuse', 'Ġk', 'train', 'Ġ99', '.', '9', '%', 'Ġof', 'Ġthe', 'Ġtime', '.']

So, we have 12 embedding vectors but only 8 input tokens because words like "ktrain" are split up into multiple subwords. One way to resolve this is to take the embedding of the first subword and use that as the representation for the entire word. Right now, the way this is done is suboptimal and specific to BERT (see the rest of the embed method). This code basically needs to be removed and replaced with something cleaner and more general that will work for any model. Fortunately, there are tokenizer methods to map the subword tokens back to original input word (e.g., Ġk and train are from ktrain), so we would just need to use something like that.

@Niekvdplas
Copy link
Contributor Author

Niekvdplas commented Apr 30, 2022

Hi @amaiya,

I have figured out how to get the mapping of words and subwords, but I found some contradicting results in regards to your example. The way to find the correct mapping of subwords into the actual words is through word_ids() which can only be used in 'fast' tokenizers. (I've also went down the rabbit hole with 'return_offset_mapping' (which works if you think of subwords as parts of where there are no spaces in between) of which I'll also share my thoughts.) I'll share the values I got returned from those functions after inputting your example here:

tokens = ['I', 'Ġuse', 'Ġk', 'train', 'Ġ99', '.', '9', '%', 'Ġof', 'Ġthe', 'Ġtime', '.']

word_mapping = [0, 1, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10 , None, None]

offset_mapping = [0, 1], [2, 5], [6, 7], [ 7, 12], [13, 15], [15, 16], [16, 17], [17, 18], [19, 21], [22, 25], [26, 30], [30, 31], [0, 0], [0, 0])

According to word_mapping, only 'train' is a subword belonging to 'k' while according to offset_mapping 'train' is a subword to 'k' (as the the starting index is the same as the former end index), '99.9%' is a full word (as there are no spaces between the tokens) and 'time.' is a full word (idem).

Two different results via two different ways of determining the input tokens which are also different from your value of 8 input tokens.

Based on the offset_mapping (as it is closer to what you were expecting and without having to resort to fast tokenizers) I've implemented the following code:

    def embed(self, texts, word_level=True, max_length=512):
        """
        ```
        get embedding for word, phrase, or sentence
        Args:
          text(str|list): word, phrase, or sentence or list of them representing a batch
          word_level(bool): If True, returns embedding for each token in supplied texts.
                            If False, returns embedding for each text in texts
          max_length(int): max length of tokens
        Returns:
            np.ndarray : embeddings
        ```
        """
        if isinstance(texts, str): texts = [texts]
        if not isinstance(texts[0], str): texts = [" ".join(text) for text in texts]

        sentences = []
        for text in texts:
            sentences.append(self.tokenizer.tokenize(text))
        maxlen = len(max([tokens for tokens in sentences], key=len,)) + 2
        if max_length is not None and maxlen > max_length: maxlen = max_length # added due to issue #270

        encoded_input = self.tokenizer.batch_encode_plus(texts, max_length=maxlen, padding='max_length', return_tensors='tf', return_offsets_mapping=True, add_special_tokens=False, truncation=True)
        offset_mapping = encoded_input['offset_mapping'].numpy()
        #Remove offset_mapping as it breaks the model_output
        del encoded_input['offset_mapping']
        model_output = self.model(**encoded_input)
        raw_embeddings = model_output[0].numpy() # output_hidden_states=True


        if not word_level:
            k = np.mean(raw_embeddings, axis=1)
            return k


        #Remove subword vectors and replace with mean of the subword vectors
        filtered_embeddings = []
        for i in range(len(raw_embeddings)):
            filtered_embedding = []
            raw_embedding = raw_embeddings[i]
            subvectors = []
            last_index = -1
            for j in range(len(raw_embedding)):
                if offset_mapping[i][j][0] == last_index:
                    subvectors.append(raw_embedding[j])
                    last_index = offset_mapping[i][j][1]
                if offset_mapping[i][j][0] > last_index:
                    if len(subvectors) > 0:
                        filtered_embedding.append(np.mean(subvectors, axis=0))
                        subvectors = []
                    subvectors.append(raw_embedding[j])
                    last_index = offset_mapping[i][j][1]
            if len(subvectors) > 0:
                filtered_embedding.append(np.mean(subvectors, axis=0))
                subvectors = []
            filtered_embeddings.append(filtered_embedding) 


        #Pad to max length
        max_length = max([len(e) for e in filtered_embeddings])
        embeddings = []
        for e in filtered_embeddings:
            for i in range(max_length - len(e)):
                e.append(np.zeros((self.embsize, )))
            embeddings.append(np.array(e))
        return np.array(embeddings)

And this implementation got rid of my former bug and actually started running lr_find(), I did get a similar a few steps in though.

"Can't convert non-rectangular Python sequence to Tensor." Perhaps I am missing another small issue or am I not converting the tensors the right way, my guess it has something to do with how I calculate the mean of the subvectors?
Fixed by adding 'truncation=True'

Halfway through the steps I got a similar error again:

"Dimension 1 in both shapes must be equal: shape[0] = [32,509,100] vs. shape[1] = [32,449,768]"

This implementation is a generalized solution for all transformers.

Here the same code but with filtering by word_ids (Which is the correct way according to Huggingface link


        filtered_embeddings = []
        for i in range(len(raw_embeddings)):
            word_ids = encoded_input.word_ids(i)
            filtered_embedding = []
            raw_embedding = raw_embeddings[i]
            subvectors = []
            last_id = -1
            for j in range(len(raw_embedding)):
                if word_ids[j] == None:
                    continue
                if word_ids[j] == last_id:
                    subvectors.append(raw_embedding[j])
                if word_ids[j] > last_id:
                    if len(subvectors) > 0:
                        filtered_embedding.append(np.mean(subvectors, axis=0))
                        subvectors = []
                    subvectors.append(raw_embedding[j])
                    last_id = word_ids[j]
            if len(subvectors) > 0:
                filtered_embedding.append(np.mean(subvectors, axis=0))
                subvectors = []
            filtered_embeddings.append(filtered_embedding) 

Now I get the same error as the other implementation however right at the start:

ConcatOp : Dimension 1 in both shapes must be equal: shape[0] = [32,509,1] vs. shape[1] = [32,450,768]

@amaiya
Copy link
Owner

amaiya commented May 1, 2022

@Niekvdplas This is fantastic - thanks. I will take a closer look at this early next week. We should hopefully have a working generalized solution soon.

@Niekvdplas
Copy link
Contributor Author

Niekvdplas commented May 1, 2022

Yes, I hope so too! :-) I created a fork for this feature. Please let me know what I am missing so I can implement those facets as well

@amaiya
Copy link
Owner

amaiya commented May 1, 2022

In my earlier post, I may have stated something inaccurately. In the sentence, "I use ktrain 99.9% of the time.", "ktrain" is assigned a single embedding (e.g., average over subwords or take first subword vector), but "99.9%" is assigned 4 embeddings (one for each token). This is the way things are done now and is consistent with the word IDs returned from transformers:

#11 word embeddings for this sentence (not 12) since ktrain is assigned a single word embedding
[None, 0, 1, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10, None] 

The issue is that the input text may or may not tokenize "99.9%" as separate tokens (e.g., separate rows in a CoNLL-formatted training set), which causes problems like the shape mismatch you're seeing. The current solution is to transform the input to be consistent with the the tokenization scheme of transformers model (minus the subword tokenizations for words like "ktrain"). That is, if "99.9%" appears as a single token in input training set, then it will be transformed to separate tokens if that's what the transformers model does:

Niiek   B-PER
uses    O
ktrain  O
99      O   
.       O
9       O
%       O
of      O
the     O
time    O
.       O

This is done in text.ner.anago.preprocessing.IndexTransformer.fix_tokenization. Right now, the tokenization fix is implemented in a way that is specific to BERT, so it just needs a cleaner, generalized solution like you did for the embed method (i.e., look at word IDs to determine if a token like "99.9%" needs to be split into separate "rows" with separate/duplicated targets).

If you generalize the fix_tokenization method (similar to what you did in embed by looking at word IDs), then this should hopefully resolve the shape errors.

@Niekvdplas
Copy link
Contributor Author

Alright, will start on that tomorrow morning. Hopefully at mid-day I will come back with results. If all is tested and well I will create a PR.

@Niekvdplas
Copy link
Contributor Author

Niekvdplas commented May 2, 2022

Hi @amaiya, so I've given it quite some time now and here is what I've come up with:

    import unidecode
    def fix_tokenization(self, X, Y, maxlen=U.DEFAULT_TRANSFORMER_MAXLEN, num_special=U.DEFAULT_TRANSFORMER_NUM_SPECIAL):
        """
        Should be called prior training
        """
        if not self.transformer_is_activated():
            return X, Y
        encode = self.te.tokenizer
        new_X = []
        new_Y = []
        for i, x in enumerate(X):
            new_x = []
            new_y =[]
            seq_len = 0
            for j,s in enumerate(x):
                s = unidecode.unidecode(s)
                s = s.replace("'", "")
                encoded = encode(s, add_special_tokens=False, return_tensors='tf')
                subtokens = encoded.tokens()
                word_ids = encoded.word_ids()
                token_len = len(subtokens)
                if seq_len + token_len > (maxlen - num_special): 
                    break
                seq_len += token_len
                last_id = -1
                word = ''
                words = []
                for idx, id in enumerate(word_ids):
                    if id == None:
                        continue
                    elif id == last_id:
                        word += subtokens[idx]
                    elif id > last_id:
                        if len(word) > 0:
                            words.append(word)
                            word = ''
                        word += subtokens[idx]
                        last_id = id
                if len(word) > 0:
                    words.append(word)
                new_x.extend(words)
                if Y is not None:
                    tag = Y[i][j]
                    new_y.extend([tag])
                    if len(words) > 1:
                        new_tag = tag
                        if tag.startswith('B-'):
                            new_tag = 'I-'+tag[2:]
                        new_y.extend([new_tag]*(len(words)-1) )
                    #if tag.startswith('B-'): tag = 'I-'+tag[2:]
            new_X.append(new_x)
            new_Y.append(new_y)
        new_Y = None if Y is None else new_Y
        return new_X, new_Y

As you can see I do some string manipulation by converting special characters to normal ones (é to e, etc.) as this breaks the tokenization for some reason. I also did the same with accents as it did that as well.
Here is an example: 'drop' became ['d, rop, ']. When coming back to embed and encoding it, it would return as tokens: [', d, rop, '], which accounted for quite some misshapes.

Rest is relatively in line with embed. This runs and gets quite far into training, but at some point I get:

"logits and labels must be broadcastable: logits_size=[3808,9] labels_size=[3872,9]" (always a multiple of 32 (batch size) off), I can't find where the problem is and perhaps you've encountered a similar one.

Let me know if you can think of any better fixes than the string manipulation I've done in the above.

@amaiya
Copy link
Owner

amaiya commented May 2, 2022

Thanks, @Niekvdplas. I've never seen this error before. My guess is it's something in fix_tokenization.

Are you using BERT or CodeBert/Roberta to test? If the latter, I would recommend testing wietsedv/bert-base-dutch-cased with this notebook. You can use your new embed method with the old fix_tokenization method. If everything works with Dutch BERT, this will confirm that your new embed method is a generalized drop-in replacement to what is there now. After that, you can modify fix_tokenization until it also works with Dutch BERT. If after that everything works, then CodeBERT can be tested. This might shed light on where the problem is. If you've already done this, then please disregard, of course.

Also, the string manipulation may be problematic. For instance, running unidecode on Chinese tokens sounds like it might be a problem. But, I would try the above first before tackling the issue you're seeing with accented characters. I didn't quite follow why embed and fix_tokenization would be different for something like 'drop' - perhaps you can clarify.

@Niekvdplas
Copy link
Contributor Author

@amaiya I ran it with the notebook and embed is implemented correctly as it runs without any problems and something is thus wrong with the fix_tokenization function. The string manipulation tricks should definitely not be used so I will try something different.

@Niekvdplas
Copy link
Contributor Author

Niekvdplas commented May 3, 2022

    def fix_tokenization(self, X, Y, maxlen=U.DEFAULT_TRANSFORMER_MAXLEN, num_special=U.DEFAULT_TRANSFORMER_NUM_SPECIAL):
        """
        Should be called prior training
        """
        if not self.transformer_is_activated():
            return X, Y
        encode = self.te.tokenizer
        new_X = []
        new_Y = []
        for i, x in enumerate(X):
            new_x = []
            new_y =[]
            seq_len = 0
            for j,s in enumerate(x):
                encoded = encode(s, add_special_tokens=False, return_offsets_mapping=True, return_tensors='tf')
                offsets = encoded['offset_mapping'].numpy()[0]
                word_ids = encoded.word_ids()
                token_len = len(offsets)
                if seq_len + token_len > (maxlen - num_special): 
                    break
                seq_len += token_len
                last_id = -1
                word = ''
                words = []
                for idx, id in enumerate(word_ids):
                    if id == None:
                        continue
                    elif id == last_id:
                        word += s[offsets[idx][0]:offsets[idx][1]]
                    elif id > last_id:
                        if len(word) > 0:
                            words.append(word)
                            word = ''
                        word += s[offsets[idx][0]:offsets[idx][1]]
                        last_id = id
                if len(word) > 0:
                    words.append(word)
                new_x.extend(words)
                if Y is not None:
                    tag = Y[i][j]
                    new_y.extend([tag])
                    if len(words) > 1:
                        new_tag = tag
                        if tag.startswith('B-'):
                            new_tag = 'I-'+tag[2:]
                        new_y.extend([new_tag]*(len(words)-1) )
            new_X.append(new_x)
            new_Y.append(new_y)
        new_Y = None if Y is None else new_Y
        return new_X, new_Y

Okay, so with some alterations in fix_tokenization wietsedv/bert-base-dutch-cased runs without problem on the dutchner dataset. It is when running it on roberta that the shape error occurs. I have no clue what is going wrong and am running out of time to find out, I unfortunately will have to put this on a side track and focus on my research experiments as my thesis is coming to a close in two months.

Let me know if you find a solution @amaiya

@amaiya
Copy link
Owner

amaiya commented May 3, 2022

@Niekvdplas Thanks a lot. It turned out that there was only an extra minor fix that was needed to get things working.

The shape mismatches with Roberta were caused by embed tokenizing words further that were already tokenized. All that appeared to be needed is to ensure embed generated an embedding vector for each space-separated token (which is easy to do with offset mappings as you demonstrated). If 99.9% is a single token in the input, then it gets a single vector. If the input contains 99 . 9 %, then this gets four vectors. The embedding shapes are then always consistent with the inputs to the model.

I tested this with both wietsedv/bert-base-dutch-cased (BERT) and delobelle/robbert-v2-dutch-base (Roberta) in addition to CodeBert (also Roberta) on the Dutch NER dataset. All models trained successfully with no errors. The updates have been pushed to a feature branch called nerroberta. Note that your original version was yielding a lower validation F1 score for some reason, so I re-applied the changes in a more minimally-invasive way (e.g., no batch encode) and F1 went back up. Will have to look into this later.

You did the heavy-lifting on this, so you if you want to copy the embed method and the fix_tokenization method from the feature branch into a PR (with develop as base for PR), you can submit and get credited for these updates as the contributor when merged to develop.

@Niekvdplas
Copy link
Contributor Author

Alright, I did that @amaiya. Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants