You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we know, this is a gather function which will use orig_to_token_index as the index. But for the batch, different sentences will have different length. So I observed that in your preprocessing batch sample code, which is in the data/transformers_dataset.py(marked as code [2]) orig_to_tok_index = feature.orig_to_tok_index + [0] * padding_word_len label_ids = feature.label_ids + [0] * padding_word_len
You use [0] * padding_word_len to pad orig_to_tok_index and label_ids. So if we run code [1], it will get 0 index in the padding position, so we get [CLS] embedding vector from the Bert in the padding position. And then predict [CLS] embedding to the 0 label index(PAD's index in your code).
I think it's a little wired to use 0 padding in orig_to_token_index list and then predict [CLS] to PAD label, do we need to change it? Or just I misunderstanded the logic?
Hope to receive your explanation.
Thank you very much!
The text was updated successfully, but these errors were encountered:
I have understood what you have meant. So you use mask in CRF layer. Thank you for your explanation for me.
It seems that I should read the CRF carefully about the CRF code, thank you, Allan.
I have some problems in your dataset preprocessing code.
In the model/transformers_embedder.py file, we have TransformersEmbedder class. For the forward function, the return is (marked as code [1])
return torch.gather(word_rep[:, :, :], 1, orig_to_token_index.unsqueeze(-1).expand(batch_size, max_sent_len, rep_size))
As we know, this is a gather function which will use orig_to_token_index as the index. But for the batch, different sentences will have different length. So I observed that in your preprocessing batch sample code, which is in the data/transformers_dataset.py(marked as code [2])
orig_to_tok_index = feature.orig_to_tok_index + [0] * padding_word_len
label_ids = feature.label_ids + [0] * padding_word_len
You use [0] * padding_word_len to pad orig_to_tok_index and label_ids. So if we run code [1], it will get 0 index in the padding position, so we get [CLS] embedding vector from the Bert in the padding position. And then predict [CLS] embedding to the 0 label index(PAD's index in your code).
I think it's a little wired to use 0 padding in orig_to_token_index list and then predict [CLS] to PAD label, do we need to change it? Or just I misunderstanded the logic?
Hope to receive your explanation.
Thank you very much!
The text was updated successfully, but these errors were encountered: