About orig_to_token_index padding problem #31

EternalEep · 2021-09-06T08:26:46Z

I have some problems in your dataset preprocessing code.

In the model/transformers_embedder.py file, we have TransformersEmbedder class. For the forward function, the return is (marked as code [1])

return torch.gather(word_rep[:, :, :], 1, orig_to_token_index.unsqueeze(-1).expand(batch_size, max_sent_len, rep_size))

As we know, this is a gather function which will use orig_to_token_index as the index. But for the batch, different sentences will have different length. So I observed that in your preprocessing batch sample code, which is in the data/transformers_dataset.py(marked as code [2])
orig_to_tok_index = feature.orig_to_tok_index + [0] * padding_word_len
label_ids = feature.label_ids + [0] * padding_word_len

You use [0] * padding_word_len to pad orig_to_tok_index and label_ids. So if we run code [1], it will get 0 index in the padding position, so we get [CLS] embedding vector from the Bert in the padding position. And then predict [CLS] embedding to the 0 label index(PAD's index in your code).

I think it's a little wired to use 0 padding in orig_to_token_index list and then predict [CLS] to PAD label, do we need to change it? Or just I misunderstanded the logic?

Hope to receive your explanation.
Thank you very much!

The text was updated successfully, but these errors were encountered:

allanj · 2021-09-06T08:50:03Z

For code[2], it is practically same if we use some other index instead of 0.

Because in CRF, we will take the subsequence representation instead of the full representation.

Particularly in this step (https://github.com/allanj/pytorch_neural_crf/blob/master/src/model/module/linear_crf_inferencer.py#L102),
if the word_seq_len is only 8, only the max length is 10, we only take the scores up to 8. The rest will not be considered.

EternalEep · 2021-09-06T10:01:05Z

I have understood what you have meant. So you use mask in CRF layer. Thank you for your explanation for me.
It seems that I should read the CRF carefully about the CRF code, thank you, Allan.

EternalEep closed this as completed Sep 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About orig_to_token_index padding problem #31

About orig_to_token_index padding problem #31

EternalEep commented Sep 6, 2021

allanj commented Sep 6, 2021

EternalEep commented Sep 6, 2021

About orig_to_token_index padding problem #31

About orig_to_token_index padding problem #31

Comments

EternalEep commented Sep 6, 2021

allanj commented Sep 6, 2021

EternalEep commented Sep 6, 2021