Symbols don't match between model and embedding #2

noe · 2017-10-18T09:29:11Z

It seems that symbols from the sentencepiece model and the associated embeddings are not the same, specifically control symbols are not present in the embedding. For instance, in de.wiki.bpe.op50000.model there are the symbols <s>, </s> and <unk>, but in the associated embedding de.wiki.bpe.vs50000.d300.vectors.bin only <unk> is defined.

Update: Further exploration reveals that the number of common symbols between the aforementioned 2 files are 49631.

The text was updated successfully, but these errors were encountered:

bheinzerling · 2017-10-18T09:49:11Z

<s>, </s> and <unk> are generated automatically by sentencepiece. <s>, </s> do not occur in the texts I used for training the embeddings, since I didn't apply any preprocessing other than the preproc_text.sh script. That's why there are no embeddings for <s>, </s>

Do you need <s>, </s> for your specific application?

noe · 2017-10-18T09:51:30Z

Any NMT is likely to need </s> to signal end of sentence for the translation.

noe · 2017-10-18T10:02:09Z

Also, if tokenization is performed with sentencepiece and then processed with the embedding in the bin file (as suggested in the Readme), given that the symbols don't match, there will be unknowns that should not appear, isn't it?

bheinzerling · 2017-10-18T10:11:18Z

Yes, there will be unknowns, because I set the GloVe minimum token frequency to 100 for high-resource languages, and to 10 for low-resource languages, since you probably won't get good embeddings for tokens that occur too rarely.

I know this is not ideal, but so far haven't found a good solution for this. In the original BPE implementation (subword-nmt), you can set a minimum symbol frequency, meaning that no BPE symbols with lower frequency will be generated. In this case there wouldn't be any unknowns.

I haven't found a similary option in sentencepiece, but I have to use sentencepiece, because it is so much faster.

Regarding <s>: Reading the docs again, it seems sentencepiece can add sentence markers automatically, but so far I haven't used that option since I didn't need it for my use case. Let me look into how well this works across languages.

noe · 2017-10-18T10:15:21Z

Thanks @bheinzerling . Also, I would suggest adding a warning in the Readme about the difference in tokens. This would save some time from someone that is thinking about using the sentencepiece python wrapper and then use the embedding right away with the token IDs generated by sentencepiece.SentencePieceModel.EncodeAsIds.

bheinzerling · 2018-11-19T17:02:42Z

This is (finally) fixed in the latest version.

bheinzerling closed this as completed Nov 19, 2018

bheinzerling mentioned this issue Jul 4, 2019

model/embedding versioning? #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbols don't match between model and embedding #2

Symbols don't match between model and embedding #2

noe commented Oct 18, 2017 •

edited

bheinzerling commented Oct 18, 2017

noe commented Oct 18, 2017

noe commented Oct 18, 2017

bheinzerling commented Oct 18, 2017 •

edited

noe commented Oct 18, 2017

bheinzerling commented Nov 19, 2018

Symbols don't match between model and embedding #2

Symbols don't match between model and embedding #2

Comments

noe commented Oct 18, 2017 • edited

bheinzerling commented Oct 18, 2017

noe commented Oct 18, 2017

noe commented Oct 18, 2017

bheinzerling commented Oct 18, 2017 • edited

noe commented Oct 18, 2017

bheinzerling commented Nov 19, 2018

noe commented Oct 18, 2017 •

edited

bheinzerling commented Oct 18, 2017 •

edited