How to create vocab for other language? #13

008karan · 2019-10-04T09:28:19Z

Thanks for the implementation!
I would like to know how you have created vocab for training ALBERT.
I am using Sentencepiece for this but which model_type to choose from bpe,word or char

The text was updated successfully, but these errors were encountered:

brightmart · 2019-10-04T11:08:40Z

hi, we used vocabulary from bert, it is a single word for Chinese.

you can check bert or xlnet for reference.

008karan · 2019-10-04T11:20:02Z

I have seen bert's vocab. They used Wordpiece for it. Which generates subwords like ##word but in sentencepiece thats not the case. Any suggestions?

brightmart · 2019-10-04T12:44:50Z

in xlnet, it is using sentence piece. you can easily generate vocabulary based on your corpus. check here: xlnet, and then you will have to take a look of tokenizer

you can also check non-english support from bert: https://github.com/google-research/bert/blob/master/multilingual.md

lonePatient · 2019-10-04T12:59:04Z

@008karan SentencePiece does quite the opposite to WordPiece. From the documentation. SentencePiece first escapes the whitespace with a meta-symbol "▁" (U+2581) as follows:

Hello▁World.

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Subwords which occur after whitespace (which are also those that most words begin with) are prepended with '▁', while others are unchanged. This excludes subwords which only occur at the beginning of sentences and nowhere else. These cases should be quite rare, however.

So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a simple conversion, removing "▁" from the tokens that contain it and adding "##" to the ones that don't.

def parse_sentencepiece_token(token):
    if token.startswith("▁"):
        return token[1:]
    else:
        return "##" + token

brightmart · 2019-10-10T14:31:25Z

good job

brightmart closed this as completed Oct 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create vocab for other language? #13

How to create vocab for other language? #13

008karan commented Oct 4, 2019

brightmart commented Oct 4, 2019

008karan commented Oct 4, 2019

brightmart commented Oct 4, 2019

lonePatient commented Oct 4, 2019

brightmart commented Oct 10, 2019

How to create vocab for other language? #13

How to create vocab for other language? #13

Comments

008karan commented Oct 4, 2019

brightmart commented Oct 4, 2019

008karan commented Oct 4, 2019

brightmart commented Oct 4, 2019

lonePatient commented Oct 4, 2019

brightmart commented Oct 10, 2019