-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create vocab for other language? #13
Comments
hi, we used vocabulary from bert, it is a single word for Chinese. you can check bert or xlnet for reference. |
I have seen bert's vocab. They used Wordpiece for it. Which generates subwords like |
in xlnet, it is using sentence piece. you can easily generate vocabulary based on your corpus. check here: xlnet, and then you will have to take a look of tokenizer you can also check non-english support from bert: https://github.com/google-research/bert/blob/master/multilingual.md |
@008karan SentencePiece does quite the opposite to WordPiece. From the documentation. SentencePiece first escapes the whitespace with a meta-symbol "▁" (U+2581) as follows:
Then, this text is segmented into small pieces, for example:
Subwords which occur after whitespace (which are also those that most words begin with) are prepended with '▁', while others are unchanged. This excludes subwords which only occur at the beginning of sentences and nowhere else. These cases should be quite rare, however. So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a simple conversion, removing "▁" from the tokens that contain it and adding "##" to the ones that don't. def parse_sentencepiece_token(token):
if token.startswith("▁"):
return token[1:]
else:
return "##" + token |
good job |
Thanks for the implementation!
I would like to know how you have created vocab for training ALBERT.
I am using Sentencepiece for this but which
model_type
to choose frombpe,word or char
The text was updated successfully, but these errors were encountered: