Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create vocab for other language? #13

Closed
008karan opened this issue Oct 4, 2019 · 5 comments
Closed

How to create vocab for other language? #13

008karan opened this issue Oct 4, 2019 · 5 comments

Comments

@008karan
Copy link

008karan commented Oct 4, 2019

Thanks for the implementation!
I would like to know how you have created vocab for training ALBERT.
I am using Sentencepiece for this but which model_type to choose from bpe,word or char

@brightmart
Copy link
Owner

hi, we used vocabulary from bert, it is a single word for Chinese.

you can check bert or xlnet for reference.

@008karan
Copy link
Author

008karan commented Oct 4, 2019

I have seen bert's vocab. They used Wordpiece for it. Which generates subwords like ##word but in sentencepiece thats not the case. Any suggestions?

@brightmart
Copy link
Owner

in xlnet, it is using sentence piece. you can easily generate vocabulary based on your corpus. check here: xlnet, and then you will have to take a look of tokenizer

you can also check non-english support from bert: https://github.com/google-research/bert/blob/master/multilingual.md

@lonePatient
Copy link
Collaborator

@008karan SentencePiece does quite the opposite to WordPiece. From the documentation. SentencePiece first escapes the whitespace with a meta-symbol "▁" (U+2581) as follows:

Hello▁World.

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Subwords which occur after whitespace (which are also those that most words begin with) are prepended with '▁', while others are unchanged. This excludes subwords which only occur at the beginning of sentences and nowhere else. These cases should be quite rare, however.

So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a simple conversion, removing "▁" from the tokens that contain it and adding "##" to the ones that don't.

def parse_sentencepiece_token(token):
    if token.startswith("▁"):
        return token[1:]
    else:
        return "##" + token

@brightmart
Copy link
Owner

good job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants