-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to train a chinese roberta #2196
Comments
Hi, I have met this problem. Have you solved it? If so, please help me! @RyanHuangNLP |
fairseq is not so friendly in common usage... I am wondering how to integrate other Chinese PLM models into fairseq framework too... |
My implementation for Chinese bert, may it helps you. |
Summary: adds finetuned robust w2v models and updates readme fixes #3721 Pull Request resolved: fairinternal/fairseq-py#2196 Reviewed By: wnhsu Differential Revision: D30367999 Pulled By: alexeib fbshipit-source-id: 616b373bf31265c89f694fba7dccce2961d394f3
) Summary: adds finetuned robust w2v models and updates readme fixes facebookresearch#3721 Pull Request resolved: fairinternal/fairseq-py#2196 Reviewed By: wnhsu Differential Revision: D30367999 Pulled By: alexeib fbshipit-source-id: 616b373bf31265c89f694fba7dccce2961d394f3
I implemented Chinese-BERT-wwm/Chinese-RoBERTa-wwm to Fariseq. I performed minimal changes on Fairseq to let it be compatible with (Chinese) BERT based tasks, and provided an option to Chinese Whole Word Masking strategy. |
❓ Questions and Help
Before asking:
I want to train a chinese roberta, following the pretrain tutorial, I was wonder how to generate
https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
、https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
、https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
these three file in chineseI found that the vocab file is different from the one in google's bert
I have try to change script
multiprocessing_bpe_encoder.py
classMultiprocessingEncoder
to use theBertTokenizer
, the changing is in the following code, and preprocess the corpus with the google's vocab.txt.but while I begin to train the roberta, the vocab embedding become
[21133, 768]
, but the google's vocab.txt just have 21128 tokens, fairseq will add special 5 token ? when I convert the weight to the transformer one, I found that tokens are not equalAnother question
I also wonder how to initial the weight by other pretrain chinese bert, because of the token embedding size not equal(21133 != 21128) and max position embedding is also not equal(514 != 512)
Code
What have you tried?
What's your environment?
pip
, source):The text was updated successfully, but these errors were encountered: