How to train a chinese roberta #2196

RyanHuangNLP · 2020-05-29T07:32:25Z

❓ Questions and Help

Before asking:

search the issues.
search the docs.

I want to train a chinese roberta, following the pretrain tutorial, I was wonder how to generate `https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json`、`https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe`、`https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt` these three file in chinese

I found that the vocab file is different from the one in google's bert

I have try to change script multiprocessing_bpe_encoder.py class MultiprocessingEncoder to use the BertTokenizer, the changing is in the following code, and preprocess the corpus with the google's vocab.txt.

but while I begin to train the roberta, the vocab embedding become [21133, 768], but the google's vocab.txt just have 21128 tokens, fairseq will add special 5 token ? when I convert the weight to the transformer one, I found that tokens are not equal

Another question

I also wonder how to initial the weight by other pretrain chinese bert, because of the token embedding size not equal(21133 != 21128) and max position embedding is also not equal(514 != 512)

Code

class MultiprocessingEncoder(object):

    def __init__(self, args):
        self.args = args

    def initializer(self):
        global bpe
        bpe = BertTokenizer.from_pretrained('bert-base-uncased')

    def encode(self, line):
        global bpe
        subword = bpe._tokenize(line)
        return subword

    def decode(self, tokens):
        global bpe
        return bpe.decode(tokens)

    def encode_lines(self, lines):
        """
        Encode a set of lines. All lines will be encoded together.
        """
        enc_lines = []
        for line in lines:
            line = line.strip()
            if len(line) == 0 and not self.args.keep_empty:
                return ["EMPTY", None]
            tokens = self.encode(line)
            enc_lines.append(" ".join(tokens))
        return ["PASS", enc_lines]

    def decode_lines(self, lines):
        dec_lines = []
        for line in lines:
            tokens = map(int, line.strip().split())
            dec_lines.append(self.decode(tokens))
        return ["PASS", dec_lines]

What have you tried?

What's your environment?

fairseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

The text was updated successfully, but these errors were encountered:

zhazhuanling · 2020-08-04T04:03:56Z

Hi, I have met this problem. Have you solved it? If so, please help me! @RyanHuangNLP

CheungZeeCn · 2020-10-19T05:55:58Z

fairseq is not so friendly in common usage... I am wondering how to integrate other Chinese PLM models into fairseq framework too...

CheungZeeCn · 2020-12-13T17:59:16Z

My implementation for Chinese bert, may it helps you.
https://github.com/CheungZeeCn/fairseq/blob/master/Sentence_prediction_Chinese_bert_demo.md
@RyanHuangNLP @zhazhuanling

Summary: adds finetuned robust w2v models and updates readme fixes #3721 Pull Request resolved: fairinternal/fairseq-py#2196 Reviewed By: wnhsu Differential Revision: D30367999 Pulled By: alexeib fbshipit-source-id: 616b373bf31265c89f694fba7dccce2961d394f3

) Summary: adds finetuned robust w2v models and updates readme fixes facebookresearch#3721 Pull Request resolved: fairinternal/fairseq-py#2196 Reviewed By: wnhsu Differential Revision: D30367999 Pulled By: alexeib fbshipit-source-id: 616b373bf31265c89f694fba7dccce2961d394f3

Once2gain · 2023-11-16T08:47:44Z

I implemented Chinese-BERT-wwm/Chinese-RoBERTa-wwm to Fariseq. I performed minimal changes on Fairseq to let it be compatible with (Chinese) BERT based tasks, and provided an option to Chinese Whole Word Masking strategy.
https://github.com/Once2gain/Chinese-RoBERTa-wwm-Fairseq
It could help you.

RyanHuangNLP added needs triage question labels May 29, 2020

RyanHuangNLP closed this as completed Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train a chinese roberta #2196

How to train a chinese roberta #2196

RyanHuangNLP commented May 29, 2020 •

edited

Loading

zhazhuanling commented Aug 4, 2020

CheungZeeCn commented Oct 19, 2020

CheungZeeCn commented Dec 13, 2020

Once2gain commented Nov 16, 2023 •

edited

Loading

How to train a chinese roberta #2196

How to train a chinese roberta #2196

Comments

RyanHuangNLP commented May 29, 2020 • edited Loading

❓ Questions and Help

Before asking:

Another question

Code

What have you tried?

What's your environment?

zhazhuanling commented Aug 4, 2020

CheungZeeCn commented Oct 19, 2020

CheungZeeCn commented Dec 13, 2020

Once2gain commented Nov 16, 2023 • edited Loading

RyanHuangNLP commented May 29, 2020 •

edited

Loading

Once2gain commented Nov 16, 2023 •

edited

Loading