Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train a chinese roberta #2196

Closed
RyanHuangNLP opened this issue May 29, 2020 · 4 comments
Closed

How to train a chinese roberta #2196

RyanHuangNLP opened this issue May 29, 2020 · 4 comments

Comments

@RyanHuangNLP
Copy link

RyanHuangNLP commented May 29, 2020

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

I want to train a chinese roberta, following the pretrain tutorial, I was wonder how to generate https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.jsonhttps://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpehttps://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt these three file in chinese

I found that the vocab file is different from the one in google's bert

I have try to change script multiprocessing_bpe_encoder.py class MultiprocessingEncoder to use the BertTokenizer, the changing is in the following code, and preprocess the corpus with the google's vocab.txt.

but while I begin to train the roberta, the vocab embedding become [21133, 768], but the google's vocab.txt just have 21128 tokens, fairseq will add special 5 token ? when I convert the weight to the transformer one, I found that tokens are not equal

Another question

I also wonder how to initial the weight by other pretrain chinese bert, because of the token embedding size not equal(21133 != 21128) and max position embedding is also not equal(514 != 512)

Code

class MultiprocessingEncoder(object):

    def __init__(self, args):
        self.args = args

    def initializer(self):
        global bpe
        bpe = BertTokenizer.from_pretrained('bert-base-uncased')

    def encode(self, line):
        global bpe
        subword = bpe._tokenize(line)
        return subword

    def decode(self, tokens):
        global bpe
        return bpe.decode(tokens)

    def encode_lines(self, lines):
        """
        Encode a set of lines. All lines will be encoded together.
        """
        enc_lines = []
        for line in lines:
            line = line.strip()
            if len(line) == 0 and not self.args.keep_empty:
                return ["EMPTY", None]
            tokens = self.encode(line)
            enc_lines.append(" ".join(tokens))
        return ["PASS", enc_lines]

    def decode_lines(self, lines):
        dec_lines = []
        for line in lines:
            tokens = map(int, line.strip().split())
            dec_lines.append(self.decode(tokens))
        return ["PASS", dec_lines]

What have you tried?

What's your environment?

  • fairseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:
@zhazhuanling
Copy link

Hi, I have met this problem. Have you solved it? If so, please help me! @RyanHuangNLP

@CheungZeeCn
Copy link

fairseq is not so friendly in common usage... I am wondering how to integrate other Chinese PLM models into fairseq framework too...

@CheungZeeCn
Copy link

facebook-github-bot pushed a commit that referenced this issue Aug 17, 2021
Summary:
adds finetuned robust w2v models and updates readme

fixes #3721

Pull Request resolved: fairinternal/fairseq-py#2196

Reviewed By: wnhsu

Differential Revision: D30367999

Pulled By: alexeib

fbshipit-source-id: 616b373bf31265c89f694fba7dccce2961d394f3
sorenmulli pushed a commit to sorenmulli/fairseq that referenced this issue Oct 4, 2021
)

Summary:
adds finetuned robust w2v models and updates readme

fixes facebookresearch#3721

Pull Request resolved: fairinternal/fairseq-py#2196

Reviewed By: wnhsu

Differential Revision: D30367999

Pulled By: alexeib

fbshipit-source-id: 616b373bf31265c89f694fba7dccce2961d394f3
@Once2gain
Copy link

Once2gain commented Nov 16, 2023

I implemented Chinese-BERT-wwm/Chinese-RoBERTa-wwm to Fariseq. I performed minimal changes on Fairseq to let it be compatible with (Chinese) BERT based tasks, and provided an option to Chinese Whole Word Masking strategy.
https://github.com/Once2gain/Chinese-RoBERTa-wwm-Fairseq
It could help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants