Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Roberta in pytorch transformers #39

Open
Vimos opened this issue Sep 28, 2019 · 5 comments
Open

Use Roberta in pytorch transformers #39

Vimos opened this issue Sep 28, 2019 · 5 comments

Comments

@Vimos
Copy link

Vimos commented Sep 28, 2019

I have been using BertPreTrainedModel to load this roberta model, which works well.

Noticing in pytorch_transformers, Roberta is also supported.

from pytorch_transformers import (BertConfig, BertTokenizer,
                                  RobertaConfig, RobertaTokenizer)

Should I switch to Roberta? If so, what to use for the parameter merges_file in RobertaTokenizer?

@brightmart
Copy link
Owner

no. this roberta_zh is changed based on bert with main ideas from RoBERTa paper, but it may not compatible to roberta pytorch implementation from facebook ai project. so load from BertPreTrainedModel should be right way.

can you paste code here to show how you load roberta_zh using BertPreTrainedModel, and how you use it.

(anyway you can have a try, but it should be failed)

@Vimos
Copy link
Author

Vimos commented Sep 28, 2019

It's the same with original pytorch-transformers

bert_config = MODEL_CLASSES[args.model_type][0].from_json_file(args.bert_config_file)
tokenizer = MODEL_CLASSES[args.model_type][2](vocab_file=args.vocab_file, 
do_lower_case=args.do_lower_case)
model = MODEL_CLASSES[args.model_type][1].from_pretrained(args.model_name_or_path,
                from_tf=bool('.ckpt' in args.model_name_or_path), config=bert_config)

However, for RobertaTokenizer, you cannot use it to load roberta_zh.

  • It requires merges_file parameter which is not in roberta_zh
Traceback (most recent call last):
  File "run_transformers.py", line 447, in <module>
    main()
  File "run_transformers.py", line 389, in main
    tokenizer = MODEL_CLASSES[args.model_type][2](vocab_file=args.vocab_file, do_lower_case=args.do_lower_case)
TypeError: __init__() missing 1 required positional argument: 'merges_file'
  • If you use from_pretrained, it complains an encoding problem of roberta_zh.
09/28/2019 14:23:53 - INFO - pytorch_transformers.tokenization_utils -   Model name '/home/vimos/.pytorch_pretrained_bert/roberta_zh/pytorch_model.bin' nsubsubsectionot found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli). Assuming '/home/vimos/.pytorch_pretrained_bert/roberta_zh/pytorch_model.bin' is a path or url to a directory containing tokenizer files.
09/28/2019 14:23:53 - INFO - pytorch_transformers.tokenization_utils -   Didn't find file /home/vimos/.pytorch_pretrained_bert/roberta_zh/added_tokens.json. We won't load it.
09/28/2019 14:23:53 - INFO - pytorch_transformers.tokenization_utils -   Didn't find file /home/vimos/.pytorch_pretrained_bert/roberta_zh/special_tokens_map.json. We won't load it.
09/28/2019 14:23:53 - INFO - pytorch_transformers.tokenization_utils -   Didn't find file /home/vimos/.pytorch_pretrained_bert/roberta_zh/tokenizer_config.json. We won't load it.
09/28/2019 14:23:53 - INFO - pytorch_transformers.tokenization_utils -   loading file /home/vimos/.pytorch_pretrained_bert/roberta_zh/pytorch_model.bin
09/28/2019 14:23:53 - INFO - pytorch_transformers.tokenization_utils -   loading file /home/vimos/.pytorch_pretrained_bert/roberta_zh/pytorch_model.bin
09/28/2019 14:23:53 - INFO - pytorch_transformers.tokenization_utils -   loading file None
09/28/2019 14:23:53 - INFO - pytorch_transformers.tokenization_utils -   loading file None
09/28/2019 14:23:53 - INFO - pytorch_transformers.tokenization_utils -   loading file None
Traceback (most recent call last):
  File "run_transformers.py", line 448, in <module>
    main()
  File "run_transformers.py", line 389, in main
    tokenizer = MODEL_CLASSES['roberta'][2].from_pretrained(args.model_name_or_path)
  File "/home/vimos/anaconda3/envs/rcqa/lib/python3.7/site-packages/pytorch_transformers/tokenization_utils.py", line 293, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/home/vimos/anaconda3/envs/rcqa/lib/python3.7/site-packages/pytorch_transformers/tokenization_utils.py", line 421, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/vimos/anaconda3/envs/rcqa/lib/python3.7/site-packages/pytorch_transformers/tokenization_roberta.py", line 82, in __init__
    mask_token=mask_token, **kwargs)
  File "/home/vimos/anaconda3/envs/rcqa/lib/python3.7/site-packages/pytorch_transformers/tokenization_gpt2.py", line 118, in __init__
    self.encoder = json.load(open(vocab_file, encoding="utf-8"))
  File "/home/vimos/anaconda3/envs/rcqa/lib/python3.7/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/home/vimos/anaconda3/envs/rcqa/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

@brightmart
Copy link
Owner

yeah, it is.

@mayktian
Copy link

can you please suggest which pytorch bert package is compatible with this pre-trained model?
I had the same failure as Vimos when loading it with facebook transformers.

@brightmart
Copy link
Owner

roberta is bert,you can load roberta use pytorch transformer, select model as bert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants