Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Biobert pre-trained weights into Bert model with Pytorch bert hugging face run_classifier.py code #26

Closed
sheetalsh456 opened this issue Apr 8, 2019 · 6 comments

Comments

@sheetalsh456
Copy link

These are the steps I followed to get Biobert working with the existing Bert hugging face pytorch code.

  1. I downloaded the pre-trained weights 'biobert_pubmed_pmc.tar.gz' from the Releases page.

  2. I ran this command to convert the tf checkpoint to pytorch model

python pytorch-pretrained-BERT/pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path="biobert/pubmed_pmc_470k/biobert_model.ckpt.index" --bert_config_file="biobert/pubmed_pmc_470k/bert_config.json" --pytorch_dump_path="biobert/pubmed_pmc_470k/Pytorch/biobert.model"

This created a file 'biobert.model' in the specified path.

  1. As mentioned in this link , I compressed 'biobert.model' created above and 'biobert/pubmed_pmc_470k/bert_config.json' together into a biobert_model.tar.gz

  2. I then ran the run_classifier.py of hugging face bert with the following command, using the tar.gz created above.

python pytorch-pretrained-BERT/examples/run_classifier.py --data_dir="Data/" --bert_model="biobert_model.tar.gz" --task_name="qqp" --output_dir="OutputModels/Pretrained/" --do_train --do_eval --do_lower_case

I get the error

'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte' 

in the line

tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

Am I doing something wrong?

I just wanted to run run_classifier.py code provided by hugging face with biobert pretrained weights in the same way that we run bert with it. Is there a way to do this?

@JohnGiorgi
Copy link

JohnGiorgi commented May 21, 2019

Hi @sheetalsh456,

I had the same issue until I placed the correct files under one folder. I wrote instructions for myself here, if you want to take a look and see if they solve your issue.

@jhyuklee
Copy link
Member

Hi, we've updated our weights not to contain parameters of the optimizer. You can download them again, and please tell us what you get.
Thanks.

@nikhilsid
Copy link

Hi @sheetalsh456,

I had the same issue until I placed the correct files under one folder. I wrote instructions for myself here, if you want to take a look and see if they solve your issue.

here

Hi @JohnGiorgi ,

I followed your instructions here to convert the checkpoint and then placing the files (pytorch_model.bin, bert_config.json, and vocab.txt) in one folder to compress it.

I copied the compressed folder to the home folder of 'pytorch-transformers'. Then I ran the following command from there itself to run the example code ('run_glue.py') that has been given

Then, I am trying to run the example code given by running the following command,

python ./examples/run_glue.py \ --model_type bert \ --model_name_or_path biobert.gz \ --task_name=sts-b \ --do_train \ --do_eval \ --do_lower_case \ --data_dir=$DIR \ --max_seq_length 128 \ --per_gpu_eval_batch_size=8 \ --per_gpu_train_batch_size=8

But, I get the same error as mentioned in the main discussion:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

at this location:

File "./examples/run_glue.py", line 424, in main config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name)

Can you please tell what to change.

@JohnGiorgi
Copy link

@nikhilsid Can you try unzipping it, and point --model_name_or_path to the unzipped folder? My guess is that run_glue.py isn't unzipping the folder before trying to read it.

Also, I believe that in pytorch_pretrained_bert the config file was named bert_config.json, but in pytorch_transformers it is just config.json, so watch out for that!

If that fails, I have converted BioBERT V1.1 PubMed and placed the weights here. I was able to load this model with

from pytorch_transformers import BertForTokenClassification

self.model = BertForTokenClassification.from_pretrained('path/to/biobert_v1.1/unzipped')

@nikhilsid
Copy link

nikhilsid commented Aug 5, 2019

@nikhilsid Can you try unzipping it, and point --model_name_or_path to the unzipped folder? My guess is that run_glue.py isn't unzipping the folder before trying to read it.

Also, I believe that in pytorch_pretrained_bert the config file was named bert_config.json, but in pytorch_transformers it is just config.json, so watch out for that!

If that fails, I have converted BioBERT V1.1 PubMed and placed the weights here. I was able to load this model with

from pytorch_transformers import BertForTokenClassification

self.model = BertForTokenClassification.from_pretrained('path/to/biobert_v1.1/unzipped')

@JohnGiorgi Thanks a lot for replying.

I tried unzipping and pointing everything to that. But now I get the following error.

raceback (most recent call last): | 0/83 [00:00<?, ?it/s] File "./examples/run_glue.py", line 485, in <module> main() File "./examples/run_glue.py", line 439, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "./examples/run_glue.py", line 129, in train outputs = model(**inputs) File "/home/siddhartha_nikhil/myenv/env1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/siddhartha_nikhil/myenv/env1/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/siddhartha_nikhil/myenv/env1/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/siddhartha_nikhil/myenv/env1/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/siddhartha_nikhil/myenv/env1/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/home/siddhartha_nikhil/myenv/env1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/siddhartha_nikhil/n2c2/pytorch-transformers/pytorch_transformers/modeling_bert.py", line 977, in forward attention_mask=attention_mask, head_mask=head_mask) File "/home/siddhartha_nikhil/myenv/env1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/siddhartha_nikhil/n2c2/pytorch-transformers/pytorch_transformers/modeling_bert.py", line 713, in forward embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids) File "/home/siddhartha_nikhil/myenv/env1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/siddhartha_nikhil/n2c2/pytorch-transformers/pytorch_transformers/modeling_bert.py", line 269, in forward embeddings = self.LayerNorm(embeddings) File "/home/siddhartha_nikhil/myenv/env1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/siddhartha_nikhil/n2c2/pytorch-transformers/pytorch_transformers/modeling_bert.py", line 237, in forward u = x.mean(-1, keepdim=True) RuntimeError: CUDA error: device-side assert triggered

where, I have kept all the generated pytorch model (pytorch_model.bin), 'bert_config.json' and 'vocab.txt' files at one location.

The other two files (bert_config.json, vocab.txt) are the one that biobert proivdes.
Does this make any sense to you?

@JohnGiorgi
Copy link

@nikhilsid Hmm. I am not sure. Did you try using the weights I referenced in my last comment? (here). I am able to use those across multiple machines without issue

Otherwise, have you tried browsing the pytorch-transformer issue tracker? The problem could be on their end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants