Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocBERT #59

Closed
Zhou3983 opened this issue May 18, 2020 · 3 comments
Closed

DocBERT #59

Zhou3983 opened this issue May 18, 2020 · 3 comments

Comments

@Zhou3983
Copy link

I tried to run DocBERT with
python -m models.bert --dataset Reuters --model bert-base-uncased --max-seq-length 256 --batch-size 16 --lr 2e-5 --epochs 30
, but got the following error:

r66xu@hedwig:~/RX/hedwig$ python -m models.bert --dataset Reuters --model bert-base-uncased --max-seq-length 256 --batch-size 16 --lr 2e-5 --epochs 30
Device: CUDA
Number of GPUs: 1
FP16: False
Traceback (most recent call last):
  File "/jet/var/python/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/jet/var/python/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/r66xu/RX/hedwig/models/bert/__main__.py", line 75, in <module>
    tokenizer = BertTokenizer.from_pretrained(pretrained_vocab_path)
  File "/jet/var/python/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 282, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/jet/var/python/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 346, in _from_pretrained
    list(cls.vocab_files_names.values())))
OSError: Model name '../hedwig-data/models/bert_pretrained/bert-base-uncased-vocab.txt' was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). We assumed '../hedwig-data/models/bert_pretrained/bert-base-uncased-vocab.txt' was a path or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

To fix it, I downloaded the pre-trained model for BERT and moved it in Hedwig-data, then it's able to run. Is it the correct way to fix this?

@mrkarezina
Copy link

I encountered this as well. Would it be possible to just add a command line argument for an option to download the hugging face model weights for the specified BERT model instead of looking for them locally?

I tried implementing this to make it easier to work on a Colab notebook. #58

@achyudh
Copy link
Member

achyudh commented May 21, 2020

@richard3983 yup that's the right way to fix it. The simplest way to solve this issue would be to add the pre-trained model weights to https://git.uwaterloo.ca/jimmylin/hedwig-data

@achyudh
Copy link
Member

achyudh commented May 29, 2020

Just an update. I've added the weights to the data repo (https://git.uwaterloo.ca/jimmylin/hedwig-data) so it should work out of the box now without needing the fix suggested by @mrkarezina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants