Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如直接使用LongformerTokenizer会报此错,是否需要使用BertTokenizer? #2

Closed
PolarisRisingWar opened this issue Feb 28, 2022 · 2 comments

Comments

@PolarisRisingWar
Copy link

I'm not sure if it's able to directly ask you questions in Chinese. If it caused misinterpretations, I can change to English.

您好!我现在正在使用您的预训练模型,文件下载自 https://huggingface.co/ValkyriaLenneth/longformer_zh 。我直接使用AutoTokenizer的话,代码会自动调用LongformerTokenizer,然后会报如下错误:

Traceback (most recent call last):
  File "mypath/trylongformerzh1.py", line 3, in <module>
    tokenizer = LongformerTokenizer.from_pretrained("pretrain_path/longformer_zh")
  File "virtualenv_path/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1744, in from_pretrained
    return cls._from_pretrained(
  File "virtualenv_path/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1872, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "virtualenv_path/lib/python3.8/site-packages/transformers/models/roberta/tokenization_roberta.py", line 159, in __init__
    super().__init__(
  File "virtualenv_path/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 179, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

我看到您的代码中使用的是BertTokenizerFast,所以请问加载longformer_zh的tokenizer是否也需要使用BertTokenizer?
我直接用BertTokenizer确实是可以运行的。

另:我是使用transformers.LongformerModel.from_pretrained来加载您的模型。我暂时没有测试其他功能,直接加载模型似乎是可行的。

我的transformers版本是4.12.5,我能够运行成功的代码是:

from transformers import BertTokenizer, LongformerModel

tokenizer = BertTokenizer.from_pretrained("pretrain_path/longformer_zh")

model = LongformerModel.from_pretrained("pretrain_path/longformer_zh")

如果您有时间浏览本issue的话,我会非常感谢!

@ValkyriaLenneth
Copy link
Owner

谢谢你的问题。

这个模型的基础是谷歌的Roberta_ZH, 它虽然叫做Roberta,但是实际上那个作者是用BertModel结构,但是换成Roberta的训练方法进行训练的。

而本模型,是从Roberta_ZH的基础上再进行修改和训练的。

因此使用BertTokenizer进行加载是没错的。

AutoTokenizer的问题比较奇怪,我推测原因是,Transformers 默认LongformerModel是基于RobertaModel类。因此LongformerTokenizer本质上调用的是RobertaTokenizer,而不是本模型基础的BertTokenizer。 因此出现这个bug。

希望能解决你的问题。

@PolarisRisingWar
Copy link
Author

好的,非常感谢!那我就照着BertTokenizer的写法来写了!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants