如直接使用LongformerTokenizer会报此错，是否需要使用BertTokenizer？ #2

PolarisRisingWar · 2022-02-28T13:48:02Z

I'm not sure if it's able to directly ask you questions in Chinese. If it caused misinterpretations, I can change to English.

您好！我现在正在使用您的预训练模型，文件下载自 https://huggingface.co/ValkyriaLenneth/longformer_zh 。我直接使用AutoTokenizer的话，代码会自动调用LongformerTokenizer，然后会报如下错误：

Traceback (most recent call last):
  File "mypath/trylongformerzh1.py", line 3, in <module>
    tokenizer = LongformerTokenizer.from_pretrained("pretrain_path/longformer_zh")
  File "virtualenv_path/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1744, in from_pretrained
    return cls._from_pretrained(
  File "virtualenv_path/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1872, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "virtualenv_path/lib/python3.8/site-packages/transformers/models/roberta/tokenization_roberta.py", line 159, in __init__
    super().__init__(
  File "virtualenv_path/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 179, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

我看到您的代码中使用的是BertTokenizerFast，所以请问加载longformer_zh的tokenizer是否也需要使用BertTokenizer？
我直接用BertTokenizer确实是可以运行的。

另：我是使用transformers.LongformerModel.from_pretrained来加载您的模型。我暂时没有测试其他功能，直接加载模型似乎是可行的。

我的transformers版本是4.12.5，我能够运行成功的代码是：

from transformers import BertTokenizer, LongformerModel

tokenizer = BertTokenizer.from_pretrained("pretrain_path/longformer_zh")

model = LongformerModel.from_pretrained("pretrain_path/longformer_zh")

如果您有时间浏览本issue的话，我会非常感谢！

The text was updated successfully, but these errors were encountered:

ValkyriaLenneth · 2022-02-28T13:54:10Z

谢谢你的问题。

这个模型的基础是谷歌的Roberta_ZH, 它虽然叫做Roberta，但是实际上那个作者是用BertModel结构，但是换成Roberta的训练方法进行训练的。

而本模型，是从Roberta_ZH的基础上再进行修改和训练的。

因此使用BertTokenizer进行加载是没错的。

而AutoTokenizer的问题比较奇怪，我推测原因是，Transformers 默认LongformerModel是基于RobertaModel类。因此LongformerTokenizer本质上调用的是RobertaTokenizer，而不是本模型基础的BertTokenizer。因此出现这个bug。

希望能解决你的问题。

PolarisRisingWar · 2022-02-28T14:21:50Z

好的，非常感谢！那我就照着BertTokenizer的写法来写了！

PolarisRisingWar closed this as completed Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

如直接使用LongformerTokenizer会报此错，是否需要使用BertTokenizer？ #2

如直接使用LongformerTokenizer会报此错，是否需要使用BertTokenizer？ #2

PolarisRisingWar commented Feb 28, 2022

ValkyriaLenneth commented Feb 28, 2022

PolarisRisingWar commented Feb 28, 2022

如直接使用LongformerTokenizer会报此错，是否需要使用BertTokenizer？ #2

如直接使用LongformerTokenizer会报此错，是否需要使用BertTokenizer？ #2

Comments

PolarisRisingWar commented Feb 28, 2022

ValkyriaLenneth commented Feb 28, 2022

PolarisRisingWar commented Feb 28, 2022