Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why BertTokenizer is used instead of BartTokenizer? #16

Closed
Chen-Wang-CUHK opened this issue Nov 19, 2021 · 2 comments
Closed

why BertTokenizer is used instead of BartTokenizer? #16

Chen-Wang-CUHK opened this issue Nov 19, 2021 · 2 comments

Comments

@Chen-Wang-CUHK
Copy link

Thank you for your nice work!

when preprocessing data, I follow your code to use BertTokenizer to load the cpt-base tokenizer. The tokenizer is load successfully, but I get the following warning message:

"""
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BartTokenizer'.
The class this function is called from is 'BertTokenizer'.
"""

Then I tried to use BartTokenizer to load it, but I failed.

The question is whether I should ignore the warning and still use the BertTokenizer? Thank you.

@choosewhatulike
Copy link
Member

Thank you for your attention.

BertTokenizer should be used instead of BartTokenizer. Because our tokenizer is following the tokenizer of BERT-base-chinese.
And you can ignore the warming message from huggingface. Don't worry about it.

@Chen-Wang-CUHK
Copy link
Author

Thank you for your attention.

BertTokenizer should be used instead of BartTokenizer. Because our tokenizer is following the tokenizer of BERT-base-chinese. And you can ignore the warming message from huggingface. Don't worry about it.

Get it. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants