-
Notifications
You must be signed in to change notification settings - Fork 538
[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. #1449
Comments
Actually I can load the model: import gluonnlp
from gluonnlp.data.tokenizers import SentencepieceTokenizer
tokenizer = SentencepieceTokenizer(model_path='spm.model', vocab='spm.vocab')
print(tokenizer) Output:
@preeyank5 Would you try again? |
I find that the root cause is that we will need better error handling of the The way to fix the issue is to revise gluon-nlp/src/gluonnlp/data/tokenizers/sentencepiece.py Lines 99 to 101 in 08dc6ed
|
Marked it as a "good first issue" because it's a good issue for early contributors. We can just ensure that the correct error is raised when |
Thanks Xingjian, I am now able to load the model |
Let's keep this issue to track the error message. We should raise the error if the user has specified some unexpected kwargs. |
Hi, i am new to this Project and would like to tackle this issue |
Have you Solved it yet |
Description
While using tokenizers.create with the model and vocab file for a custom corpus, the code throws an error and is not able to generate the BERT vocab file
Error Message
ValueError: Mismatch vocabulary! All special tokens specified must be control tokens in the sentencepiece vocabulary.
To Reproduce
from gluonnlp.data import tokenizers
tokenizers.create('spm', model_path='lsw1/spm.model', vocab_path='lsw1/spm.vocab')
spm.zip
The text was updated successfully, but these errors were encountered: