[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. #1449

preeyank5 · 2020-12-03T00:29:00Z

Description

While using tokenizers.create with the model and vocab file for a custom corpus, the code throws an error and is not able to generate the BERT vocab file

Error Message

ValueError: Mismatch vocabulary! All special tokens specified must be control tokens in the sentencepiece vocabulary.

To Reproduce

from gluonnlp.data import tokenizers
tokenizers.create('spm', model_path='lsw1/spm.model', vocab_path='lsw1/spm.vocab')

spm.zip

sxjscience · 2020-12-03T02:02:28Z

Actually I can load the model:

import gluonnlp
from gluonnlp.data.tokenizers import SentencepieceTokenizer
tokenizer = SentencepieceTokenizer(model_path='spm.model', vocab='spm.vocab')
print(tokenizer)

Output:

SentencepieceTokenizer(
   model_path = /home/ubuntu/spm.model
   lowercase = False, nbest = 0, alpha = 0.0
   vocab = Vocab(size=3500, unk_token="<unk>", bos_token="<s>", eos_token="</s>", pad_token="<pad>")
)

@preeyank5 Would you try again?

sxjscience · 2020-12-03T02:05:20Z

I find that the root cause is that we will need better error handling of the **kwargs here. Basically, the argument should be vocab instead of vocab_path and vocab_path has been put under **kwargs.

The way to fix the issue is to revise

gluon-nlp/src/gluonnlp/data/tokenizers/sentencepiece.py

Lines 99 to 101 in 08dc6ed

    
           for k, v in kwargs.items(): 
        
               if k in special_tokens_kv: 
        
                   if v != special_tokens_kv[k]:

sxjscience · 2020-12-03T02:06:29Z

Marked it as a "good first issue" because it's a good issue for early contributors. We can just ensure that the correct error is raised when kwargs contains unexpected values.

preeyank5 · 2020-12-03T18:18:15Z

Thanks Xingjian, I am now able to load the model

sxjscience · 2020-12-03T18:19:37Z

Let's keep this issue to track the error message. We should raise the error if the user has specified some unexpected kwargs.

ConaGo · 2021-07-02T01:22:40Z

Hi, i am new to this Project and would like to tackle this issue

Abdullium · 2022-08-07T09:14:00Z

Hi, i am new to this Project and would like to tackle this issue

Have you Solved it yet

preeyank5 added the bug Something isn't working label Dec 3, 2020

sxjscience added the good first issue Good for newcomers label Dec 3, 2020

preeyank5 closed this as completed Dec 3, 2020

sxjscience changed the title ~~tokenizers.create throwing an error~~ [Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. Dec 3, 2020

sxjscience added enhancement New feature or request and removed bug Something isn't working labels Dec 3, 2020

sxjscience reopened this Dec 3, 2020

AdarshAcharya5 mentioned this issue Sep 3, 2023

[DOC] : Improved error message in SentencePiece tokenizer #1600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. #1449

[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. #1449

preeyank5 commented Dec 3, 2020

sxjscience commented Dec 3, 2020

sxjscience commented Dec 3, 2020

sxjscience commented Dec 3, 2020

preeyank5 commented Dec 3, 2020

sxjscience commented Dec 3, 2020

ConaGo commented Jul 2, 2021

Abdullium commented Aug 7, 2022

[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. #1449

[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. #1449

Comments

preeyank5 commented Dec 3, 2020

Description

Error Message

To Reproduce

sxjscience commented Dec 3, 2020

sxjscience commented Dec 3, 2020

sxjscience commented Dec 3, 2020

preeyank5 commented Dec 3, 2020

sxjscience commented Dec 3, 2020

ConaGo commented Jul 2, 2021

Abdullium commented Aug 7, 2022