Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. #1449

Open
preeyank5 opened this issue Dec 3, 2020 · 7 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@preeyank5
Copy link

Description

While using tokenizers.create with the model and vocab file for a custom corpus, the code throws an error and is not able to generate the BERT vocab file

Error Message

ValueError: Mismatch vocabulary! All special tokens specified must be control tokens in the sentencepiece vocabulary.

To Reproduce

from gluonnlp.data import tokenizers
tokenizers.create('spm', model_path='lsw1/spm.model', vocab_path='lsw1/spm.vocab')

spm.zip

@preeyank5 preeyank5 added the bug Something isn't working label Dec 3, 2020
@sxjscience
Copy link
Member

Actually I can load the model:

import gluonnlp
from gluonnlp.data.tokenizers import SentencepieceTokenizer
tokenizer = SentencepieceTokenizer(model_path='spm.model', vocab='spm.vocab')
print(tokenizer)

Output:

SentencepieceTokenizer(
   model_path = /home/ubuntu/spm.model
   lowercase = False, nbest = 0, alpha = 0.0
   vocab = Vocab(size=3500, unk_token="<unk>", bos_token="<s>", eos_token="</s>", pad_token="<pad>")
)

@preeyank5 Would you try again?

@sxjscience
Copy link
Member

I find that the root cause is that we will need better error handling of the **kwargs here. Basically, the argument should be vocab instead of vocab_path and vocab_path has been put under **kwargs.

The way to fix the issue is to revise

for k, v in kwargs.items():
if k in special_tokens_kv:
if v != special_tokens_kv[k]:

@sxjscience sxjscience added the good first issue Good for newcomers label Dec 3, 2020
@sxjscience
Copy link
Member

Marked it as a "good first issue" because it's a good issue for early contributors. We can just ensure that the correct error is raised when kwargs contains unexpected values.

@preeyank5
Copy link
Author

Thanks Xingjian, I am now able to load the model

@sxjscience
Copy link
Member

Let's keep this issue to track the error message. We should raise the error if the user has specified some unexpected kwargs.

@sxjscience sxjscience changed the title tokenizers.create throwing an error [Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. Dec 3, 2020
@sxjscience sxjscience added enhancement New feature or request and removed bug Something isn't working labels Dec 3, 2020
@sxjscience sxjscience reopened this Dec 3, 2020
@ConaGo
Copy link

ConaGo commented Jul 2, 2021

Hi, i am new to this Project and would like to tackle this issue

@Abdullium
Copy link

Hi, i am new to this Project and would like to tackle this issue

Have you Solved it yet

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants