Skip to content

Conversation

@cebtenzzre
Copy link
Collaborator

This PR fixes some confusion as to the purpose of HfVocab, by making it explicit that it is only for LLaMA "SPM" vocabularies in tokenizer.json format, not generic HuggingFace fast tokenizer (tokenizer.json) vocabs. (There is one exception to this, which is its use for WordPiece - this will be corrected in a follow-up PR.)

PR #5821 fixed some of the confusion as to which files map to which tokenizers, but in adding the automatic fallback to HfVocab it unintentionally caused a few issues.

This PR makes it the job of each vocab class to attempt to load the vocab from the appropriate files, and to fail if tokenizer.json represents the wrong vocab type.

I also changed the Vocab Union to a pair of Protocols to make the API a little more explicit.


With these changes, converting e.g. deepseek-llm-7b-chat results in this exception with the default --vocab-type:

FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

And converting with --vocab-type bpe --pad-vocab works as expected.

With #5821, the model would appear to convert successfully with the default --vocab-type but fail at runtime, and --vocab-type bpe did not recognize the model.

Prior to #5821, the presence of tokenizer.json caused convert.py to attempt to load it as a sentencepiece model:

RuntimeError: Internal: could not parse ModelProto from /home/jared/dirs/text-ai-models/dl/deepseek-llm-7b-chat/tokenizer.json

Closes #6245
Fixes #6238
Fixes #6216
Fixes #5973

@cebtenzzre cebtenzzre requested a review from ggerganov March 27, 2024 22:20
@cebtenzzre cebtenzzre merged commit be55134 into master Mar 28, 2024
@cebtenzzre cebtenzzre deleted the ceb/fix-convert-bpe-hf branch March 28, 2024 15:44
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants