Skip to content

Misc. bug: BPE tokenizer not recognized for mistralai/Mistral-Nemo-Instruct-2407 #16970

@jim-plus

Description

@jim-plus

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
version: 6937 (48bd265)
built with MSVC 19.44.35219.0 for x64

Built from a clean install, so no residual files were present.

python convert_hf_to_gguf_update.py [hf_token] was run.

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

Python/Bash scripts

Command line

My template command:

python llama.cpp/convert_hf_to_gguf.py ./text-generation-webui/user_data/models/%1 --outfile temp.gguf --outtype f32

Where %1 is the actual folder/directory containing the model. This template had worked some versions ago.

Problem description & steps to reproduce

I made sure the target model had the tokenizer files that aligned with the original mistralai/Mistral-Nemo-Instruct-2407. For some reason, it's looking for the file "tokenizer.model".

In short, attempt the conversion from HF model to GGUF in F32 precision on this family of models.

First Bad Commit

Unknown; I didn't keep the prior files on hand.

Relevant log output

Attempting the conversion from HF model to GGUF on this family of models results in an error like the following:

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggml-org/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  aa78fe8b04bc622b077520b1fb3d3a5c6f7a53dd375e2361e62599be3cf58de1
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:

Traceback (most recent call last):
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 2277, in set_vocab
    self._set_vocab_sentencepiece()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1150, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1167, in _create_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: text-generation-webui\user_data\models\n12\tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 2280, in set_vocab
    self._set_vocab_llama_hf()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1252, in _set_vocab_llama_hf
    vocab = gguf.LlamaHfVocab(self.dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\cygwin64\home\Jim\chat\llama.cpp\gguf-py\gguf\vocab.py", line 515, in __init__
    raise TypeError('Llama 3 must be converted with BpeVocab')
TypeError: Llama 3 must be converted with BpeVocab

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 10231, in <module>
    main()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 10225, in main
    model_instance.write()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 584, in write
    self.prepare_metadata(vocab_only=False)
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 705, in prepare_metadata
    self.set_vocab()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 2283, in set_vocab
    self._set_vocab_gpt2()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1086, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 801, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1074, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions