Misc. bug: BPE tokenizer not recognized for mistralai/Mistral-Nemo-Instruct-2407

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
version: 6937 (48bd26501)
built with MSVC 19.44.35219.0 for x64

Built from a clean install, so no residual files were present.

python convert_hf_to_gguf_update.py [hf_token] was run.

### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

Python/Bash scripts

### Command line

```shell
My template command:

python llama.cpp/convert_hf_to_gguf.py ./text-generation-webui/user_data/models/%1 --outfile temp.gguf --outtype f32

Where %1 is the actual folder/directory containing the model. This template had worked some versions ago.
```

### Problem description & steps to reproduce

I made sure the target model had the tokenizer files that aligned with the original mistralai/Mistral-Nemo-Instruct-2407. For some reason, it's looking for the file "tokenizer.model".

In short, attempt the conversion from HF model to GGUF in F32 precision on this family of models.

### First Bad Commit

Unknown; I didn't keep the prior files on hand.

### Relevant log output

```shell
Attempting the conversion from HF model to GGUF on this family of models results in an error like the following:

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggml-org/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  aa78fe8b04bc622b077520b1fb3d3a5c6f7a53dd375e2361e62599be3cf58de1
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:

Traceback (most recent call last):
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 2277, in set_vocab
    self._set_vocab_sentencepiece()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1150, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1167, in _create_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: text-generation-webui\user_data\models\n12\tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 2280, in set_vocab
    self._set_vocab_llama_hf()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1252, in _set_vocab_llama_hf
    vocab = gguf.LlamaHfVocab(self.dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\cygwin64\home\Jim\chat\llama.cpp\gguf-py\gguf\vocab.py", line 515, in __init__
    raise TypeError('Llama 3 must be converted with BpeVocab')
TypeError: Llama 3 must be converted with BpeVocab

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 10231, in <module>
    main()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 10225, in main
    model_instance.write()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 584, in write
    self.prepare_metadata(vocab_only=False)
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 705, in prepare_metadata
    self.set_vocab()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 2283, in set_vocab
    self._set_vocab_gpt2()
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1086, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 801, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\cygwin64\home\Jim\chat\llama.cpp\convert_hf_to_gguf.py", line 1074, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: BPE tokenizer not recognized for mistralai/Mistral-Nemo-Instruct-2407 #16970

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: BPE tokenizer not recognized for mistralai/Mistral-Nemo-Instruct-2407 #16970

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions