Post-processing steps of tokenizer (string replacements) are not included in the GGUF model

It seems that the string replacements in the post-processing of the tokenizer are not included in the GGUF model.
Hence some LLM with fancy tokenizers can have the output text a bit weird with tools like ollama that use GGUF models.

I noticed it with Lucie Instruct: https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct#test-with-ollama

The tokenizer include several post-processing steps that are discarded:
https://huggingface.co/OpenLLM-France/Lucie-7B/raw/main/tokenizer.json
```
"decoder": {
    "type": "Sequence",
    "decoders": [
      {
        "type": "ByteFallback"
      },
      {
        "type": "Metaspace",
        "replacement": "▁",
        "add_prefix_space": true,
        "prepend_scheme": "always"
      },
      {
        "type": "Fuse"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": "\n "
        },
        "content": "\n"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": "\t "
        },
        "content": "\t"
      },
...
```

Those are supposed to remove extra space (introduced in the pre-processing to have "uniform" subword tokens, i.e. sam e token represente for a word whether it comes after a space or after something starting a new sentence (start of string, apostrophe, quotation mark, ...).

![Image](https://github.com/user-attachments/assets/d8b47bfc-8330-4b9a-969e-e3e551f241de)

@ggerganov  I would be happy to contribute to this repo to solve this bug :)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Post-processing steps of tokenizer (string replacements) are not included in the GGUF model #1093

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Post-processing steps of tokenizer (string replacements) are not included in the GGUF model #1093

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions