Skip to content

Post-processing steps of tokenizer (string replacements) are not included in the GGUF model #1093

@Jeronymous

Description

@Jeronymous

It seems that the string replacements in the post-processing of the tokenizer are not included in the GGUF model.
Hence some LLM with fancy tokenizers can have the output text a bit weird with tools like ollama that use GGUF models.

I noticed it with Lucie Instruct: https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct#test-with-ollama

The tokenizer include several post-processing steps that are discarded:
https://huggingface.co/OpenLLM-France/Lucie-7B/raw/main/tokenizer.json

"decoder": {
    "type": "Sequence",
    "decoders": [
      {
        "type": "ByteFallback"
      },
      {
        "type": "Metaspace",
        "replacement": "▁",
        "add_prefix_space": true,
        "prepend_scheme": "always"
      },
      {
        "type": "Fuse"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": "\n "
        },
        "content": "\n"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": "\t "
        },
        "content": "\t"
      },
...

Those are supposed to remove extra space (introduced in the pre-processing to have "uniform" subword tokens, i.e. sam e token represente for a word whether it comes after a space or after something starting a new sentence (start of string, apostrophe, quotation mark, ...).

Image

@ggerganov I would be happy to contribute to this repo to solve this bug :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions