-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
It seems that the string replacements in the post-processing of the tokenizer are not included in the GGUF model.
Hence some LLM with fancy tokenizers can have the output text a bit weird with tools like ollama that use GGUF models.
I noticed it with Lucie Instruct: https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct#test-with-ollama
The tokenizer include several post-processing steps that are discarded:
https://huggingface.co/OpenLLM-France/Lucie-7B/raw/main/tokenizer.json
"decoder": {
"type": "Sequence",
"decoders": [
{
"type": "ByteFallback"
},
{
"type": "Metaspace",
"replacement": "▁",
"add_prefix_space": true,
"prepend_scheme": "always"
},
{
"type": "Fuse"
},
{
"type": "Replace",
"pattern": {
"String": "\n "
},
"content": "\n"
},
{
"type": "Replace",
"pattern": {
"String": "\t "
},
"content": "\t"
},
...
Those are supposed to remove extra space (introduced in the pre-processing to have "uniform" subword tokens, i.e. sam e token represente for a word whether it comes after a space or after something starting a new sentence (start of string, apostrophe, quotation mark, ...).
@ggerganov I would be happy to contribute to this repo to solve this bug :)
