`convert.py`: Mistral models converted from `tokenizer.json` display `<0x0A>` instead of newlines. #4622

TheBloke · 2023-12-24T17:54:07Z

This issue follows on from the discussions we had at the end of @strutive07 's PR which added support for tokenizer.json, here: #3633

Summary

Llama and Mistral models GGUF converted from tokenizer.json experience an issue with newlines, printing <0x0A instead of \n. The issue does not exist when tokenizer.model is used for the same model.

This represents an issue for some new fine tunes which do not include tokenizer.model. Sometimes this is simply a mistake, and the base model file can be used. But in some cases the models have extended or changed the vocab in tokenizer.json, and a new SPM model would need to be created. (Something that I've not yet been able to figure out how to do.)

Steps to reproduce

Download any Llama or Mistral 7B repo which contains tokenizer.model and tokenizer.json:

pip3 install --upgrade 'huggingface-hub>=0.18'     # if not installed
huggingface-cli download mistralai/Mistral-7B-v0.1 --local-dir test-mistral  --local-dir-use-symlinks False

Run convert.py on it, and verify that output is as expected. Because tokenizer.model is present, it will be used in preference to tokenizer.json, and no issue will exist.

$ ls -al /workspace/test-mistral/tokenizer.model
-rw-rw-r-- 1 quant quant 482K Dec 24 17:14 /workspace/test-mistral/tokenizer.model

$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/with-tokenizer.model.fp16.gguf

$ ./main -m /workspace/test-mistral/with-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

 A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the

Remove tokenizer.model to force tokenizer.json to be used and re-run convert.py

$ mv /workspace/test-mistral/tokenizer.model /workspace/test-mistral/dead.tokenizer.model

$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/no-tokenizer.model.fp16.gguf

Test inference and note that \n is now represented as <0x0A> in the output:

$ ./main -m /workspace/test-mistral/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

 A haiku example is 5-7-5 syllables.<0x0A><0x0A>The first line has five syllables, the second line has seven syllables and the

Testing the same using Hugging Face transformers does not show an issue:

In [1]: import os
   ...: from transformers import AutoTokenizer
   ...: print ("tokenizer.model exists:", os.path.exists("/workspace/test-mistral/tokenizer.model"))
   ...: tokenizer = AutoTokenizer.from_pretrained("/workspace/test-mistral/")
   ...: encoded = tokenizer(""" A haiku example is 5-7-5 syllables.
   ...:
   ...: The first line has five syllables, the second line has seven syllables and the""")
   ...: print(f"Tokens: {encoded.input_ids}")
   ...: print(f"Decoded again: '{tokenizer.decode(encoded.input_ids)}'")
tokenizer.model exists: False
Tokens: [1, 28705, 330, 3631, 23550, 2757, 349, 28705, 28782, 28733, 28787, 28733, 28782, 5747, 584, 2561, 28723, 13, 13, 1014, 907, 1407, 659, 3359, 5747, 584, 2561, 28725, 272, 1676, 1407, 659, 6671, 5747, 584, 2561, 304, 272]
Decoded again: '<s>  A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the'

In [2]: tokenizer.decode(13)
Out[2]: '\n'

Llama example

$ huggingface-cli download meta-llama/Llama-2-7b-chat-hf  --local-dir test-llama2  --local-dir-use-symlinks False

$ mv test-llama2/tokenizer.model test-llama2/dead.tokenizer.model

$ python3 ./convert.py /workspace/test-llama2 --outtype f16 --outfile /workspace/test-llama2/no-tokenizer.model.fp16.gguf

$ ./main -m /workspace/test-llama2/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

A haiku example is 5 syllables, 7 syllables, and 5 syllables.<0x0A>A haiku is a traditional form of Japanese poetry that

CC @ArthurZucker

The text was updated successfully, but these errors were encountered:

slaren · 2023-12-24T18:41:29Z

The HF tokenizer treats all tokens in the format <0x..> as ~~special~~ byte tokens:

            let bytes = if token.len() == 6 && token.starts_with("<0x") && token.ends_with('>') {
                if let Ok(byte) = u8::from_str_radix(&token[3..5], 16) {
                    Some(byte)
                } else {
                    None
                }
            } else {
                None
            };

https://github.com/huggingface/tokenizers/blob/11462596d11d886e501091e28acbe1174385087a/tokenizers/src/decoders/byte_fallback.rs#L30-L38

slaren · 2023-12-24T19:27:55Z

This should fix it:

diff --git a/convert.py b/convert.py
index 7a3cd615..710f196b 100755
--- a/convert.py
+++ b/convert.py
@@ -394,10 +394,13 @@ class VocabLoader:
             if self.spm.is_byte(token_id):
                 toktype = gguf.TokenType.BYTE
         else:
+            token = self.reverse_vocab[token_id]
             if token_id == self.unk_token_id:
                 toktype = gguf.TokenType.UNKNOWN
-            if token_id in self.special_ids:
+            elif token_id in self.special_ids:
                 toktype = gguf.TokenType.CONTROL
+            elif len(token) == 6 and token.startswith("<0x") and token.endswith(">"):
+                toktype = gguf.TokenType.BYTE

         return toktype

strutive07 · 2023-12-25T03:07:03Z

Generation

 A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the

Vocab Type Check

from pathlib import Path

def is_same_vocab(v1, v2):
    v1_set = set()
    v2_set = set()

    for text, score, toktype in v1.all_tokens():
        v1_set.add((text, toktype))
        
    for text, score, toktype in v2.all_tokens():
        v2_set.add((text, toktype))

    return v1_set == v2_set

model_path = Path("/workspace/Mistral-7B-v0.1")

params = Params.load(load_some_model(model_path))
vocab_tokenizer_model = VocabLoader(params, model_path)

# remove tokenizer.model here

params = Params.load(load_some_model(model_path))
vocab_hf = VocabLoader(params, model_path)

is_same_vocab(vocab_hf, vocab_tokenizer_model)
>> True

Code DIff

diff --git a/convert.py b/convert.py
index 7a3cd61..a9ccb69 100755
--- a/convert.py
+++ b/convert.py
@@ -357,6 +357,7 @@ class VocabLoader:
             for tok in self.tokenizer.all_special_tokens
         }
         self.special_ids: set[int] = set(self.tokenizer.all_special_ids)
+        self.reverse_vocab = {id: encoded_tok for encoded_tok, id in self.tokenizer.get_vocab().items()}
         self.vocab_size_base: int = self.tokenizer.vocab_size
         self.vocab_size: int = self.vocab_size_base + len(self.added_tokens_dict)
         self.fname_tokenizer: Path = fname_tokenizer
@@ -371,14 +372,13 @@ class VocabLoader:
 
     def hf_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
         tokenizer = self.tokenizer
-        reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.get_vocab().items()}
         added_tokens_ids = set(self.added_tokens_dict.values())
 
         for i in range(self.vocab_size_base):
             if i in added_tokens_ids:
                 continue
 
-            text = reverse_vocab[i].encode("utf-8")
+            text = self.reverse_vocab[i].encode("utf-8")
             yield text, self.get_token_score(i), self.get_token_type(i)
 
     def get_token_type(self, token_id: int) -> gguf.TokenType:
@@ -394,10 +394,13 @@ class VocabLoader:
             if self.spm.is_byte(token_id):
                 toktype = gguf.TokenType.BYTE
         else:
+            token = self.reverse_vocab[token_id]
             if token_id == self.unk_token_id:
                 toktype = gguf.TokenType.UNKNOWN
-            if token_id in self.special_ids:
+            elif token_id in self.special_ids:
                 toktype = gguf.TokenType.CONTROL
+            elif len(token) == 6 and token.startswith("<0x") and token.endswith(">"):
+                toktype = gguf.TokenType.BYTE
 
         return toktype

@slaren

I've checked that the code you posted works and I've also checked the vocab type and it's the same. Thank you.
Since you suggested the code, I was wondering if you could give it a PR? If you reposition the reverse_vocab to init like the code I posted above, it works fine. Could you please PR the code including the above code?

teleprint-me · 2023-12-25T04:50:00Z

I'll look into it. There are a bunch of issues related to the vocab and conversion script. e.g. iss #4493, #4360, etc.

There's also vocab mismatches occurring in a higher frequency after the latest merge with the updated convert.py.

llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).

Phi models aren't the only models affected by it. It shows up with Mistral, Mixtral, Llama-1, Llama-2, etc.

slaren · 2023-12-25T11:38:32Z

@strutive07 Please open a PR yourself, I am not familiar with the convert.py code.

TheBloke · 2023-12-28T11:28:52Z

Thanks very much for the diagnosis and fixes, @slaren and @strutive07 !

strutive07 mentioned this issue Dec 26, 2023

Add byte token type when tokenizer.model is not exists #4641

Merged

strutive07 linked a pull request Dec 26, 2023 that will close this issue

Add byte token type when tokenizer.model is not exists #4641

Merged

strutive07 closed this as completed in #4641 Dec 27, 2023

Artefact2 mentioned this issue Jan 21, 2024

convert.py --vocab-type hfft produces <0x0A> instead of new lines #5064

Closed

CrispStrobe mentioned this issue May 17, 2024

convert-hf-to-gguf.py breaks on phi-2 #7219

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`convert.py`: Mistral models converted from `tokenizer.json` display `<0x0A>` instead of newlines. #4622

`convert.py`: Mistral models converted from `tokenizer.json` display `<0x0A>` instead of newlines. #4622

TheBloke commented Dec 24, 2023

slaren commented Dec 24, 2023 •

edited

Loading

slaren commented Dec 24, 2023

strutive07 commented Dec 25, 2023 •

edited

Loading

teleprint-me commented Dec 25, 2023 •

edited

Loading

slaren commented Dec 25, 2023

TheBloke commented Dec 28, 2023

convert.py: Mistral models converted from tokenizer.json display <0x0A> instead of newlines. #4622

convert.py: Mistral models converted from tokenizer.json display <0x0A> instead of newlines. #4622

Comments

TheBloke commented Dec 24, 2023

Summary

Steps to reproduce

Testing the same using Hugging Face transformers does not show an issue:

Llama example

slaren commented Dec 24, 2023 • edited Loading

slaren commented Dec 24, 2023

strutive07 commented Dec 25, 2023 • edited Loading

teleprint-me commented Dec 25, 2023 • edited Loading

slaren commented Dec 25, 2023

TheBloke commented Dec 28, 2023

`convert.py`: Mistral models converted from `tokenizer.json` display `<0x0A>` instead of newlines. #4622

`convert.py`: Mistral models converted from `tokenizer.json` display `<0x0A>` instead of newlines. #4622

slaren commented Dec 24, 2023 •

edited

Loading

strutive07 commented Dec 25, 2023 •

edited

Loading

teleprint-me commented Dec 25, 2023 •

edited

Loading