llama : more tokenizer fixes #2810

ggerganov · 2023-08-26T12:57:25Z

TLDR: after this PR, devs should no longer add in the user code a leading space to the text they need to tokenize

In other words, if your code uses llama.cpp library for tokenization and is doing this:

     s.insert(0, 1, ' '); // add a space to match OG LLaMA tokenizer
     p = ::llama_tokenize(ctx, s, add_bos);

It should now just do this:

     p = ::llama_tokenize(ctx, s, add_bos);

note: by "user" below I mean developers. I.e. users of llama.cpp library

This PR should not change the tokenization results from master - it just improves the tokenizer API and fixes the decoding logic and the reason for the leading space in the detokenized results. It adds a few more tokenizer tests and a helper Python script to compare the behaviour of the 2 implementations. Also confirmed that wiki.test.raw tokenizes to 335688 tokens in exactly the same way as the OG Python tokenizer and fixed a minor bug in perplexity tool.

The main mistake we have been doing is to leave to the user to add the leading whitespace before calling llama_tokenize(). For example:

llama.cpp/examples/main/main.cpp

Lines 198 to 201 in 04f4b1e

    
           if (llama_vocab_type(ctx) == LLAMA_VOCAB_TYPE_SPM) { 
        
               // Add a space in front of the first character to match OG llama tokenizer behavior 
        
               params.prompt.insert(0, 1, ' '); 
        
           }

llama.cpp/examples/server/server.cpp

Lines 288 to 290 in 04f4b1e

    
           { 
        
               s.insert(0, 1, ' '); // add a space if it's the first 
        
               p = ::llama_tokenize(ctx, s, add_bos);

llama.cpp/examples/server/server.cpp

Lines 310 to 314 in 04f4b1e

    
           { 
        
               auto s = json_prompt.template get<std::string>(); 
        
               s.insert(0, 1, ' '); // always add a first space 
        
               prompt_tokens = ::llama_tokenize(ctx, s, add_bos); 
        
           }

This leading space is indeed needed since without it, we do not get the same results as the OG Python tokenizer. It does not matter if the provided text has one or more on none leading spaces - we always must prefix it with a leading space. Therefore, this has to be part of the internal implementation to avoid the user (developer) forgetting to do it:

llama.cpp/llama.cpp

Lines 3391 to 3398 in e4324cb

    
           case LLAMA_VOCAB_TYPE_SPM: 
        
               { 
        
                   // without adding this leading whitespace, we do not get the same results as the original tokenizer 
        
                   raw_text = " " + raw_text; 
        
                   llm_tokenizer_spm tokenizer(vocab); 
        
                   llama_escape_whitespace(raw_text); 
        
                   tokenizer.tokenize(raw_text, output);

For example, we have technically always had a mistake in the perplexity tool because we haven't added the whitespace there:

llama.cpp/examples/perplexity/perplexity.cpp

Lines 190 to 197 in 04f4b1e

    
           const bool is_spm = llama_vocab_type(ctx) == LLAMA_VOCAB_TYPE_SPM; 
        
           const bool add_bos = is_spm; 
        
           fprintf(stderr, "%s: tokenizing the input ..\n", __func__); 
        
           auto tokens = ::llama_tokenize(ctx, params.prompt, add_bos); 
        
           const int n_chunk_max = tokens.size() / params.n_ctx;

This shouldn't lead to any dramatic changes in the results, but it's better to have it the right way. With the current PR I verified that the produced tokens by llama.cpp match exactly the ones from Python for wiki.text.raw:

# tokenize with Python
python3 tests/test-tokenizer-0-llama.py ../llama2/ --fname-tok ./build/wikitext-2-raw/wiki.test.raw

# tokenize with llama.cpp
make -j && ./bin/test-tokenizer-falcon.0 ../models/ggml-vocab-llama.gguf ./wikitext-2-raw/wiki.test.raw

# diff
diff wikitext-2-raw/wiki.test.raw.tok wikitext-2-raw/wiki.test.raw.tokcpp

I've also update the C-style API with better name of the de-tokenizer function: llama_token_to_str -> llama_token_to_piece:

llama.cpp/llama.h

Lines 384 to 393 in e4324cb

    
           // Token Id -> Piece. 
        
           // Uses the vocabulary in the provided context. 
        
           // Does not write null terminator to the buffer. 
        
           // User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens. 
        
           LLAMA_API int llama_token_to_piece( 
        
                   const struct llama_context * ctx, 
        
                                  llama_token   token, 
        
                                         char * buf, 
        
                                         int    length);

The common lib now provides a llama_detokenize() function that correctly detokenizes a vector of tokens into text, without adding a leading space at the start:

llama.cpp/common/common.h

Lines 119 to 137 in e4324cb

    
           // tokenizes a string into a vector of tokens 
        
           // should work similar to Python's `tokenizer.encode` 
        
           std::vector<llama_token> llama_tokenize( 
        
                   struct llama_context * ctx, 
        
                      const std::string & text, 
        
                                   bool   add_bos); 
        
           // tokenizes a token into a piece 
        
           // should work similar to Python's `tokenizer.id_to_piece` 
        
           std::string llama_token_to_piece( 
        
                   const struct llama_context * ctx, 
        
                                  llama_token   token); 
        
           // detokenizes a vector of tokens into a string 
        
           // should work similar to Python's `tokenizer.decode` 
        
           // removes the leading space from the first non-BOS token 
        
           std::string llama_detokenize( 
        
                                    llama_context * ctx, 
        
                   const std::vector<llama_token> & tokens);

The logic is that we have to drop the leading space of the first non-BOS token when detokenizing:

llama.cpp/common/common.cpp

Lines 750 to 768 in e4324cb

    
           std::string llama_detokenize(llama_context * ctx, const std::vector<llama_token> & tokens) { 
        
               const llama_token bos_id = llama_token_bos(ctx); 
        
               std::string piece; 
        
               std::string result; 
        
               for (size_t i = 0; i < tokens.size(); ++i) { 
        
                   piece = llama_token_to_piece(ctx, tokens[i]); 
        
                   // remove the leading space of the first non-BOS token 
        
                   if (((tokens[0] == bos_id && i == 1) || (tokens[0] != bos_id && i == 0)) && piece[0] == ' ') { 
        
                       piece = piece.substr(1); 
        
                   } 
        
                   result += piece; 
        
               } 
        
               return result; 
        
           }

This function should behave exactly the same as the Python's tokenizer.decode().

Notice that in main we still get a leading space inserted when we print out the input prompt. This is because we decode the generated text token by token - i.e. piece by piece. So we don't have the information which token is first. We can add a state to main.cpp to track this, but I think it is not worth it.

Another small bug fix is that we now correctly tokenize the empty string '' to [1] with bos == true and to [] with bos == false

Please give this branch a thorough test to make sure I haven't made some mistake and lets hope we have finally resolved all tokenizer related issues :)

TLDR: after this PR, devs should no longer add in the user code a leading space to the text they need to tokenize as we have been doing up until now

llama.cpp

ggml-ci

SlyEcho

Seems to be okay with the server, at least.

ghost · 2023-08-26T17:48:47Z

interactive in main looks good, no issues in my tests.

ggerganov · 2023-08-26T17:59:17Z

@SlyEcho

I'm looking into more details in this code:

llama.cpp/examples/server/server.cpp

Lines 298 to 307 in e4324cb

    
                   else 
        
                   { 
        
                       if (first) 
        
                       { 
        
                           first = false; 
        
                       } 
        
                       prompt_tokens.push_back(p.template get<llama_token>()); 
        
                   } 
        
               } 
        
           }

I think it should be:

         { 
             if (first) 
             { 
                 prompt_tokens.push_back(bos_token);
                 first = false; 
             } 
             prompt_tokens.push_back(p.template get<llama_token>()); 
         }

Not sure how important is this, but mentioning just in case. You will need to propagate bos_token from somewhere though

klosax · 2023-08-26T20:30:36Z

LF token is not correct, should not be tokenized using prepended space.

llm_load_print_meta: general.name   = llama2-7b-hf
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 29871 '▁'
llm_load_tensors: ggml ctx size =    0.09 MB

Master:

llm_load_print_meta: general.name   = llama2-7b-hf
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB

ghost · 2023-08-26T20:38:21Z

LF token is not correct, should not be tokenized using prepended space.

I missed this, but yeah there's a difference for LF token in main between PR and master.

ggerganov · 2023-08-26T20:53:32Z

LF token is not correct, should not be tokenized using prepended space.

Ah good catch. The actual problem is that it should not be tokenized, but we should use piece_to_id function

examples/perplexity/perplexity.cpp

klosax · 2023-08-26T21:04:49Z

Now LF should work in both llama and falcon.

ghost · 2023-08-26T21:14:10Z

Now LF should work in both llama and falcon.

llama2 now correctly showing LF token = 13 '<0x0A>' for main

klosax

Seems to work good.

ggerganov · 2023-08-27T08:55:38Z

I have verified that this branch compared to master produces:

identical Hellaswag results for F16 Falcon 7B and F16 LLaMA-v2 7B
identical perplexity for Falcon 7B Q4_0
nearly identical perplexity for LLaMA-v2 7B Q4_0 (this is expected due to the new correct tokenization of wiki text):
- master Q4_0: ppl 5.9693
- PR Q4_0: ppl 5.9694

ggml-ci

ikawrakow · 2023-08-27T09:14:08Z

So, I approved the changes, but the PR basically goes back to auto-adding white space at the beginning strings to be tokenized, which changes HellaSwag scores. I.e., it kind of reverts @klosax's fix in #2806. With PR #2805 merged, this no longer leads to the dramatic difference we saw for HellaSwag after the GGUF merge, but still. I guess, the values in #2321 will need to be updated (for instance, we now have a HellaSwag score of 76.75 after 400 tasks for LLaMA-v2-7B instead of the 77.5 shown in the table in #2321).

ggerganov · 2023-08-27T09:32:17Z

Yes, after digging into this yesterday, I believe that the proposed implementation in this PR is the correct one and the old values have to be updated. The auto-prefix with whitespace inside the llama_tokenize() implementation is required in order to match tokenizer.encode() behaviour in all cases and strings. So before the PR, we were mostly doing this manually in user-code, but not always. Now we should never need to prefix with whitespace in user-code (unless one wants to concatenate tokenization results, but in this case one must take into account the specific whitespace behaviour of the tokenizer and know how to perform it correctly -- in general, we should avoid concatenation of tokens as we currently do so).

It's still possible that I am misunderstanding some of the tokenizer logic, but atm I'm fairly confident that the proposal is better than what we have on master and my unit tests confirm this.

Edit: Regarding the concatenation comment, here is an example to try to make it clear:

Running this Python code first with LLaMA's SPM tokenizer and then with Falcon's BPE tokenizer:

print(tokenizer.encode('hello'))         # x, beginning
print(tokenizer.encode('world'))         # y, ending
print(tokenizer.encode(' world'))        # z, ending with whitespace prefix
print(tokenizer.encode('hello world'))   # r, concat

# results with SPM
[22172]             # x
[3186]              # y
[29871, 3186]       # z
[22172, 3186]       # r = [ x, y ]

# results with BPE
[30835]             # x
[9091]              # y
[1079]              # z
[30835, 1079]       # r = [ x, z ]

ggerganov · 2023-08-27T10:09:27Z

And one more observation regarding the built-in Falcon BPE tokenization in llama.cpp:

We currently know that it does not handle Unicode strings correctly. However, running the unit test, we can see that doing a roundtrip of (tokenize + detokenize) we always get the "source" string:

$ ./bin/test-tokenizer-0-falcon ../models/falcon-7b/ggml-model-f16.gguf 

src: ' '
res: ' '
tok: 204 

src: '  '
res: '  '
tok: 258 

src: '   '
res: '   '
tok: 466 

src: '    Hello'
res: '    Hello'
tok: 466 23090 

src: '    Hello
    Hello'
res: '    Hello
    Hello'
tok: 466 23090 742 23090 

src: '   Hello'
res: '   Hello'
tok: 258 23090 

src: '  Hello'
res: '  Hello'
tok: 204 23090 

src: ' Hello'
res: ' Hello'
tok: 23090 

src: ' Hello World'
res: ' Hello World'
tok: 23090 2889 

src: ' Hello World!'
res: ' Hello World!'
tok: 23090 2889 12 

src: ' Hello world'
res: ' Hello world'
tok: 23090 1079 

src: ' Hello, world!'
res: ' Hello, world!'
tok: 23090 23 1079 12 

src: ' this is 🦙.cpp'
res: ' this is 🦙.cpp'
tok: 414 304 204 182 237 111 231 25 29247 
main : failed test:    ' this is 🦙.cpp'
main : detokenized to: ' this is 🦙.cpp' instead of ' this is 🦙.cpp'
main : expected tokens:    414,    304,   3346,    111,    231,     25,  29247, 
main : got tokens:         414,    304,    204,    182,    237,    111,    231,     25,  29247, 

src: 'Hello'
res: 'Hello'
tok: 9856 

src: 'Hello World'
res: 'Hello World'
tok: 9856 2889 

src: 'Hello world'
res: 'Hello world'
tok: 9856 1079 

src: 'Hello, world!'
res: 'Hello, world!'
tok: 9856 23 1079 12 

src: 'w048 7tuijk dsdfhu'
res: 'w048 7tuijk dsdfhu'
tok: 98 55866 204 34 16682 7149 36190 6869 11481 

src: 'нещо на Български'
res: 'нещо на Български'
tok: 150 133 6207 151 215 150 134 204 150 133 6279 204 150 223 151 216 15401 150 123 6279 9310 9649 16153 7795 
main : failed test:    'нещо на Български'
main : detokenized to: 'нещо на Български' instead of 'нещо на Български'
main : expected tokens:    150,    133,   6207,    151,    215,    150,    134,   5052,    133,   6279,   5052,    223,    151,    216,  49679,    123,  53110,  47043,   7795, 
main : got tokens:         150,    133,   6207,    151,    215,    150,    134,    204,    150,    133,   6279,    204,    150,    223,    151,    216,  15401,    150,    123,   6279,   9310,   9649,  16153,   7795, 

src: 'កាន់តែពិសេសអាចខលចេញ'
res: 'កាន់តែពិសេសអាចខលចេញ'
tok: 167 236 206 167 236 126 167 236 225 167 237 217 167 236 221 167 237 208 167 236 228 167 236 127 167 236 237 167 237 207 167 236 237 167 236 107 167 236 126 167 236 211 167 236 207 167 236 233 167 236 211 167 237 207 167 236 215 
main : failed test:    'កាន់តែពិសេសអាចខលចេញ'
main : detokenized to: 'កាន់តែពិសេសអាចខលចេញ' instead of 'កាន់តែពិសេសអាចខលចេញ'
main : expected tokens:  38154,    206,  38154,    126,  38154,    225,    167,    237,    217,  38154,    221,    167,    237,    208,  38154,    228,  38154,    127,  38154,    237,    167,    237,    207,  38154,    237,  38154,    107,  38154,    126,  38154,    211,  38154,    207,  38154,    233,  38154,    211,    167,    237,    207,  38154,    215, 
main : got tokens:         167,    236,    206,    167,    236,    126,    167,    236,    225,    167,    237,    217,    167,    236,    221,    167,    237,    208,    167,    236,    228,    167,    236,    127,    167,    236,    237,    167,    237,    207,    167,    236,    237,    167,    236,    107,    167,    236,    126,    167,    236,    211,    167,    236,    207,    167,    236,    233,    167,    236,    211,    167,    237,    207,    167,    236,    215, 

src: '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
res: '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
tok: 182 237 232 206 204 19 11003 20 204 182 237 230 126 168 206 219 182 237 218 116 13392 204 19 51831 732 63209 1741 7955 522 20 204 57196 204 19 7927 53360 325 504 701 946 10930 20 
main : failed test:    '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
main : detokenized to: '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)' instead of '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
main : expected tokens:   2571,    232,    206,    204,     19,  11003,     20,   8196,    126,    283,    219,  48778,    116,  13392,    204,     19,  51831,    732,  63209,   1741,   7955,    522,     20,  22438,    211,    204,     19,   7927,  53360,    325,    504,    701,    946,  10930,     20, 
main : got tokens:         182,    237,    232,    206,    204,     19,  11003,     20,    204,    182,    237,    230,    126,    168,    206,    219,    182,    237,    218,    116,  13392,    204,     19,  51831,    732,  63209,   1741,   7955,    522,     20,    204,  57196,    204,     19,   7927,  53360,    325,    504,    701,    946,  10930,     20,

However, the obtained string is sometimes constructed from a different set of tokens that does not match the correct set that we obtain with the OG tokenizer. This happens when unicode characters are involved.

All of this is expected and indicates that for now, projects should avoid llama.cpp's built-in BPE tokenizer if tokenization accuracy is important. But for quick tests, the generated texts should look OK

ikawrakow · 2023-08-27T13:11:56Z

Btw., did anyone except me notice that tokenization is now almost 6 times slower than it used to be?

After merging this PR, it now takes 3.2 seconds to tokenize Wikitext. It used to be ~540 ms up to bae5c5f, became 3.1 seconds after merging #2806, and now 3.2 seconds after this PR.

ghost · 2023-08-27T14:07:23Z

I don't have an accurate measure for timing for tokenizing input in perplexity, but counting in my head took 23 seconds for both master & bae5c5f

If there's slowness, then I'd notice on Android unless my device is unaffected.

* tests : write a Python tokenizer test (wip) * llama : prefix input text for tokenization with whitespace * llama : distinguish pieces from decoded text + fix detokenization * common : add comments * examples : no longer manually add leading space when tokenizing * tests : use Python to generate tokenizer tests for C++ * tests : add option to tokenize text files ggml-ci * tests : add test-tokenizer-1.py * llama.cpp : fix LF token * hellaswag : move the concat space for clarity * tests : add falcon tests (py + cpp, currently do not pass Unicode) ggml-ci * common : temporary separate llama_detokenize calls for SPM and BPE --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>

shibe2 · 2023-09-08T08:37:35Z

So how do I tokenize partial text for Lllama 2 now? I need tokenize(x+y)==tokenize(x)+tokenize(y) (given that ending of x and beginning of y don't form a token).

goerch · 2023-09-09T18:43:55Z

So how do I tokenize partial text for Lllama 2 now? I need tokenize(x+y)==tokenize(x)+tokenize(y) (given that ending of x and beginning of y don't form a token).

Interesting. This is an invariant I haven't looked into yet. For now:

        for(auto i = 0; i < 1024; ++ i) {
            auto str = random_string(64);
            for(auto j = 0; j < str.length(); ++ j) {
                auto ids1 = spp.EncodeAsIds(str.substr(0, j));
                auto ids2 = spp.EncodeAsIds(str.substr(j));
                auto str1 = spp.DecodeIds(ids1);
                auto str2 = spp.DecodeIds(ids2);
                if (str1 + str2 != str)
                    std::cout << "1: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
            }
        }
        for(auto i = 0; i < 1024; ++ i) {
            auto str = random_string(64);
            auto ids = spp.EncodeAsIds(str);
            for(auto j = 0; j < ids.size(); ++j) {
                auto str1 = spp.DecodeIds(std::vector(ids.begin(), ids.begin() + j));
                auto str2 = spp.DecodeIds(std::vector(ids.begin() + j, ids.end()));
                if (str1 + str2 != str)
                    std::cout << "2: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
            }
        }

Invariant 1 seems to hold for Meta's tokenizer, invariant 2 breaks as soon as spaces are involved.

goerch · 2023-09-09T18:58:13Z

If my quick check

        for(auto i = 0; i < 1024; ++ i) {
            auto str = random_string(64);
            auto ids = spp.EncodeAsIds(str);
            for(auto j = 0; j < str.length(); ++ j) {
                auto ids1 = spp.EncodeAsIds(str.substr(0, j));
                auto ids2 = spp.EncodeAsIds(str.substr(j));
                auto check = ids1;
                check.insert(check.end(), ids2.begin(), ids2.end());
                if (check != ids) {
                    auto str1 = spp.DecodeIds(ids1);
                    auto str2 = spp.DecodeIds(ids2);
                    std::cout << "3: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
                }
            }
        }

is correct, this invariant also doesn't hold for Meta's tokenizer in the presence of spaces.

shibe2 · 2023-09-15T16:28:45Z

So how do I tokenize partial text for Lllama 2 now?

I have found a way. I prefix the string with LF, then remove tokens up to and including the first llama_token_nl. I added a parameter to server /tokenize, which triggers this behavior (only effective for LLAMA_VOCAB_TYPE_SPM).

By the way, server documentation as of now says that this should be the default behavior:

Note that the special BOS token is not added in front of the text and also a space character is not inserted automatically as it is for /completion.

ggerganov added 2 commits August 26, 2023 15:55

tests : write a Python tokenizer test (wip)

5cad62b

llama : prefix input text for tokenization with whitespace

5d0ffb6

klosax reviewed Aug 26, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama : distinguish pieces from decoded text + fix detokenization

9668aa1

ggerganov force-pushed the fix-tokenizer branch from 6314b51 to 8976384 Compare August 26, 2023 14:43

common : add comments

1e7a033

ggerganov force-pushed the fix-tokenizer branch from 8976384 to 1e7a033 Compare August 26, 2023 14:50

ggerganov added 3 commits August 26, 2023 17:51

examples : no longer manually add leading space when tokenizing

dfa058e

tests : use Python to generate tokenizer tests for C++

70005bd

tests : add option to tokenize text files

e4324cb

ggml-ci

ggerganov force-pushed the fix-tokenizer branch from 084546b to e4324cb Compare August 26, 2023 16:21

ggerganov marked this pull request as ready for review August 26, 2023 16:39

ggerganov requested review from SlyEcho, ikawrakow and klosax August 26, 2023 16:43

ggerganov mentioned this pull request Aug 26, 2023

Enhancement: Codellama FIM tokenization #2818

Closed

SlyEcho approved these changes Aug 26, 2023

View reviewed changes

ggerganov commented Aug 26, 2023

View reviewed changes

examples/perplexity/perplexity.cpp Outdated Show resolved Hide resolved

klosax approved these changes Aug 26, 2023

View reviewed changes

ggerganov and others added 4 commits August 27, 2023 00:41

tests : add test-tokenizer-1.py

eb8b326

Merge branch 'master' into fix-tokenizer

c767746

llama.cpp : fix LF token

ab3ba64

hellaswag : move the concat space for clarity

dbcf470

ggerganov force-pushed the fix-tokenizer branch from 724fa67 to dbcf470 Compare August 26, 2023 21:45

ikawrakow approved these changes Aug 27, 2023

View reviewed changes

tests : add falcon tests (py + cpp, currently do not pass Unicode)

3bb0f84

ggml-ci

ggerganov force-pushed the fix-tokenizer branch from 802ef4c to 3bb0f84 Compare August 27, 2023 09:00

common : temporary separate llama_detokenize calls for SPM and BPE

841983f

ggerganov merged commit edd4c14 into master Aug 27, 2023
25 checks passed

ggerganov deleted the fix-tokenizer branch August 27, 2023 11:19

saddam213 mentioned this pull request Aug 28, 2023

Fix Tokenize of new line, Remove space inserts martindevans/LLamaSharp#3

Merged

This comment was marked as spam.

Sign in to view

ggerganov mentioned this pull request Sep 7, 2023

Tokenization is not equal to Meta's tokenization. #2310

Closed

cztomsik mentioned this pull request Sep 13, 2023

GGUF #2398

Merged

34 tasks

shibe2 mentioned this pull request Oct 9, 2023

Infill Incorrect Tokenization #3503

Closed

shibe2 mentioned this pull request Oct 18, 2023

[feature] Make space insertion optional in SentencePiece tokenizer #3664

Closed

shibe2 mentioned this pull request Nov 15, 2023

llama : restore prefix space in llama tokenizer #4081

Merged

mgroeber9110 mentioned this pull request May 24, 2024

server: do not remove whitespace at the start of a completion chunk #7524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : more tokenizer fixes #2810

llama : more tokenizer fixes #2810

ggerganov commented Aug 26, 2023 •

edited

Loading

SlyEcho left a comment

ghost commented Aug 26, 2023

ggerganov commented Aug 26, 2023

klosax commented Aug 26, 2023 •

edited

Loading

ghost commented Aug 26, 2023 •

edited by ghost

Loading

ggerganov commented Aug 26, 2023

klosax commented Aug 26, 2023

ghost commented Aug 26, 2023

klosax left a comment

ggerganov commented Aug 27, 2023 •

edited

Loading

ikawrakow commented Aug 27, 2023

ggerganov commented Aug 27, 2023 •

edited

Loading

ggerganov commented Aug 27, 2023

ikawrakow commented Aug 27, 2023

ghost commented Aug 27, 2023 •

edited by ghost

Loading

This comment was marked as spam.

This comment was marked as spam.

shibe2 commented Sep 8, 2023 •

edited

Loading

goerch commented Sep 9, 2023

goerch commented Sep 9, 2023

shibe2 commented Sep 15, 2023

	if (llama_vocab_type(ctx) == LLAMA_VOCAB_TYPE_SPM) {
	// Add a space in front of the first character to match OG llama tokenizer behavior
	params.prompt.insert(0, 1, ' ');
	}

	{
	s.insert(0, 1, ' '); // add a space if it's the first
	p = ::llama_tokenize(ctx, s, add_bos);

	{
	auto s = json_prompt.template get<std::string>();
	s.insert(0, 1, ' '); // always add a first space
	prompt_tokens = ::llama_tokenize(ctx, s, add_bos);
	}

	case LLAMA_VOCAB_TYPE_SPM:
	{
	// without adding this leading whitespace, we do not get the same results as the original tokenizer
	raw_text = " " + raw_text;

	llm_tokenizer_spm tokenizer(vocab);
	llama_escape_whitespace(raw_text);
	tokenizer.tokenize(raw_text, output);

	const bool is_spm = llama_vocab_type(ctx) == LLAMA_VOCAB_TYPE_SPM;
	const bool add_bos = is_spm;

	fprintf(stderr, "%s: tokenizing the input ..\n", __func__);

	auto tokens = ::llama_tokenize(ctx, params.prompt, add_bos);

	const int n_chunk_max = tokens.size() / params.n_ctx;

	// Token Id -> Piece.
	// Uses the vocabulary in the provided context.
	// Does not write null terminator to the buffer.
	// User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens.
	LLAMA_API int llama_token_to_piece(
	const struct llama_context * ctx,
	llama_token token,
	char * buf,
	int length);

	// tokenizes a string into a vector of tokens
	// should work similar to Python's `tokenizer.encode`
	std::vector<llama_token> llama_tokenize(
	struct llama_context * ctx,
	const std::string & text,
	bool add_bos);

	// tokenizes a token into a piece
	// should work similar to Python's `tokenizer.id_to_piece`
	std::string llama_token_to_piece(
	const struct llama_context * ctx,
	llama_token token);

	// detokenizes a vector of tokens into a string
	// should work similar to Python's `tokenizer.decode`
	// removes the leading space from the first non-BOS token
	std::string llama_detokenize(
	llama_context * ctx,
	const std::vector<llama_token> & tokens);

	std::string llama_detokenize(llama_context * ctx, const std::vector<llama_token> & tokens) {
	const llama_token bos_id = llama_token_bos(ctx);

	std::string piece;
	std::string result;

	for (size_t i = 0; i < tokens.size(); ++i) {
	piece = llama_token_to_piece(ctx, tokens[i]);

	// remove the leading space of the first non-BOS token
	if (((tokens[0] == bos_id && i == 1) \|\| (tokens[0] != bos_id && i == 0)) && piece[0] == ' ') {
	piece = piece.substr(1);
	}

	result += piece;
	}

	return result;
	}

llama : more tokenizer fixes #2810

llama : more tokenizer fixes #2810

Conversation

ggerganov commented Aug 26, 2023 • edited Loading

SlyEcho left a comment

Choose a reason for hiding this comment

ghost commented Aug 26, 2023

ggerganov commented Aug 26, 2023

klosax commented Aug 26, 2023 • edited Loading

ghost commented Aug 26, 2023 • edited by ghost Loading

ggerganov commented Aug 26, 2023

klosax commented Aug 26, 2023

ghost commented Aug 26, 2023

klosax left a comment

Choose a reason for hiding this comment

ggerganov commented Aug 27, 2023 • edited Loading

ikawrakow commented Aug 27, 2023

ggerganov commented Aug 27, 2023 • edited Loading

ggerganov commented Aug 27, 2023

ikawrakow commented Aug 27, 2023

ghost commented Aug 27, 2023 • edited by ghost Loading

This comment was marked as spam.

This comment was marked as spam.

shibe2 commented Sep 8, 2023 • edited Loading

goerch commented Sep 9, 2023

goerch commented Sep 9, 2023

shibe2 commented Sep 15, 2023

ggerganov commented Aug 26, 2023 •

edited

Loading

klosax commented Aug 26, 2023 •

edited

Loading

ghost commented Aug 26, 2023 •

edited by ghost

Loading

ggerganov commented Aug 27, 2023 •

edited

Loading

ggerganov commented Aug 27, 2023 •

edited

Loading

ghost commented Aug 27, 2023 •

edited by ghost

Loading

shibe2 commented Sep 8, 2023 •

edited

Loading