sentencepiece bpe compatible tokenizer #252

eiz · 2023-03-18T03:05:18Z

I believe this largely fixes the tokenization issues. The example mentioned in #167 as well as my local tests (e.g. "accurately" should tokenize as [7913, 2486]) are fixed by it. Have not tested extensively though, especially with Unicode.

I saw some discussion around file format updates so just take this as an rfc, I just hacked something in

sorry if my coding style is not to your liking ;)

slaren · 2023-03-18T04:10:12Z

Breaks quantize.cpp currently, needs to update the tokenizer part to add the score.

eiz · 2023-03-18T04:19:49Z

doh, thanks for pointing that out, I've only been using fp16 =) will fix.

convert-pth-to-ggml.py

j-f1

Neat! I think this is pretty close but the Unicode handling isn’t quite right. In particular I don’t believe the tokenizer should be UTF-8 aware, since LLaMA should be perfectly capable of handling invalid UTF-8 strings. It seems to operate on the byte level so I believe this PR as-is will prevent characters that are not in the token dataset from being tokenized. Unrecognized characters are currently represented using their UTF-8 bytes as separate tokens.

eiz · 2023-03-18T04:47:47Z

The handling of UTF-8 here is exactly the same as SentencePiece does. Multi-byte characters that don't form tokens will be output one byte at a time.

j-f1 · 2023-03-18T04:53:31Z

That’s what happens when I do code review at 1am! Everything looks great now (but it’s still 1am so I am not going to approve it until tomorrow morning when I can take a proper look)

ggerganov · 2023-03-18T17:07:40Z

@eiz
Before you merge this, add temporary notice at the top of the README to regen models (like this: 8a01f56)

Also, let's start bumping the magic when the ggml models change:
https://github.com/ggerganov/llama.cpp/blob/master/convert-pth-to-ggml.py#L96

Or add a version ggml version number. Whichever you prefer.
Check during loading and if it is not correct - print message asking the user to regen the models.

P.S. Need a few more days before I start looking into details, so appreciate all the help from the collaborators so far

slaren · 2023-03-18T17:26:40Z

The tokenization look great, I couldn't find any differences with the original llama tokenizer.

Ronsor · 2023-03-18T18:53:06Z

@ggerganov I would suggest a version number. That allows for better error messages like version unsupported versus something like invalid model file.

eiz · 2023-03-18T19:20:56Z

"why not both?"

changed file magic so existing unversioned files don't misparse (ggml -> ggmf "gg model file")
now a version number in the header

j-f1 · 2023-03-18T20:16:25Z

quantize.cpp

+        finp.read((char *) &format_version, sizeof(format_version));
+
+        if (format_version != 1) {
+            fprintf(stderr, "%s: invalid model file '%s' (unsupported format version %" PRIu32 ")\n",


Suggestion: move the format_version to a shared header file of some sort, and then say (unsupported version 2, expected 1)

ggerganov · 2023-03-19T18:44:56Z

@eiz Apologies for the convert-pth-to-ggml.py refactoring - please resolve the conflicts and merge when you are ready

bakkot · 2023-03-20T01:45:21Z

Good news: I ran this using the scoring logic in #270 and saw an improvement in perplexity for the 7B FP16 on wikitext-2/test from 10.4625 before this PR to 5.8149 after. That's a huge improvement.

* potential out of bounds read * fix quantize * style * Update convert-pth-to-ggml.py Co-authored-by: slaren <2141330+slaren@users.noreply.github.com> * mild cleanup * don't need the space-prefixing here rn since main.cpp already does it * new file magic + version header field * readme notice * missing newlines

Green-Sky · 2023-03-21T10:26:39Z

just wanted to note that this change had the positive side effect of the model now producing most common english words as a single token. before words where pieced together. this results in a significant speed up (~2x?) of generated text, even if the token/sec stayed the same. 🎉

* potential out of bounds read * fix quantize * style * Update convert-pth-to-ggml.py * mild cleanup * don't need the space-prefixing here rn since main.cpp already does it * new file magic + version header field * readme notice * missing newlines Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>

vjeux · 2023-06-23T04:42:55Z

I believe that there's an issue with this algorithm. It can only merge tokens if they are composite of existing tokens. Not only this, but the combination that works must be the highest scoring one.

I was curious to see in the llama vocabulary how many tokens would not tokenize to themselves and I got 1631 out of 32000.
https://gist.github.com/vjeux/5a466b4c47dc19ec9630f6fbf0cc3a1b

Note that I reimplemented the algorithm in this commit, so I may have made a mistake.

I'm trying to figure out right now if this is the same algorithm as in the sentencepiece project, in that case it's an issue with the original tokenizer. Or if it's an issue with this implementation only.

Edit: this is the output of sentencepiece python implementation. So looks like it's able to print the tokens. So they must be using a different algorithm than this one (or I messed up the implementation).

>> print(tt.encode("""▁–"""))

[1, 56805, 702, 2]
1 <s>
56805 ▁
702 ▁–
2 </s>

vjeux · 2023-06-23T05:23:26Z

@bobakfb asked ChatGPT to find the differences between this algo and SentencePiece one: https://github.com/google/sentencepiece/blob/master/src/bpe_model.cc

Both of these algorithms appear to be implementations of the Byte-Pair Encoding (BPE) algorithm, which is often used for tokenizing text in natural language processing (NLP) tasks.

The key difference between the two is the post-processing step, specifically how they handle symbols or characters that were not matched to any token in the vocabulary.

Here are the notable differences:

Post-processing:
In the first algorithm (which is implemented within the SentencePiece framework), if a symbol sequence doesn't exist in the vocabulary, the algorithm performs a recursive process of segmentation (resegmentation), trying to break down the sequence into smaller subwords or symbols that are present in the vocabulary. This means it attempts to encode unrecognized sequences to known pieces.

In the second algorithm (the llama_tokenizer), if a symbol sequence is not found in the vocabulary, it treats each individual character as a separate token and outputs their corresponding byte values (plus 3, as the first three positions seem reserved for special tokens). This approach treats unrecognized sequences as a series of individual characters.

It looks correct. There's a "resegment" step at the end of the sentencepiece algorithm ( https://github.com/google/sentencepiece/blob/master/src/bpe_model.cc#L175-L200 ) that isn't present in this implementation (

llama.cpp/llama.cpp

Lines 1870 to 1883 in 7487137

    
           for (int i = 0; i != -1; i = symbols_[i].next) { 
        
               auto & symbol = symbols_[i]; 
        
               auto token = vocab_.token_to_id.find(std::string(symbol.text, symbol.n)); 
        
               if (token == vocab_.token_to_id.end()) { 
        
                   // output any symbols that did not form tokens as bytes. 
        
                   for (int j = 0; j < (int) symbol.n; ++j) { 
        
                       llama_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3; 
        
                       output.push_back(token_id); 
        
                   } 
        
               } else { 
        
                   output.push_back((*token).second); 
        
               } 
        
           }

).

So we should add it as well.

…F16/KQuants per iter. (ggerganov#252) * Fix hordeconfig maxcontext setting. * cuda: Bring DMMV_F16 and KQUANTS_ITER Makefile flags over from llama.

eiz force-pushed the mack/sentencepiece-bpe branch 6 times, most recently from c67e5ef to 448f398 Compare March 18, 2023 04:03

eiz force-pushed the mack/sentencepiece-bpe branch from 9711125 to de29a00 Compare March 18, 2023 04:23

slaren reviewed Mar 18, 2023

View reviewed changes

convert-pth-to-ggml.py Outdated Show resolved Hide resolved

j-f1 reviewed Mar 18, 2023

View reviewed changes

eiz force-pushed the mack/sentencepiece-bpe branch from 51d3b4a to 0acb5f5 Compare March 18, 2023 12:54

Ronsor approved these changes Mar 18, 2023

View reviewed changes

j-f1 reviewed Mar 18, 2023

View reviewed changes

thement mentioned this pull request Mar 19, 2023

Commit c9f670a (Implement non-greedy tokenizer that tries to maximize token lengths) breaks llama? #280

Closed

ggerganov approved these changes Mar 19, 2023

View reviewed changes

ggerganov mentioned this pull request Mar 19, 2023

Improve Alpaca integration to match it's trained prompt syntax #302

Closed

bakkot mentioned this pull request Mar 20, 2023

Compute perplexity over prompt #270

Merged

eiz force-pushed the mack/sentencepiece-bpe branch from 5d54356 to 649cee5 Compare March 20, 2023 10:03

eiz merged commit 074bea2 into ggerganov:master Mar 20, 2023

eiz mentioned this pull request Mar 20, 2023

move file magic/version to header, print expected version #319

Merged

PriNova mentioned this pull request Mar 20, 2023

Breaking change of models since PR #252 #324

Closed

slaren mentioned this pull request Mar 20, 2023

Differences with the llama tokenizer #167

Closed

ar-jan mentioned this pull request Mar 20, 2023

The prompt is not converted to tokens #113

Closed

glinscott mentioned this pull request Mar 21, 2023

Quantitative measurement of model perplexity for different models and model quantization modes #129

Closed

PriNova mentioned this pull request Mar 21, 2023

Improve the Chat Mode with some tricks and considerations #353

Closed

This was referenced Mar 21, 2023

Update the convert-gptq-to-ggml.py with the new tokenizer output #362

Closed

Update IPFS links to quantized alpaca with new tokenizer format #352

Merged

This was referenced Mar 21, 2023

sha256 check sums to verify original and converted model data #338

Merged

SHA256 checksums correctness #374

Closed

philpax mentioned this pull request Mar 22, 2023

llama-cli: Could not load model: InvalidMagic { path: ... } rustformers/llm#59

Closed

anzz1 mentioned this pull request Mar 22, 2023

segmentation fault Alpaca #317

Closed

stduhpf mentioned this pull request Mar 26, 2023

Unicode characters break tokenizer PotatoSpudowski/fastLLaMa#24

Closed

anzz1 mentioned this pull request Apr 2, 2023

Bring back the ggml model format and revert breaking mmap change (#613) #711

Closed

ziwang-com mentioned this pull request May 20, 2023

重新转换量化羊驼lora模型 ziwang-com/zero-lora#21

Open

slaren mentioned this pull request Jun 27, 2023

Fix the tokenizer #2023

Closed

This was referenced Jul 21, 2023

Fix for #2023 #2314

Closed

llama : fix tokenizer #2315

Closed

ProjectAtlantis-dev mentioned this pull request Jul 29, 2023

[Prompt Processing] Is there a way to speed up prompt processing for Metal? (M1/M2) #2428

Closed

4 tasks

opparco mentioned this pull request Sep 2, 2023

llama : fix bpe tokenize from byte #2889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sentencepiece bpe compatible tokenizer #252

sentencepiece bpe compatible tokenizer #252

eiz commented Mar 18, 2023 •

edited

Loading

slaren commented Mar 18, 2023

eiz commented Mar 18, 2023

j-f1 left a comment

eiz commented Mar 18, 2023

j-f1 commented Mar 18, 2023

ggerganov commented Mar 18, 2023

slaren commented Mar 18, 2023

Ronsor commented Mar 18, 2023

eiz commented Mar 18, 2023

j-f1 Mar 18, 2023

ggerganov commented Mar 19, 2023 •

edited

Loading

bakkot commented Mar 20, 2023 •

edited

Loading

Green-Sky commented Mar 21, 2023 •

edited

Loading

vjeux commented Jun 23, 2023 •

edited

Loading

vjeux commented Jun 23, 2023 •

edited

Loading

sentencepiece bpe compatible tokenizer #252

sentencepiece bpe compatible tokenizer #252

Conversation

eiz commented Mar 18, 2023 • edited Loading

slaren commented Mar 18, 2023

eiz commented Mar 18, 2023

j-f1 left a comment

Choose a reason for hiding this comment

eiz commented Mar 18, 2023

j-f1 commented Mar 18, 2023

ggerganov commented Mar 18, 2023

slaren commented Mar 18, 2023

Ronsor commented Mar 18, 2023

eiz commented Mar 18, 2023

j-f1 Mar 18, 2023

Choose a reason for hiding this comment

ggerganov commented Mar 19, 2023 • edited Loading

bakkot commented Mar 20, 2023 • edited Loading

Green-Sky commented Mar 21, 2023 • edited Loading

vjeux commented Jun 23, 2023 • edited Loading

vjeux commented Jun 23, 2023 • edited Loading

eiz commented Mar 18, 2023 •

edited

Loading

ggerganov commented Mar 19, 2023 •

edited

Loading

bakkot commented Mar 20, 2023 •

edited

Loading

Green-Sky commented Mar 21, 2023 •

edited

Loading

vjeux commented Jun 23, 2023 •

edited

Loading

vjeux commented Jun 23, 2023 •

edited

Loading