Differences with the llama tokenizer #167

slaren · 2023-03-15T16:45:04Z

In this case the llama.cpp and the llama tokenizers produce different output:

main: prompt: 'This is 🦙.cpp'
main: number of tokens in prompt = 10
     1 -> ''
  4013 -> 'This'
   338 -> ' is'
 29871 -> ' '
   243 -> '�'
   162 -> '�'
   169 -> '�'
   156 -> '�'
 29889 -> '.'
  8223 -> 'cpp'

Meanwhile the llama tokenizer produces:

text = "This is 🦙.cpp"
t = tokenizer.encode(text, bos=True, eos=False)

[1, 910, 338, 29871, 243, 162, 169, 156, 29889, 8223]

So in one case "This" is encoded as 4013 and other as 910. I have verified that both ids decode to the same text:

t1 = tokenizer.decode([4013])
t2 = tokenizer.decode([910])
print(t1, [int(b) for b in bytes(t1, "UTF-8")])
print(t2, [int(b) for b in bytes(t2, "UTF-8")])

This [84, 104, 105, 115]
This [84, 104, 105, 115]

I am not sure if this causes any significant differences in the generation but it may be a good idea to check it.

The text was updated successfully, but these errors were encountered:

beiller · 2023-03-15T19:23:34Z

Tokenizer recent fixes are here:

https://github.com/ggerganov/llama.cpp/pull/79/files#diff-7696a3039a95b41e744a08272f14d2a4345fddbd06ac482deb37f03a3afad2b5R122

Maybe its possible that it does some other preprocess like lowercase / uppercaseisms?

slaren · 2023-03-15T19:50:26Z

Same result using the current master and reconverting the model. More interestingly, the llama tokenizer seems to produce different results for single tokens than from groups of tokens. For example:

llama.cpp
     1 -> ''
 10994 -> 'Hello'
  2787 -> ' World'

llama:
[1, 15043, 2787]
     1 -> ''
 15043 -> 'Hello'
  2787 -> 'World'
Hello World

Note that when decoding token by token the space is absent. This is the code I am using to test the llama tokenizer if you want to try it: https://gist.github.com/slaren/9f26fc4cb24685d42601b1d91d70a13a

It seems that re-implementing the SentencePiece tokenizer is not going to be entirely trivial and may be a good idea to use their library after all.

Ronsor · 2023-03-15T21:09:28Z

https://guillaume-be.github.io/2020-05-30/sentence_piece seems to document the SentencePiece algorithm fairly well

beiller · 2023-03-15T22:20:59Z

I have sentencepiece integrated in c++, the branch was a pr, it's closed. We want to avoid using additional libraries. Try building that branch.

Edit link

https://github.com/beiller/llama.cpp/tree/feature/tokenization

thement · 2023-03-16T20:32:49Z

What about writing tests that compare the python implementation of tokenizer from original llama code with the current tokenizer implementation in llama.cpu and then fixing the llama.cpu tokenizer? This way we wouldn't have to add another dependency to libsentencepiece.

jarcen · 2023-03-17T08:20:26Z

My observations:
Token 4013 = 'This'. Token 910 = ' This'.
Token 10994 = 'Hello'. Token 15043 = ' Hello'.

Notice whitespace. They're different. Don't know why python library doesn't show it but that's how it is when talking directly to the c++ library. Sentencepiece always encodes first token with whitespace even if you ask to prepend <bos> token. This is even demonstrated on their repo:

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

That's the difference with current tokenizer.

slaren · 2023-03-17T09:07:53Z

It looks like SentencePiece has an option --add_dummy_prefix which adds a dummy whitespace at the beginning of text, so that may well explain it.

slaren · 2023-03-17T09:40:02Z

Extracted these options from the tokenizer model protobuf:

trainer_spec {
  input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
  model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
  model_type: BPE
  vocab_size: 32000
  self_test_sample_size: 0
  input_format: "text"
  character_coverage: 0.99995
  input_sentence_size: 200000000
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  num_threads: 80
  num_sub_iterations: 2
  max_sentence_length: 4192
  shuffle_input_sentence: true
  max_sentencepiece_length: 16
  split_by_unicode_script: true
  split_by_whitespace: true
  split_by_number: true
  treat_whitespace_as_suffix: false
  split_digits: true
  allow_whitespace_only_pieces: true
  vocabulary_output_piece_score: true
  hard_vocab_limit: true
  use_all_vocab: false
  byte_fallback: true
  required_chars: ""
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_surface: " \342\201\207 "
  unk_piece: "<unk>"
  bos_piece: "<s>"
  eos_piece: "</s>"
  pad_piece: "<pad>"
  train_extremely_large_corpus: false
  enable_differential_privacy: false
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: "identity"
  precompiled_charsmap: ""
  add_dummy_prefix: true
  remove_extra_whitespaces: false
  normalization_rule_tsv: ""
}

thement · 2023-03-17T16:39:35Z

https://guillaume-be.github.io/2020-05-30/sentence_piece seems to document the SentencePiece algorithm fairly well

I implemented something very similar here:
#242

I haven't tested it yet, I'm on a train and can't download the datasets to run llama, but I'll do it this evening/tomorrow.

slaren · 2023-03-17T22:22:49Z

The recently merged #242 still isn't accurate, for example:

llama.cpp:
     1 -> ''
 29871 -> ' '
  7346 -> '########'
 13383 -> '################'
    13 -> '
'
llama:
     1 -> '<s>'
   835 -> '▁###'
 13383 -> '################'
  4136 -> '####'
 29937 -> '#'
    13 -> '<0x0A>'

Note that in the PR max_len (and with it MAX_TOKEN_LEN) is never used, so I imagine that there are some oversights there.

eiz · 2023-03-17T22:56:12Z

ya, it's still not right. for starters you need the actual scores from the tokenizer model. Also, the article above is discussing the SentencePiece unigram model, not its BPE model, which is what LLaMA uses.

lofcz · 2023-03-18T00:06:30Z

#242 doesn't improve the responses in any way I was able to measure

jarcen · 2023-03-18T00:33:28Z

I don't expect it will. I read sentencepiece frontpage documentation and it says it uses regularization at training time. Basically, it randomly creates suboptimal tokenized strings to improve robustness. It is likely that model is already robust to variances between different tokenizers.

Subword regularization and BPE-dropout

Subword regularization [Kudo.] and BPE-dropout Provilkov et al are simple regularization methods that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library (C++/Python) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of Python library. You can find that 'New York' is segmented differently on each SampleEncode (C++) or encode with enable_sampling=True (Python) calls. The details of sampling parameters are found in sentencepiece_processor.h.
>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']

beiller · 2023-03-18T00:53:00Z

The problem is in python it's taking raw tokens and placing them in the file. It may be missing real tokens and proper whitespace. You can use my branch to build and compare tokenizers. It's important to fix but it's probably very subtle how it affects the output.

#66

Ronsor · 2023-03-18T01:11:04Z

I think we'd have to do a backwards-incompatible file format change to support all the tokenizer's features; it also gives us a chance to do some things needed by #91, like proper data alignment ahead of time.

beiller · 2023-03-18T01:13:39Z

The commit where it was fixed to support utf8 is here but I fear it might be responsible for whitespace breaks

cb8c464

Ronsor · 2023-03-18T01:28:14Z

I think we'd have to do a backwards-incompatible file format change to support all the tokenizer's features; it also gives us a chance to do some things needed by #91, like proper data alignment ahead of time.

On this note, I think we'd be best served parsing the tokenizer model directly (it's just a protobuf) and converting it that way. I can work on doing that, if there aren't any objections.

beiller · 2023-03-18T01:37:26Z

That was accomplished in this pr (using python protobuf to parse the tokenizer directly)

#73

But the protobuf methods are directly exposed in sentence piece which was determined to be a cleaner solution.

slaren · 2023-03-20T15:21:55Z

Fixed in #252

gjmulder added the bug Something isn't working label Mar 15, 2023

setzer22 mentioned this issue Mar 15, 2023

Good ideas from llama.cpp rustformers/llm#15

Closed

6 tasks

philpax mentioned this issue Mar 16, 2023

Let's collaborate rustformers/llm#4

Closed

eiz mentioned this issue Mar 18, 2023

sentencepiece bpe compatible tokenizer #252

Merged

bakkot mentioned this issue Mar 20, 2023

Compute perplexity over prompt #270

Merged

slaren closed this as completed Mar 20, 2023

cebtenzzre mentioned this issue Sep 13, 2023

GGUF #2398

Merged

34 tasks

polarathene mentioned this issue Jun 7, 2024

Refactor: GGUF metadata tokenizer EricLBuehler/mistral.rs#389

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences with the llama tokenizer #167

Differences with the llama tokenizer #167

slaren commented Mar 15, 2023 •

edited

Loading

beiller commented Mar 15, 2023

slaren commented Mar 15, 2023

Ronsor commented Mar 15, 2023

beiller commented Mar 15, 2023 •

edited

Loading

thement commented Mar 16, 2023

jarcen commented Mar 17, 2023 •

edited

Loading

slaren commented Mar 17, 2023

slaren commented Mar 17, 2023

thement commented Mar 17, 2023

slaren commented Mar 17, 2023

eiz commented Mar 17, 2023 •

edited

Loading

lofcz commented Mar 18, 2023

jarcen commented Mar 18, 2023

beiller commented Mar 18, 2023

Ronsor commented Mar 18, 2023

beiller commented Mar 18, 2023

Ronsor commented Mar 18, 2023

beiller commented Mar 18, 2023

slaren commented Mar 20, 2023

Differences with the llama tokenizer #167

Differences with the llama tokenizer #167

Comments

slaren commented Mar 15, 2023 • edited Loading

beiller commented Mar 15, 2023

slaren commented Mar 15, 2023

Ronsor commented Mar 15, 2023

beiller commented Mar 15, 2023 • edited Loading

thement commented Mar 16, 2023

jarcen commented Mar 17, 2023 • edited Loading

slaren commented Mar 17, 2023

slaren commented Mar 17, 2023

thement commented Mar 17, 2023

slaren commented Mar 17, 2023

eiz commented Mar 17, 2023 • edited Loading

lofcz commented Mar 18, 2023

jarcen commented Mar 18, 2023

beiller commented Mar 18, 2023

Ronsor commented Mar 18, 2023

beiller commented Mar 18, 2023

Ronsor commented Mar 18, 2023

beiller commented Mar 18, 2023

slaren commented Mar 20, 2023

slaren commented Mar 15, 2023 •

edited

Loading

beiller commented Mar 15, 2023 •

edited

Loading

jarcen commented Mar 17, 2023 •

edited

Loading

eiz commented Mar 17, 2023 •

edited

Loading