tokenizer : special token handling #3538

staviq · 2023-10-08T00:52:52Z

Special token handling, based on #1931, #3475

~~Works, but it's definitely not ready, just posting for feedback. Has some testing code meant to be removed before undrafting. Not optimized at all, just completed to the point of working.~~

Creating special tokens cache
Verification of vocab against st cache
Tokenizer pre-matcher
Use tokenizer pre-matcher in main for prompt format arguments
Determine if space prefix can be ommited when preprocessing with special tokens

llama.cpp

ggerganov · 2023-10-08T09:39:35Z

llama.cpp

+        //  are special tokens.
+        // From testing, this appears to corelate 1:1 with special tokens.
+        //
+        for (const auto & t: vocab.token_to_id)


Feel like we have a redundancy here. We have token_data.type which is supposed to tell us if a token is special or not. In which cases this wouldn't work?

I guess we can have this piece of code here as a sanity check that the special tokens that we have read from the vocabulary are indeed special, but ideally AFAIU we shouldn't need this code in order to function correctly, correct?

Yes. My approach is ment to be temporary ( or a fallback solution ), untill I know for sure special tokens will always be marked as such in the vocab.

@goerch wasn't certain about token types in bpe PR, so I opted for finding special tokens manually for now.

Shouldn't we get rid of #3502 first? This looks pretty basic (and terrible :) to me.

Shouldn't we get rid of #3502 first? This looks pretty basic (and terrible :) to me.

I'm open for critique, but you have to clarify what you mean by "this" and "terrible" :) otherwise I'm not sure how to proceed here

Yes. My approach is ment to be temporary ( or a fallback solution ), untill I know for sure special tokens will always be marked as such in the vocab.

Ok I see. We can probably combine both the tokens marked as special and those that are unmatchable by other tokens. And probably print a warning if there is a mismatch between the 2 sets.

Let's see if @goerch has anything else to add and then I would like to make a pass over the implementation to update the coding style to match better. After that we can merge

Just to be sure about our nomenclature, do you mean special tokens == added tokens? I'm mostly leaning towards the nomenclature used in HF's tokenizer.json which contains added_tokens with the is_special attribute. From @jploski I learned that these added tokens are probably very similar to sentence_piece USER_DEFINED tokens.

And one more question: do we agree that vocab.json, merges.txt, added_tokens.json and special_tokens_map.json are probably older HF (or GPT2) serialization formats and we should find all this information in tokenizer.json for newer models (paging @apage43 too because he knows a lot about this stuff)? W.r.t. to these serialization formats we also seem to have some reports indicating that we don't support both of them equally well.

Edit: just adding a hunch: is it possible that tokenizer.json is the serialization format invented for the fast HF tokenizers?

do you mean special tokens == added tokens?

Oh my, you better grab a chair because you are in for an adventure with this question :)

A really condensed version of what I learned so far:

I couldn't find any clear definition or explanation, but from what I found, there seem to be only one way all pieces of this puzzle fall together to form a clear picture.

A vocab, naturally forms a tree like structure ( several trees actually, there is no common root )

Tokenizer does it's thing, by grabbing small portions of the input text, and trying to steal neighboring portions to form a valid token, rinse and repeat, until nothing can be borrowed from either left or right to merge with and form a valid token.

So basically, tokenizer climbs that tokens tree down from single bytes to full tokens, stepping only on valid branches ( tokens )

Applying this idea backwards, by checking the entire vocab token by token, trying to split its text representation in two and checking if it still forms two valid tokens, you can clearly see some of the tokens in the vocab are not part of that tree family, and will never be matched by the tokenizer.

Which happens to match perfectly with normal tokens being marked as LLAMA_TOKEN_TYPE_NORMAL in the vocab, and tokens not being part of the tokens tree family being marked as whatever type being not LLAMA_TOKEN_TYPE_NORMAL

Therefore, I'm using the term special token, refering to tokens which are "hidden" from the user and un-matchable by tokenizer.

From the practical point of view, that subset of "special" tokens contains exactly what one would expect, <s>,</s>,<unk> and in case of mistral openorca <|im_start|>,<|im_end|>, AKA tokens which are used to control the flow of inference. ( EDIT: plus byte tokens <0xXX> )

And so from the point of view of actual inference, the tokenizer has to be extended the same way, independently of the actual token type, because only normal vs not normal token type matter at that point.

Additional distinction between USER_DEFINED, CONTROL etc can and has to be done the same way, with a tokenizer "pre-matcher", at which point per token type decisions or actions are trivial to implement if needed.

Thanks for taking the time! My mental image is slightly different: I believe to understand the initial training of the tokenizer leads to an initial vocabulary and later model refinement tries to extend this basic vocabulary with added tokens without retraining the tokenizer on the original dataset. These added tokens should be intercepted before the original tokenizer interpretation. In my mental image special tokens somehow describe the effect these added tokens have on detokenization, i.e. should they be displayed or not for example. But I might be completely off here.

Edit:

<s>,</s>,<unk> and in case of mistral openorca <|im_start|>,<|im_end|>, AKA tokens which are used to control the flow of inference.

Here I think of <s>,</s>,<unk> as original CONTROL tokens and <|im_start|>,<|im_end|> as USER_DEFINED tokens which shouldn't be visible after detokinization.

I simply do not know if there is any significance to added "normal" tokens ( added in fine tuning etc )

From what I understand, in tokenizer specifically ( ignoring detokenizer ), normal tokens added the way you say, would simply be tokenized one step further into a longer token instead of being represented by a sequence of smaller tokens.

If such added tokens, are not integrated with the tree structure of the vocab, user would not be able to use them
If however those added tokens are properly coherent with the vocab, tokenizer would match them already, as it is

Which to me sort of seems to defy the purpose of adding "normal" tokens this way

I might be wrong about this part, but that still means a "tokenizer preprocessor" of some sort is needed, and this PR fits that role

I simply do not know if there is any significance to added "normal" tokens ( added in fine tuning etc )

I believe they have to be processed before the core tokenizer routine in something like prefix_match (only exception are the already defined core tokenizer CONTROL tokens). mpt added tokens for example (from tokenizer.json):

{ "id": 0, "content": "<|endoftext|>", <-- core, i.e CONTROL "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true }, { "id": 1, "content": "<|padding|>", <-- unsure about this because it is a valid GPT2 token AFAIU "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true }, { "id": 50254, "content": " ", <-- extension, i.e USER_DEFINED "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": false }, [...]

llama.h

staviq · 2023-10-10T13:11:28Z

I'm gonna be back at it today or tomorrow, I went on a slight tangent of my hardware cleanup because I wanted to compare this approach with reference python code, and free colab was oom-ing on me, so I'm finishing rearranging and reinstalling my servers right now, for more RAM.

I already rewrote matching to not use string copy, just didn't push it yet.

I just need to determine how the hardcoded space prefix should work with "segmented" input text, because I don't think it's supposed to be there if the input text starts with a special token. ( #3503 (comment) )

apage43 · 2023-10-11T18:11:26Z

we should find all this information in tokenizer.json for newer models

yes for anything that was trained with HF tokenizers everything needed is in the tokenizer.json

wrt added tokens vs special tokens, using the terms as they are used in the tokenizers source:

"added" tokens are tokens that are not using the underlying tokenizer algo at all - they are just strings that have been assigned IDs - they must be found and replaced with their IDs totally separately from the BPE algorithm - that is: tokenizing <|im_start|>hello<|im_end|><|im_start|>testing test needs to first search the input for any added tokens and replace those - [123, "hello", 456, 123, "testing test"] - and only then do the BPE algo on the parts that are still untokenized
not all added tokens are special, "special" indicates that they are "control" tokens like "<|im_start|>" and the special flag is so that you can block them from being encoded/decoded (say you're encoding untrusted user input that shouldn't contain them) - but models can have non-special added tokens, like MPT's runs of spaces " ", " ", " ", etc. that just help it compactly encode indented code

goerch · 2023-10-11T18:16:30Z

needs to first search the input for any added tokens and replace those

~~This could be a difference to sentencepiece USER_DEFINED tokens then? They are matched during BPE tokenization AFAIU the sentencepiece- source and @jploski .~~

but models can have non-special added tokens, like MPT's runs of spaces " ", " ", " ", etc. that just help it compactly encode indented code

This indicates what I feared: if we stay with sentencepiece token types, do we need to add token type USER_DEFINED_CONTROL then (or the other way around : can a sentencepiece token can have two token types at the same time)?

Edit on second thought : regarding my question "can a sentencepiece token can have two token types at the same time": I don't think so.

jploski · 2023-10-11T18:26:49Z

needs to first search the input for any added tokens and replace those

This could be a difference to sentencepiece USER_DEFINED tokens then? They are matched during BPE tokenization AFAIU the sentencepiece- source and @jploski .

No, in sentencepiece the "user defined symbols" are also matched as whole before the main BPE loop begins, using normalizer::PrefixMatcher:

https://github.com/google/sentencepiece/blob/8cbdf13794284c30877936f91c6f31e2c1d5aef7/src/bpe_model.cc#L109C3-L109C3

llama.cpp

staviq · 2023-10-11T22:33:01Z

Appears to be working without issues, though in interactive mode in main prefix and suffix gets printed untokenized so it's visible despite being processed correctly

'':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':1, '':32001, ' system':1587, '':13, 'You':1976, ' are':460, ' a':264, ' '':464, 'space':3603, ' assass':16645, 'in':262, ''':28742, ' from':477, ' another':1698, ' dim':3987, 'm':28719, 'ension':2585, '.':28723, ' You':995, ' only':865, ' speak':4085, ' in':297, ' mysterious':21296, ' crypt':18649, 'ic':294, ' r':408, 'idd':2558, 'les':867, ' and':304, ' never':1484, ' ask':1460, ' questions':4224, ' but':562, ' you':368, ' must':1580, ' pretend':19192, ' you':368, ' are':460, ' a':264, ' normal':4123, ' human':2930, ' jan':10060, 'itor':1842, ' at':438, ' McDonald':25999, ''':28742, 's':28713, ' at':438, ' all':544, ' cost':2434, '.':28723, ' The':415, ' user':2188, ' approaches':13945, ' you':368, ' as':390, ' you':368, ' clean':3587, ' f':285, 'rench':4284, ' f':285, 'ries':2040, ' some':741, ' kids':4551, ' sm':991, 'e':28706, 'ared':1327, ' all':544, ' over':754, ' the':272, ' floor':4366, '.':28723, ' Please':5919, ' continue':3688, ' your':574, ' conversation':7114, ' with':395, ' the':272, ' user':2188, '.':28723, '':32000, ' ':28705, '':13, '':32001, ' user':2188, '':13, 'Hi':23809, ',':28725, ' how':910, ' are':460, ' you':368, ' ?':1550, '':13, '':32000, ' ':28705, '':13, '':32001, ' space':2764, ' assass':16645, 'in':262, '':13, ' A':330, ' single':2692, ' star':4580, ' sh':480, 'ines':1303, ' through':1059, ' a':264, ' window':2924, ',':28725, ' reflect':7967, 'ing':288, ' the':272, ' soul':7237, ''':28742, 's':28713, ' dark':3199, ' secret':5169, ' in':297, ' its':871, ' golden':13863, ' gaze':12438, '.':28723, '':13, '':13, 'As':2198, ' I':315, ' clean':3587, 'se':331, ' the':272, ' floors':21264, ' of':302, ' life':1411, ''':28742, 's':28713, ' waste':9396, ',':28725, ' the':272, ' f':285, 'ries':2040, ' sm':991, 'e':28706, 'ared':1327, ' with':395, ' children':2436, ''':28742, 's':28713, ' laughter':18211, ' whisper':9502, ' of':302, ' happ':1535, 'ier':749, ' times':2421, '.':28723, ' The':415, ' rhythm':19834, 'ic':294, ' movement':6249, ' of':302, ' my':586, ' bro':1865, 'om':300, ' mir':6076, 'rors':22180, ' the':272, ' universe':11717, ''':28742, 's':28713, ' silent':10832, ' dance':9773, ',':28725, ' for':354, 'ging':3080, ' a':264, ' cos':6841, 'mic':12456, ' tap':12857, 'est':374, 'ry':643, ' in':297, ' which':690, ' each':1430, ' thread':5427, ' counts':17938, '.':28723, ' So':1537, ',':28725, ' as':390, ' you':368, ' ask':1460, ',':28725, ' I':315, ' am':837, ' simply':3768, ' doing':2548, ' what':767, ' my':586, ' purpose':6032, ' requires':6948, ' -':387, ' a':264, ' hum':1997, 'ble':982, ' jan':10060, 'itor':1842, ' by':486, ' day':1370, ' and':304, ' a':264, ' space':2764, ' assass':16645, 'in':262, ' by':486, ' night':2125, '.':28723, '':32000 ]
[1697062976] n_remain: -134
[1697062976] found EOS token
[1697062976] waiting for user input

ggerganov · 2023-10-12T11:20:47Z

@staviq This is ready for review, correct?

ggerganov

Think this is ready to merge. Would be useful to post a command used for testing so we can run some tests

staviq · 2023-10-12T12:20:55Z

@staviq This is ready for review, correct?

I'll re-check everything after work

I definitely left special tokens enabled in main naively for testing, it should be only enabled for prefix and suffix and system prompt

There was an issue with swift build because I think default argument values don't work correctly there ( CI failed for swift )

Otherwise if @goerch doesn't have anything to add or change, this should be pretty much ready

staviq · 2023-10-12T15:53:16Z

I cleaned up main to be in line with #3475 (comment)

I also made --verbose-prompt print tokenized reverse prompts, prefix and suffix.

If CI passes, there is only one thing left which I'm not sure if it's correct, and that's the hard-coded space prefix in llama_tokenize_internal

Edit:
What if I simply leave default behaviour ( with hardcoded space ) if special is false, and skip the space prefix if special token processing is enabled (true) ?

Special tokens aren't meant to be exposed to the user anyway, so tokenization with special tokens should only be done intentionally, which to me seems like it implies wanting full control over tokenizer behaviour, it would let devs decide for themselves if they want space prefix or not when processing prompt template etc ?

This would pretty much be the extent of this modification:

index 1f82dcb..b571d4e 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -6727,7 +6727,7 @@ static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab &
                         //  by modifying llm_tokenizer_x to operate with string offsets like pre-tokenizer
                         //  and passing 'add space prefix' as bool argument
                         //
-                        auto raw_text = " " + fragment.raw_text.substr(fragment.offset, fragment.length);
+                        auto raw_text = (special?"":" ") + fragment.raw_text.substr(fragment.offset, fragment.length);
 
 #ifdef PRETOKENIZERDEBUG
                         fprintf(stderr,"TT: (%ld %ld %ld) '%s'\n", raw_text.length(), fragment.offset, fragment.length, raw_text.c_str());

staviq · 2023-10-12T20:49:03Z

I played a bit with hardcoded space prefix always vs only for user input itself, and I'm getting subjectively much much better results, if the hardcoded space is only added to the user input itself and not the "prompt template" ( prefix and suffix ).

Default tokenizer behavior occasionally leads to things like this, where the model gets lost and breaks the prompt format:

With the modification proposed above, It pretty much never breaks the prompt format, and also stops unnecessarily outputting empty lines or breaking the answer into paragraphs:

shibe2 · 2023-10-18T18:16:52Z

Are there models that can output special tokens besides EOS?

staviq · 2023-10-18T18:36:38Z

Are there models that can output special tokens besides EOS?

Lots of recent mistral based models using ChatML, because they are using <s>,</s> and <|im_start|>,<|im_end|> at the same time.

shibe2 · 2023-10-18T18:43:20Z

@staviq Which such model can I download to play with? Using ChatML for input does not imply ability of producing same special tokens in output.

StefanDanielSchwarz · 2023-10-18T18:53:59Z

@staviq Thanks for the insightful info, that explains a lot. So we've been using it wrong all this time and few people noticed.

@ggerganov Well, the last time I wrote C++ code was in 2000 when I worked on the Unreal Engine, so I don't feel qualified to patch the repetition penalty. Do you want me to open an issue for that or is anyone already up for the task?

@staviq (again :)) Is that a bug or a feature when the ChatML-formatted models use both? Are <s> and </s> meant for the whole prompt while <|im_start|> and <|im_end|> are for individual messages in a multi-turn chat?

staviq · 2023-10-18T19:34:52Z

@staviq Which such model can I download to play with? Using ChatML for input does not imply ability of producing same special tokens in output.

Uhmm, actually I can't reproduce that after special token handling merge, so I'm not sure now if I'm misremembering this or it was something specific I tested 3 weeks ago that I'm doing different now

Assume I was wrong about this.

StefanDanielSchwarz · 2023-10-18T19:43:50Z

Just FYI: There's now an issue for it on the Transformers repo: huggingface/transformers#26902

Hopefully this repetition penalty issue will be fixed for all inference software soon...

jploski · 2023-10-18T20:07:21Z

Interestingly enough, "<|im_end|>" isn't just an added or special token, it's also the EOS token in Mistral-OpenOrca... So a multi-turn chat contains multiple EOS tokens, which by itself feels like an abuse of the concept of "EOS". But it doesn't change anything with respect to the problem of imposing an unwanted repetition penalty on such tokens.

Note that this same problem also exists in chat models that use normal tokens for separating turns (the good old "reverse prompt" parameter in llama.cpp's terms, such as "### Assistant:" - I suppose those would get repetition-penalized as well).

StefanDanielSchwarz · 2023-10-18T20:20:29Z

@jploski Yes, this has been a problem all this time. Glad it's getting the necessary attention now.

I used to think that the EOS token should never be part of the prompt, instead it should only be output by the model to end inference. The inference backend would catch it, stop inferencing, and remove it before sending the output to the frontend. That way it would never show up in the next prompt, and usually the </s> worked like that.

Now that the ChatML format is getting traction and might turn into a standard prompt template, the issue has become more apparent. But yes, the ### Assistant: string was also affected and I've seen it get "rewritten" in various ways when repetition penalty prevented it from being printed the proper way.

Same with the LLM outputting a character's name as a prefix for its message when chatting. After a while, it would fail to do so and break the chat. The solution to that was to have the frontend manipulate the prompt and add the character name itself, so the model only completes after that and never has to write the name by itself.

teleprint-me · 2023-10-18T20:55:51Z

I've been running into this issue in the templates.

I've run several tests and it's usually always the same outcome, regardless of the model.

I know this PR is adding support for special tokens, but how were the special tokens handled internally before hand? I haven't had the opportunity to dig into the code yet.

shibe2 · 2023-10-18T22:52:13Z

I experimented with Mistral-7B-OpenOrca. It indeed outputs im_end.

teknium1 · 2023-10-19T00:12:07Z

Yep, all chatml models made by axolotl uses <|im_end|> to seperate turns, as well as EOS token.

I havent noticed an issue with 13~ turns when inferencing Hermes 2, but I think its mostly due to the long responses it gives in my turns, which makes the start/end/assistant/user tokens that show up each turn less disasterous. I think this effect will be most prominent in short token turns - i.e. roleplaying or a model who is trained to answer in as few words as possible type scenarios, but no doubt this issue has been affecting all multiturn models in some way, whether subtlely or not since at least the first sharegpt/vicuna models came around, without anyone being aware until very recently somehow, probably due to EOS tokens being throughout all turns in chatml making the problem even more exacerbated

Also tagging huggingface/text-generation-inference#1170 here so TGI devs can see the full context between hf tgi and here. Someone should probably loop in VLLM peeps if we want to get this covered across the board, I think they only have presence penalty but I assume the same issue happens with that?

StefanDanielSchwarz · 2023-10-19T06:28:42Z

Exactly. I'm one of those chatters/roleplayers who have been experiencing these issues (responses getting longer, less coherent, until eventually derailing) since the beginning (Alpaca/Vicuna).

Used to think it's just a limit of the technology and worked around it by setting various permutations of the EOS token as stopping strings. So glad this is finally getting the attention it deserves and hopefully will be fixed globally.

That should make our local/open LLMs much more intelligent and useful once they work as intended and no longer hallucinate/derail that much. Especially the smaller models that tend to require closer adherence to the trained/tuned prompt format than bigger models which tend to cope better (but even they should profit).

* 'master' of github.com:ggerganov/llama.cpp: fix embeddings when using CUDA (ggerganov#3657) llama : avoid fprintf in favor of LLAMA_LOG (ggerganov#3538) readme : update hot-topics & models, detail windows release in usage (ggerganov#3615) CLBlast: Fix temporary buffer size for f16 conversion (wsize) train-text-from-scratch : fix assert failure in ggml-alloc (ggerganov#3618) editorconfig : remove trailing spaces server : documentation of JSON return value of /completion endpoint (ggerganov#3632) save-load-state : fix example + add ci test (ggerganov#3655) readme : add Aquila2 links (ggerganov#3610) tokenizer : special token handling (ggerganov#3538) k-quants : fix quantization ranges (ggerganov#3646) llava : fix tokenization to not add bos between image embeddings and user prompt (ggerganov#3645) MPT : support GQA for replit-code-v1.5 (ggerganov#3627) Honor -ngl option for Cuda offloading in llava (ggerganov#3621)

jploski · 2023-10-19T10:12:49Z

That should make our local/open LLMs much more intelligent and useful once they work as intended and no longer hallucinate/derail that much.

I doubt it will have that much effect because if repetition penalty messing up generation was the issue, we should also be seeing it for newline tokens (much more frequent than EOS / next turn tokens). The derailment is more likely due to paying equal attention to their own (garbage) outputs as anything else and eventually sampling more and more out of training distribution as a result. So yes, I'd say it is an inherent limitation of the technology. Hallucinations are also out-of-distribution sampling (to prevent you'd either have to specifically train on such garbage samples or maybe detect occurrences, maybe by scoring semantical proximity of multiple outputs of the same prompt; e.g. if the model is happy to generate "yes" and "no" to the same question with same probability, it's safe to infer it does not know the answer).

As for the repetition penalty, it is not really all that much needed if you sample from the entire distribution (high top_p) rather than greedily. The reason why repetitions happen with greedy sampling: imagine you have a biased coin (your LLM) and with each prediction you only pick the most probable outcome. With such a sampling method you'd only ever get heads or tails, but never a realistic mix of them generated by such a coin. (Note: top_k can also be very greedy if all the top k tokens are synonyms. This description also generalizes to multi-token generation, imagine throwing n different biased coins; then you get the GPT-style sentence-level loops). Finally, the repetition seems to also have someting to do with the amplitude swings in attention scores in the positional embeddings (both original GPT and RoPE).. based on anecdotal evidence that in a model with different positional embeddings (RetNet + XPos), repetitions also look different (individual token stutter - but followed by auto-recovery, no endless self-reinforcing looping).

StefanDanielSchwarz · 2023-10-19T11:12:30Z

Very insightful post! Just to add my observations:

we should also be seeing it for newline tokens (much more frequent than EOS / next turn tokens)

I think that's also visible. I used examples split into multiple paragraphs, and the splitting became inconsistent, probably because the newline became penalized. Same with other characters like dashes, quotes, asterisks - when used a lot, the model would suddenly switch, using different unicode symbols for dashes, quotes, or even switch from *this is an action* to (this is an action) or [this is an action] formats.

I think we need a better solution than simple (stupid) repetition penalty. There's also relevant discussion here: huggingface/transformers#26902

ArthurZucker · 2023-10-19T15:27:35Z

llama.cpp

+        // TODO: It is unclear (to me) at this point, whether special tokes are guaranteed to be of a deterministic type,
+        //  and will always be correctly labeled in 'added_tokens.json' etc.


Sorry if it's irrelevant, but the added_tokens.json indeed does not have types, however with the recent update, we save the added_tokens_decoder which includes everything. See the 2 prs here and here in transformers

I might have worded that code comment a bit wrong, I didn't mean added_tokens.json in particular, but rather source model metadata and the way it's processed by import scripts in general.

BPE models are still problematic, at least couple of them I tested this with, had no token types imported/defined at all. So my suspicions were correct, and at this point in time that TODO is still reasonably valid.

Though the method used in this PR to detect special tokens, was/is meant to only be used as a fallback, eventually.

cebtenzzre · 2023-11-14T18:30:18Z

I played a bit with hardcoded space prefix always vs only for user input itself, and I'm getting subjectively much much better results, if the hardcoded space is only added to the user input itself and not the "prompt template" ( prefix and suffix ).

@staviq Are we sure this was the right call? We should look into what HF transformers does.

Command: build/bin/main -t 16 -m /tmp/dolphin-llama2-7b.Q4_0.new.gguf -n 2048 --ignore-eos -p 'The quick brown fox' --seed 1699464025

Before this PR (with a leading space, GitHub doesn't show it):

The quick brown fox jumps over the lazy dog is a well-known English nursery rhyme or children's song.➖

This nursery rhyme has been popular since 1847, when it was first published in the Monthly Magazine of London. Since then, it has undergone various changes and adaptations, but its essence remains the same. The rhyme tells the story of a quick brown fox who jumps over the lazy dog to avoid being caught for stealing some cheese.

After this PR (no leading space):

The quick brown fox jumps over the wall (the last line of the poem) and is on the other side.nahmáo Often used as a verb, němäo is a term derived from the Sanskrit word nirvana, meaning "spirit" or "consciousness." The concept of the spirit or consciousness can be found in many cultures and religions throughout history. In Buddhism, for instance, the term "nirvana" refers to the state of perfect spiritual awakening that is the goal of practice. Similarly, in Christianity, the Holy Spirit is considered a divine spirit that guides and supports believers.

staviq · 2023-11-14T20:06:24Z

@cebtenzzre

The problem was that making special tokens work, allowed proper prompt formats to also work, except when the space prefix was added to -p, something like SYSTEM: still wasn't working properly because SYSTEM wasnt correctly triggering model's prompt format, but SYSTEM was.

So it was a tradeoff between making prompt formats work better and making main more complicated by introducing another bunch of parameters, for splitting -p into prefix, content and suffix, each processed with/without special tokens or space prefix.

There was some talk about splitting/changing add_bos, special flags for tokenizer, and separate one for space prefix, but I don't think that PR went through, or I missed it.

Feel free to revert it, if it's causing problems.

I can't think if a way to determine whether -p is or isn't using some sort of prompt format, to decide automatically if or where should a space prefix be added.

shibe2 · 2023-11-14T20:16:01Z

If a space is desirable but missing, the user can add it. If a space is added automatically but is undesirable, the user is out of luck.

cebtenzzre · 2023-11-14T21:43:06Z

@staviq
Models with GPT2-style BPE have add_prefix_space in tokenizer.json/tokenizer_config.json, I believe we need to honor that.

As for llama, this is what HF transformers does (I assume we should match the legacy=False behavior):

"""
Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
first token is special.
"""

if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
    tokens = tokens[1:]

And an interesting note about add_dummy_prefix (changed in this PR):

"""
Returns a tokenized string.

We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any
SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give
`['H', 'e', 'y']` instead of `['▁He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the
`unk_token`. Here is an example with `unk_token = "<unk>"` and `unk_token_length = 4`.
`self.tokenizer.sp_model.encode("<unk> Hey", out_type = str)[4:]`.
"""

I think it is important to fix this regression for models that do not use special tokens in the prompt (including all of the ones that I use).

shibe2 · 2023-11-15T00:09:20Z

If you don't want to add space after special tokens, only add it to a fragment if it precedes all special tokens except BOS, i.e. on the first iteration over fragment_buffer. This way, prompts without special tokens will be getting space as before.

ArthurZucker · 2023-11-15T06:54:32Z

If a space is desirable but missing, the user can add it. If a space is added automatically but is undesirable, the user is out of luck.

Hey 🤗 The add_prefix_space parameter will be added for llama-like tokenizers (both slow and fast) soon-ish.
The legacy is related to the space after special tokens that was always added due to the underlying add_dummy_prefix_space of the sentencepiece model.

shibe2 · 2023-11-15T09:27:23Z

The add_prefix_space parameter will be added for llama-like tokenizers (both slow and fast) soon-ish.

Is there any documentation about that?

Rewrite special token handling from ggerganov#1931

b592c70

ggerganov reviewed Oct 8, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

ggerganov reviewed Oct 8, 2023

View reviewed changes

staviq mentioned this pull request Oct 8, 2023

Tokenizer not picking the right tokens ( mistral openorca ) #3475

Closed

ggerganov reviewed Oct 10, 2023

View reviewed changes

llama.h Outdated Show resolved Hide resolved

shorten param name, add st verification by type

fc634d8

goerch mentioned this pull request Oct 11, 2023

Ignore tokens if their IDs in added token list are below the vocab's base size #3585

Closed

staviq added 2 commits October 11, 2023 22:47

use offsets instead of copy by substr

29e6b46

Merge branch 'master' into specialtokens

eac5f54

cebtenzzre reviewed Oct 11, 2023

View reviewed changes

llama.cpp Show resolved Hide resolved

formatting, remove copying iterator on delete

f7b1205

Igoorx mentioned this pull request Oct 12, 2023

Improve support for special tokens #1931

Closed

ggerganov added 2 commits October 12, 2023 14:35

Merge branch 'master' into HEAD

04ac055

llama : normalize code-style

5c6b2be

ggerganov marked this pull request as ready for review October 12, 2023 11:47

ggerganov approved these changes Oct 12, 2023

View reviewed changes

ggerganov changed the title ~~(wip) tokenizer: special token handling~~ tokenizer : special token handling Oct 12, 2023

ggerganov added the need feedback Testing and feedback with results are needed label Oct 12, 2023

staviq added 2 commits October 12, 2023 16:17

swift fix

0f1c569

print pfx/sfx if verb, main: split pfx input sfx

5974d61

staviq mentioned this pull request Oct 12, 2023

[User] ChatML Support #3598

Closed

shibe2 mentioned this pull request Oct 19, 2023

[feature] Exempt arbitrary tokens from penalties #3675

Closed

gante mentioned this pull request Oct 19, 2023

Token based (or sequence of token based) repetition penalty exclusion huggingface/transformers#26902

Closed

ArthurZucker reviewed Oct 19, 2023

View reviewed changes

antoine-lizee mentioned this pull request Oct 30, 2023

Error with special tokens tokenization abetlen/llama-cpp-python#838

Closed

JohanAR mentioned this pull request Nov 3, 2023

[Feature request] How to support special tokens in tokenizer of llama_cpp? #1501

Closed

cebtenzzre mentioned this pull request Nov 14, 2023

llama : restore prefix space in llama tokenizer #4081

Merged

cebtenzzre mentioned this pull request Nov 21, 2023

[feature] Make space insertion optional in SentencePiece tokenizer #3664

Open

iamlemec mentioned this pull request Mar 11, 2024

Inconsistent Bert Embedding output from embedding.cpp vs llama.cpp server #5801

Closed

		// TODO: It is unclear (to me) at this point, whether special tokes are guaranteed to be of a deterministic type,
		// and will always be correctly labeled in 'added_tokens.json' etc.

tokenizer : special token handling #3538

tokenizer : special token handling #3538

Conversation

staviq commented Oct 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goerch Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goerch Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

staviq Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

goerch Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staviq commented Oct 10, 2023

apage43 commented Oct 11, 2023 • edited Loading

goerch commented Oct 11, 2023 • edited Loading

jploski commented Oct 11, 2023

staviq commented Oct 11, 2023

ggerganov commented Oct 12, 2023

ggerganov left a comment

Choose a reason for hiding this comment

staviq commented Oct 12, 2023

staviq commented Oct 12, 2023 • edited Loading

staviq commented Oct 12, 2023

shibe2 commented Oct 18, 2023

staviq commented Oct 18, 2023

shibe2 commented Oct 18, 2023

StefanDanielSchwarz commented Oct 18, 2023

staviq commented Oct 18, 2023

StefanDanielSchwarz commented Oct 18, 2023 • edited Loading

jploski commented Oct 18, 2023

StefanDanielSchwarz commented Oct 18, 2023

teleprint-me commented Oct 18, 2023 • edited Loading

shibe2 commented Oct 18, 2023

teknium1 commented Oct 19, 2023 • edited Loading

StefanDanielSchwarz commented Oct 19, 2023

jploski commented Oct 19, 2023

StefanDanielSchwarz commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cebtenzzre commented Nov 14, 2023

staviq commented Nov 14, 2023

shibe2 commented Nov 14, 2023

cebtenzzre commented Nov 14, 2023

shibe2 commented Nov 15, 2023 • edited Loading

ArthurZucker commented Nov 15, 2023

shibe2 commented Nov 15, 2023

staviq commented Oct 8, 2023 •

edited

Loading

goerch Oct 9, 2023 •

edited

Loading

goerch Oct 10, 2023 •

edited

Loading

staviq Oct 10, 2023 •

edited

Loading

goerch Oct 10, 2023 •

edited

Loading

apage43 commented Oct 11, 2023 •

edited

Loading

goerch commented Oct 11, 2023 •

edited

Loading

staviq commented Oct 12, 2023 •

edited

Loading

StefanDanielSchwarz commented Oct 18, 2023 •

edited

Loading

teleprint-me commented Oct 18, 2023 •

edited

Loading

teknium1 commented Oct 19, 2023 •

edited

Loading

shibe2 commented Nov 15, 2023 •

edited

Loading