Improve support for special tokens #1931

Igoorx · 2023-06-19T00:06:07Z

Hello! This is my first attempt to contribute to this project, so I apologize in advance for any mistakes.
This PR should add a basic support for special tokens and improve the support for added tokens. All special tokens come from the file added_tokens.json and the file special_tokens_map.json or the file tokenizer_config.json ~~(I have no idea if it's safe to rely on only one so I added tokenizer_config.json as a fallback).~~
EDIT: The loading of the jsons now seems to follow the huggingface implementation a bit better.

The most important points of this PR are:

~~The GGML format was changed due to the requirement of a way to know which tokens are the special tokens.~~
EDIT: This isn't necessary anymore.
The tokenizer now uses a trie algorithm to efficiently split the prompt based on the special tokens, this was necessary because the BPE tokenizer isn't able to tokenize the special tokens by itself.
Please note that this algorithm was ported from the huggingface/transformers repository, so I wonder if this could cause license issues?
EDIT: The algorithm now is just a linear search.

Using this PR, this is the output of --verbose-prompt:

main: prompt: ' One Two</s>Three</s> Four '
main: number of tokens in prompt = 8
     1 -> '<s>'
  3118 -> ' One'
  7803 -> ' Two'
     2 -> '</s>'
 28575 -> 'Three'
     2 -> '</s>'
 12458 -> ' Four'
 29871 -> ' '

when the model is converted to ggml with this special_tokens_map.json:

{
    "bos_token": {
      "content": "<s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false
    },
    "eos_token": {
      "content": "</s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false
    },
    "pad_token": "<unk>",
    "unk_token": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false
    }
  }

KerfuffleV2 · 2023-06-19T21:09:34Z

A breaking change to the GGML format might be a tough sell (but don't take my personal opinion as speaking for the project in any way). You might consider allowing a commandline option and/or API addition to allow just reading the special tokens from a separate file or list.

so I wonder if this could cause license issues?

llama.cpp is under MIT and transformers seems to be Apache 2.0. I'm not qualified to say what the issue is or how to fix it, but the projects do have different licenses. I don't know if there's any kind of policy for dealing with that situation already in place. Perhaps someone else can give you a better answer, but until that part is resolved I'd guess you shouldn't expect your PR to get merged (again, not speaking with any kind of authority).

Igoorx · 2023-06-19T22:56:04Z

A breaking change to the GGML format might be a tough sell (but don't take my personal opinion as speaking for the project in any way).
You might consider allowing a commandline option and/or API addition to allow just reading the special tokens from a separate file or list.

On the bright side, the old format is still supported... But I agree with you and it's also something I would want to avoid, but the thing is, after looking at the code it looks like everything related to the vocab is integrated inside the ggml (e.g. the file added_tokens.json and the whole vocab itself), so I thought the only right choice was to include the list of special tokens too.
I don't know what the maintainers of the project would prefer though, so I'm open to making any changes.

llama.cpp is under MIT and transformers seems to be Apache 2.0. I'm not qualified to say what the issue is or how to fix it, but the projects do have different licenses. I don't know if there's any kind of policy for dealing with that situation already in place.

Yes, this is something that concerns me a bit. I would appreciate feedback on whether I should be concerned about the license. After all, I'm not simply copy-pasting the code. Perhaps, the comment I added with the link to the original implementation would be sufficient?
Nevertheless, even if the license still should be respected to the fullest, I am aware that Apache is compatible with MIT. Therefore, I believe there should be no issue with including that code in the repository. However, I also need feedback on this matter. This seems to be the first time this issue has arisen here, so I'm not sure whether the maintainers would want to add the license file of Hugging Face specifically for this piece of code. If they do, I'm not sure where I should add the license - perhaps just appending it to the LICENSE file?

Perhaps someone else can give you a better answer, but until that part is resolved I'd guess you shouldn't expect your PR to get merged (again, not speaking with any kind of authority).

You're absolutely right, no doubts about that, that's why I brought that up in the PR message, this is something quite important for an open-source project so it needs to be sorted out before the PR is merged.

bullno1

AFAIK, there is a discussion on a new format here: ggerganov/ggml#220

You may want to chime in.

llama-util.h

Igoorx · 2023-06-20T22:53:39Z

AFAIK, there is a discussion on a new format here: ggerganov/ggml#220

You may want to chime in.

I don't believe I have anything to contribute to the discussion... but upon a closer look, it appears there is also a discussion regarding the tokenizer in the issue 🤔
I'm uncertain whether there is a possibility of this PR being merged or if the maintainers would prefer to wait for GGUF.
I suppose I will modify the approach of this PR and alter the GGML format in such a manner that the new model format remains compatible with older versions. By doing so, if we still have doubts about when GGUF will be merged, this pull request could serve as a temporary solution to the issue of special tokens at least until that time.

grantbey · 2023-08-07T10:35:45Z

Hey @Igoorx what's the status of this PR? I'm really interested in this work.

I fine tune using LoRA and add a few special tokens, but then these aren't tokenized correctly when running inference on llama.cpp. I'm going to try your PR and see if it helps things.

Igoorx · 2023-08-07T13:09:42Z

@grantbey It's finished, but since the maintainers showed no interest whatsoever in merging it, I didn't resolve the merge conflicts. If you can't do that by yourself, you should just wait for GGUF, it's right around the corner: #2398

grantbey · 2023-08-07T13:14:31Z

Thanks. I’ve started taking a stab at the merge conflicts but I imagine I’ll get a few things wrong and it’ll end up consuming a lot of my time 😅 I read about GGUF but I’ve got some deadlines coming up and I’d like a working solution sooner rather than later, and it’s not clear what kind of timeline to expect GGUF on.

…

On Mon, Aug 7, 2023 at 14:09 Igor Pissolati ***@***.***> wrote: @grantbey <https://github.com/grantbey> It is finished, but since the maintainers had no interest whatsoever in merging it I didn't fix the merge conflicts. If you can't do that by yourself, you should just wait for GGUF, since it's right around the corner: #2398 <#2398> — Reply to this email directly, view it on GitHub <#1931 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXSGXU5IRMINHEY5PGJOX3XUDSKFANCNFSM6AAAAAAZLGK4LI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Igoorx · 2023-08-07T16:22:09Z

@grantbey I rebased the PR to the last master commit 👍

grantbey · 2023-08-07T16:22:38Z

Ok wow @Igoorx you're amazing. Thank you so much!

goerch · 2023-08-07T19:16:30Z

@Igoorx : @klosax just made me aware that we are working on weaknesses of the tokenizer(s) at the same time. I'd greatly appreciate any cooperation on this. If I'd go to incorporate your changes my first question would be how to test them?

Igoorx · 2023-08-07T20:33:17Z

@goerch
Here is a simple test: 6f7daba

goerch · 2023-08-07T20:41:11Z

@goerch Here is a simple test: 6f7daba

Nice, thank you! Then my plan would be to try to first integrate your changes into #2315 and afterwards migrate the PR over to gguf. Would that be OK for you?

Igoorx · 2023-08-07T20:54:34Z

@goerch Yeah, that's fine. You just have to be aware about:

llama.cpp/llama-util.h

Lines 554 to 555 in 6f7daba

    
           // Trie in C++. Creates a Trie out of a list of words. The trie is used to split on multiple delimiters in one pass 
        
           // Ported from: https://github.com/huggingface/transformers/blob/ee88ae59940fd4b2c8fc119373143d7a1175c651/src/transformers/tokenization_utils.py#L52

The trie algorithm used in the PR is a port from the huggingface repository, as written in the comment, so maybe something needs to be done about it. I'm not sure if that comment is enough or if it would be necessary to add the hf license somewhere.

klosax · 2023-08-07T21:06:57Z

The gguf gpt2 tokenizer also have a Trie implementation. The tokenizer is on MIT license. Maybe it could be reused for the llama tokenizer.

goerch · 2023-08-07T21:22:01Z

@goerch Yeah, that's fine. You just have to be aware about: ...

IANAL, but I'm equally concerned about compatibility with the sentencepiece license (Apache-2.0). They use a trie too, which might the base of the hf implementation? When developing #2315 I experimented with a simple clean room implementation of a trie already, I'll try to understand the differences.

@ggerganov and others: any opinions on this?

klosax · 2023-08-07T21:29:01Z

The author of the gpt2 tokenizer gave permission to use it and stated it is on MIT license here #2398 (comment)

Igoorx · 2023-08-07T21:36:58Z

The important part is the split method

llama.cpp/llama-util.h

Lines 575 to 577 in 6f7daba

    
           // Will look for the words added to the trie within `text`. Output is the boundaries of the words found. 
        
           // Note that this trie will match the longest possible word first! 
        
           std::vector<size_t> split(const std::string & text) const {

If using the ported version isn't an option, it would be necessary to reimplement it using the other trie algorithm.

IANAL, but I'm equally concerned about compatibility with the sentencepiece license (Apache-2.0).

IANAL either, but I think Apache-2.0 and MIT should be compatible with each other: https://law.stackexchange.com/a/6732

klosax · 2023-08-07T22:00:20Z

It looks like the MIT and Apache licenses are compatible, but a copy of the Apache license and a Notice file must be included:
https://softwareengineering.stackexchange.com/questions/51987/how-to-include-an-apache-library-with-my-opensource-code#52223

ggerganov · 2023-08-08T10:58:49Z

I'll recommend to either implement a trie from scratch, or use a linear search algorithm - we are not tokenizing billions of tokens, so not sure what we gain from using a trie.

Igoorx · 2023-08-08T15:57:27Z

I'll recommend to either implement a trie from scratch, or use a linear search algorithm - we are not tokenizing billions of tokens, so not sure what we gain from using a trie.

If you say so then probably trie really is a premature optimization in this case... I changed the code to use linear search.

goerch · 2023-08-08T20:25:07Z

@Igoorx : I took a look into your PR, but I believe we first have to sort out open problems with #2549. Sorry for the delay.

goerch · 2023-09-19T12:31:53Z

@lgoorx : #2549 is done, I'm now waiting for review of #3252 and am already downloading Vicuna-7B-v1.1.

goerch · 2023-09-20T21:06:00Z

Currently discussing this at #2820. Maybe close this one and participate over there?

Igoorx · 2023-10-12T00:12:29Z

Looks like this PR was superseded by #3538, from what I could see it looks great. Thanks for your attention @goerch! I don't think I have anything more to contribute.

* Rewrite special token handling from #1931 * shorten param name, add st verification by type * use offsets instead of copy by substr * formatting, remove copying iterator on delete * llama : normalize code-style * swift fix * print pfx/sfx if verb, main: split pfx input sfx * dont add space when using special tokens * minor : comment + spacing --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Igoorx force-pushed the specialtokens branch 3 times, most recently from 6a8e3ff to 6c55fe1 Compare June 19, 2023 17:55

bullno1 reviewed Jun 20, 2023

View reviewed changes

llama-util.h Outdated Show resolved Hide resolved

llama-util.h Outdated Show resolved Hide resolved

Igoorx force-pushed the specialtokens branch from 0f14396 to bfccc62 Compare June 22, 2023 14:38

JWNoctis mentioned this pull request Jun 28, 2023

Fix the tokenizer #2023

Closed

jxy mentioned this pull request Jul 26, 2023

server: allow json array in prompt or content for direct token input #2306

Merged

Igoorx added 8 commits August 7, 2023 12:45

Improve support for special tokens

61a98bc

Code cleanup

0c14627

Better loading of special tokens from jsons

7f9d720

Remove trailing whitespaces

e468e75

Fix issues revealed by CI

ca1fc20

Ignore unusable json values

41a2ed0

Use some tricks to eliminate the necessity for a new format

f6d5fe3

Fixes to rebase

099119f

Igoorx force-pushed the specialtokens branch from bfccc62 to 099119f Compare August 7, 2023 16:20

klosax mentioned this pull request Aug 7, 2023

llama : fix tokenizer #2315

Closed

Igoorx added 2 commits August 7, 2023 17:30

Add C API for adding special tokens

d9791bb

Add simple test for special tokens

6f7daba

Add another test case

4fc3776

Igoorx added 2 commits August 8, 2023 12:40

Replace trie with linear search

ada6cce

Refactor special tokens tokenization

465cadd

Igoorx force-pushed the specialtokens branch from 863a440 to 465cadd Compare August 8, 2023 15:51

goerch mentioned this pull request Aug 8, 2023

Merge tokenizer fixes #2549

Merged

ggerganov mentioned this pull request Sep 15, 2023

convert.py : handle special tokens #2820

Closed

goerch mentioned this pull request Oct 6, 2023

Tokenizer not picking the right tokens ( mistral openorca ) #3475

Closed

staviq added a commit to staviq/llama.cpp that referenced this pull request Oct 8, 2023

Rewrite special token handling from ggerganov#1931

b592c70

staviq mentioned this pull request Oct 8, 2023

tokenizer : special token handling #3538

Merged

5 tasks

Igoorx closed this Oct 12, 2023

remy415 mentioned this pull request Mar 6, 2024

Add support for libcudart.so for CUDA devices (Adds Jetson support) ollama/ollama#2279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for special tokens #1931

Improve support for special tokens #1931

Igoorx commented Jun 19, 2023 •

edited

KerfuffleV2 commented Jun 19, 2023

Igoorx commented Jun 19, 2023

bullno1 left a comment

Igoorx commented Jun 20, 2023

grantbey commented Aug 7, 2023

Igoorx commented Aug 7, 2023 •

edited

grantbey commented Aug 7, 2023 via email

Igoorx commented Aug 7, 2023

grantbey commented Aug 7, 2023

goerch commented Aug 7, 2023

Igoorx commented Aug 7, 2023

goerch commented Aug 7, 2023

Igoorx commented Aug 7, 2023

klosax commented Aug 7, 2023 •

edited

goerch commented Aug 7, 2023

klosax commented Aug 7, 2023

Igoorx commented Aug 7, 2023 •

edited

klosax commented Aug 7, 2023

ggerganov commented Aug 8, 2023

Igoorx commented Aug 8, 2023 •

edited

goerch commented Aug 8, 2023

goerch commented Sep 19, 2023

goerch commented Sep 20, 2023 •

edited

Igoorx commented Oct 12, 2023

Improve support for special tokens #1931

Improve support for special tokens #1931

Conversation

Igoorx commented Jun 19, 2023 • edited

KerfuffleV2 commented Jun 19, 2023

Igoorx commented Jun 19, 2023

bullno1 left a comment

Choose a reason for hiding this comment

Igoorx commented Jun 20, 2023

grantbey commented Aug 7, 2023

Igoorx commented Aug 7, 2023 • edited

grantbey commented Aug 7, 2023 via email

Igoorx commented Aug 7, 2023

grantbey commented Aug 7, 2023

goerch commented Aug 7, 2023

Igoorx commented Aug 7, 2023

goerch commented Aug 7, 2023

Igoorx commented Aug 7, 2023

klosax commented Aug 7, 2023 • edited

goerch commented Aug 7, 2023

klosax commented Aug 7, 2023

Igoorx commented Aug 7, 2023 • edited

klosax commented Aug 7, 2023

ggerganov commented Aug 8, 2023

Igoorx commented Aug 8, 2023 • edited

goerch commented Aug 8, 2023

goerch commented Sep 19, 2023

goerch commented Sep 20, 2023 • edited

Igoorx commented Oct 12, 2023

Igoorx commented Jun 19, 2023 •

edited

Igoorx commented Aug 7, 2023 •

edited

klosax commented Aug 7, 2023 •

edited

Igoorx commented Aug 7, 2023 •

edited

Igoorx commented Aug 8, 2023 •

edited

goerch commented Sep 20, 2023 •

edited