Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for special tokens #1931

Closed
wants to merge 13 commits into from
Closed

Conversation

Igoorx
Copy link

@Igoorx Igoorx commented Jun 19, 2023

Fixes #1812
Fixes #1501

Hello! This is my first attempt to contribute to this project, so I apologize in advance for any mistakes.
This PR should add a basic support for special tokens and improve the support for added tokens. All special tokens come from the file added_tokens.json and the file special_tokens_map.json or the file tokenizer_config.json (I have no idea if it's safe to rely on only one so I added tokenizer_config.json as a fallback).
EDIT: The loading of the jsons now seems to follow the huggingface implementation a bit better.

The most important points of this PR are:

  • The GGML format was changed due to the requirement of a way to know which tokens are the special tokens.
    EDIT: This isn't necessary anymore.
  • The tokenizer now uses a trie algorithm to efficiently split the prompt based on the special tokens, this was necessary because the BPE tokenizer isn't able to tokenize the special tokens by itself.
    Please note that this algorithm was ported from the huggingface/transformers repository, so I wonder if this could cause license issues?

    EDIT: The algorithm now is just a linear search.

Using this PR, this is the output of --verbose-prompt:

main: prompt: ' One Two</s>Three</s> Four '
main: number of tokens in prompt = 8
     1 -> '<s>'
  3118 -> ' One'
  7803 -> ' Two'
     2 -> '</s>'
 28575 -> 'Three'
     2 -> '</s>'
 12458 -> ' Four'
 29871 -> ' '

when the model is converted to ggml with this special_tokens_map.json:

{
    "bos_token": {
      "content": "<s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false
    },
    "eos_token": {
      "content": "</s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false
    },
    "pad_token": "<unk>",
    "unk_token": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false
    }
  }

@Igoorx Igoorx force-pushed the specialtokens branch 3 times, most recently from 6a8e3ff to 6c55fe1 Compare June 19, 2023 17:55
@KerfuffleV2
Copy link
Collaborator

A breaking change to the GGML format might be a tough sell (but don't take my personal opinion as speaking for the project in any way). You might consider allowing a commandline option and/or API addition to allow just reading the special tokens from a separate file or list.

so I wonder if this could cause license issues?

llama.cpp is under MIT and transformers seems to be Apache 2.0. I'm not qualified to say what the issue is or how to fix it, but the projects do have different licenses. I don't know if there's any kind of policy for dealing with that situation already in place. Perhaps someone else can give you a better answer, but until that part is resolved I'd guess you shouldn't expect your PR to get merged (again, not speaking with any kind of authority).

@Igoorx
Copy link
Author

Igoorx commented Jun 19, 2023

A breaking change to the GGML format might be a tough sell (but don't take my personal opinion as speaking for the project in any way).
You might consider allowing a commandline option and/or API addition to allow just reading the special tokens from a separate file or list.

On the bright side, the old format is still supported... But I agree with you and it's also something I would want to avoid, but the thing is, after looking at the code it looks like everything related to the vocab is integrated inside the ggml (e.g. the file added_tokens.json and the whole vocab itself), so I thought the only right choice was to include the list of special tokens too.
I don't know what the maintainers of the project would prefer though, so I'm open to making any changes.

llama.cpp is under MIT and transformers seems to be Apache 2.0. I'm not qualified to say what the issue is or how to fix it, but the projects do have different licenses. I don't know if there's any kind of policy for dealing with that situation already in place.

Yes, this is something that concerns me a bit. I would appreciate feedback on whether I should be concerned about the license. After all, I'm not simply copy-pasting the code. Perhaps, the comment I added with the link to the original implementation would be sufficient?
Nevertheless, even if the license still should be respected to the fullest, I am aware that Apache is compatible with MIT. Therefore, I believe there should be no issue with including that code in the repository. However, I also need feedback on this matter. This seems to be the first time this issue has arisen here, so I'm not sure whether the maintainers would want to add the license file of Hugging Face specifically for this piece of code. If they do, I'm not sure where I should add the license - perhaps just appending it to the LICENSE file?

Perhaps someone else can give you a better answer, but until that part is resolved I'd guess you shouldn't expect your PR to get merged (again, not speaking with any kind of authority).

You're absolutely right, no doubts about that, that's why I brought that up in the PR message, this is something quite important for an open-source project so it needs to be sorted out before the PR is merged.

Copy link
Contributor

@bullno1 bullno1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, there is a discussion on a new format here: ggerganov/ggml#220

You may want to chime in.

llama-util.h Outdated Show resolved Hide resolved
llama-util.h Outdated Show resolved Hide resolved
@Igoorx
Copy link
Author

Igoorx commented Jun 20, 2023

AFAIK, there is a discussion on a new format here: ggerganov/ggml#220

You may want to chime in.

I don't believe I have anything to contribute to the discussion... but upon a closer look, it appears there is also a discussion regarding the tokenizer in the issue 🤔
I'm uncertain whether there is a possibility of this PR being merged or if the maintainers would prefer to wait for GGUF.
I suppose I will modify the approach of this PR and alter the GGML format in such a manner that the new model format remains compatible with older versions. By doing so, if we still have doubts about when GGUF will be merged, this pull request could serve as a temporary solution to the issue of special tokens at least until that time.

@grantbey
Copy link

grantbey commented Aug 7, 2023

Hey @Igoorx what's the status of this PR? I'm really interested in this work.

I fine tune using LoRA and add a few special tokens, but then these aren't tokenized correctly when running inference on llama.cpp. I'm going to try your PR and see if it helps things.

@Igoorx
Copy link
Author

Igoorx commented Aug 7, 2023

@grantbey It's finished, but since the maintainers showed no interest whatsoever in merging it, I didn't resolve the merge conflicts. If you can't do that by yourself, you should just wait for GGUF, it's right around the corner: #2398

@grantbey
Copy link

grantbey commented Aug 7, 2023 via email

@Igoorx
Copy link
Author

Igoorx commented Aug 7, 2023

@grantbey I rebased the PR to the last master commit 👍

@grantbey
Copy link

grantbey commented Aug 7, 2023

Ok wow @Igoorx you're amazing. Thank you so much!

@klosax klosax mentioned this pull request Aug 7, 2023
@goerch
Copy link
Collaborator

goerch commented Aug 7, 2023

@Igoorx : @klosax just made me aware that we are working on weaknesses of the tokenizer(s) at the same time. I'd greatly appreciate any cooperation on this. If I'd go to incorporate your changes my first question would be how to test them?

@Igoorx
Copy link
Author

Igoorx commented Aug 7, 2023

@goerch
Here is a simple test: 6f7daba

@goerch
Copy link
Collaborator

goerch commented Aug 7, 2023

@goerch Here is a simple test: 6f7daba

Nice, thank you! Then my plan would be to try to first integrate your changes into #2315 and afterwards migrate the PR over to gguf. Would that be OK for you?

@Igoorx
Copy link
Author

Igoorx commented Aug 7, 2023

@goerch Yeah, that's fine. You just have to be aware about:

llama.cpp/llama-util.h

Lines 554 to 555 in 6f7daba

// Trie in C++. Creates a Trie out of a list of words. The trie is used to split on multiple delimiters in one pass
// Ported from: https://github.com/huggingface/transformers/blob/ee88ae59940fd4b2c8fc119373143d7a1175c651/src/transformers/tokenization_utils.py#L52

The trie algorithm used in the PR is a port from the huggingface repository, as written in the comment, so maybe something needs to be done about it. I'm not sure if that comment is enough or if it would be necessary to add the hf license somewhere.

@klosax
Copy link
Collaborator

klosax commented Aug 7, 2023

The gguf gpt2 tokenizer also have a Trie implementation. The tokenizer is on MIT license. Maybe it could be reused for the llama tokenizer.

@goerch
Copy link
Collaborator

goerch commented Aug 7, 2023

@goerch Yeah, that's fine. You just have to be aware about: ...

IANAL, but I'm equally concerned about compatibility with the sentencepiece license (Apache-2.0). They use a trie too, which might the base of the hf implementation? When developing #2315 I experimented with a simple clean room implementation of a trie already, I'll try to understand the differences.

@ggerganov and others: any opinions on this?

@klosax
Copy link
Collaborator

klosax commented Aug 7, 2023

The author of the gpt2 tokenizer gave permission to use it and stated it is on MIT license here #2398 (comment)

@Igoorx
Copy link
Author

Igoorx commented Aug 7, 2023

The important part is the split method

llama.cpp/llama-util.h

Lines 575 to 577 in 6f7daba

// Will look for the words added to the trie within `text`. Output is the boundaries of the words found.
// Note that this trie will match the longest possible word first!
std::vector<size_t> split(const std::string & text) const {

If using the ported version isn't an option, it would be necessary to reimplement it using the other trie algorithm.

IANAL, but I'm equally concerned about compatibility with the sentencepiece license (Apache-2.0).

IANAL either, but I think Apache-2.0 and MIT should be compatible with each other: https://law.stackexchange.com/a/6732

@klosax
Copy link
Collaborator

klosax commented Aug 7, 2023

It looks like the MIT and Apache licenses are compatible, but a copy of the Apache license and a Notice file must be included:
https://softwareengineering.stackexchange.com/questions/51987/how-to-include-an-apache-library-with-my-opensource-code#52223

@ggerganov
Copy link
Owner

I'll recommend to either implement a trie from scratch, or use a linear search algorithm - we are not tokenizing billions of tokens, so not sure what we gain from using a trie.

@Igoorx
Copy link
Author

Igoorx commented Aug 8, 2023

I'll recommend to either implement a trie from scratch, or use a linear search algorithm - we are not tokenizing billions of tokens, so not sure what we gain from using a trie.

If you say so then probably trie really is a premature optimization in this case... I changed the code to use linear search.

@goerch goerch mentioned this pull request Aug 8, 2023
@goerch
Copy link
Collaborator

goerch commented Aug 8, 2023

@Igoorx : I took a look into your PR, but I believe we first have to sort out open problems with #2549. Sorry for the delay.

@goerch
Copy link
Collaborator

goerch commented Sep 19, 2023

@lgoorx : #2549 is done, I'm now waiting for review of #3252 and am already downloading Vicuna-7B-v1.1.

@goerch
Copy link
Collaborator

goerch commented Sep 20, 2023

Currently discussing this at #2820. Maybe close this one and participate over there?

@Igoorx
Copy link
Author

Igoorx commented Oct 12, 2023

Looks like this PR was superseded by #3538, from what I could see it looks great. Thanks for your attention @goerch! I don't think I have anything more to contribute.

@Igoorx Igoorx closed this Oct 12, 2023
ggerganov added a commit that referenced this pull request Oct 17, 2023
* Rewrite special token handling from #1931

* shorten param name, add st verification by type

* use offsets instead of copy by substr

* formatting, remove copying iterator on delete

* llama : normalize code-style

* swift fix

* print pfx/sfx if verb, main: split pfx input sfx

* dont add space when using special tokens

* minor : comment + spacing

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants