Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special token modification capability #7166

Merged
merged 18 commits into from
May 9, 2024

Conversation

CISC
Copy link
Contributor

@CISC CISC commented May 9, 2024

To be able to fix/amend special tokens in a GGUF let's add two new arguments:

  • --special-token <name> <value> where <name> can be bos, eos, prefix, middle, etc. while <value> is the token value, f.ex. "<|fim▁begin|>"
  • --special-token-by-id <name> <id> where <id> is the ID of the token, f.ex. 32006

So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:

gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁end|>" --special-token suffix "<|fim▁hole|>"

(yes, fim_end is the middle token, because completion is a prefix/suffix/middle sequence (where middle is unfilled))
or

gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>"

etc...

NB: The tokens have to exist already, trying to add non-existent token name will be ignored (with a warning), while non-existent values will fail (with an error).

CISC added 17 commits April 20, 2024 08:33
To be able to fix/amend special tokens in a GGUF let's add two new arguments:
* `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<|fim▁begin|>"`
* `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006

So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:
```bash
python3 gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁hole|>" --special-token suffix "<|fim▁end|>"
```
To be able to fix/amend special tokens in a GGUF let's add two new arguments:
* `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<|fim▁begin|>"`
* `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006

So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:
```bash
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁end|>" --special-token suffix "<|fim▁hole|>"
```
(yes, fim_end is the `middle` token, because completion is a `prefix`/`suffix`/`middle` sequence (where `middle` is unfilled))
or
```bash
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>"
```
etc...

NB: The tokens have to exist already, trying to add non-existent token name/IDs will be ignored (with a warning), while non-existent values will fail (with an error).
@CISC
Copy link
Contributor Author

CISC commented May 9, 2024

Messed up merge of #6778 so just a resubmission.

@CISC
Copy link
Contributor Author

CISC commented May 9, 2024

Uhm, not sure what's going on with the failed ios/tvos checks?

@CISC
Copy link
Contributor Author

CISC commented May 9, 2024

@ggerganov Ooops, was just making a minor tweak, nothing earthshattering. :)

@CISC CISC requested a review from ggerganov May 9, 2024 10:25
@ggerganov ggerganov merged commit 2284216 into ggerganov:master May 9, 2024
21 checks passed
@CISC CISC deleted the modify-special-tokens-metadata branch May 9, 2024 10:57
@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level enhancement New feature or request labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants