fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false #5487

vriesdemichael · 2024-02-14T13:22:15Z

This fixes one of the problems in #5040

The add_bos_token and add_eos_token are config params to determine whether you should add special tokens to the sequence automatically: https://huggingface.co/docs/transformers/en/model_doc/llama2#transformers.LlamaTokenizer.add_bos_token describes the params.

Regardless of add_bos_token and add_eos_token the special tokens should be added in the metadata, as they can still be used in the chat_template.

…oken is set to false

slaren · 2024-02-14T13:50:26Z

Should the list of special_token_types also be updated to include all the token types listed in #5040 (comment)? Otherwise only these will be exported:

https://github.com/ggerganov/llama.cpp/blob/f428652b2ab852fe84f39d6877a8f38d63fa4df7/gguf-py/gguf/vocab.py#L29-L32

vriesdemichael · 2024-02-14T15:34:28Z

Good idea, but I guess that would also require added functions here
https://github.com/ggerganov/llama.cpp/blob/f428652b2ab852fe84f39d6877a8f38d63fa4df7/gguf-py/gguf/gguf_writer.py#L402-L417

and constants here
https://github.com/ggerganov/llama.cpp/blob/f428652b2ab852fe84f39d6877a8f38d63fa4df7/gguf-py/gguf/constants.py#L64-L81

I'll gladly add them later on

gguf-py/gguf/vocab.py

ggerganov · 2024-02-15T09:15:07Z

Lint check needs fix too

…oken is set to false (ggml-org#5487) * fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false * fix(gguf-py): added missing cls and mask token ids to the gguf metadata

fix(gguf-py): special tokens are no longer skipped when add_<token>_t…

f428652

…oken is set to false

ggerganov reviewed Feb 15, 2024

View reviewed changes

gguf-py/gguf/vocab.py Outdated Show resolved Hide resolved

vriesdemichael force-pushed the fix/5040-eos-bos-token-missing-in-gguf branch 2 times, most recently from b656423 to 10f9651 Compare February 15, 2024 09:12

ggerganov approved these changes Feb 15, 2024

View reviewed changes

ggerganov requested a review from slaren February 15, 2024 09:14

fix(gguf-py): added missing cls and mask token ids to the gguf metadata

3b5dc11

vriesdemichael force-pushed the fix/5040-eos-bos-token-missing-in-gguf branch from 10f9651 to 3b5dc11 Compare February 15, 2024 09:36

slaren approved these changes Feb 15, 2024

View reviewed changes

slaren merged commit 7312247 into ggml-org:master Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false #5487

fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false #5487

Uh oh!

vriesdemichael commented Feb 14, 2024

Uh oh!

slaren commented Feb 14, 2024

Uh oh!

vriesdemichael commented Feb 14, 2024

Uh oh!

Uh oh!

ggerganov commented Feb 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false #5487

fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false #5487

Uh oh!

Conversation

vriesdemichael commented Feb 14, 2024

Uh oh!

slaren commented Feb 14, 2024

Uh oh!

vriesdemichael commented Feb 14, 2024

Uh oh!

Uh oh!

ggerganov commented Feb 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants