Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the tokenizer #2023

Closed
slaren opened this issue Jun 27, 2023 · 6 comments
Closed

Fix the tokenizer #2023

slaren opened this issue Jun 27, 2023 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@slaren
Copy link
Collaborator

slaren commented Jun 27, 2023

We should fix the issues found in the llama tokenizer by @vjeux. It is explained in detail here: #252 (comment)

Might be a good first issue.

@slaren slaren added the bug Something isn't working label Jun 27, 2023
@JWNoctis
Copy link

JWNoctis commented Jun 28, 2023

Related to #1931 - Some other changes to EOS token behaviour, maybe hidden behind a command line switch, would be nice to have too.

Right now models using EOS token as separator (e.g. Vicuna 1.1/1.3 from lmsys, Nous-Hermes-13b) are somewhat broken (current code doesn't tokenize EOS, and replaces generated EOS with newline in interactive mode).

@howard0su howard0su self-assigned this Jun 30, 2023
@howard0su
Copy link
Collaborator

I will try this in the weekend.

@huichen
Copy link

huichen commented Jul 4, 2023

Have you thought about supporting text normalization in tokenizer like https://github.com/google/sentencepiece/blob/master/doc/normalization.md ?

This is essential to get correct encoding (NFKC) for unicode languages like Chinese.

For example encoding and then decoding '(' wouldn't yield the original string.

@goerch
Copy link
Collaborator

goerch commented Sep 9, 2023

Have you thought about supporting text normalization in tokenizer like https://github.com/google/sentencepiece/blob/master/doc/normalization.md ?

Spot on: somebody should have thought about this in the past probably. Not sure about the best way forward here.

Might be a good first issue.

I don't think so.

@goerch
Copy link
Collaborator

goerch commented Sep 16, 2023

@slaren : as soon as #3170 is merged I'm happy (for now) with the character encoding/decoding behavior of the sentencepiece tokenizer for LlaMa, although we only check invariants and I only looked into literal representations sporadically. Do you see a need or have ideas how to improve test coverage?

@huichen : I just checked your example of '(' in test-tokenizer-0-llama and it seems to work for me. OTOH hand I could imagine deviations to the sentencepiece tokenizer for non normalized input strings (even for normalized ones as sentencepiece seems to do some kind of unique normalization according to the documentation you referenced). Do you have better examples?

@ggerganov: Should we test the character behavior of the Falcon tokenizer the same way as for the Llama one? Do we have a strategy how to cope with Unicode normalization if necessary?

@slaren
Copy link
Collaborator Author

slaren commented Sep 16, 2023

@goerch If I am not mistaken, test-tokenizer-1-llama.cpp covers the issue described here very closely, so maybe this should be closed now? If there are still different issues with the llama tokenizer, then it would be better to open a new issue. I suspect that the biggest issue with the tokenizer at this point is the handling of special tokens, but I haven't been following the recent developments very closely.

@slaren slaren closed this as completed Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants