Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : fix bpe tokenize from byte #2889

Merged
merged 1 commit into from
Sep 3, 2023

Conversation

opparco
Copy link
Contributor

@opparco opparco commented Aug 30, 2023

in llm_tokenizer_bpe::tokenize, if lookup on vocab.token_to_id fails, llama_byte_to_token(vocab, ch) should be tried.

@opparco
Copy link
Contributor Author

opparco commented Aug 30, 2023

convert hf elyza/ELYZA-japanese-Llama-2-7b-fast to gguf
run main
...
DEBUG: string not found in vocab: '
'
DEBUG: char token found in vocab: '<0x0A>' 13
llm_load_print_meta: format         = GGUF V1 (support until nov 2023)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = BPE
llm_load_print_meta: n_vocab        = 45043
llm_load_print_meta: n_merges       = 64040
...
llm_load_print_meta: LF token  = 13 '<0x0A>'  # detect LF token correctly.
...
DEBUG: string not found in vocab: '
'
DEBUG: char token found in vocab: '<0x0A>' 13
DEBUG: string not found in vocab: ' '
DEBUG: char token found in vocab: '<0x20>' 35

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Aug 30, 2023

This helps, but it seems like there are still pretty serious issues mith BPE handling. For BPE, convert.py adds all tokens as type USER_DEFINED, however llama_token_to_piece_with_model in common/common.cpp doesn't handle that case so the content of the tokens is always a blank string.

I tried changing:

llama.cpp/llama.cpp

Lines 6155 to 6157 in ad9ddcf

int llama_token_to_piece_with_model(const struct llama_model * model, llama_token token, char * buf, int length) {
if (0 <= token && token < llama_model_n_vocab(model)) {
if (llama_is_normal_token(model->vocab, token)) {

to

if (llama_is_normal_token(model->vocab, token)
    || llama_is_user_defined_token(model->vocab, token)) {

which helps, but since everything is marked as user-defined stuff like byte tokens don't get decoded. I'd guess what convert.py is doing is most wrong, but it seems like llama_token_to_piece needs to do something other than return a blank string for user defined tokens as well.

@ggerganov

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't needed in the reference implementation that is supposed to work 100% correct:

https://github.com/cmp-nct/ggllm.cpp/blob/2b487f2e1b7284847bcc1f8a368604f356815098/libfalcon.cpp#L2737-L2751

I think this is a workaround for the missing Unicode support in llama.cpp's BPE tokenizer.
Is it better to add this sort of patches or should we try to implement 100% correct BPE tokenizer? For the latter to happen, we would need help from someone with enough knowledge to implement it in a compact way

@KerfuffleV2
Copy link
Collaborator

Is it better to add this sort of patches or should we try to implement 100% correct BPE tokenizer?

From what you said over here: #2938 (review)

It seems like an argument against the workaround. Basically at least for the BPE models I tested, it got them a little closer to doing something other than just crashing but they still were pretty far from being able to produce usable output. I might be wrong, but my impression is that without actually supporting the unicode stuff it won't be possible to really use them with or without the workaround.

@opparco Seems like the model you were testing with is ELYZA-japanese-Llama-2-7b-fast. With your patch here, do you actually get good/usable results from it?

@opparco
Copy link
Contributor Author

opparco commented Sep 2, 2023

byte-fallback feature is an option in training phase.
but it is mandatory for Byte-Level BPE.

in google/sentencepiece:
sentencepiece supports byte-fallback feature.
google/sentencepiece#621 (comment)

in huggingface/tokenizers:
https://github.com/huggingface/tokenizers/blob/main/bindings/python/py_src/tokenizers/models/__init__.pyi#L110

class BPE(Model):
    """
    An implementation of the BPE (Byte-Pair Encoding) algorithm
...
        byte_fallback (:obj:`bool`, `optional`):
            Whether to use spm byte-fallback trick (defaults to False)
    """

implementation:
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/decoders/byte_fallback.rs

sentencepiece bpe compatible tokenizer #252

        for (int i = 0; i != -1; i = symbols_[i].next) {
            auto& symbol = symbols_[i];
            auto token = vocab_.token_to_id.find(std::string(symbol.text));

            if (token == vocab_.token_to_id.end()) {
                // output any symbols that did not form tokens as bytes.
                for (int j = 0; j < symbol.text.size(); ++j) {
                    gpt_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
                    output.push_back(token_id);
                }
            } else {
                output.push_back((*token).second);
            }
        }

this code maps byte to id correctly.

<0x0A> 13
<0x20> 35

in libfalcon.cpp, it does not have byte-fallback feature, so broken.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opparco

Thank you for the explanation. So it seems if the byte-fallback was enabled during training, we do need this patch.
I guess it is OK to merge then.

One more question: if we use byte-fallback feature with a model that did not have that feature enabled during training, would the tokenization end up wrong? Or would we never even hit this byte-fallback branch?
My guess is the latter, but just want to confirm

@opparco
Copy link
Contributor Author

opparco commented Sep 3, 2023

even in the current code, there are invalid input prompts that reach the ERROR branch.

                const auto token = vocab.token_to_id.find(str);

                if (token == vocab.token_to_id.end()) {
                    for (auto j = str.begin(); j != str.end(); ++j) {
                        std::string byte_str(1, *j);
                        auto token_multibyte = vocab.token_to_id.find(byte_str);
                        if (token_multibyte == vocab.token_to_id.end()) {
                            fprintf(stderr,"ERROR: byte not found in vocab: '%s'\n", byte_str.c_str());
                        }
                        output.push_back((*token_multibyte).second);
                    }
                } else {
                    output.push_back((*token).second);
                }

even if it reached ERROR branch with the model of byte-fallback = false, byte-fallback is not related in that case.

@ggerganov ggerganov merged commit 3730134 into ggerganov:master Sep 3, 2023
25 checks passed
@opparco opparco deleted the fix-bpe-tokenize-from-byte branch September 3, 2023 12:09
Sam2much96 pushed a commit to Sam2much96/llama.cpp that referenced this pull request Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants