Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unigram tokenizer needed by T5 and FLAN-T5 model families #8089

Merged
merged 5 commits into from
Jun 25, 2024

Conversation

fairydreaming
Copy link
Collaborator

This is a second PR from a series of PRs adding support for T5 and FLAN-T5 models.

This PR adds implementation of the Unigram tokenizer used in T5 and FLAN-T5 models. It also adds T5 model architecture, tensors and model header parameters to allow testing the tokenizer with llama-tokenize command.

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 24, 2024
llama : fix preventing crashes when precompiled_charsmap is not present
llama.cpp Outdated
Comment on lines 4977 to 4979
vocab.n_precompiled_charsmap = gguf_get_arr_n(ctx, precompiled_charsmap_keyidx);
vocab.precompiled_charsmap = (char *) malloc(vocab.n_precompiled_charsmap);
memcpy((void*) vocab.precompiled_charsmap, gguf_get_arr_data(ctx, precompiled_charsmap_keyidx), vocab.n_precompiled_charsmap);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a memory leak here. Use std::vector<char> instead of:

    uint32_t n_precompiled_charsmap = 0;
    char * precompiled_charsmap = NULL;

Copy link
Collaborator Author

@fairydreaming fairydreaming Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I replaced it with a vector as suggested, but I had to move endianness correction code from llm_tokenizer_ugm to llm_load_vocab - reference to vocab is const in tokenizer, so manipulating the precompiled_charsmap vector buffer would require const casts.

@fairydreaming fairydreaming merged commit 6fcbf68 into ggerganov:master Jun 25, 2024
63 checks passed
Comment on lines +14067 to +14070
// initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores
std::vector<struct best_tokenization> tokenization_results(input_len + 1, {0, 0, -FLT_MAX});
// at the beginning tokenization score is zero
tokenization_results[0] = { 0, 0, 0 };
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be:

        // initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores
        std::vector<struct best_tokenization> tokenization_results(input_len + 1, {vocab.special_unk_id, 0, -FLT_MAX});
        // at the beginning tokenization score is zero
        tokenization_results[0] = { vocab.special_unk_id, 0, 0 };

Currently, the string of a single space character tokenizes to a single PAD token [0], while the AutoTokenizer returns an empty array of tokens in this case. With the change above, llama.cpp returns a single UNK token [2], which is still incorrect though. Or at least, it does not match the AutoTokenizer result

Copy link
Collaborator Author

@fairydreaming fairydreaming Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit to #8141 PR that adds an early return in tokenizer when normalized input is empty to match the SentencePiece implementation behavior in order to fix this problem: 78675f3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants