Add Unigram tokenizer needed by T5 and FLAN-T5 model families #8089

fairydreaming · 2024-06-24T07:23:50Z

This is a second PR from a series of PRs adding support for T5 and FLAN-T5 models.

This PR adds implementation of the Unigram tokenizer used in T5 and FLAN-T5 models. It also adds T5 model architecture, tensors and model header parameters to allow testing the tokenizer with llama-tokenize command.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

llama : add Unigram tokenizer

llama : fix preventing crashes when precompiled_charsmap is not present

ggerganov · 2024-06-25T07:46:32Z

llama.cpp

+                vocab.n_precompiled_charsmap = gguf_get_arr_n(ctx, precompiled_charsmap_keyidx);
+                vocab.precompiled_charsmap = (char *) malloc(vocab.n_precompiled_charsmap);
+                memcpy((void*) vocab.precompiled_charsmap, gguf_get_arr_data(ctx, precompiled_charsmap_keyidx), vocab.n_precompiled_charsmap);


There's a memory leak here. Use std::vector<char> instead of:

uint32_t n_precompiled_charsmap = 0; char * precompiled_charsmap = NULL;

Good catch! I replaced it with a vector as suggested, but I had to move endianness correction code from llm_tokenizer_ugm to llm_load_vocab - reference to vocab is const in tokenizer, so manipulating the precompiled_charsmap vector buffer would require const casts.

…r to avoid memory leak

ggerganov · 2024-07-02T07:37:32Z

llama.cpp

+        // initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores
+        std::vector<struct best_tokenization> tokenization_results(input_len + 1, {0, 0, -FLT_MAX});
+        // at the beginning tokenization score is zero
+        tokenization_results[0] = { 0, 0, 0 };


Is this supposed to be:

// initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores std::vector<struct best_tokenization> tokenization_results(input_len + 1, {vocab.special_unk_id, 0, -FLT_MAX}); // at the beginning tokenization score is zero tokenization_results[0] = { vocab.special_unk_id, 0, 0 };

Currently, the string of a single space character tokenizes to a single PAD token [0], while the AutoTokenizer returns an empty array of tokens in this case. With the change above, llama.cpp returns a single UNK token [2], which is still incorrect though. Or at least, it does not match the AutoTokenizer result

I added a commit to #8141 PR that adds an early return in tokenizer when normalized input is empty to match the SentencePiece implementation behavior in order to fix this problem: 78675f3

llama : add T5 model architecture, tensors and model header parameters

c2c799c

llama : add Unigram tokenizer

fairydreaming mentioned this pull request Jun 24, 2024

llama : add T5 (encoder-decoder) support #5763

Closed

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 24, 2024

llama : add handling of byte tokens in UGM tokenizer (same as in SPM)

f4c03c0

llama : fix preventing crashes when precompiled_charsmap is not present

iamlemec mentioned this pull request Jun 25, 2024

Add support for BERT embedding models #5423

Merged

ggerganov approved these changes Jun 25, 2024

View reviewed changes

sszymczy and others added 3 commits June 25, 2024 17:36

llama : replace allocated precompiled_charsmap buffer with std::vecto…

87b7dd2

…r to avoid memory leak

Merge branch 'ggerganov:master' into t5-clean-2

21d3684

llama : fix whitespace formatting

f23ff91

fairydreaming merged commit 6fcbf68 into ggerganov:master Jun 25, 2024
63 checks passed

ggerganov reviewed Jul 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unigram tokenizer needed by T5 and FLAN-T5 model families #8089

Add Unigram tokenizer needed by T5 and FLAN-T5 model families #8089

fairydreaming commented Jun 24, 2024

ggerganov Jun 25, 2024

fairydreaming Jun 25, 2024 •

edited

Loading

ggerganov Jul 2, 2024

fairydreaming Jul 2, 2024 •

edited

Loading

Add Unigram tokenizer needed by T5 and FLAN-T5 model families #8089

Add Unigram tokenizer needed by T5 and FLAN-T5 model families #8089

Conversation

fairydreaming commented Jun 24, 2024

ggerganov Jun 25, 2024

Choose a reason for hiding this comment

fairydreaming Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Jul 2, 2024

Choose a reason for hiding this comment

fairydreaming Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

fairydreaming Jun 25, 2024 •

edited

Loading

fairydreaming Jul 2, 2024 •

edited

Loading