Inference support for T5 and FLAN-T5 model families #8141

fairydreaming · 2024-06-26T17:25:31Z

This PR is the third PR from the series of PR adding support for T5 and FLAN-T5 model families.

This PR adds:

model types for T5 and FLAN-T5 model families,
inference support for these models,
three new API functions: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()
support for encoder-decoder models in llama-cli

Example model for testing: https://huggingface.co/google-t5/t5-small

Example usage:

./llama-cli -m models/t5-small.gguf -p 'translate English to German: The house is wonderful.'

Supported models:

t5-small,
t5-base,
t5-large,
t5-3b,
t5-11b,
flan-t5-small,
flan-t5-base,
flan-t5-large,
flan-t5-xl,
flan-t5-xxl,
other models like t5-base-spellchecker, LaMini-T5, LaMini-Flan-T5, Pile-T5, MADLAD400 and related shall also work.

I think it fixes #5763, #3393, #247, #4316, #7238

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

…l families llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token() common, llama-cli : use new API functions to support encoder-decoder models convert-hf : handle shared token embeddings tensors in T5Model convert-hf : handle SentencePiece BPE tokenizer in T5Model (for Pile-T5 models) convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

vladfaust · 2024-06-28T10:49:57Z

include/llama.h

@@ -768,6 +775,14 @@ extern "C" {
    // Frees a batch of tokens allocated with llama_batch_init()
    LLAMA_API void llama_batch_free(struct llama_batch batch);

+    // Processes a batch of tokens with the ecoder part of the encoder-decoder model.
+    // Stores the encoder output internally for later use by the decoder cross-attention layers.


In my case, a prompt consists of a static part, which is unchanged and makes use of the KV cache, and dynamic part, which changes frequently. It works good with GPT, where I can call llama_kv_cache_seq_rm to cleanup the dynamic part of KV cache and start evaluating again. Would a similar approach work with T5? In other words, what's the degree of control over the encoder output? Thank you.

@vladfaust No, encoder requires all input tokens to be present in the input batch. It's because the attention in encoder is not causal, so each token in the input sequence attends to all tokens in the input sequence. It doesn't even use KV cache because there's no need to.

I guess theoretically it would be possible to implement it in a way that would allow "adding" tokens to encoder output by calling llama_encode() multiple times, but the implementation would be much more complicated, definitely outside the scope of this PR.

Just to clarify, @fairydreaming: one of my use-cases is converting a growing chat history to some structured representation for each new message. Do I understand correctly that for now I'd have to encode the whole history again and again for each inference without any form of caching? (No offence, obviously, as I'm very grateful for the T5 support at all!)

@vladfaust Yes, there's no caching in the encoder, so if the input sequence grows even by one token, you have to encode it again and during this process all previous calculations for this token sequence are repeated.

fairydreaming · 2024-06-28T14:51:34Z

@ggerganov take a look at the new API in this PR when you have some time

include/llama.h

ggerganov · 2024-06-29T09:26:03Z

src/llama.cpp

+
+        inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb);
+
+        if (lctx.is_encoding) {


In which cases this would be false during llama_decode()?

Always, as llama_decode_internal sets is_encoding to false at the start. It's true only during llama_encode_internal call.

ggerganov · 2024-07-02T07:26:30Z

I think I found a small problem with the tokenization. Tried to tokenize the string !!!!!!:

main : expected tokens:      3 ' ',     55 '!',  17065 '!!!!!', 
main : got tokens:           3 ' ',  17065 '!!!!!',     55 '!',

Using: https://huggingface.co/google-t5/t5-small

src/llama.cpp

…s empty

fairydreaming · 2024-07-02T09:20:18Z

I think I found a small problem with the tokenization. Tried to tokenize the string !!!!!!:
main : expected tokens:      3 ' ',     55 '!',  17065 '!!!!!', 
main : got tokens:           3 ' ',  17065 '!!!!!',     55 '!',
Using: https://huggingface.co/google-t5/t5-small

@ggerganov Can you give me a full example that produced the 3, 55, 17065 tokenization? I did some tests and got 3, 17065, 55 both in llama.cpp and in transformers library.

ggerganov · 2024-07-02T09:23:37Z

I opened a PR in your repo with instructions to reproduce:

fairydreaming#1

fairydreaming · 2024-07-02T11:14:36Z

I opened a PR in your repo with instructions to reproduce:

fairydreaming#1

These test failures are caused by differences in tokenization between T5Tokenizer and T5TokenizerFast in HF transformers library. More info in the fairydreaming#1 PR.

* convert : add t5 tokenizer tests, use "slow" tokenizer * llama : UGM tokenizer init with UNK tokens instead of PAD

ggerganov

Pushed some relatively minor changes:

updated names of variables
simplified the logic in llama_encode_internal by removing micro-batching support
extended llama-batched example to work with T5 models

Feel free to merge if this looks good to you

sszymczy and others added 4 commits June 26, 2024 15:03

Merge branch 'ggerganov:master' into t5-clean-3

1c8d37a

llama : updated llm_build_ffn() calls to new API in build_t5()

bad0caf

llama : make pos_bias contiguous for CUDA

c4ded1a

fairydreaming mentioned this pull request Jun 26, 2024

llama : add T5 (encoder-decoder) support #5763

Closed

fairydreaming self-assigned this Jun 26, 2024

github-actions bot added examples python python script changes labels Jun 26, 2024

Merge remote-tracking branch 'upstream/master' into t5-clean-3

7293243

fairydreaming mentioned this pull request Jun 27, 2024

ggml : implement a spellcheck model (xfspell, t5-spellchecker, etc) ggerganov/ggml#233

Open

llama : whitespace formatting

7d7fff4

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 27, 2024

vladfaust reviewed Jun 28, 2024

View reviewed changes

fairydreaming requested a review from ggerganov June 28, 2024 14:50

ggerganov reviewed Jun 29, 2024

View reviewed changes

llama : quantization-related fixes for T5

6dc9eb4

fairydreaming mentioned this pull request Jun 29, 2024

Model conversion support for T5 and FLAN-T5 model variants #8055

Merged

4 tasks

ggerganov reviewed Jul 2, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

llama : add early return in Unigram tokenizer when normalized input i…

78675f3

…s empty

fairydreaming mentioned this pull request Jul 2, 2024

Add Unigram tokenizer needed by T5 and FLAN-T5 model families #8089

Merged

4 tasks

llama : remove obsolete code

1d1cb01

ggerganov and others added 4 commits July 2, 2024 19:52

add t5 tokenizer tests

7c610fa

* convert : add t5 tokenizer tests, use "slow" tokenizer * llama : UGM tokenizer init with UNK tokens instead of PAD

Merge remote-tracking branch 'upstream/master' into t5-clean-3

b01ce7d

llama : move JAIS after T5 everywhere for easier merging later

d40c9a1

llama : change naming to prefer "_enc" suffix

03ab5dd

ggerganov added 3 commits July 4, 2024 13:37

llama : simplify llama_encode_internal

88270a3

llama-batched : add encoder support

ded682d

llama : minor

01cd5a6

ggerganov approved these changes Jul 4, 2024

View reviewed changes

sszymczy and others added 2 commits July 4, 2024 13:48

llama : silence compiler warnings

8b560e6

Merge branch 'ggerganov:master' into t5-clean-3

9bcecf1

fairydreaming merged commit 807b0c4 into ggerganov:master Jul 4, 2024
54 checks passed

yoshoku mentioned this pull request Jul 5, 2024

docs: update README on preparing quantized model yoshoku/llama_cpp.rb#19

Closed

flostellbrink mentioned this pull request Jul 6, 2024

Madlad400 model ollama/ollama#2802

Open

compilade mentioned this pull request Jul 9, 2024

Adding models to the list in convert-hf-to-gguf-update.py #8357

Open

4 tasks

kyteinsky mentioned this pull request Jul 11, 2024

Cannot run T5-based models abetlen/llama-cpp-python#1587

Open

4 tasks

jepjoo mentioned this pull request Aug 15, 2024

[feature request] t5 gguf support city96/ComfyUI-GGUF#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference support for T5 and FLAN-T5 model families #8141

Inference support for T5 and FLAN-T5 model families #8141

fairydreaming commented Jun 26, 2024 •

edited

Loading

vladfaust Jun 28, 2024

fairydreaming Jun 28, 2024

vladfaust Jul 1, 2024

fairydreaming Jul 1, 2024

fairydreaming commented Jun 28, 2024

ggerganov Jun 29, 2024

fairydreaming Jun 29, 2024 •

edited

Loading

ggerganov commented Jul 2, 2024 •

edited

Loading

fairydreaming commented Jul 2, 2024

ggerganov commented Jul 2, 2024

fairydreaming commented Jul 2, 2024

ggerganov left a comment


		inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb);

		if (lctx.is_encoding) {

Inference support for T5 and FLAN-T5 model families #8141

Inference support for T5 and FLAN-T5 model families #8141

Conversation

fairydreaming commented Jun 26, 2024 • edited Loading

vladfaust Jun 28, 2024

Choose a reason for hiding this comment

fairydreaming Jun 28, 2024

Choose a reason for hiding this comment

vladfaust Jul 1, 2024

Choose a reason for hiding this comment

fairydreaming Jul 1, 2024

Choose a reason for hiding this comment

fairydreaming commented Jun 28, 2024

ggerganov Jun 29, 2024

Choose a reason for hiding this comment

fairydreaming Jun 29, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Jul 2, 2024 • edited Loading

fairydreaming commented Jul 2, 2024

ggerganov commented Jul 2, 2024

fairydreaming commented Jul 2, 2024

ggerganov left a comment

Choose a reason for hiding this comment

fairydreaming commented Jun 26, 2024 •

edited

Loading

fairydreaming Jun 29, 2024 •

edited

Loading

ggerganov commented Jul 2, 2024 •

edited

Loading