Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference support for T5 and FLAN-T5 model families #8141

Merged
merged 18 commits into from
Jul 4, 2024

Conversation

fairydreaming
Copy link
Collaborator

@fairydreaming fairydreaming commented Jun 26, 2024

This PR is the third PR from the series of PR adding support for T5 and FLAN-T5 model families.

This PR adds:

  • model types for T5 and FLAN-T5 model families,
  • inference support for these models,
  • three new API functions: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()
  • support for encoder-decoder models in llama-cli

Example model for testing: https://huggingface.co/google-t5/t5-small

Example usage:

./llama-cli -m models/t5-small.gguf -p 'translate English to German: The house is wonderful.'

Supported models:

I think it fixes #5763, #3393, #247, #4316, #7238

sszymczy and others added 4 commits June 26, 2024 15:03
…l families

llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

common, llama-cli : use new API functions to support encoder-decoder models

convert-hf : handle shared token embeddings tensors in T5Model

convert-hf : handle SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 27, 2024
@@ -768,6 +775,14 @@ extern "C" {
// Frees a batch of tokens allocated with llama_batch_init()
LLAMA_API void llama_batch_free(struct llama_batch batch);

// Processes a batch of tokens with the ecoder part of the encoder-decoder model.
// Stores the encoder output internally for later use by the decoder cross-attention layers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case, a prompt consists of a static part, which is unchanged and makes use of the KV cache, and dynamic part, which changes frequently. It works good with GPT, where I can call llama_kv_cache_seq_rm to cleanup the dynamic part of KV cache and start evaluating again. Would a similar approach work with T5? In other words, what's the degree of control over the encoder output? Thank you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladfaust No, encoder requires all input tokens to be present in the input batch. It's because the attention in encoder is not causal, so each token in the input sequence attends to all tokens in the input sequence. It doesn't even use KV cache because there's no need to.

I guess theoretically it would be possible to implement it in a way that would allow "adding" tokens to encoder output by calling llama_encode() multiple times, but the implementation would be much more complicated, definitely outside the scope of this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, @fairydreaming: one of my use-cases is converting a growing chat history to some structured representation for each new message. Do I understand correctly that for now I'd have to encode the whole history again and again for each inference without any form of caching? (No offence, obviously, as I'm very grateful for the T5 support at all!)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladfaust Yes, there's no caching in the encoder, so if the input sequence grows even by one token, you have to encode it again and during this process all previous calculations for this token sequence are repeated.

@fairydreaming
Copy link
Collaborator Author

@ggerganov take a look at the new API in this PR when you have some time

include/llama.h Show resolved Hide resolved

inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb);

if (lctx.is_encoding) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases this would be false during llama_decode()?

Copy link
Collaborator Author

@fairydreaming fairydreaming Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always, as llama_decode_internal sets is_encoding to false at the start. It's true only during llama_encode_internal call.

@ggerganov
Copy link
Owner

ggerganov commented Jul 2, 2024

I think I found a small problem with the tokenization. Tried to tokenize the string !!!!!!:

main : expected tokens:      3 ' ',     55 '!',  17065 '!!!!!', 
main : got tokens:           3 ' ',  17065 '!!!!!',     55 '!',

Using: https://huggingface.co/google-t5/t5-small

src/llama.cpp Outdated Show resolved Hide resolved
@fairydreaming
Copy link
Collaborator Author

I think I found a small problem with the tokenization. Tried to tokenize the string !!!!!!:

main : expected tokens:      3 ' ',     55 '!',  17065 '!!!!!', 
main : got tokens:           3 ' ',  17065 '!!!!!',     55 '!',

Using: https://huggingface.co/google-t5/t5-small

@ggerganov Can you give me a full example that produced the 3, 55, 17065 tokenization? I did some tests and got 3, 17065, 55 both in llama.cpp and in transformers library.

@ggerganov
Copy link
Owner

I opened a PR in your repo with instructions to reproduce:

fairydreaming#1

@fairydreaming
Copy link
Collaborator Author

I opened a PR in your repo with instructions to reproduce:

fairydreaming#1

These test failures are caused by differences in tokenization between T5Tokenizer and T5TokenizerFast in HF transformers library. More info in the fairydreaming#1 PR.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed some relatively minor changes:

  • updated names of variables
  • simplified the logic in llama_encode_internal by removing micro-batching support
  • extended llama-batched example to work with T5 models

Feel free to merge if this looks good to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

llama : add T5 (encoder-decoder) support
5 participants