-
Notifications
You must be signed in to change notification settings - Fork 948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot run T5-based models #1587
Comments
@kyteinsky Inference with T5-like models requires using some new API functions (e.g. |
hey @fairydreaming, thanks for the T5 implementation! Which python module do you refer to? I was under the impression that llama-cpp-python is the python wrapper to use. Also, it doesn't depend on any such package related to llama.cpp. |
|
That's where we are :) |
@kyteinsky lol, didn't notice that all that time, sorry |
Same problem here, really hoping the T5 architecture will get implemented soon! |
Hi, Can you help with the python code? I tried these, but output is gibberish: I added these functions to "_internals.py" (https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/_internals.py#L354):
then these lines in "llama.py" (https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L633):
|
@Sadeghi85 One thing that is missing in your code is a preparation of the input to So before calling |
I tried, but I really don't know what I'm doing. Waiting for @abetlen to take a look at this. |
@Sadeghi85 Have you been able to fixed this issue?
|
I see there are serious problems with using T5, so I added a branch with a high-level example of inference with T5 model: https://github.com/fairydreaming/llama-cpp-python/tree/t5 Also there is a second branch containing low-level example of T5 inference: https://github.com/fairydreaming/llama-cpp-python/tree/fix-low-level-examples Hope it helps. |
@fairydreaming Thanks a lot for early response and scripts you provided. May I know can we pass batch prompts as well rather than passing single prompt to model? If yes, how can we do that? |
@yugaljain1999 Yes, you can pass multiple prompts. I don't know how it works in llama-cpp-python high-level API, but in llama.cpp (low-level API) you do it by creating a batch containing tokens with different seq_id. For example if you have two prompts tokenized to pt1 and pt2 token arrays then you create a batch and:
then you call llama_encode() on this batch. For llama_decode you initialize the batch with decoder_start_token for both sequences:
then you call llama_decode() on this batch, sample next tokens for both sequences, add them to the batch, call llama_decode() again to generate next tokens for both sequences etc. |
@fairydreaming Thanks for this info. Command - Error -
Any idea how can we resolve it? Thanks |
@yugaljain1999 T5 models are still not supported in llama-server |
@fairydreaming thanks for informing. As llama-server won't work with T5 so now I am continuing working with low level api example of T5 as you shared, may I know what params we need to change in this script to leverage gpu acceleration fully? Thanks
|
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I'm trying to use models based on the T5 architecture, like "google/flan-t5-small" and "google/madlad400-3b-mt" after converting them to gguf format (8 bit). It works nicely with
llama-cli
.It was a recent addition to llama.cpp: ggerganov/llama.cpp#8141
Related issue but I don't understand what needs to be changed here in the python package: ggerganov/llama.cpp#8398
Current Behavior
It loads the model into gpu/memory but the execution fails with a GGML_ASSERT and
(core dumped)
Environment and Context
I'm running v0.2.82 using the compiled whl provided for cuda 12.4.
$ lscpu
$ uname -a
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
python convert_hf_to_gguf.py ./flan-t5-small --outtype q8_0
)Note: Many issues seem to be regarding functional or performance issues / differences with
llama.cpp
. In these cases we need to confirm that you're comparing against the version ofllama.cpp
that was built with your python package, and which parameters you're passing to the context.Failure Logs
It is always the same.
The text was updated successfully, but these errors were encountered: