TurboPrefill: Multi-GPU prefill acceleration for llama.cpp #24092
sergey-automation
started this conversation in
Show and tell
Replies: 2 comments 1 reply
-
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Update: I also validated the same scheduling approach on Vision Language Models (Qwen2.5-VL). The experiments suggest that VLM response waiting time can also be reduced by about 2× without changing model weights, prompts, quantization, or inference mathematics. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
TurboPrefill is an attempt to make layer-split multi-GPU configurations spend less time waiting and more time computing during prefill.
Beta Was this translation helpful? Give feedback.
All reactions