Replies: 3 comments 17 replies
-
Huh! Quite interesting Is there a catch? If not, this seems like a major breakthrough. Will be taking a deeper look - thanks for sharing |
Beta Was this translation helpful? Give feedback.
-
The x11 speedup is kind of cherrypicked because the llama.cpp GPU code for Falcon 40b is just not well-optimized. I also didn't put much effort into FP16 in general because there is no significant difference in logits compared to q8_0. For LLaMA 2 70b q4_0 the paper reports a ~3-4x speedup which is still very good though. I think there are some issues that would need to be investigated to determine how useful this actually is:
|
Beta Was this translation helpful? Give feedback.
-
I've been thinking about PowerInfer and come to the following conclusion: as of right now trying to integrate it into llama.cpp would not be a good investment of my time. The first issue is that the code in the PowerInfer repository is simply incomplete. The second issue is that while a speedup of ~3-4x would be quite good it would not be universal: it would only be for partial offloading, only for models that use ReLU as their activation function, and there are diminishing returns in combination with sampling-based speedup techniques. The paper also shows that for batched inference PowerInfer provides less speedup but honestly I don't care about that either way. Overall I think right now the best investment of my time is to simply focus on the basics: faster matrix multiplication, more efficient quantization, and better training code. |
Beta Was this translation helpful? Give feedback.
-
https://github.com/SJTU-IPADS/PowerInfer
Claim 11x for 40B on 4090 vs llamacpp
Beta Was this translation helpful? Give feedback.
All reactions