Better offload? #4534

Kimiko-AI · 2023-12-19T14:42:32Z

Kimiko-AI
Dec 19, 2023

https://github.com/SJTU-IPADS/PowerInfer
Claim 11x for 40B on 4090 vs llamacpp

ggerganov · 2023-12-19T14:55:03Z

ggerganov
Dec 19, 2023
Maintainer

Huh! Quite interesting

Is there a catch? If not, this seems like a major breakthrough. Will be taking a deeper look - thanks for sharing

3 replies

cmp-nct Dec 19, 2023

That's actually insane .. According to the paper it's basically within 5% of vanilla accuracy

This has another use case as well, this appears to identify the most important neurons (20%) with high accuracy.
Wouldn't that make it possible to apply a much better quantization?
We've been looking into tons of methods on how to quantize a model while keeping perplexity, this might allow to quantize 20% of the model at higher precision and 80% at lower precision while gaining perplexity. Not sure if those 20% are changing all the time (hence predictor models) but there might be a path to great quantization hidden.

Kimiko-AI Dec 19, 2023
Author

Can't wait to run Mixtral on my 8gb GPU

darkanubis0100 Mar 17, 2024

I am running Mixtral (Dolphin Mixtral) on an 8 GB GPU with Offloading. The CPU helps a lot but unfortunately there is no way to load such a large model on so little VRAM, so you will always resort to offloading.

JohannesGaessler · 2023-12-19T17:29:16Z

JohannesGaessler
Dec 19, 2023
Collaborator

The x11 speedup is kind of cherrypicked because the llama.cpp GPU code for Falcon 40b is just not well-optimized. I also didn't put much effort into FP16 in general because there is no significant difference in logits compared to q8_0. For LLaMA 2 70b q4_0 the paper reports a ~3-4x speedup which is still very good though.

I think there are some issues that would need to be investigated to determine how useful this actually is:

How does the sparsity compare to further quantization? If I understand correctly the new method works by predicting which neurons will actually pass ReLU which then allows you to save the compute for the neurons that would be filtered out anyways. But this is obviously not lossless and you could also speed up hybrid inference by using a smaller quant and running a larger percentage of a dense model on the GPU. So the question would then be which degrades quality more.
Can we find an efficient implementation for prompt processing? Neither the paper nor the Github list separate speeds for prompt processing. I would very much expect that this cannot be done sparsely because with each additional token it should become less likely that a given weight can be skipped entirely. It may be possible to reconstruct the dense matrix on the GPU so you would get the same speeds in the end though.
How much would we need to change the code to make this work properly? I have not yet tested PowerInfer against llama.cpp but if the speedup is not that much higher than some variant of speculative decoding I think the opportunity cost may be too high. Speculative decoding has the advantage that it can be separated from the actual computation so focusing on optimizing the CUDA code for something like that may yield better results. Every time I have to touch the CUDA matrix multiplication code the most time consuming part are the k-quants because those are relatively complicated and require an implementation per quantization format. If I had to also consider PowerInfer things for each format it would make changes even more time consuming and it would also make it more difficult to onboard new devs that can make changes or do maintenance.
If my understanding is correct, this method would require an additional model to predict sparsity specific to each regular language model. If you compare the number of EXL2 quants on HuggingFace to the number of GGUF quants it seems to me like there are significantly fewer EXL2 quants, presumably because they are just more laborious to make than GGUF quants. So this method may not yield a very universal speedup simply due to a lack of helper models.

6 replies

JohannesGaessler Dec 19, 2023
Collaborator

The author mentioned using 4 bit quantization, also measured the speedups with quantized and non quantized models (still up to 8x faster according to him)

This is missing the point. The x8 speedup is also for Falcon 40b. Again, poorly optimized in llama.cpp. This is a bad number to infer the speedup from PowerInfer itself.

Based on their benchmarks at 5% loss. So it's not instead of quantization, it's on top of quantization.

I never said the two things are mutually exclusive. I am saying that we don't know whether sparsity via this method is better or worse than further quantizing a model with the methods that we already have because the metric that we commonly use is perplexity.

The pure number of kernels makes any addition a problem.

Something like speculative decoding can essentially be implemented purely in the sampling code without touching the CUDA code at all. So you can cleanly separate the code and the project as a whole is easier to maintain. PowerInfer on the other hand seems to need more tightly coupled changes (the paper says they added 4200 LOC).

llama.cpp is currently maybe ~30% slower than the fastest competing implementations (exl2). If it's 400% slower it would be an issue, on the other hand if it is 300% faster than others it would leave a deeper mark.

Are you perhaps misunderstanding what PowerInfer does? It provides a speedup for hybrid CPU+GPU inference. There is no comparable speedup for purely GPU inference. The whole reason why it works is that it keeps the weights which you need only sometimes in RAM so that the GPU gets only the weights that are needed most often. Therefore the VRAM can be used more efficiently to accelerate part of the model that is too large to fit entirely into VRAM.

cmp-nct Dec 19, 2023

You are probably right on Falcon-40 with it's broadcasted KV contributing, I used a KV-cycle optimization in ggllm though not on cuda kernels. Given this method focused with KV being on CPU, it's probably by Falcon-40 benefits most.
In any case, it's a massive speedup for large models, given Nvidias battle against consumer VRAM and the increasing model sizes that's still very interesting imho. Would also allow to run multiple models on a single consumer GPU, all of them partial offloaded while still being fast..

I was also interested in implications on Quantization, those tensors that are offloaded on GPU appear to be more important. That could be leveraged in a selective quantization method to gain better perplexity at lower model sizes.

FNsi Dec 20, 2023

For what I understand, that really makes LLMs to typical software.

YixinSong-e Dec 20, 2023

Thank you for your interest. I need to clarify that the code structure we used in our paper was somewhat disorganized. Therefore, to ensure that the open-sourced code is user-friendly, we have currently reorganized a part of the code for public release. We plan to further open-source the remaining parts, including the code related to the predictor, in the future. It should also be noted that to enhance ease of use, the currently open-sourced code has experienced 5%-10% reduction in speed. In the future, we will release the full code we used in our paper, along with more detailed explanations. Here, I will also try my best to answer some questions:

Regarding the impact of sparsity on accuracy, yes, our predictor on average has about 5% false negatives during prediction, but the results in our paper demonstrate that the impact is actually not significant. We have also conducted some perplexity tests, which I will share here:

Model Dataset Original ppl Sparse ppl

OPT-6.7B WikiText 12.2864 12.3078

OPT-13B WikiText 11.4954 11.5403

OPT-30B WikiText 10.6703 10.704

(which is evaluated by tools lm-evaluation-harness)
Additionally, you can refer to the papers at https://arxiv.org/pdf/2312.11514.pdf and https://arxiv.org/abs/2310.17157, which I believe are aligned with our results.

Yes, currently, the implementation of GPU sparse operators in the prompt processing phase is not efficient, which is one of the issues we need to address. In evaluation, to optimize performance, we set a threshold: when the input length exceeds 80, we perform dense computation for prompt processing(which is not in our open-sourced code). And you can see in our performance tests that when the input length is longer than 64 and the output length is short, our acceleration is little, due to the inefficiency of the prompt stage. We hope everyone can provide some optimization ideas.
Regarding the accuracy of the q4_0 model, we did not mention it in the paper, partly due to space considerations and partly because the accuracy has already been demonstrated in DejaVu https://arxiv.org/abs/2310.17157. In the paper, we only wanted to focus on the optimization in speed. I think what you said makes a lot of sense. We can further explore the accuracy performance combined with quantization and sparsity in the future.
As for falcon-40B, I'm very sorry that I wasn't aware of its poor optimization on llama.cpp. At the time of evaluation, I thought the acceleration ratio mainly came from MQA, which reduced the number of parameters in the Attention block and allowed more hot neurons to be placed on the GPU. Maybe we can optimize Falcon together.

If you are interested in predictor training code, we extend our training predictor code on top of DejaVu.

Anyway, I am very pleased to see everyone's enthusiastic discussion. I also hope that everyone can have more patience and wait for us to clean up more code.

nanowell Dec 20, 2023

Can't wait for Mixtral support @YixinSong-e

JohannesGaessler · 2023-12-20T19:32:20Z

JohannesGaessler
Dec 20, 2023
Collaborator

I've been thinking about PowerInfer and come to the following conclusion: as of right now trying to integrate it into llama.cpp would not be a good investment of my time. The first issue is that the code in the PowerInfer repository is simply incomplete. The second issue is that while a speedup of ~3-4x would be quite good it would not be universal: it would only be for partial offloading, only for models that use ReLU as their activation function, and there are diminishing returns in combination with sampling-based speedup techniques. The paper also shows that for batched inference PowerInfer provides less speedup but honestly I don't care about that either way. Overall I think right now the best investment of my time is to simply focus on the basics: faster matrix multiplication, more efficient quantization, and better training code.

8 replies

JohannesGaessler Dec 21, 2023
Collaborator

And with PowerInfer you won't be able to do it either. The only two real candidates where it would help right now are LLaMA and Mixtral. But both of these models use SiLU instead of ReLU so you cannot use PowerInfer unless you specifically add more training for the modified architecture. And I'm not convinced that the quality degradation from doing that is smaller than the quality degradation from further quantization.

LoganDark Dec 21, 2023

And with PowerInfer you won't be able to do it either. The only two real candidates where it would help right now are LLaMA and Mixtral. But both of these models use SiLU instead of ReLU so you cannot use PowerInfer unless you specifically add more training for the modified architecture. And I'm not convinced that the quality degradation from doing that is smaller than the quality degradation from further quantization.

On my 12GB 3060 it's not possible to fully offload Mixtral 8x7B even in Q2_K. Not sure how much "further quantization" is really possible there.

sorasoras Dec 21, 2023

maybe we need to wait for powerinfer support mixtral then we merge.

ggerganov Dec 21, 2023
Maintainer

Yes, I'm thinking we should give the project some time to mature and stabilize. We can then think about integrating some of the ideas here if there is interest and resources. In the meantime will be following along with the progress

darkanubis0100 Mar 17, 2024

I am running Mixtral (Dolphin Mixtral) on an 8 GB GPU with Offloading. The CPU helps a lot but unfortunately there is no way to load such a large model on so little VRAM, so you will always resort to offloading. (Comment duplicated)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better offload? #4534

{{title}}

Replies: 3 comments 17 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Better offload? #4534

Replies: 3 comments · 17 replies

ggerganov Dec 19, 2023 Maintainer

Kimiko-AI Dec 19, 2023 Author

JohannesGaessler Dec 19, 2023 Collaborator

JohannesGaessler Dec 19, 2023 Collaborator

JohannesGaessler Dec 20, 2023 Collaborator

JohannesGaessler Dec 21, 2023 Collaborator

ggerganov Dec 21, 2023 Maintainer

Replies: 3 comments 17 replies

ggerganov
Dec 19, 2023
Maintainer

Kimiko-AI Dec 19, 2023
Author

JohannesGaessler
Dec 19, 2023
Collaborator

JohannesGaessler Dec 19, 2023
Collaborator

JohannesGaessler
Dec 20, 2023
Collaborator

JohannesGaessler Dec 21, 2023
Collaborator

ggerganov Dec 21, 2023
Maintainer