-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU CUDA - 8x performance degradation when splitting tensors -> let's split by layer as an option #4055
Comments
See the conversation starting at #3776 (comment) . I am aware of the parallelization scheme where the model is split into blocks of layers instead of splitting each layer into slices. As I said before: I have no intention of implementing it. Multi GPU only really makes sense for running something like 70b and for that purpose I think the best buys are either multiple P40s or multiple RTX 3090s. For multiple P40s the current scheme works better while for multiple RTX 3090s NVLink is available which should also result in low parallelization overhead. Synchronization overhead may also vary by OS: if you e.g. use Windows peer access between devices is only available via NVLink so the performance for multiple GPUs working on small batches should be worse.
No, for Also: see #3814 and check whether that PR has unintentionally resulted in a performance regression for your hardware setup. |
That's a pity, Nvlink has been deprecated in 2022 and is not likely going to come back to consumer GPUs. I am aware about the theory but in practice we have a 800-1000% slowdown with the current implementation of tensor split. Best would be to fix the synchronization problem, splitting by layers would be a simple solution solving that problem until synchronization works better. |
From what I see there might be an issue with the produced values being inconsistent between single and multi-GPU setups. Single-GPU:
Multi-GPU:
Edited by JG: use |
@jezzarax There's something really, really weird going on here. According to your logs you get 8+ ppl for single GPU and ~6.4, which is a gigantic difference. Also, multi-GPU is the "weird" scenario but apparently the more typical one is where you get the unexpected result. I'm very skeptical about 8+ being correct, the 6.4 sounds much more reasonable. I don't know anything about multi-GPU so I can't help diagnose the actual problem. |
I also assume something weird happens is in addition to the performance problem.
I think given the high quality state of llama.cpp and considering new models like llama2 70B and falcon 180B being open for our use it would be quite important to get multi GPU working better, closing the performance gap to python. |
The case where they got the unexpected result was for single GPU, as far as I could see. That's what makes it so weird. |
As I said before:
|
Regarding multi-GPU:
Regarding the ppl differences: We need to understand what is going on there.
|
I can do both, got access to 1x node as well. Would |
You'd need to build without GPU support, prompt processing (which is all |
export CUDA_VISIBLE_DEVICES = "-1"; That should enumerate 0 devices to the cuda backend, so nothing could be initialized or sent to a GPU |
Likely a bug was introduced in 4760e7c |
I made multiple runs over two commits and two quantisation levels. I used some commit from two-ish weeks ago and one from yesterday. It looks like there's something strange about f16 quantisation, q8 results seem more consistent.
I'm not able to run f16 for the current version of the code on bigger models for now due to #3930 (comment) If there are any other tests I can run on multi-A100 setup, happy to contribute. |
When 2400 tokens/second drops down to 300 tokens/sec despite using twice the processing hardware, and while inferencing the same model we have a problem that needs solving. That's almost a magnitude in performance lost when adding a second card. I didn't intend to trigger emotions when I used the term "bad" in my later comment, just to point to the problem. |
It's not only about the performance drop. The numbers differ between single and multi-gpu runs, please check the table I've posted above. Producing correct results is crucial. |
Problem:
I am aware everyone has different results, in my case I am running llama.cpp on a 4090 primary and a 3090 secondary, so both are quite capable cards for llms.
I am getting around 800% slowdowns when using both cards on the same model and settings (basically regardless which model I tried), batch processing speed can go down from 2400t/sec to 200-300t/sec (8-10 times slower than on single GPU).
This happens as soon as any tiny bit of processing (-ts) is shifted to the 2nd card.
I assume it is a synchronization problem in the cuda loops, I also assume the issue does not affect every combination of GPUs, especially if one GPU is significantly slower.
Suggestion:
My suggestion is to add a parameter like -layer-split, when this is used the tensors are not split up, instead the layers are split into the cards (using -ls instead of -ts).
This means the calculations can all be computed without synchronization on a single GPU at highest possible performance of that GPU.
Caveat:
In theory tensor split should boost performance, as both cards can process a split tensor at the same time, so it's the better solution but currently that's so far from reality, the suggested layer split should significantly boost the processing speed.
@JohannesGaessler what do you think ?
The text was updated successfully, but these errors were encountered: