You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On build b2849 (and more ancient ones as well), the -nkvo argument to keep the KV cache in RAM gives a huge compute buffer size in full Cublas offload of a model's layers on an heterogeneous dual GPU configuration (3090 24GB + 3060 12GB). When the non-repeating layers are not offloaded, the size of the compute cash decreases massively, back to something more "normal".
Problem :
On build b2849 (and more ancient ones as well), the -nkvo argument to keep the KV cache in RAM gives a huge compute buffer size in full Cublas offload of a model's layers on an heterogeneous dual GPU configuration (3090 24GB + 3060 12GB). When the non-repeating layers are not offloaded, the size of the compute cash decreases massively, back to something more "normal".
Example on a Yi 34b model (60+1 layers) :
perplexity -m U:\text-generation-webui\models\Merged-RP-Stew-V2-34B.i1-IQ4_XS.gguf -f wiki.test.raw -ngl 61 -b 512 -ts 1,1 -fa --no-mmap -c 2048 --chunks 10
perplexity -m U:\text-generation-webui\models\Merged-RP-Stew-V2-34B.i1-IQ4_XS.gguf -f wiki.test.raw -ngl 61 -nkvo -b 512 -ts 1,1 -fa --no-mmap -c 2048 --chunks 10 (the problematic case)
perplexity -m U:\text-generation-webui\models\Merged-RP-Stew-V2-34B.i1-IQ4_XS.gguf -f wiki.test.raw -ngl 60 -nkvo -b 512 -ts 1,1 -fa --no-mmap -c 2048 --chunks 10
It also happens on a Llama 2 70b model.
And also happens in similar proportions without flash attention.
Is this "as intended"/"as necessary", or a bug?
The text was updated successfully, but these errors were encountered: