Questions related to llama.cpp options #3111
Answered
by
staviq
zastroyshchik
asked this question in
Q&A
Replies: 2 comments 9 replies
-
|
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
zastroyshchik
-
it was in fact not. llama 2 was pretrained on
no you do. it defaults to 512
can you elaborate on that? does the output seem cut-off? |
Beta Was this translation helpful? Give feedback.
8 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I use mainly this model, quantized at q4_0 and q5_1:
For now, I am testing it, on two 1070Ti, and I get ~4t/s with q4_0.
But now, I have some questions (for the sake of simplicity, we will only consider q4_0 quantized model):
--ctx-size
to, say, 7296? Should I get any improvement? Or it just wastes of memory? Or it does nothing?Indeed, if I set
--ctx-size
to 2048, I get this output:And if I set it to 7296, I get this:
But seems it does not impact the output length, nor the memory usage.
-ngl 38
, with--low-vram
, I am yet "surprised" to see that llama.cpp uses around 20GB of RAM, in addition to the ~15VRAM. But the q4_0 model is 17.7GB. So why so much RAM is used?Here what I get with these options
--ctx-size 2048 -ngl 38 -t 6 --keep -1 --main-gpu 1 --multiline-input --tensor-split 54,46 --low-vram --mlock
:Yet, llama.cpp use in addition 20GB of RAM. Does it mean that
--mlock
implies to load all the model into the RAM, in addition to the VRAM?Should I set
--batch-size
to some huge value, or should I keep it the same as--ctx-size
? Are these options related?The option
--tensor-split
seems rounding to one decimal place, right? Because, i tried value like--tensor-split 54.5,44.5
, but did not get the expected result (I get CUDA error 2 at ggml-cuda.cu:5031: out of memory).llama.cpp still produces log files (main..log and llama..log) despite the use of
--log-disable
. What am I missing?Beta Was this translation helpful? Give feedback.
All reactions