Questions related to llama.cpp options #3111

zastroyshchik · 2023-09-10T17:06:51Z

zastroyshchik
Sep 10, 2023

I use mainly this model, quantized at q4_0 and q5_1:

https://huggingface.co/Phind/Phind-CodeLlama-34B-v2

For now, I am testing it, on two 1070Ti, and I get ~4t/s with q4_0.

But now, I have some questions (for the sake of simplicity, we will only consider q4_0 quantized model):

Knowing that LLaMa 2 was built with a context 2048, what if I set --ctx-size to, say, 7296? Should I get any improvement? Or it just wastes of memory? Or it does nothing?

Indeed, if I set --ctx-size to 2048, I get this output:

llama_new_context_with_model: kv self size = 384.00 MB

And if I set it to 7296, I get this:

llama_new_context_with_model: kv self size = 1368.00 MB

But seems it does not impact the output length, nor the memory usage.

While I can offload some layers to the GPU, with -ngl 38, with --low-vram, I am yet "surprised" to see that llama.cpp uses around 20GB of RAM, in addition to the ~15VRAM. But the q4_0 model is 17.7GB. So why so much RAM is used?
Here what I get with these options --ctx-size 2048 -ngl 38 -t 6 --keep -1 --main-gpu 1 --multiline-input --tensor-split 54,46 --low-vram --mlock :

ggml_cuda_set_main_device: using device 1 (NVIDIA GeForce GTX 1070 Ti) as main device
llm_load_tensors: mem required = 4059.00 MB (+ 384.00 MB per state)
llm_load_tensors: offloading 38 repeating layers to GPU
llm_load_tensors: offloaded 38/51 layers to GPU
llm_load_tensors: VRAM used: 14110 MB

Yet, llama.cpp use in addition 20GB of RAM. Does it mean that --mlock implies to load all the model into the RAM, in addition to the VRAM?

Should I set --batch-size to some huge value, or should I keep it the same as --ctx-size? Are these options related?
The option --tensor-split seems rounding to one decimal place, right? Because, i tried value like --tensor-split 54.5,44.5, but did not get the expected result (I get CUDA error 2 at ggml-cuda.cu:5031: out of memory).
llama.cpp still produces log files (main..log and llama..log) despite the use of --log-disable. What am I missing?

Answered by staviq

Sep 10, 2023

Context is what a model can see and work with, basically the only thing that can change dynamically, the rest is pretty much "static" and it cannot be changed at runtime. Models trained for particular size of context, are trained on examples not exceeding that size (not exceeding the "text length" limit). So artificially increasing context ("models working memory"), won't do much, because model is somewhat "hardcoded" to only fill the context up to the predefined size. To overcome that, rope scaling may be used to force the model to use bigger context, but that kind of makes is selectively ignore certain words or parts of the text, it's sort of like JPEG, when you decrease data quality t…

View full answer

staviq · 2023-09-10T20:06:37Z

staviq
Sep 10, 2023

Context is what a model can see and work with, basically the only thing that can change dynamically, the rest is pretty much "static" and it cannot be changed at runtime. Models trained for particular size of context, are trained on examples not exceeding that size (not exceeding the "text length" limit). So artificially increasing context ("models working memory"), won't do much, because model is somewhat "hardcoded" to only fill the context up to the predefined size. To overcome that, rope scaling may be used to force the model to use bigger context, but that kind of makes is selectively ignore certain words or parts of the text, it's sort of like JPEG, when you decrease data quality to make it smaller ( to make it "fit in the context"). If you force bigger context on a model that is not made for bigger context, it can work sometimes, and return garbage other times. With GGUF, you do not have to specify context size at all. I believe context will use memory when it actually gets filled with data, but this project evolves rapidly and it might not be accurate.
If you tell it to use GPU with -ngl but you use a number smaller than the amount of layers the model has, whatever number you put in -ngl that many layers will run on the GPU, and the rest of the layers will run on the CPU, hence the RAM usage. If you set the -ngl to the exact number of layers the model has, RAM usage will drop, and VRAM usage will increase. Different quantizations ( the q numbers ) have diffrent total model sizes, and proportionally higher/lower memory requirements.
It's very unlikely that batch size over 512 will make things any better. 512 used to be default max for a long time.
--tensor-split doesn't round really, the number you put there get converted to float, without any explicit rounding. However, you cannot split a model at arbitrary point, so that's where the "rounding" comes from. It's the models "parts", not being divisible by that number.
That was fixed some time ago, so update llama.cpp. You can always compile with log files disabled with make LLAMA_DISABLE_LOGS=1

1 reply

zastroyshchik Sep 11, 2023
Author

If you tell it to use GPU with -ngl but you use a number smaller than the amount of layers the model has, whatever number you put in -ngl that many layers will run on the GPU, and the rest of the layers will run on the CPU, hence the RAM usage. If you set the -ngl to the exact number of layers the model has, RAM usage will drop, and VRAM usage will increase. Different quantizations ( the q numbers ) have diffrent total model sizes, and proportionally higher/lower memory requirements.

But I would think that, the rest will only be the remaining model parts, not the whole model. Or maybe it is for performance reason? Like this, --mlock load the whole model, then its parts are deleted from the VRAM, loaded from the RAM, etc.?

Green-Sky · 2023-09-10T23:05:15Z

Green-Sky
Sep 10, 2023
Collaborator

Knowing that LLaMa 2 was built with a context 2048,

it was in fact not. llama 2 was pretrained on 4096 max positions. and codellama and the phind finetune on 16384.

With GGUF, you do not have to specify context size at all.

no you do. it defaults to 512

But seems it does not impact the output length

can you elaborate on that? does the output seem cut-off?

8 replies

Green-Sky Sep 11, 2023
Collaborator

-n set the only other reason the program is done predicting, is when the model is emitting a EOS (end of sentence token). you can add --ignore-eos

zastroyshchik Sep 11, 2023
Author

Sorry, but I am not sure to understand.

Here the complete command line:

./main -m ../../Customizations/models/Phind-CodeLlama-34B-v2/ggml-model-q4_0.gguf -ngl 38 -t 6 --temp 0.7 --top_k 40 --top_p 0.5 --ctx-size 7296 --keep -1 -n -1 --batch-size 1024 --repeat_penalty 1.17647 --color -i -r "User:" --mlock -f ../../Customizations/prompts/chat-with-llama.txt --repeat-last-n 256 --log-disable --main-gpu 1 --multiline-input --tensor-split 53.77,46.23 --low-vram

staviq Sep 11, 2023

If you have token limit set to infinite -n -1, the model output is no longer hard limited, but the model itself might imply it's done, and doesn't know what else to say, and the model does that by outputting a special token, which you never see, but this tells llama.cpp the model itself wanted to stop, and so llama.cpp stops generating. You can bypass that behaviour, by adding --ignore-eos parameter, and llama.cpp will not stop even if the model says it's done. This will force the model to go on.

Fair warning, most models just start spitting out garbage if you force it like that. Nous Hermes and MythoMax were the only models I tried, which could coherently continue when forced, but not for much longer either. Forcing the model like this makes it operate outside it's training parameters, and you will start seeing garbage, random weird characters, or severe repetitions.

zastroyshchik Sep 11, 2023
Author

Sorry, but I still do not see, in which way it is related to this:

Indeed, if I set --ctx-size to 2048, I get this output:
llama_new_context_with_model: kv self size = 384.00 MB
And if I set it to 7296, I get this:
llama_new_context_with_model: kv self size = 1368.00 MB
But seems it does not impact the output length, nor the memory usage.

😕

Green-Sky Sep 11, 2023
Collaborator

But seems it does not impact the output length

true statement

nor the memory usage.

it should. if not, it might be allocating as it goes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions related to llama.cpp options #3111

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Questions related to llama.cpp options #3111

zastroyshchik Sep 10, 2023

Replies: 2 comments · 9 replies

staviq Sep 10, 2023

zastroyshchik Sep 11, 2023 Author

Green-Sky Sep 10, 2023 Collaborator

Green-Sky Sep 11, 2023 Collaborator

zastroyshchik Sep 11, 2023 Author

staviq Sep 11, 2023

zastroyshchik Sep 11, 2023 Author

Green-Sky Sep 11, 2023 Collaborator

zastroyshchik
Sep 10, 2023

Replies: 2 comments 9 replies

staviq
Sep 10, 2023

zastroyshchik Sep 11, 2023
Author

Green-Sky
Sep 10, 2023
Collaborator

Green-Sky Sep 11, 2023
Collaborator

zastroyshchik Sep 11, 2023
Author

zastroyshchik Sep 11, 2023
Author

Green-Sky Sep 11, 2023
Collaborator