Skip to content

Commit

Permalink
Update 04 - Model Tab.md
Browse files Browse the repository at this point in the history
  • Loading branch information
oobabooga committed Oct 23, 2023
1 parent 306d764 commit 82c11be
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/04 ‐ Model Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ Loads: GGUF models. Note: GGML models have been deprecated and do not work anymo
Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF

* **n-gpu-layers**: The number of layers to allocate to the GPU. If set to 0, only the CPU will be used. If you want to offload all layers, you can simply set this to the maximum value.
* **n-ctx**: Context length of the model. In llama.cpp, the context is preallocated, so the higher this value, the higher the RAM/VRAM usage will be. It gets automatically updated with the value in the GGUF metadata for the model when you select it in the Model dropdown.
* **n_ctx**: Context length of the model. In llama.cpp, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on the metadata inside the GGUF file, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "n_ctx" so that you don't have to set the same thing twice.
* **threads**: Number of threads. Recommended value: your number of physical cores.
* **threads_batch**: Number of threads for batch processing. Recommended value: your total number of cores (physical + virtual).
* **n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.
Expand Down

0 comments on commit 82c11be

Please sign in to comment.