Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 20 additions & 40 deletions tools/run/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,50 +3,30 @@
The purpose of this example is to demonstrate a minimal usage of llama.cpp for running models.

```bash
llama-run granite3-moe
llama-run -hf llama.cpp/example/run
```

```bash
Description:
Runs a llm
Usage: llama-run [server-options]

Usage:
llama-run [options] model [prompt]
This tool starts a llama-server process and provides an interactive chat interface.
All options except --port are passed through to llama-server.

Options:
-c, --context-size <value>
Context size (default: 2048)
-n, -ngl, --ngl <value>
Number of GPU layers (default: 0)
--temp <value>
Temperature (default: 0.8)
-v, --verbose, --log-verbose
Set verbosity level to infinity (i.e. log all messages, useful for debugging)
-h, --help
Show help message
Common options:
-h, --help Show this help
-m, --model FNAME model path (default: `models/$filename` with filename from `--hf-file`
or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
-hf, -hfr, --hf-repo <user>/<model>[:quant]
Hugging Face model repository; quant is optional, case-insensitive,
default to Q4_K_M, or falls back to the first file in the repo if
Q4_K_M doesn't exist.
mmproj is also downloaded automatically if available. to disable, add
--no-mmproj
example: unsloth/phi-4-GGUF:q4_k_m
(default: unused)
-c, --ctx-size N Context size
-n, --predict N Number of tokens to predict
-t, --threads N Number of threads

Commands:
model
Model is a string with an optional prefix of
huggingface:// (hf://), ollama://, https:// or file://.
If no protocol is specified and a file exists in the specified
path, file:// is assumed, otherwise if a file does not exist in
the specified path, ollama:// is assumed. Models that are being
pulled are downloaded with .partial extension while being
downloaded and then renamed as the file without the .partial
extension when complete.

Examples:
llama-run llama3
llama-run ollama://granite-code
llama-run ollama://smollm:135m
llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
llama-run ms://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
llama-run modelscope://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
llama-run https://example.com/some-file1.gguf
llama-run some-file2.gguf
llama-run file://some-file3.gguf
llama-run --ngl 999 some-file4.gguf
llama-run --ngl 999 some-file5.gguf Hello World
For all server options, run: llama-server --help
```
Loading
Loading