-
-
Notifications
You must be signed in to change notification settings - Fork 201
3. Engine Options
Below are the full list of available options to use with the Aphrodite Engine API server.
Name or path of the HuggingFace model to use, example:
Name or path of the HuggingFace tokenizer, defines to the same as --model.
The specific model branch to use. It can be a branch, a tag, or a commit ID. If unspecified, the main branch will be used.
The tokenizer mode to use. "auto" will use the fast tokenizer if available.
Trust remote code from certain models.
The directory to download and load the weights, defauts to the default HF cache (~/.cache)
The format of the model weights to load. Will also filter the download from HuggingFace if they provide multiple formats for the model.
- "auto" will load in safetensors by default, and will fallback to pytorch bin.
- "pt" will load in pytorch bin format.
- "safetensors" will load in safetensors format.
- "npcache" will load in pytorch format and store a numpy cache to speed up the loading.
- "dummy" will init the weights with random values - for testing purposes.
The data type to use for loading the model.
- "auto" will use FP16 for FP32/FP16 models, and BF16 for BF16 models.
- "half" = FP16.
- "float16" = FP16.
- "bfloat16" will load the weights in BF16.
- "float" = FP32
- "float32" = FP32
Model context length. If unspecified, will be automatically set to the model's default. If set to a higher number, will automatically adjust RoPE scaling to increase model's max context length.
Use Ray for distributed serving. It's on by default if using more than 1 GPU. Useful for 1 GPU scenarios too for a slight increase in throughput.
Number of pipeline stages. Currently unsupported.
The number of tensor parallel replicas. Set this to the number of GPUs you want to use.
Load model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and larger models.
Token block size for contiguous chunks of tokens.
The initial random seed to use.
CPU swap space size (GiB) per GPU.
The fraction of GPU memory to be used for the model, which can range from 0 to 1. For example, a value of 0.5 would reserve 50% of the GPU for Aphrodite. If unspecified, will be set to 0.9 (90% of the GPU memory).
Maximum number of batched tokens per iteration.
Maximum number of sequences per iteration. 256 by default. Increase if you want to process more requests per iteration.
Maximum number of paddings in a batch.
Disable logging statistics.
Disable the request logging.
Maximum length of the logs. By default it's set to infinite. Set to 0 to disable logging the prompts.
Method used to quantize the weights. Pair with --dtype float16.
Use eager-mode PyTorch and disable CUDA graph. If False, will use eager mode and CUDA graph in hybrid for maximum performance and flexibility.
Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.
The data type to use for the KV cache. Set to fp8 for memory savings with a slight increase to throughput.
Pretty name to use in the API. If not specified, the model name will be the same as the --model arg.
The maximum length to report to the API. Does not affect the engine. Kobold API only.
Authorization API key for the OpenAI server. Not applicable to Kobold API. Optional.
Path to the jinja file with the chat template. By default, it will use the model's specified template (if available). If not found, will disable Chat Completions. OpenAI API only.
The role name to return for OpenAI chat completions.
Path to the SSL key file.
Path to the SSL certificate file.