Skip to content

3. Engine Options

AlpinDale edited this page Jan 8, 2024 · 22 revisions

Below are the full list of available options to use with the Aphrodite Engine API server.

--model <model_name_or_path>

Name or path of the HuggingFace model to use, example:

--tokenizer <tokenizer_name_or_path>

Name or path of the HuggingFace tokenizer, defines to the same as --model.

--revision <revision>

The specific model branch to use. It can be a branch, a tag, or a commit ID. If unspecified, the main branch will be used.

--tokenizer-mode {auto,slow}

The tokenizer mode to use. "auto" will use the fast tokenizer if available.

--trust-remote-code

Trust remote code from certain models.

--download-dir <directory>

The directory to download and load the weights, defauts to the default HF cache (~/.cache)

--load-format {auto,pt,safetensors,npcache,dummy}

The format of the model weights to load. Will also filter the download from HuggingFace if they provide multiple formats for the model.

  • "auto" will load in safetensors by default, and will fallback to pytorch bin.
  • "pt" will load in pytorch bin format.
  • "safetensors" will load in safetensors format.
  • "npcache" will load in pytorch format and store a numpy cache to speed up the loading.
  • "dummy" will init the weights with random values - for testing purposes.

--dtype {auto,half,float16,bfloat16,float,float32}

The data type to use for loading the model.

  • "auto" will use FP16 for FP32/FP16 models, and BF16 for BF16 models.
  • "half" = FP16.
  • "float16" = FP16.
  • "bfloat16" will load the weights in BF16.
  • "float" = FP32
  • "float32" = FP32

--max-model-len <length>

Model context length. If unspecified, will be automatically set to the model's default. If set to a higher number, will automatically adjust RoPE scaling to increase model's max context length.

--worker-use-ray

Use Ray for distributed serving. It's on by default if using more than 1 GPU. Useful for 1 GPU scenarios too for a slight increase in throughput.

--pipeline-parallel-size (-pp) <size>

Number of pipeline stages. Currently unsupported.

--tensor-parallel-size (-tp) <size>

The number of tensor parallel replicas. Set this to the number of GPUs you want to use.

--max-parallel-loading-workers <workers>

Load model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and larger models.

--block-size {8,16,32}

Token block size for contiguous chunks of tokens.

--seed <seed>

The initial random seed to use.

--swap-space <size>

CPU swap space size (GiB) per GPU.

--gpu-memory-utilization (-gmu) <fraction>

The fraction of GPU memory to be used for the model, which can range from 0 to 1. For example, a value of 0.5 would reserve 50% of the GPU for Aphrodite. If unspecified, will be set to 0.9 (90% of the GPU memory).

--max-num-batched-tokens <tokens>

Maximum number of batched tokens per iteration.

--max-num-seqs <sequences>

Maximum number of sequences per iteration. 256 by default. Increase if you want to process more requests per iteration.

--max-paddings <paddings>

Maximum number of paddings in a batch.

--disable-log-stats

Disable logging statistics.

--disable-log-requests

Disable the request logging.

--max-log-len <length>

Maximum length of the logs. By default it's set to infinite. Set to 0 to disable logging the prompts.

--quantization (-q) {awq,gptq,squeezellm,None}

Method used to quantize the weights. Pair with --dtype float16.

--enforce-eager

Use eager-mode PyTorch and disable CUDA graph. If False, will use eager mode and CUDA graph in hybrid for maximum performance and flexibility.

--max-context-len-to-capture <length>

Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.

--kv-cache-dtype {fp8,None}

The data type to use for the KV cache. Set to fp8 for memory savings with a slight increase to throughput.

--served-model-name {name}

Pretty name to use in the API. If not specified, the model name will be the same as the --model arg.

--max-length <length>

The maximum length to report to the API. Does not affect the engine. Kobold API only.

--api-keys [key1,key2]

Authorization API key for the OpenAI server. Not applicable to Kobold API. Optional.

--chat-template <path>

Path to the jinja file with the chat template. By default, it will use the model's specified template (if available). If not found, will disable Chat Completions. OpenAI API only.

--response-role <role

The role name to return for OpenAI chat completions.

--ssl-keyfile <path>

Path to the SSL key file.

--ssl-certfile <path>

Path to the SSL certificate file.

Clone this wiki locally