2. Usage

Usage

There are two ways to use Aphrodite Engine, via the OpenAI API server, or using it via the provided LLM class.

API Server

Aphrodite provides 2 REST API servers, OpenAI and KoboldAI. Below are examples of running the Mistral 7b on 2 GPUs:

OpenAI

python -m aphrodite.endpoints.openai.api_server --model mistralai/Mistral-7B-v0.1 -tp 2 --api-keys sk-example

You can query the server via curl:

curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-example" \
-d '{
  "model": "mistralai/Mistral-7B-v0.1",
  "prompt": "Every age it seems is tainted by the greed of men. Rubbish to one such as I,",
  "stream": false,
  "mirostat_mode": 2,
  "mirostat_tau": 6.5,
  "mirostat_eta": 0.2
}'

KoboldAI

python -m aphrodite.endpoints.kobold.api_server --model mistralai/Mistral-7B-v0.1 -tp 2

And the curl request:

curl -X 'POST' \
  'http://localhost:2242/api/v1/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Niko the kobold stalked carefully down the alley, his small scaly figure obscured by a dusky cloak that fluttered lightly in the cold winter breeze.",
  "max_context_length": 32768,
  "max_length": 512,
  "stream": false,
  "mirostat_mode": 2,
  "mirostat_tau": 6.5,
  "mirostat_eta": 0.2
}'

Keep in mind that -tp 2 uses the first 2 visible GPUs. Adjust that value based on the number of available GPUs.

Offline Inference

You can also use Aphrodite without setting up a REST API server, e.g. you may want to use it in your scripts.

First, import the LLM class to handle the model-related configurations, and SamplingParams for specifying sampler settings.

from aphrodite import LLM, SamplingParams

Then, define a single or a list of inputs for the model.

prompts = [
    "What is a man? A miserable little",
    "Once upon a time",
]

Specify the sampling parameters:

sampling_params = SamplingParams(temperature=1.1, min_p=0.05)

Define the model to use:

llm = LLM(model="mistralai/Mistral-7B-v0.1", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params

The llm.generate method will use the loaded model to process the prompts. You can then print out the responses:

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Output: {generated_text!r}")

Quantization

A variety of quantization formats are supported in Aphrodite. To get started, download (or convert) the appropriate quantized model from HuggingFace, and launch the engine with -q $QUANT_FORMAT --dtype half, where $QUANT_FORMAT is either awq,gguf,gptq,quip,squeezellm. GGUF models need a bit more work, so read below.

GGUF

For running GGUF models, you need to convert them to PyTorch format first. This doesn't affect the model - it's closer to converting a pytorch_model.bin file to model.safetensors; nothing about the model is changed, it's simply re-packaged into a compatible format.

python examples/gguf-to-torch.py --input /path/to/model.gguf --output /path/to/output/directory

Then launch the engine with your --model flag pointing to the output directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

2. Usage

Usage

API Server

OpenAI

KoboldAI

Offline Inference

Quantization

GGUF

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally