Skip to content

2. Usage

AlpinDale edited this page Jan 30, 2024 · 7 revisions

Usage

There are two ways to use Aphrodite Engine, via the OpenAI API server, or using it via the provided LLM class.

API Server

Aphrodite provides 2 REST API servers, OpenAI and KoboldAI. Below are examples of running the Mistral 7b on 2 GPUs:

OpenAI

python -m aphrodite.endpoints.openai.api_server --model mistralai/Mistral-7B-v0.1 -tp 2 --api-keys sk-example

You can query the server via curl:

curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-example" \
-d '{
  "model": "mistralai/Mistral-7B-v0.1",
  "prompt": "Every age it seems is tainted by the greed of men. Rubbish to one such as I,",
  "stream": false,
  "mirostat_mode": 2,
  "mirostat_tau": 6.5,
  "mirostat_eta": 0.2
}'

KoboldAI

python -m aphrodite.endpoints.kobold.api_server --model mistralai/Mistral-7B-v0.1 -tp 2

And the curl request:

curl -X 'POST' \
  'http://localhost:2242/api/v1/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Niko the kobold stalked carefully down the alley, his small scaly figure obscured by a dusky cloak that fluttered lightly in the cold winter breeze.",
  "max_context_length": 32768,
  "max_length": 512,
  "stream": false,
  "mirostat_mode": 2,
  "mirostat_tau": 6.5,
  "mirostat_eta": 0.2
}' 

Keep in mind that -tp 2 uses the first 2 visible GPUs. Adjust that value based on the number of available GPUs.

Offline Inference

You can also use Aphrodite without setting up a REST API server, e.g. you may want to use it in your scripts.

First, import the LLM class to handle the model-related configurations, and SamplingParams for specifying sampler settings.

from aphrodite import LLM, SamplingParams

Then, define a single or a list of inputs for the model.

prompts = [
    "What is a man? A miserable little",
    "Once upon a time",
]

Specify the sampling parameters:

sampling_params = SamplingParams(temperature=1.1, min_p=0.05)

Define the model to use:

llm = LLM(model="mistralai/Mistral-7B-v0.1", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params

The llm.generate method will use the loaded model to process the prompts. You can then print out the responses:

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Output: {generated_text!r}")

Quantization

A variety of quantization formats are supported in Aphrodite. To get started, download (or convert) the appropriate quantized model from HuggingFace, and launch the engine with -q $QUANT_FORMAT --dtype half, where $QUANT_FORMAT is either awq,gguf,gptq,quip,squeezellm. GGUF models need a bit more work, so read below.

GGUF

For running GGUF models, you need to convert them to PyTorch format first. This doesn't affect the model - it's closer to converting a pytorch_model.bin file to model.safetensors; nothing about the model is changed, it's simply re-packaged into a compatible format.

python examples/gguf-to-torch.py --input /path/to/model.gguf --output /path/to/output/directory

Then launch the engine with your --model flag pointing to the output directory.

Clone this wiki locally