-
-
Notifications
You must be signed in to change notification settings - Fork 201
2. Usage
There are two ways to use Aphrodite Engine, via the OpenAI API server, or using it via the provided LLM class.
Aphrodite provides 2 REST API servers, OpenAI and KoboldAI. Below are examples of running the Mistral 7b on 2 GPUs:
python -m aphrodite.endpoints.openai.api_server --model mistralai/Mistral-7B-v0.1 -tp 2 --api-keys sk-exampleYou can query the server via curl:
curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-example" \
-d '{
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "Every age it seems is tainted by the greed of men. Rubbish to one such as I,",
"stream": false,
"mirostat_mode": 2,
"mirostat_tau": 6.5,
"mirostat_eta": 0.2
}'python -m aphrodite.endpoints.kobold.api_server --model mistralai/Mistral-7B-v0.1 -tp 2And the curl request:
curl -X 'POST' \
'http://localhost:2242/api/v1/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Niko the kobold stalked carefully down the alley, his small scaly figure obscured by a dusky cloak that fluttered lightly in the cold winter breeze.",
"max_context_length": 32768,
"max_length": 512,
"stream": false,
"mirostat_mode": 2,
"mirostat_tau": 6.5,
"mirostat_eta": 0.2
}' Keep in mind that -tp 2 uses the first 2 visible GPUs. Adjust that value based on the number of available GPUs.
You can also use Aphrodite without setting up a REST API server, e.g. you may want to use it in your scripts.
First, import the LLM class to handle the model-related configurations, and SamplingParams for specifying sampler settings.
from aphrodite import LLM, SamplingParamsThen, define a single or a list of inputs for the model.
prompts = [
"What is a man? A miserable little",
"Once upon a time",
]Specify the sampling parameters:
sampling_params = SamplingParams(temperature=1.1, min_p=0.05)Define the model to use:
llm = LLM(model="mistralai/Mistral-7B-v0.1", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_paramsThe llm.generate method will use the loaded model to process the prompts. You can then print out the responses:
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Output: {generated_text!r}")A variety of quantization formats are supported in Aphrodite. To get started, download (or convert) the appropriate quantized model from HuggingFace, and launch the engine with -q $QUANT_FORMAT --dtype half, where $QUANT_FORMAT is either awq,gguf,gptq,quip,squeezellm. GGUF models need a bit more work, so read below.
For running GGUF models, you need to convert them to PyTorch format first. This doesn't affect the model - it's closer to converting a pytorch_model.bin file to model.safetensors; nothing about the model is changed, it's simply re-packaged into a compatible format.
python examples/gguf-to-torch.py --input /path/to/model.gguf --output /path/to/output/directoryThen launch the engine with your --model flag pointing to the output directory.