- https://docs.vllm.ai/en/latest/getting_started/quickstart.html
- vllm
    - easy, fast, cheap llm serving
    - serving/deploying/hosting
        - fastapi-based (uvicorn) server for online serving
        - OpenAI-Compatible Server
- openai api 参数及返回值
    - stop：list
        - `['\n\nHuman']`
    - finish_reason
        - https://platform.openai.com/docs/api-reference/chat/object
        - length:  if the maximum number of tokens specified in the request was reached
        - stop: which means the API returned the full chat completion generated by the model without running into any limits.
            -  This will be stop if the model hit a natural stop point or a provided stop sequence,

### basics

```
from vllm import LLM
prompts = ['Hello, my name is ', 'The captail of China is ']
llm = LLM(model='meta-llama/Meta-Llama-3.1-8B', max_model_len=4096)
outputs = llm.generate(prompts)
print(outputs[0].outputs[0].text)
print(outputs[1].outputs[0].text)
```

- the current vLLM instance can use total_gpu_memory (23.65GiB) x gpu_memory_utilization (0.90) = 21.28GiB
    - model weights take 14.99GiB;
    - non_torch_memory takes 0.09GiB;
    - PyTorch activation peak memory takes 1.20GiB;
    - the rest of the memory reserved for KV Cache is 5.01GiB.
- the current vLLM instance can use total_gpu_memory (23.65GiB) x gpu_memory_utilization (0.95) = 22.47GiB
    - model weights take 14.99GiB;
    - non_torch_memory takes 0.09GiB;
    - PyTorch activation peak memory takes 1.20GiB;
    - the rest of the memory reserved for KV Cache is 6.19GiB.
- the current vLLM instance can use total_gpu_memory (23.65GiB) x gpu_memory_utilization (0.95) = 22.47GiB
    - model weights take **7.51GiB**;
    - non_torch_memory takes 0.28GiB;
    - PyTorch activation peak memory takes 1.20GiB;
    - the rest of the memory reserved for KV Cache is **13.47GiB.**

### OpenAI-Compatible Server

```
$ vllm serve meta-llama/Llama-3.1-8B-Instruct --max_model_len 8192
$ vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype auto --api-key keytest --gpu_memory_utilization 0.95  --max_model_len 8192
$ nohup vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype auto --api-key keytest --gpu_memory_utilization 0.95  --max_model_len 8192 &

$  python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max_model_len 8192
```

- `http://localhost:8000/`
- 默认参数
    - ip: localhost 
    - port: 8000
    - dtype: auto
    - device: auto
    - api_key: None
    - gpu_memory_utilization: 0.9
    - **max_model_len**: None

#### completions

- http://192.168.101.16:8000/docs
- curl
    ```
    curl http://localhost:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "meta-llama/Llama-3.1-8B-Instruct",
            "prompt": "San Francisco is a",
            "max_tokens": 7,
            "temperature": 0
        }'
    ```
- postman
    

In [None]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="meta-llama/Llama-3.1-8B-Instruct",
                                      prompt="San Francisco is a")

#### chat

```
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "How many Rs in strwaberry? Let us think step by step."}
        ]
    }'
```

In [None]:
chat_response = client.chat.completions.create(
    # model="Qwen/Qwen2.5-1.5B-Instruct",
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response)

In [None]:
print(chat_response.choices[0].message.content)

### distributed inference & serving

https://docs.vllm.ai/en/latest/serving/distributed_serving.html

- `--tensor-parallel-size 2`