- https://docs.vllm.ai/en/latest/getting_started/quickstart.html

### OpenAI-Compatible Server

```
$ vllm serve Qwen/Qwen2.5-1.5B-Instruct
$ vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype auto --api-key keytest --gpu_memory_utilization 0.95  --max_model_len 4096
$ nohup vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype auto --api-key keytest --gpu_memory_utilization 0.95  --max_model_len 4096 &
```

- `http://localhost:8000/`
- 默认参数
    - port: 8000
    - dtype: auto
    - device: auto
    - api_key: None
    - gpu_memory_utilization: 0.9
        - the current vLLM instance can use total_gpu_memory (23.65GiB) x gpu_memory_utilization (0.95) = 22.47GiB
    - **max_model_len**: None
- curl
    ```
    curl http://localhost:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "Qwen/Qwen2.5-1.5B-Instruct",
            "prompt": "San Francisco is a",
            "max_tokens": 7,
            "temperature": 0
        }'
    ```

#### openai api

In [24]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "keytest"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

In [None]:
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                      prompt="San Francisco is a")

In [16]:
completion

Completion(id='cmpl-b997458da9a24b62946f616268f404df', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' This is a question that people have debated for centuries, and there is no definitive', stop_reason=None, prompt_logprobs=None)], created=1736569230, model='Qwen/Qwen2.5-1.5B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=7, total_tokens=23, completion_tokens_details=None, prompt_tokens_details=None))

In [17]:
completion.choices[0].text

' This is a question that people have debated for centuries, and there is no definitive'

In [7]:
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                      prompt="What is the meaning of life?")

In [8]:
completion.choices[0].text

' This is a question that people have debated for centuries, and there is no definitive'

#### chat

```
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
```

In [26]:
chat_response = client.chat.completions.create(
    # model="Qwen/Qwen2.5-1.5B-Instruct",
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response)

Chat response: ChatCompletion(id='chatcmpl-8e5990b0a11f46d1b8d7b9ce9ab79d4e', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='A man walked into a library and asked the librarian, "Do you have any books on Pavlov\'s dogs and Schrödinger\'s cat?" \n\nThe librarian replied, "It rings a bell, but I\'m not sure if it\'s here or not."', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)], created=1736574759, model='meta-llama/Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=55, prompt_tokens=46, total_tokens=101, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)


In [27]:
print(chat_response.choices[0].message.content)

A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?" 

The librarian replied, "It rings a bell, but I'm not sure if it's here or not."


### distributed inference & serving

https://docs.vllm.ai/en/latest/serving/distributed_serving.html

- `--tensor-parallel-size 2`