- `pip install vllm==0.8.1`

Run the vLLM server with the model `Qwen/Qwen2.5-7B`:

```
$ trl vllm-serve --model Qwen/Qwen2.5-7B
...
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

Use the client to generate completions and **update model weights**:

```python
>>> from trl.extras.vllm_client import VLLMClient
>>> client = VLLMClient()
>>> client.generate(["Hello, AI!", "Tell me a joke"])
[[2980, 498, 1492, 752, 448, 264, 13027, 8645, 30, 358, 2776, 4460, 311, 3270, 264, 2025],
 [911, 7988, 1251, 382, 3838, 653, 498, 1618, 4325, 879, 2581, 20027, 264, 21428, 30, 362]]

>>> from transformers import AutoModelForCausalLM
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", device_map="cuda")
>>> client.update_model_params(model)
```

### generation & training

- When using vLLM, ensure that the GPUs assigned for training and generation are separate to avoid resource conflicts. For instance, if you plan to use 4 GPUs for training and another 4 for vLLM generation, you can specify GPU allocation using `CUDA_VISIBLE_DEVICES`.
    - you define GPUs for training and others for inference. This avoids having to offload
- Set GPUs 0-3 for vLLM generation:

```
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>
```
    
- And GPUs 4-7 for training:

    ```
    CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
    ```

    - note: 如果本身 accelerate 也有自己的配置，比如默认使用 8 个 gpu 进程，则需要进行如下的启动设置
        - `CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch --num_processes=4 train.py`
        - 覆盖 `~/.cache/huggingface/accelerate/default_config.yaml` 的配置；