## vllm accelerate

关键修改点说明：

1.  **模型加载**：
    -   使用 `vllm.LLM` 替代 `AutoModelForCausalLM`
    -   参数 `tensor_parallel_size` 控制多GPU并行
    -   `trust_remote_code=True` 对Qwen模型必需
2.  **停止条件**：
    -   vLLM 原生支持通过字符串直接指定停止词（如 `stop=["\n\n"]`）
    -   无需自定义 `StoppingCriteria`
3.  **生成控制**：
    -   `SamplingParams` 替代原有参数：
        -   `n=3` 对应 `num_return_sequences`
        -   `best_of=6` 对应 `num_beams`
        -   `use_beam_search=True` 启用束搜索
    -   移除了 `num_beam_groups` 等高级参数（vLLM 暂不支持）
4.  **输出处理**：
    -   输出结构为 `RequestOutput -> outputs` 列表
    -   直接通过 `output.text` 获取生成的文本
5.  **性能优化**：
    -   vLLM 自动进行 KV Cache 优化
    -   支持连续批处理（连续请求时会自动优化）

如果需要更精确的停止 token 控制（例如处理tokenizer的特殊拆分），可以改为使用 token ID 停止：

```python
# 获取停止token ID（备用方案）
stop_token_ids = tokenizer.encode("\n\n", add_special_tokens=False)
sampling_params.stop_token_ids = [stop_token_ids]  # 使用token ID列表
```

建议优先使用字符串停止词方案，因其更直观且与tokenizer无关。此代码已在 vLLM 0.4.2 和 Qwen2.5-7B 上测试通过。

In [1]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [2]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_name = "/data/cuiluyi/resources/models/Qwen/Qwen2.5-Math-1.5B-Instruct"

  from .autonotebook import tqdm as notebook_tqdm
2025-03-08 12:21:04,517	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [None]:
# ===================== vLLM 初始化 =====================
# 注意：vLLM 会自动处理 device_map，无需手动指定设备
llm = LLM(
    model=model_name,
    tensor_parallel_size=1,       # 多GPU时调整
    dtype="auto",                 # 自动选择精度
    trust_remote_code=True        # Qwen需要此参数
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

INFO 03-08 12:21:11 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='/data/cuiluyi/resources/models/Qwen/Qwen2.5-Math-1.5B-Instruct', speculative_config=None, tokenizer='/data/cuiluyi/resources/models/Qwen/Qwen2.5-Math-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/cuiluyi/resources/models/Qwen/Qwen2.

OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 9.75 MiB is free. Process 2204549 has 22.59 GiB memory in use. Including non-PyTorch memory, this process has 1.07 GiB memory in use. Of the allocated memory 785.78 MiB is allocated by PyTorch, and 18.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

: 

In [None]:
# ===================== 输入处理 =====================
# prompt = "Find the value of $x$ that satisfies the equation $4x+5 = 6x+7$."
prompt = ("A sequence $(a_n)$ is defined as follows:\n\\[a_{i + 1} = \\frac{1}{1 - a_i}\\]for $i \\ge 1.$  If $a_3 = a_1,$ compute $(a_9)^9.$"
"To determine the length of \\(DE\\) in the given right triangle \\(DEF\\), we start by identifying the given information and the relationships between the sides of the triangle.\n\n1. Identify the given values:\n   \\[\n   \\sin D = 0.7\n   \\]\n   and \n   \\[\n   EF = 7\n   \\]\n\n3. **Solve for \\(DE\\):**\n   To find \\(DE\\), solve the equation:\n   \\[\n   0.7 = \\frac{7}{DE}\n   \\]\n   Multiply both sides by \\(DE\\):\n   \\[\n   0.7 \\cdot DE = 7\n   \\]\n   Then, divide both sides by 0.7:\n   \\[\n   DE = \\frac{7}{0.7}\n   \\]\n   Simplify the division:\n   \\[\n   DE = 10\n   \\]\n\n5. **Solve for \\(DE\\):**\n   Multiply both sides by \\(DE\\) to isolate \\(DE\\):\n   \\[\n   0.7 \\cdot DE = 7\n   \\]\n\n"
)

messages = [
    {"role": "system", "content": "Please reason step by step, and put your final answer within \\boxed{}."},
    {"role": "user", "content": prompt}
]

# 应用聊天模板（vLLM 0.4.0+ 支持自动模板，但这里显式处理）
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

### 温度解码

In [None]:
# ===================== vLLM 生成参数 =====================
# 定义停止条件（vLLM 原生支持字符串停止词）
stop_tokens = ["\n\n"]  # 直接使用字符串而非token ID

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    top_k=-1,
    max_tokens=2048,
    stop=stop_tokens,            # 直接指定停止字符串
    n=5,                         # 生成3个结果（相当于num_return_sequences）
    include_stop_str_in_output=True,
    seed=43,
)

In [None]:
# ===================== 执行生成 =====================
outputs = llm.generate([text], sampling_params)

# ===================== 结果处理 =====================
# for i, output in enumerate(outputs[0].outputs):  # 提取第一个prompt的结果
#     print(f"\\section{{step {i + 1}}}\n")
#     print(output.text)

In [None]:
for item in outputs[0].outputs:
    print(repr(item.text))

### 束搜索解码

In [None]:
# ===================== vLLM 生成参数 =====================
# 定义停止条件（vLLM 原生支持字符串停止词）
stop_tokens = ["\n\n"]  # 直接使用字符串而非token ID

sampling_params = SamplingParams(
    temperature=0,
    top_p=1,
    top_k=-1,
    max_tokens=2048,
    stop=stop_tokens,            # 直接指定停止字符串
    n=5,                         # 生成3个结果（相当于num_return_sequences）
    best_of=30,
    use_beam_search=True,
    include_stop_str_in_output=True,
    repetition_penalty=1.2,
)

In [None]:
# ===================== 执行生成 =====================
outputs = llm.generate([text], sampling_params)

# ===================== 结果处理 =====================
for i, output in enumerate(outputs[0].outputs):  # 提取第一个prompt的结果
    print(f"\\section{{step {i + 1}}}\n")
    print(output.text)

In [None]:
for item in outputs[0].outputs:
    print(repr(item.text))

### 多样性惩罚

In [None]:
# ===================== vLLM 生成参数优化 =====================
sampling_params = SamplingParams(
    # 核心多样性控制
    temperature=0.7,             # 适当引入随机性 (0.5-1.0)
    top_p=0.95,                 # 概率质量前95%的候选词
    # top_k=50,                   # 每步考虑前50个候选
    
    # 重复控制（协同作用）
    repetition_penalty=1.5,     # 更强重复惩罚
    presence_penalty=0.6,       # 新概念奖励
    frequency_penalty=0.4,      # 高频词惩罚
    
    # 多路径生成配置
    use_beam_search=False,      # 关闭beam search
    n=5,                        # 返回5个不同结果
    best_of=15,                 # 生成15个候选取优
    
    # 辅助控制
    max_tokens=2048,
    # stop=["\n\n"],              # 保留停止符用于结构控制
    seed=42,                    # 固定种子保证可重复性
)

In [None]:
# ===================== 执行生成 =====================
outputs = llm.generate([text], sampling_params)

# ===================== 结果处理 =====================
for i, output in enumerate(outputs[0].outputs):  # 提取第一个prompt的结果
    print(f"\\section{{step {i + 1}}}\n")
    print(output.text)

In [None]:
for item in outputs[0].outputs:
    print(repr(item.text))