Skip to content

fix: remove redundent prefill when n > 1 and no beam search#809

Open
zhangjianning-zjn wants to merge 1 commit intoalibaba:mainfrom
zhangjianning-zjn:fix/multi_seq_redundent_prefill
Open

fix: remove redundent prefill when n > 1 and no beam search#809
zhangjianning-zjn wants to merge 1 commit intoalibaba:mainfrom
zhangjianning-zjn:fix/multi_seq_redundent_prefill

Conversation

@zhangjianning-zjn
Copy link
Copy Markdown
Collaborator

No description provided.

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

Code Review: PR #809 — fix: remove redundant prefill when n > 1 and no beam search

Reviewer: Automated CR (2026-03-20)


Summary

This PR optimizes multi-sequence generation (n > 1, no beam search) by running prefill with batch_size=1 instead of n, then expanding at the decode step. It also fixes several latent bugs in the tiling/logits-processor path exposed by this change.

Files: GenerateStream.cc, LogitsProcessorFactory.cc/.h, NormalBatchStreamProcessor.cc (+18/-12)


Findings

1. isDoneWithoutLock(i) -> isDoneWithoutLock(cur_batch) [Bug Fix, HIGH]
When tiling is active, i ranges up to nextBatchSize() but sub-generate-status only has currentBatchSize() entries. Using cur_batch (clamped) prevents potential out-of-bounds access. Good catch.

2. Logits tiling offset: 0 -> total_decode_batch_size_in [Bug Fix, HIGH]
The old code started writing context-stream tiled logits at offset 0, which would overwrite decode-stream logits already copied there. Starting at total_decode_batch_size_in correctly places context logits after decode logits in the buffer.

3. batchSize(0) returning 1 for non-beam n>1 [Optimization, HIGH]
Core change. Mirrors the existing beam-search pattern where numBeams(0) = 1. The perf_test_ guard is appropriate.

Question: resizeSubGenerateStatus(init_batch_size) in the constructor will now be called with 1 instead of n. Is the resize from 1 -> n at the context-to-decode transition handled correctly in all cases?

4. setLogitsProcessorInputs batch size fix [Bug Fix, MEDIUM]
Using needTilingForSampling() ? nextBatchSize() : currentBatchSize() ensures logits processor state ranges match the actual tiled buffer layout. Previously currentBatchSize() was always used, which would be wrong during tiling.

5. updateLogitProcessorMultiSeqStatus guard removal [Correctness, MEDIUM]
Removing !hasNumBeams() allows updateMultiSeqStatus to run for n>1 non-beam cases. Necessary for ThinkModeLogitsProcessor and MultiSeqLogitsProcessor to track sequence reordering.

6. max_batch_size parameter removal from LogitsProcessorFactory [API Change]
ThinkModeLogitsProcessor was previously initialized with maxBatchSize(), now uses logits_processor_init_batch_size (= next_batch_size = n for non-beam n>1). Values are equivalent, so this is safe.


Suggestions

  • Testing: No tests are included. Given the bug fixes (isDoneWithoutLock OOB, logits offset corruption), unit tests covering n>1 non-beam-search generation would be valuable to prevent regressions.
  • PR description: Currently empty. A brief note about the optimization and the bug fixes would help future readers.
  • The TODO comments about moving tiling after logits processors are good breadcrumbs for follow-up work.

Verdict

Approve with minor comments. The optimization is sound, the bug fixes are correct, and the beam-search path is properly guarded. Main ask: verify resizeSubGenerateStatus transition and consider adding test coverage.

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

🤖 AI Code Review — PR #809 fix: remove redundent prefill when n > 1 and no beam search

Files changed: 4 (+18 / -12)


Summary

Optimizes the n > 1 generation path (non-beam-search) by running prefill with batch_size=1 and expanding to n only at decode, mirroring the existing beam-search pattern. Also fixes several latent bugs exposed by this change.


Key Findings

isDoneWithoutLock(cur_batch) — Bug Fix [HIGH]

When tiling is active, the loop variable i ranges 0..n-1, but isDoneWithoutLock(i) indexed into sub-generate-status which only has currentBatchSize() entries (= 1 during context-to-decode transition). Using cur_batch (clamped) prevents out-of-bounds access. Good fix.

Logits Tiling Offset — Bug Fix [HIGH]

The old code started writing tiled context-stream logits at offset 0, overwriting decode-stream logits. Starting at total_decode_batch_size_in aligns with the buffer layout where decode streams occupy indices 0..total_decode_batch_size_in-1. Correct fix.

Logits Processor Init Batch Size [MEDIUM]

Logits processors (e.g. ThinkModeLogitsProcessor) maintain per-sequence state and must be initialized with the decode-phase batch size (n), not the prefill batch size (1). The new logits_processor_init_batch_size logic handles this correctly.

updateLogitProcessorMultiSeqStatus Guard Removal [MEDIUM]

Removing the !hasNumBeams() early-return allows updateMultiSeqStatus to run for n>1 non-beam-search, which is necessary for ThinkModeLogitsProcessor and MultiSeqLogitsProcessor.


Suggestions

  1. Verify resize transition: resizeSubGenerateStatus(1) at init followed by a later resize to n — please confirm this transition is handled correctly and no state is lost.
  2. Add test coverage: Given the bug fixes (isDoneWithoutLock OOB, logits offset corruption), unit tests covering n > 1 non-beam-search generation would be valuable to prevent regressions.
  3. PR description: Consider adding a brief explanation of the optimization and included bug fixes to the PR body for future reference.

Verdict: Approve with minor comments. The optimization is sound and the bug fixes are correct.

@zhangjianning-zjn
Copy link
Copy Markdown
Collaborator Author

  • Testing: No tests are included. Given the bug fixes (isDoneWithoutLock OOB, logits offset corruption), unit tests covering n>1 non-beam-search generation would be valuable to prevent regressions.

Testing has been covered by existing smoke tests already. However, it is worth noting that smoke tests only check correctness but do not detect performance regression, which the pr aims to address.

@LLLLKKKK LLLLKKKK self-requested a review as a code owner April 6, 2026 17:15
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

LLLLKKKK commented Apr 9, 2026

🤖 AI Code Review — PR #809
Head SHA: c1e3b7ab596f | Verdict: P2

Summary

Fixes redundant prefill when n > 1 without beam search. Prefill now runs with batch_size=1 and expands on step 2. Also fixes a real bug: isDoneWithoutLock(i) -> isDoneWithoutLock(cur_batch) for correct finished-mask indexing.

Findings

P2: logits_processor_init_batch_size coupling is fragile
The new variable uses batchSize(1) when init != next && !hasNumBeams(). Three scattered TODOs say "tiling for logits should be moved after logits processors". This coupling between step-0/step-1 batch sizes and processor allocation is fragile -- if tiling logic changes without updating this, processors get wrong batch size. Track the TODO as follow-up.

P2: Bug fix buried in feature PR
The isDoneWithoutLock(i) -> isDoneWithoutLock(cur_batch) fix in gatherSamplerInput is a genuine correctness bug fix for multi-sequence streams. Should ideally be called out in commit message or split into a separate commit.

Clean removal of unused max_batch_size parameter from LogitsProcessorFactory.

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

AI Code Review — PR #809

PR 概述

Title: fix: remove redundent prefill when n > 1 and no beam search
Author: zhangjianning-zjn
规模: 4 files, +18/-12

num_return_sequences > 1 且不使用 beam search 时,prefill 阶段所有序列共享同一个 prompt,因此只需计算一次 prefill(batch_size=1),在第二步再 tile 到 n 个序列。此 PR 通过修改 batchSize(0) 的返回值实现这一优化,并修复了 tiling 路径中 logits processor 索引、logits 拷贝偏移、isDoneWithoutLock 参数等相关 bug。

Review 意见

  1. 缺少测试覆盖 [P1]

    此 PR 修改了 batchSize() 的核心语义(prefill 阶段从 n 变为 1),影响 logits processor 初始化、sampler input 拼装、logits tiling 等多个关键路径。但未添加任何单元测试或 smoke 测试来验证 n > 1 场景下的正确性。

    建议:至少添加一个 num_return_sequences > 1 的端到端 smoke 测试,覆盖 n>1 无 beam search 的正常生成、n>1 启用 think mode、n==1 的回归验证。

  2. PR 基于过时代码,存在合并冲突 [P1]

    GitHub 显示 mergeable_state: "dirty"。PR 中修改的 NormalBatchStreamProcessor.cc(含 gatherSamplerInputsetLogitsProcessorInputs)在当前 main 分支中已被重构到 NormalSamplerInputGatherer.ccisDoneWithoutLock 已被重命名为 isSubGenerateDoneWithoutLockLogitsProcessorFactory::createLogitsProcessors 签名也已变更。PR 需要 rebase 到最新 main 并重新适配。

  3. PR description 为空 [P2]

    对于涉及 batch size 语义变更的 PR,应说明优化动机、影响范围和设计决策。

  4. 三处 TODO 注释无跟踪 [P2]

    新增三处 // TODO(zhangjianning.zjn): tiling for logits should be moved after logits processors。建议创建 issue 跟踪此技术债务。

  5. batchSize(0)perf_test_ 的特殊处理缺少注释 [P2]

    perf_test_ 排除条件的设计意图不够直观,建议补充注释。

整体评价

核心优化思路正确且有价值。isDoneWithoutLock(cur_batch) 的修复也是一个实际 bug fix。但 PR 基于过时代码存在严重合并冲突,且缺少测试覆盖。建议 rebase 后补充测试再合入。

存在阻塞/重要问题,不建议合入。

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

AI Code Review - PR #809

Status: LGTM

Summary: P0/0 · P1/0 · P2/2 · P3/1

lgtm ready to ci

Non-blocking Suggestions

P2

  • logits_processor_init_batch_size 计算依赖 batchSize(1) 但 perf_test_ 分支未覆盖 @ rtp_llm/cpp/engine_base/stream/GenerateStream.cc:63
    • 建议:在 TODO 注释旁补充一句说明 perf_test 场景下 init_batch_size == next_batch_size 因此不受影响,方便后续维护者理解。
  • logits tiling 循环仅对 contextStreams 做 per-row copy,decode 部分用 bulk copy 但未校验 total_decode_batch_size_in 一致性 @ rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc:344
    • 建议:考虑在 tiling 分支入口加 RTP_LLM_CHECK(total_decode_batch_size_in <= model_output.logits->shape()[0]) 防御性断言。

P3

  • 三处相同 TODO 注释重复 @ rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc:343
    • 建议:保留一处 TODO 并在其他位置用 // see TODO in GenerateStream constructor 引用,或统一到一个 tracking issue。
Review Checklist: 7 pass / 0 fail

General Principles Checklist

Passed

  • [PASS] 6.1 Software Engineering: 接口变更是否保持向后兼容(createLogitsProcessors 移除 max_batch_size 参数)
  • [PASS] 6.1 Architecture: batchSize 语义变更是否影响所有调用方
  • [PASS] 6.1 Tests: n>1 场景是否有对应测试覆盖
  • [PASS] 6.1 Quality: 逻辑变更与格式变更是否分离

RTP-LLM Checklist

Passed

  • [PASS] B 正确性与逻辑: isDoneWithoutLock 参数是否与实际 batch index 对应
  • [PASS] D 性能: prefill 阶段是否避免了不必要的多 batch 计算
  • [PASS] B 正确性与逻辑: logits tiling offset 是否正确跳过 decode 区域

Strengths

  • 修复了 n>1 场景下 prefill 阶段冗余计算的核心问题:batchSize(0) 返回 1 避免了不必要的多 batch 分配
  • isDoneWithoutLock(cur_batch) 修复了 finished_mask 越界访问的正确性 bug
  • logits tiling offset 从 total_decode_batch_size_in 开始,修复了 decode logits 被覆盖的问题
  • updateLogitProcessorMultiSeqStatus 移除 hasNumBeams 守卫,使 ThinkMode/Tree processor 在 n>1 场景正确同步状态
  • 接口简化:移除 max_batch_size 参数,语义更清晰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants