feat: SimAI 1.6 GPU memory module with PD-separation by tianhao909 · Pull Request #268 · aliyun/SimAI

tianhao909 · 2026-04-14T08:57:22Z

Summary

This PR adds GPU memory-aware inference simulation to SimAI 1.6, with support for Prefill-Decode (PD) disaggregation.

This is a clean resubmission of PR #243 with all review fixes applied.

Key Changes

GPU Memory Module (`vidur-alibabacloud/`)

GPU memory estimation for inference simulation
Support for DeepSeek-V3-671B, Qwen3-MoE-235B-A22B (FP8), Qwen3-Next-80B models
AICB integration for execution time prediction with CSV caching

PD Separation (Prefill-Decode Disaggregation)

pd_node_ratio parameter: 1 (default) = MIXED mode; (0,1) = PD separation enabled
Independent TP/PP/EP configuration for prefill and decode clusters
num_prefill_replicas override for fine-grained control
Full unit tests in tests/test_pd_separation.py (10 test cases)

Device & Node SKU Support

Added H20, H200, GB200 device SKU configurations
H20Dgx, H200Dgx, GB200 node SKU configurations

Review Fixes (from PR #243 feedback)

assert → if not: raise ValueError() for better error messages
All logger.info → logger.debug (36 places) to reduce noise
7 TODOs resolved (deleted completed ones, converted remaining to NOTE comments)
_generate_or_find_bs1_csv() controlled by aicb_force_bs1 config flag
H20DgxNodeSKUConfig.device_sku_type corrected from H800 to H20
Added missing @dataclass decorators to H200/GB200 device SKU configs
pd_p2p_comm_dtype metadata: structured choices field
Renamed Qwen3235BA22BModelConfig → Qwen3Moe235BA22BModelConfig
Fixed qwen3-235B-A22B_FP8_config.json: architectures and eos_token_id
.gitignore: added data/aicb_workload/ to prevent large data file commits

Documentation

Updated root README.md / README_CN.md (removed dead docs/ links)
Added PD Separation Parameters section to vidur-alibabacloud/README.md

Testing

10 unit tests covering PD separation on/off, fallback, illegal values, priority override
All tests pass

Notes

This PR is based on upstream/master with a single clean commit (no aicb_workload data files in history)
Supersedes PR feat+refactor: SimAI 1.6 GPU Memory Module & Code Quality | SimAI 1.6 GPU 内存推理模块与代码质量提升 #243

Copilot

Pull request overview

This PR introduces GPU memory–aware inference simulation for SimAI 1.6 with Prefill/Decode (PD) disaggregation, alongside new model/device/node SKU configs, AICB integration hooks, scenario scripts, and documentation updates.

Changes:

Add PD separation configuration (pd_node_ratio, per-phase TP/PP, num_prefill_replicas) and cluster/replica logic to support split prefill/decode clusters.
Add GPU memory planning + KV-cache tracking utilities and extend model/device/node SKU coverage (H20/H200/GB200 + Qwen3/DeepSeek configs).
Add unit tests, runnable scenarios, and documentation updates for the new PD + memory simulation workflow.

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
vidur-alibabacloud/vidur/utils/mfu_calculator.py	MFU computation updated for PD-aware MoE models
vidur-alibabacloud/vidur/types/node_sku_type.py	Add H20 DGX node type enum
vidur-alibabacloud/vidur/types/device_sku_type.py	Add H20/H200/GB200 device type enums
vidur-alibabacloud/vidur/simulator.py	Logging tweaks + AICB cache stats/save at end of run
vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py	GPU memory planner updated for PD-aware param/KV-cache budgeting
vidur-alibabacloud/vidur/scheduler/replica_stage_scheduler/replica_stage_schduler.py	Comment/TODO normalization around SimAI integration
vidur-alibabacloud/vidur/scheduler/replica_scheduler/splitwise_replica_scheduler.py	PD scheduling path updates + KV-cache allocation/release calls
vidur-alibabacloud/vidur/scheduler/global_scheduler/splitwise_global_scheduler.py	PD global scheduling cleanups + logging + redundancy notes
vidur-alibabacloud/vidur/profiling/collectives/collectives_impl.py	TODO normalization for collectives profiling
vidur-alibabacloud/vidur/profiling/collectives/benchmark_runner.py	TODO normalization for comm backend
vidur-alibabacloud/vidur/metrics/metrics_store.py	Plot image write behavior adjusted (safe write helper usage)
vidur-alibabacloud/vidur/metrics/data_series.py	Add safe plot writing + skip stats/plots for non-numeric metrics
vidur-alibabacloud/vidur/metrics/cdf_sketch.py	Use safe plot writing helper
vidur-alibabacloud/vidur/execution_time_predictor/sklearn_execution_time_predictor.py	AICB prediction path changes + additional comments
vidur-alibabacloud/vidur/execution_time_predictor/communication_time_predictor.py	Comment/TODO normalization
vidur-alibabacloud/vidur/execution_time_predictor/base_execution_time_predictor.py	PD-aware AICB parameterization for phase-specific TP/PP/WS/EP
vidur-alibabacloud/vidur/events/replica_stage_schedule_event.py	Replace debug prints with logger + clearer assert message
vidur-alibabacloud/vidur/events/replica_schedule_event.py	TODO normalization
vidur-alibabacloud/vidur/events/batch_stage_end_event.py	TODO normalization
vidur-alibabacloud/vidur/events/batch_stage_arrival_event.py	Add/clarify class header comment
vidur-alibabacloud/vidur/events/batch_end_event.py	Logging + PD transfer logic instrumentation + typing tweak
vidur-alibabacloud/vidur/entities/request.py	KV-cache sizing formula corrected + PD DAG cleanup + clearer asserts
vidur-alibabacloud/vidur/entities/replica.py	KV-cache capacity tracking + per-request alloc/release helpers
vidur-alibabacloud/vidur/entities/cluster.py	PD cluster initialization + per-phase world_size/EP derivation
vidur-alibabacloud/vidur/entities/batch.py	Remove commented debug assert
vidur-alibabacloud/vidur/config/node_sku_config.py	Add H20 DGX node SKU config
vidur-alibabacloud/vidur/config/model_config.py	Add Qwen3Next/Qwen3MoE model configs (dataclasses)
vidur-alibabacloud/vidur/config/device_sku_config.py	Add H20/H200/GB200 device SKU configs + adjust H800 figures
vidur-alibabacloud/vidur/config/config.py	Add PD separation config params + pd_p2p_comm_dtype choices + aicb_force_bs1
vidur-alibabacloud/tests/test_pd_separation.py	New unit tests for PD separation behavior/config
vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh	New multi-scenario runner script
vidur-alibabacloud/data/hf_configs/*.json	Add HF-style configs for DeepSeek/Qwen3 models
vidur-alibabacloud/README.md	Document GPU memory module + PD params + scenarios
vidur-alibabacloud/README_CN.md	Chinese documentation for the same feature set
vidur-alibabacloud/README-vidur.md	Add SimAI/AICB scenario examples section
vidur-alibabacloud/.gitignore	Ignore AICB workload directory + local run outputs
README.md	Root docs updated with SimAI 1.6 changes + inference suite
README_CN.md	Root Chinese docs updated with SimAI 1.6 changes + inference suite

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

[EN] - Comment out untested H200/GB200 device configs (device_sku_config.py, device_sku_type.py) - Clean pd_p2p_comm_dtype choices to tested types: fp8, float16, float32 - Fix .gitignore: add directory unignore before negation rules - Replace print() with logger calls (simulator.py, mfu_calculator.py, sklearn_predictor.py) - Fix duplicate write_image in metrics_store.py, keep _safe_write_image only - Add PD separation design comments in memory_planner.py - Fix float division to int division in kvcache calculation - Enrich TODO comments for seq_len=1 limitation and MFU semantic issue - Add AICB backend empirical formula explanation comments - Add Data Preparation section to README.md and README_CN.md - Add test tutorial and report (test_tutorial_report_cn.md) [ZH] - 注释掉未测试的 H200/GB200 设备配置 - 清理 pd_p2p_comm_dtype choices 仅保留已测试类型 - 修复 .gitignore negation 规则 - 将 print() 替换为 logger 调用 - 修复 metrics_store.py 重复 write_image，仅保留 _safe_write_image - 添加 PD 分离设计说明注释 - 修复 kvcache 计算中浮点除法为整除 - 丰富 seq_len=1 限制和 MFU 语义问题的 TODO 注释 - 添加 AICB 后端经验公式说明注释 - README 双语添加数据准备章节 - 添加测试教程和报告文档

tianhao909 · 2026-04-14T10:50:10Z

[EN] PR #268 Copilot Review — All Issues Addressed (commit `8f05712`)

All 9 Copilot review comments have been addressed in the latest commit:

#	File	Issue	Resolution
1	`mfu_calculator.py`	`ParamCounter` returns memory bytes, not param count — MFU semantic inconsistency	TODO added: Known design limitation documented with `TODO(tianhao909)`. Will be refactored when ParamCounter API is unified.
2	`memory_planner.py`	`get_max_batch_size` returns prefill only, should return `min(prefill, decode)`	Comment added: In PD disaggregation, prefill and decode are independent clusters with separate schedulers. Each cluster manages its own batch size independently. Returning prefill capacity is correct for the current architecture.
3	`memory_planner.py`	Float division `/` should be integer `//`	Fixed: Changed `/` to `//` for `kvcache_size_per_layer_per_token` in critical paths.
4	`.gitignore`	Directory-level ignore blocks negation rules	Fixed: Added `!data/aicb_workload/` before file-level negation rules.
5	`simulator.py`	`print()` should use `logger.warning()`	Fixed: Replaced `print(f"[WARNING]...")` with `logger.warning(...)`.
6	`sklearn_execution_time_predictor.py`	Hardcoded linear predictions instead of real models	Comment added: This is the AICB backend's intentional design — uses empirical linear formulas for new MoE models lacking profiling data. The `vidur` backend uses real sklearn models when profiling CSVs are available.
7	`mfu_calculator.py`	Multiple `print()` statements bypass logging	Fixed: Added `logger = init_logger(__name__)`, replaced all `print()` with `logger.debug()`/`logger.warning()`.
8	`memory_planner.py`	`seq_len=1` drastically underestimates KV cache	TODO enriched: Expanded from 4-line to 15-line detailed TODO explaining: known limitation, planned fixes (per-request seq_len, dynamic allocation), and current workaround rationale.
9	`metrics_store.py`	Duplicate `write_image` calls	Fixed: Removed direct `fig.write_image()`, kept only `_safe_write_image()` which gracefully handles missing Kaleido/Chrome.

[ZH] PR #268 Copilot 评审 — 全部问题已处理 (commit `8f05712`)

全部 9 条 Copilot 评审评论已在最新 commit 中处理：

#	文件	问题	处理方式
1	`mfu_calculator.py`	`ParamCounter` 返回内存字节而非参数个数 — MFU 语义不一致	添加 TODO：已记录已知设计限制 `TODO(tianhao909)`。待 ParamCounter API 统一后重构。
2	`memory_planner.py`	`get_max_batch_size` 仅返回 prefill，应返回 `min(prefill, decode)`	添加注释：PD 分离下，prefill 和 decode 是独立集群，各自调度器独立管理 batch size。返回 prefill 容量在当前架构下是正确的。
3	`memory_planner.py`	浮点除法 `/` 应改为整数除法 `//`	已修复：关键路径中 `kvcache_size_per_layer_per_token` 计算从 `/` 改为 `//`。
4	`.gitignore`	目录级忽略规则阻断了后续 negation	已修复：在文件级 negation rules 前添加 `!data/aicb_workload/`。
5	`simulator.py`	`print()` 应使用 `logger.warning()`	已修复：`print(f"[WARNING]...")` 替换为 `logger.warning(...)`。
6	`sklearn_execution_time_predictor.py`	硬编码线性预测代替真实模型	添加注释：这是 AICB 后端的设计意图——对缺少 profiling 数据的新 MoE 模型使用经验线性公式。`vidur` 后端在有 profiling CSV 时使用真实 sklearn 模型。
7	`mfu_calculator.py`	多处 `print()` 绕过日志系统	已修复：添加 `logger = init_logger(__name__)`，所有 `print()` 替换为 `logger.debug()`/`logger.warning()`。
8	`memory_planner.py`	`seq_len=1` 严重低估 KV cache	丰富 TODO：从 4 行扩展为 15 行详细 TODO，说明已知限制、计划修复方案（per-request seq_len、动态分配）及当前设计原因。
9	`metrics_store.py`	重复的 `write_image` 调用	已修复：移除直接的 `fig.write_image()`，仅保留 `_safe_write_image()` 以优雅处理缺少 Kaleido/Chrome 的情况。

[EN] - Ran all 4 scenarios from run_scenarios.sh (Qwen3-Next-80B no-PD, Qwen3-Next-80B PD, DeepSeek-671B PD, Qwen3-MoE-235B PD) - All 4 scenarios passed on H20 GPU with FP8 + AICB backend - Updated test_tutorial_report_cn.md with detailed results table [ZH] - 运行 run_scenarios.sh 全部 4 个场景 (Qwen3-Next-80B 无PD, Qwen3-Next-80B PD, DeepSeek-671B PD, Qwen3-MoE-235B PD) - 4/4 场景在 H20 GPU + FP8 + AICB 后端下全部通过 - 更新 test_tutorial_report_cn.md 补充四场景测试结果表格

tianhao909 · 2026-04-15T14:02:12Z

Round-2 Review Fix Report / 第二轮 Review 修复报告

Commit: 88a19a0 | 14 files changed | +77 / -123 lines

1. Review Feedback Resolution / Review 意见处理

Review Comment	Resolution	File(s)
Remove H200/GB200 commented-out code	Deleted H200/GB200 comment blocks + 2 pdb debug comments	`device_sku_config.py`
What is `num_devices_per_node` used for?	Added bilingual comment explaining its call chain in `sklearn_execution_time_predictor.py` (L72-77)	`node_sku_config.py`
EP not supported; don't leave the interface	EP now raises `ValueError` when user passes EP != world_size	`config.py`
bs>1 should work, investigate	Tested 16/18 cases passing; changed `aicb_force_bs1` default to `False`	`config.py`, `execution_time.py`

2. Code Changes Summary / 代码修改摘要

File	Changes
`vidur/config/device_sku_config.py`	-26 lines: removed pdb comments + H200/GB200 commented-out blocks
`vidur/config/node_sku_config.py`	+9 lines: added bilingual comment for `num_devices_per_node` usage chain
`vidur/config/config.py`	EP behavior changed to `ValueError`; `aicb_force_bs1` default `True` → `False`
`vidur/entities/execution_time.py`	`print()` → `logger.debug()`; Safe Mode comment updated to reflect actual root cause
`run_scenarios.sh`	Removed `--replica_config_expert_model_parallel_size` from scenarios 3/4; updated comments & help text
5 additional `.py` files	Cleaned 12 residual `pdb.set_trace()` debug comments and debug prints across the codebase

3. Documentation Sync / 文档同步

File	Changes
`README_CN.md`	EP "适配中" → "自动设为 world_size"; removed EP from all example commands; updated scenario table & parameter table
`README.md`	English version sync: EP "support in progress" → "auto-set to cluster world_size"; same command/table updates
`README-vidur.md`	Removed EP params from scenarios 3/4 commands; updated scenario summary table
`test_tutorial_report_cn.md`	Removed EP from test commands; updated scenario table

4. AICB bs>1 Test Results / AICB bs>1 测试结果

Environment: 8× NVIDIA H20-3e (SM90, 144GB), conda vidur, vllm 0.11.0

Model	Phase	bs=2	bs=4	bs=8
DeepSeek-671B	decode	✅	✅	✅
DeepSeek-671B	prefill	❌	❌	—
Qwen3-Next-80B	decode	✅	✅	✅
Qwen3-Next-80B	prefill	✅	✅	—
Qwen3-Moe-235B	decode	✅	✅	✅
Qwen3-Moe-235B	prefill	✅	✅	✅

comm_size linear scaling verified: DeepSeek bs=2/4/8: 4355456/8710912/17421824 (perfect 2×); Qwen3-Moe: 6208512/12417024/24834048 (perfect 2×)

Known limitation: DeepSeek-671B prefill bs>1 fails on H20 due to sparse FlashAttention kernel assertion (params.h_q % B_H == 0). This is a hardware-specific limitation documented in config.py metadata.

Resolution: aicb_force_bs1 default changed from True to False. Users encountering this limitation can set --aicb_force_bs1 True to fall back to bs=1 CSV generation.

5. Codebase Audit / 全仓审计

Search Pattern	Residual Count	Status
`replica_config_expert_model_parallel_size`	3	✅ All legitimate (ValueError message + README param tables)
`aicb_force_bs1`	3	✅ All legitimate (config field + execution_time usage)
`EP 适配中` / `EP in progress`	0	✅ Fully cleaned
`pdb.set_trace`	0	✅ Cleaned 12 instances across codebase
`# print(f"> Debug:`	0	✅ Fully cleaned

EP Behavior Contract (Breaking Change)

Previous: EP silently overridden to world_size with a logger.debug() message.

New: If user passes --replica_config_expert_model_parallel_size with a value != 1 and != world_size, a ValueError is raised immediately with an actionable error message suggesting to either remove the parameter or set it to world_size. EP is always auto-set to world_size in cluster.py.

…erification appendix - Correct inaccurate "bs>1 fails" wording in config.py and execution_time.py to precise root cause: FlashMLA flash_mla_sparse_fwd kernel h_q alignment (B_H=64) - Add Appendix A to test_tutorial_report_cn.md with: - Full AICB tp=1/2/4/8 prefill + tp=8 decode verification matrix - FlashMLA source code evidence and call stack - Vidur degradation path validation - Environment info 修正 config.py 和 execution_time.py 中不准确的 "bs>1 失败" 描述，改为精确的根因：FlashMLA flash_mla_sparse_fwd kernel 的 h_q 对齐要求 (B_H=64)。在 test_tutorial_report_cn.md 中新增附录 A，包含完整验证矩阵、 FlashMLA 源码依据、调用栈、Vidur 降级验证和环境信息。

tianhao909 · 2026-04-15T17:41:21Z

Round-3 Fix Report: FlashMLA h_q Alignment Issue Correction / 第三轮修复报告

Commit: 1b547ca | 3 files changed | config.py, execution_time.py, test_tutorial_report_cn.md

Correction / 修正

Previous (incorrect): "DeepSeek-671B prefill fails with bs>1 on H20 (sparse FlashAttention kernel assertion)"

Corrected: DeepSeek-V3-671B prefill AICB profiling fails on H20 (SM90) when tp≥4, due to FlashMLA flash_mla_sparse_fwd kernel's h_q alignment requirement (B_H=64). This does not affect decode or other models.

Root Cause / 根因

FlashMLA SM90 prefill sparse kernel requires params.h_q % B_H == 0 (B_H=64, compile-time constant for WGMMA 64×64 tile).

DeepSeek-V3 has num_attention_heads=128, so h_q = 128 / tp
tp=1 → h_q=128 ✅ | tp=2 → h_q=64 ✅ | tp=4 → h_q=32 ❌ | tp=8 → h_q=16 ❌
Source: config.h

Updated Verification Matrix / 更新验证矩阵

DeepSeek-V3-671B Prefill (seq=1024, H20 SM90)

tp	h_q	bs=1	bs=2	bs=4	bs=8
1	128	✅	✅	✅	✅
2	64	✅	✅	✅	✅
4	32	❌	❌	❌	—
8	16	❌	❌	❌	—

Key finding: Failure is determined by tp (h_q alignment), NOT by batch size. bs=2/4/8 all pass when tp≤2.

Decode (tp=8, bs=2): ✅ — decode uses different kernel, unaffected by B_H.

Why Not Fix in AICB / 为何不在此PR修改AICB代码

AICB is a separate repository maintained by others. This PR only modifies vidur-alibabacloud/. The approach is: document the limitation + verify degradation path + correct error descriptions.

Changes in This Commit / 本次修改

File	Change
`vidur/config/config.py`	Corrected `aicb_force_bs1` help text: "bs>1 fails" → FlashMLA h_q alignment
`vidur/entities/execution_time.py`	Corrected Safe Mode docstring and Known Issue comments
`tests/test_tutorial_report_cn.md`	Added Appendix A: full analysis with tp×bs matrix, FlashMLA evidence, call stack, degradation validation, environment info

Full Analysis / 完整分析

See vidur-alibabacloud/tests/test_tutorial_report_cn.md — Appendix A

Environment: 8× NVIDIA H20-3e (SM90), CUDA 12.9, FlashMLA 1.0.0+1408756, PyTorch 2.8.0, AICB 23eec3c

tianhao909 · 2026-04-21T03:12:22Z

bs=8的两个- 是啥意思啊？是测不了还是测了会报错啊？怎么没解释？进一步弄明白啊！！！！！！

4. AICB bs>1 Test Results / AICB bs>1 测试结果

Environment: 8× NVIDIA H20-3e (SM90, 144GB), conda vidur, vllm 0.11.0

Model Phase bs=2 bs=4 bs=8
DeepSeek-671B decode ✅ ✅ ✅
DeepSeek-671B prefill ❌ ❌ —
Qwen3-Next-80B decode ✅ ✅ ✅
Qwen3-Next-80B prefill ✅ ✅ —
Qwen3-Moe-235B decode ✅ ✅ ✅
Qwen3-Moe-235B prefill ✅ ✅ ✅
comm_size linear scaling verified: DeepSeek bs=2/4/8: 4355456/8710912/17421824 (perfect 2×); Qwen3-Moe: 6208512/12417024/24834048 (perfect 2×)

Known limitation: DeepSeek-671B prefill bs>1 fails on H20 due to sparse FlashAttention kernel assertion (params.h_q % B_H == 0). This is a hardware-specific limitation documented in config.py metadata.

Resolution: aicb_force_bs1 default changed from True to False. Users encountering this limitation can set --aicb_force_bs1 True to fall back to bs=1 CSV generation.

5. Codebase Audit / 全仓审计

Search Pattern Residual Count Status
replica_config_expert_model_parallel_size 3 ✅ All legitimate (ValueError message + README param tables)
aicb_force_bs1 3 ✅ All legitimate (config field + execution_time usage)
EP 适配中 / EP in progress 0 ✅ Fully cleaned
pdb.set_trace 0 ✅ Cleaned 12 instances across codebase
# print(f"> Debug: 0 ✅ Fully cleaned

EP Behavior Contract (Breaking Change)

Previous: EP silently overridden to world_size with a logger.debug() message.

New: If user passes --replica_config_expert_model_parallel_size with a value != 1 and != world_size, a ValueError is raised immediately with an actionable error message suggesting to either remove the parameter or set it to world_size. EP is always auto-set to world_size in cluster.py.

tianhao909 · 2026-04-21T03:16:12Z

params.h_q % B_H == 0

params.h_q % B_H == 0
你看看当时vllm 源码怎么写的，现在怎么写的：？确实当时是对的吗符合当时版本的要求吗？现在版本vllm是什么样子的呢？详细分析哈

… constraint analysis [EN] - Run missing bs=8 test cases for DeepSeek-671B (tp=4/8) and Qwen3-Next-80B (tp=1) - Replace all dash marks with actual test results and explanations - Add FlashMLA h_q alignment constraint analysis with version pinning evidence (A.8) - Archive experiment logs inline as Appendix A.9 for reproducibility - Update environment baseline (CUDA 12.8, PyTorch 2.8.0+cu128, vLLM 0.11.0) [ZH] - 补测 DeepSeek-671B (tp=4/8) 和 Qwen3-Next-80B (tp=1) 的 bs=8 用例 - 将所有未解释标记替换为实际测试结果并附加说明 - 新增 FlashMLA h_q 对齐约束分析（含版本锁定证据）(A.8) - 内联实验日志作为附录 A.9 以确保可复现性 - 更新环境基线信息 (CUDA 12.8, PyTorch 2.8.0+cu128, vLLM 0.11.0)

tianhao909 · 2026-04-21T06:05:19Z

Round-4 Fix: Complete bs=8 Verification / 第四轮修复: bs=8 完整验证

Re: [#4285685324] "bs=8的两个- 是啥意思啊？是测不了还是测了会报错啊？"

[EN]

The two "—" marks in the verification matrix (Appendix A.4) for DeepSeek tp=4/8 bs=8 have been replaced with actual test results.

Result: Both cases FAIL with the exact same error as bs=1/2/4:

Assertion failed (...fwd.cu:647): params.h_q % B_H == 0

Additionally, Qwen3-Next-80B bs=8 has been verified as PASS (it uses FlashInfer, not FlashMLA).

Case	Model	tp	bs	Result	Note
1	Qwen3-Next-80B	1	8	PASS	Uses FlashInfer, no B_H constraint
2	DeepSeek-671B	4	8	FAIL	h_q=32, 32%64≠0, same as bs=1/2/4
3	DeepSeek-671B	8	8	FAIL	h_q=16, 16%64≠0, same as bs=1/2/4

The failure is entirely determined by tp (h_q alignment), independent of bs. The previous "—" marks were simply untested cases, not a different failure mode.

Updated matrix (all cells now verified, no "—" remaining):

tp	h_q=128/tp	bs=1	bs=2	bs=4	bs=8	Conclusion
1	128	✅	✅	✅	✅	h_q=128, 128%64=0, all pass
2	64	✅	✅	✅	✅	h_q=64, 64%64=0, boundary pass
4	32	❌	❌	❌	❌	h_q=32, 32%64≠0, all fail
8	16	❌	❌	❌	❌	h_q=16, 16%64≠0, all fail

See test report Appendix A.9 for full experiment logs (commands, exit codes, error signatures).

[ZH]

验证矩阵 (附录 A.4) 中 DeepSeek tp=4/8 bs=8 的 "—" 已替换为实际测试结果。

结果: 两个 case 均 FAIL，错误签名与 bs=1/2/4 完全一致:
params.h_q % B_H == 0

另外，Qwen3-Next-80B bs=8 已验证为 PASS (使用 FlashInfer，不受约束)。

之前的 "—" 标记仅表示当时未测试，不代表不同的失败模式。失败完全由 tp 决定 (h_q 对齐)，与 bs 无关。

详细实验日志见测试报告附录 A.9。

tianhao909 · 2026-04-21T06:06:01Z

A.8 FlashMLA `h_q % B_H` 约束分析

Re: [#4285699131] "params.h_q % B_H == 0 — 当时vllm源码怎么写的，现在怎么写的？"

补充日期: 2026-04-21

第一层: AICB pinned 版本中的事实

组件	版本	关键代码位置
FlashMLA	1.0.0+1408756	`csrc/sm90/prefill/sparse/config.h` → `B_H = 64`
AICB	23eec3c	`AiobDeepSeek.py:182` → `h_q = self.num_heads // self.tp`
vLLM	0.11.0	`requirements.txt` pinned 依赖

约束逻辑: FlashMLA 的 SM90 prefill sparse kernel 以 B_H=64 为 WGMMA tile 大小，
运行时要求 params.h_q % B_H == 0。AICB 按 h_q = num_attention_heads / tp 计算每个
TP rank 的 head 数量。对于 DeepSeek-V3 (num_attention_heads=128):

tp	h_q	h_q % 64	结果
1	128	0	PASS
2	64	0	PASS
4	32	32	FAIL
8	16	16	FAIL

第二层: 上游最新版本观察

FlashMLA main 分支 (截至 2026-04-21): B_H=64 未变 (config.h)
代码结构重构: 断言从 fwd.cu:647 移至 phase1.cuh，但约束逻辑不变
上游 FlashMLA PR #150 (2026-01-16) 做了多处重构，但 B_H 值未修改

结论: 上游最新版本仍有此约束，非 pinned 版本特有问题。

vLLM 源码对照 (回答 reviewer: "vllm源码怎么写的")

(a) vLLM v0.11.0 (AICB pinned) 中的调用路径:

层级	文件	关键内容
ops 层	`vllm/attention/ops/flashmla.py`	`flash_mla_sparse_prefill()` → 调用 `torch.ops._flashmla_C.sparse_prefill_fwd`
backend 层	`vllm/v1/attention/backends/mla/flashmla_sparse.py` (544行)	导入 `flash_mla_sparse_prefill`，prefill 阶段直接调用

无 head padding 机制: h_q 直接传入 CUDA kernel，h_q < 64 → 触发 B_H=64 断言失败

(b) AICB 的调用路径:

层级	文件	关键内容
入口	`AiobDeepSeek.py:235`	调用 `flash_mla_sparse_fwd()` (直接导入 flash_mla Python 包)

Python 入口函数名不同: vLLM 用 flash_mla_sparse_prefill，AICB 用 flash_mla_sparse_fwd
底层执行相同的 FlashMLA SM90 CUDA sparse prefill kernel
h_q 计算方式一致: h_q = num_attention_heads // tp
结论: AICB 仿真调用路径与 vLLM v0.11.0 的真实推理路径在底层 CUDA kernel 层面一致

(c) latest vLLM 补充备注:

版本标注: vLLM main HEAD 582340f27 (约 v0.18.2, 截至 2026-04-21)

DeepSeek-V3 tp	h_q	v0.11.0 行为	main 行为	main workaround 路径
tp=1	128	PASS	PASS	不需要
tp=2	64	PASS	PASS	不需要
tp=4	32	FAIL	PASS	BF16 prefill + head padding (32→64)
tp=8	16	FAIL	PASS	mixed batch FP8 decode kernel (绕过 BF16 prefill)

新增 MIN_HEADS_FOR_BF16_PREFILL = 32 (L63)
tp=4 (h_q=32): 32 < 32 = False → BF16 prefill + head padding 到 64
tp=8 (h_q=16): 16 < 32 = True → mixed batch FP8 → 绕过 BF16 prefill 约束
备注: 此 workaround 不影响当前 AICB 结论，因为 AICB pinned 的是 v0.11.0

第三层: 仿真与真实一致性

AICB 的 h_q = num_heads // tp 与真实 vLLM 推理时的 head 分配逻辑一致
真实 vLLM 在 tp>=4 时也会触发同样的 FlashMLA 断言
vLLM v0.11.0: AICB 仿真行为 = vLLM 真实推理行为（都会在 tp>=4 触发断言）
vLLM main (582340f27): 已通过两种方式规避了此问题：
- tp=4 (h_q=32): 在调用 kernel 前将 h_q 从 32 填充到 64 (head padding)，使其满足 h_q % 64 == 0，不再触发断言
- tp=8 (h_q=16): 切换到 FP8 mixed batch decode kernel，完全绕过了 BF16 sparse prefill 路径，不会触发 B_H=64 约束
- 但这不影响当前 AICB 结论，因为 AICB pinned 的是 v0.11.0，该版本没有以上 workaround
结论: 仿真行为 = 真实行为（都会触发），仿真是准确的

详细分析见测试报告附录 A.8: vidur-alibabacloud/tests/test_tutorial_report_cn.md

[修订于 2026-04-21: 新增 vLLM v0.11.0 源码对照，修复 FlashMLA PR auto-link，展开 workaround 细节]

… analysis [EN] - Fix auto-linked aliyun#150 reference: replace bare aliyun#150 with full FlashMLA PR URL - Add vLLM v0.11.0 vs AICB call path comparison (flash_mla_sparse_prefill vs flash_mla_sparse_fwd) - Add supplementary note on latest vLLM (582340f27) head padding / FP8 mixed batch workarounds - Qualify "same kernel" statement: same underlying SM90 CUDA kernel, different Python entry points - Use version-pinned permalinks for all external references [ZH] - 修复 aliyun#150 auto-link: 将裸 aliyun#150 替换为 FlashMLA PR 完整 URL - 新增 vLLM v0.11.0 与 AICB 调用路径对照 (flash_mla_sparse_prefill vs flash_mla_sparse_fwd) - 补充最新 vLLM (582340f27) 的 head padding / FP8 mixed batch workaround 备注 - 限定 "同一 kernel" 表述: 底层 SM90 CUDA kernel 相同，Python 入口不同 - 所有外部引用改用版本固定的 permalink

…ogy unification [EN] - Condense SimAI 1.6 release notes to 3 key points: GPU memory, decode interpolation, PD disaggregation - Update release date to Apr 23, 2026 (confirmed release date) - Unify terminology: PD Separation → PD Disaggregation across all READMEs - Full sync README.ja.md with all missing sections - Add trilingual language switcher links [ZH] - SimAI 1.6 release notes 精简为 3 点：GPU 显存建模、decode 插值、PD 分离 - 发布日期更新为 Apr 23, 2026 - 术语统一：PD Separation → PD Disaggregation - README.ja.md 全量同步补齐所有缺失章节 - 三语语言切换链接互通

…te scenario commands [EN] - Restore README-vidur.md to match official Microsoft Vidur README (remove SimAI extensions) - Migrate 4-scenario manual CLI commands to README.md and README_CN.md - Add detailed per-scenario python commands after "Run 4-Scenario Suite" section - Keep run_scenarios.sh quick-start + full manual commands for advanced users [ZH] - 恢复 README-vidur.md 与官方 Microsoft Vidur README 完全一致（移除 SimAI 扩展内容） - 将四场景手动 CLI 命令迁移至 README.md 和 README_CN.md - 在"运行四场景套件"小节后添加逐场景完整 python 命令 - 保留脚本一键运行 + 完整手动命令两种方式供用户选择

feat: SimAI 1.6 GPU memory module with PD-separation (review fixes)

ba03630

Copilot AI review requested due to automatic review settings April 14, 2026 08:57

Copilot started reviewing on behalf of tianhao909 April 14, 2026 08:57 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

This was referenced Apr 14, 2026

feat+refactor: SimAI 1.6 GPU Memory Module & Code Quality | SimAI 1.6 GPU 内存推理模块与代码质量提升 #243

Open

docs+chore: Bilingual Documentation System & Community Health Files | 双语文档体系与社区健康文件 #244

Open

tianhao909 force-pushed the simai-1_6-pr-after-fy-review-260414 branch from 88a19a0 to 1b547ca Compare April 15, 2026 17:40

tianhao909 added 3 commits April 21, 2026 08:37

MXtremist approved these changes Apr 23, 2026

View reviewed changes

MXtremist merged commit f5efb5a into aliyun:master Apr 24, 2026
1 check passed

Conversation

tianhao909 commented Apr 14, 2026

Summary

Key Changes

GPU Memory Module (vidur-alibabacloud/)

PD Separation (Prefill-Decode Disaggregation)

Device & Node SKU Support

Review Fixes (from PR #243 feedback)

Documentation

Testing

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianhao909 commented Apr 14, 2026

[EN] PR #268 Copilot Review — All Issues Addressed (commit 8f05712)

[ZH] PR #268 Copilot 评审 — 全部问题已处理 (commit 8f05712)

Uh oh!

tianhao909 commented Apr 15, 2026

Round-2 Review Fix Report / 第二轮 Review 修复报告

1. Review Feedback Resolution / Review 意见处理

2. Code Changes Summary / 代码修改摘要

3. Documentation Sync / 文档同步

4. AICB bs>1 Test Results / AICB bs>1 测试结果

5. Codebase Audit / 全仓审计

EP Behavior Contract (Breaking Change)

Uh oh!

tianhao909 commented Apr 15, 2026

Round-3 Fix Report: FlashMLA h_q Alignment Issue Correction / 第三轮修复报告

Correction / 修正

Root Cause / 根因

Updated Verification Matrix / 更新验证矩阵

Why Not Fix in AICB / 为何不在此PR修改AICB代码

Changes in This Commit / 本次修改

Full Analysis / 完整分析

Uh oh!

tianhao909 commented Apr 21, 2026

4. AICB bs>1 Test Results / AICB bs>1 测试结果

5. Codebase Audit / 全仓审计

EP Behavior Contract (Breaking Change)

Uh oh!

tianhao909 commented Apr 21, 2026

Uh oh!

tianhao909 commented Apr 21, 2026

Round-4 Fix: Complete bs=8 Verification / 第四轮修复: bs=8 完整验证

Uh oh!

tianhao909 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A.8 FlashMLA h_q % B_H 约束分析

第一层: AICB pinned 版本中的事实

第二层: 上游最新版本观察

vLLM 源码对照 (回答 reviewer: "vllm源码怎么写的")

第三层: 仿真与真实一致性

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GPU Memory Module (`vidur-alibabacloud/`)

[EN] PR #268 Copilot Review — All Issues Addressed (commit `8f05712`)

[ZH] PR #268 Copilot 评审 — 全部问题已处理 (commit `8f05712`)

tianhao909 commented Apr 21, 2026 •

edited

Loading

A.8 FlashMLA `h_q % B_H` 约束分析