feat: SimAI 1.6 GPU memory module with PD-separation#268
feat: SimAI 1.6 GPU memory module with PD-separation#268MXtremist merged 8 commits intoaliyun:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces GPU memory–aware inference simulation for SimAI 1.6 with Prefill/Decode (PD) disaggregation, alongside new model/device/node SKU configs, AICB integration hooks, scenario scripts, and documentation updates.
Changes:
- Add PD separation configuration (pd_node_ratio, per-phase TP/PP, num_prefill_replicas) and cluster/replica logic to support split prefill/decode clusters.
- Add GPU memory planning + KV-cache tracking utilities and extend model/device/node SKU coverage (H20/H200/GB200 + Qwen3/DeepSeek configs).
- Add unit tests, runnable scenarios, and documentation updates for the new PD + memory simulation workflow.
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| vidur-alibabacloud/vidur/utils/mfu_calculator.py | MFU computation updated for PD-aware MoE models |
| vidur-alibabacloud/vidur/types/node_sku_type.py | Add H20 DGX node type enum |
| vidur-alibabacloud/vidur/types/device_sku_type.py | Add H20/H200/GB200 device type enums |
| vidur-alibabacloud/vidur/simulator.py | Logging tweaks + AICB cache stats/save at end of run |
| vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py | GPU memory planner updated for PD-aware param/KV-cache budgeting |
| vidur-alibabacloud/vidur/scheduler/replica_stage_scheduler/replica_stage_schduler.py | Comment/TODO normalization around SimAI integration |
| vidur-alibabacloud/vidur/scheduler/replica_scheduler/splitwise_replica_scheduler.py | PD scheduling path updates + KV-cache allocation/release calls |
| vidur-alibabacloud/vidur/scheduler/global_scheduler/splitwise_global_scheduler.py | PD global scheduling cleanups + logging + redundancy notes |
| vidur-alibabacloud/vidur/profiling/collectives/collectives_impl.py | TODO normalization for collectives profiling |
| vidur-alibabacloud/vidur/profiling/collectives/benchmark_runner.py | TODO normalization for comm backend |
| vidur-alibabacloud/vidur/metrics/metrics_store.py | Plot image write behavior adjusted (safe write helper usage) |
| vidur-alibabacloud/vidur/metrics/data_series.py | Add safe plot writing + skip stats/plots for non-numeric metrics |
| vidur-alibabacloud/vidur/metrics/cdf_sketch.py | Use safe plot writing helper |
| vidur-alibabacloud/vidur/execution_time_predictor/sklearn_execution_time_predictor.py | AICB prediction path changes + additional comments |
| vidur-alibabacloud/vidur/execution_time_predictor/communication_time_predictor.py | Comment/TODO normalization |
| vidur-alibabacloud/vidur/execution_time_predictor/base_execution_time_predictor.py | PD-aware AICB parameterization for phase-specific TP/PP/WS/EP |
| vidur-alibabacloud/vidur/events/replica_stage_schedule_event.py | Replace debug prints with logger + clearer assert message |
| vidur-alibabacloud/vidur/events/replica_schedule_event.py | TODO normalization |
| vidur-alibabacloud/vidur/events/batch_stage_end_event.py | TODO normalization |
| vidur-alibabacloud/vidur/events/batch_stage_arrival_event.py | Add/clarify class header comment |
| vidur-alibabacloud/vidur/events/batch_end_event.py | Logging + PD transfer logic instrumentation + typing tweak |
| vidur-alibabacloud/vidur/entities/request.py | KV-cache sizing formula corrected + PD DAG cleanup + clearer asserts |
| vidur-alibabacloud/vidur/entities/replica.py | KV-cache capacity tracking + per-request alloc/release helpers |
| vidur-alibabacloud/vidur/entities/cluster.py | PD cluster initialization + per-phase world_size/EP derivation |
| vidur-alibabacloud/vidur/entities/batch.py | Remove commented debug assert |
| vidur-alibabacloud/vidur/config/node_sku_config.py | Add H20 DGX node SKU config |
| vidur-alibabacloud/vidur/config/model_config.py | Add Qwen3Next/Qwen3MoE model configs (dataclasses) |
| vidur-alibabacloud/vidur/config/device_sku_config.py | Add H20/H200/GB200 device SKU configs + adjust H800 figures |
| vidur-alibabacloud/vidur/config/config.py | Add PD separation config params + pd_p2p_comm_dtype choices + aicb_force_bs1 |
| vidur-alibabacloud/tests/test_pd_separation.py | New unit tests for PD separation behavior/config |
| vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh | New multi-scenario runner script |
| vidur-alibabacloud/data/hf_configs/*.json | Add HF-style configs for DeepSeek/Qwen3 models |
| vidur-alibabacloud/README.md | Document GPU memory module + PD params + scenarios |
| vidur-alibabacloud/README_CN.md | Chinese documentation for the same feature set |
| vidur-alibabacloud/README-vidur.md | Add SimAI/AICB scenario examples section |
| vidur-alibabacloud/.gitignore | Ignore AICB workload directory + local run outputs |
| README.md | Root docs updated with SimAI 1.6 changes + inference suite |
| README_CN.md | Root Chinese docs updated with SimAI 1.6 changes + inference suite |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
[EN] - Comment out untested H200/GB200 device configs (device_sku_config.py, device_sku_type.py) - Clean pd_p2p_comm_dtype choices to tested types: fp8, float16, float32 - Fix .gitignore: add directory unignore before negation rules - Replace print() with logger calls (simulator.py, mfu_calculator.py, sklearn_predictor.py) - Fix duplicate write_image in metrics_store.py, keep _safe_write_image only - Add PD separation design comments in memory_planner.py - Fix float division to int division in kvcache calculation - Enrich TODO comments for seq_len=1 limitation and MFU semantic issue - Add AICB backend empirical formula explanation comments - Add Data Preparation section to README.md and README_CN.md - Add test tutorial and report (test_tutorial_report_cn.md) [ZH] - 注释掉未测试的 H200/GB200 设备配置 - 清理 pd_p2p_comm_dtype choices 仅保留已测试类型 - 修复 .gitignore negation 规则 - 将 print() 替换为 logger 调用 - 修复 metrics_store.py 重复 write_image,仅保留 _safe_write_image - 添加 PD 分离设计说明注释 - 修复 kvcache 计算中浮点除法为整除 - 丰富 seq_len=1 限制和 MFU 语义问题的 TODO 注释 - 添加 AICB 后端经验公式说明注释 - README 双语添加数据准备章节 - 添加测试教程和报告文档
[EN] PR #268 Copilot Review — All Issues Addressed (commit
|
| # | File | Issue | Resolution |
|---|---|---|---|
| 1 | mfu_calculator.py |
ParamCounter returns memory bytes, not param count — MFU semantic inconsistency |
TODO added: Known design limitation documented with TODO(tianhao909). Will be refactored when ParamCounter API is unified. |
| 2 | memory_planner.py |
get_max_batch_size returns prefill only, should return min(prefill, decode) |
Comment added: In PD disaggregation, prefill and decode are independent clusters with separate schedulers. Each cluster manages its own batch size independently. Returning prefill capacity is correct for the current architecture. |
| 3 | memory_planner.py |
Float division / should be integer // |
Fixed: Changed / to // for kvcache_size_per_layer_per_token in critical paths. |
| 4 | .gitignore |
Directory-level ignore blocks negation rules | Fixed: Added !data/aicb_workload/ before file-level negation rules. |
| 5 | simulator.py |
print() should use logger.warning() |
Fixed: Replaced print(f"[WARNING]...") with logger.warning(...). |
| 6 | sklearn_execution_time_predictor.py |
Hardcoded linear predictions instead of real models | Comment added: This is the AICB backend's intentional design — uses empirical linear formulas for new MoE models lacking profiling data. The vidur backend uses real sklearn models when profiling CSVs are available. |
| 7 | mfu_calculator.py |
Multiple print() statements bypass logging |
Fixed: Added logger = init_logger(__name__), replaced all print() with logger.debug()/logger.warning(). |
| 8 | memory_planner.py |
seq_len=1 drastically underestimates KV cache |
TODO enriched: Expanded from 4-line to 15-line detailed TODO explaining: known limitation, planned fixes (per-request seq_len, dynamic allocation), and current workaround rationale. |
| 9 | metrics_store.py |
Duplicate write_image calls |
Fixed: Removed direct fig.write_image(), kept only _safe_write_image() which gracefully handles missing Kaleido/Chrome. |
[ZH] PR #268 Copilot 评审 — 全部问题已处理 (commit 8f05712)
全部 9 条 Copilot 评审评论已在最新 commit 中处理:
| # | 文件 | 问题 | 处理方式 |
|---|---|---|---|
| 1 | mfu_calculator.py |
ParamCounter 返回内存字节而非参数个数 — MFU 语义不一致 |
添加 TODO:已记录已知设计限制 TODO(tianhao909)。待 ParamCounter API 统一后重构。 |
| 2 | memory_planner.py |
get_max_batch_size 仅返回 prefill,应返回 min(prefill, decode) |
添加注释:PD 分离下,prefill 和 decode 是独立集群,各自调度器独立管理 batch size。返回 prefill 容量在当前架构下是正确的。 |
| 3 | memory_planner.py |
浮点除法 / 应改为整数除法 // |
已修复:关键路径中 kvcache_size_per_layer_per_token 计算从 / 改为 //。 |
| 4 | .gitignore |
目录级忽略规则阻断了后续 negation | 已修复:在文件级 negation rules 前添加 !data/aicb_workload/。 |
| 5 | simulator.py |
print() 应使用 logger.warning() |
已修复:print(f"[WARNING]...") 替换为 logger.warning(...)。 |
| 6 | sklearn_execution_time_predictor.py |
硬编码线性预测代替真实模型 | 添加注释:这是 AICB 后端的设计意图——对缺少 profiling 数据的新 MoE 模型使用经验线性公式。vidur 后端在有 profiling CSV 时使用真实 sklearn 模型。 |
| 7 | mfu_calculator.py |
多处 print() 绕过日志系统 |
已修复:添加 logger = init_logger(__name__),所有 print() 替换为 logger.debug()/logger.warning()。 |
| 8 | memory_planner.py |
seq_len=1 严重低估 KV cache |
丰富 TODO:从 4 行扩展为 15 行详细 TODO,说明已知限制、计划修复方案(per-request seq_len、动态分配)及当前设计原因。 |
| 9 | metrics_store.py |
重复的 write_image 调用 |
已修复:移除直接的 fig.write_image(),仅保留 _safe_write_image() 以优雅处理缺少 Kaleido/Chrome 的情况。 |
[EN] - Ran all 4 scenarios from run_scenarios.sh (Qwen3-Next-80B no-PD, Qwen3-Next-80B PD, DeepSeek-671B PD, Qwen3-MoE-235B PD) - All 4 scenarios passed on H20 GPU with FP8 + AICB backend - Updated test_tutorial_report_cn.md with detailed results table [ZH] - 运行 run_scenarios.sh 全部 4 个场景 (Qwen3-Next-80B 无PD, Qwen3-Next-80B PD, DeepSeek-671B PD, Qwen3-MoE-235B PD) - 4/4 场景在 H20 GPU + FP8 + AICB 后端下全部通过 - 更新 test_tutorial_report_cn.md 补充四场景测试结果表格
Round-2 Review Fix Report / 第二轮 Review 修复报告Commit: 1. Review Feedback Resolution / Review 意见处理
2. Code Changes Summary / 代码修改摘要
3. Documentation Sync / 文档同步
4. AICB bs>1 Test Results / AICB bs>1 测试结果Environment: 8× NVIDIA H20-3e (SM90, 144GB), conda
comm_size linear scaling verified: DeepSeek bs=2/4/8: Known limitation: DeepSeek-671B prefill bs>1 fails on H20 due to sparse FlashAttention kernel assertion ( Resolution: 5. Codebase Audit / 全仓审计
EP Behavior Contract (Breaking Change)Previous: EP silently overridden to New: If user passes |
…erification appendix - Correct inaccurate "bs>1 fails" wording in config.py and execution_time.py to precise root cause: FlashMLA flash_mla_sparse_fwd kernel h_q alignment (B_H=64) - Add Appendix A to test_tutorial_report_cn.md with: - Full AICB tp=1/2/4/8 prefill + tp=8 decode verification matrix - FlashMLA source code evidence and call stack - Vidur degradation path validation - Environment info 修正 config.py 和 execution_time.py 中不准确的 "bs>1 失败" 描述, 改为精确的根因:FlashMLA flash_mla_sparse_fwd kernel 的 h_q 对齐要求 (B_H=64)。 在 test_tutorial_report_cn.md 中新增附录 A,包含完整验证矩阵、 FlashMLA 源码依据、调用栈、Vidur 降级验证和环境信息。
88a19a0 to
1b547ca
Compare
Round-3 Fix Report: FlashMLA h_q Alignment Issue Correction / 第三轮修复报告Commit: Correction / 修正Previous (incorrect): "DeepSeek-671B prefill fails with bs>1 on H20 (sparse FlashAttention kernel assertion)" Corrected: DeepSeek-V3-671B prefill AICB profiling fails on H20 (SM90) when tp≥4, due to FlashMLA Root Cause / 根因FlashMLA SM90 prefill sparse kernel requires
Updated Verification Matrix / 更新验证矩阵DeepSeek-V3-671B Prefill (seq=1024, H20 SM90)
Key finding: Failure is determined by tp (h_q alignment), NOT by batch size. bs=2/4/8 all pass when tp≤2. Decode (tp=8, bs=2): ✅ — decode uses different kernel, unaffected by B_H. Why Not Fix in AICB / 为何不在此PR修改AICB代码AICB is a separate repository maintained by others. This PR only modifies Changes in This Commit / 本次修改
Full Analysis / 完整分析See Environment: 8× NVIDIA H20-3e (SM90), CUDA 12.9, FlashMLA 1.0.0+1408756, PyTorch 2.8.0, AICB |
params.h_q % B_H == 0 |
… constraint analysis [EN] - Run missing bs=8 test cases for DeepSeek-671B (tp=4/8) and Qwen3-Next-80B (tp=1) - Replace all dash marks with actual test results and explanations - Add FlashMLA h_q alignment constraint analysis with version pinning evidence (A.8) - Archive experiment logs inline as Appendix A.9 for reproducibility - Update environment baseline (CUDA 12.8, PyTorch 2.8.0+cu128, vLLM 0.11.0) [ZH] - 补测 DeepSeek-671B (tp=4/8) 和 Qwen3-Next-80B (tp=1) 的 bs=8 用例 - 将所有未解释标记替换为实际测试结果并附加说明 - 新增 FlashMLA h_q 对齐约束分析(含版本锁定证据)(A.8) - 内联实验日志作为附录 A.9 以确保可复现性 - 更新环境基线信息 (CUDA 12.8, PyTorch 2.8.0+cu128, vLLM 0.11.0)
Round-4 Fix: Complete bs=8 Verification / 第四轮修复: bs=8 完整验证
[EN] The two "—" marks in the verification matrix (Appendix A.4) for DeepSeek tp=4/8 bs=8 have been replaced with actual test results. Result: Both cases FAIL with the exact same error as bs=1/2/4: Additionally, Qwen3-Next-80B bs=8 has been verified as PASS (it uses FlashInfer, not FlashMLA).
The failure is entirely determined by tp (h_q alignment), independent of bs. The previous "—" marks were simply untested cases, not a different failure mode. Updated matrix (all cells now verified, no "—" remaining):
See test report Appendix A.9 for full experiment logs (commands, exit codes, error signatures). [ZH] 验证矩阵 (附录 A.4) 中 DeepSeek tp=4/8 bs=8 的 "—" 已替换为实际测试结果。 结果: 两个 case 均 FAIL,错误签名与 bs=1/2/4 完全一致: 另外,Qwen3-Next-80B bs=8 已验证为 PASS (使用 FlashInfer,不受约束)。 之前的 "—" 标记仅表示当时未测试,不代表不同的失败模式。失败完全由 tp 决定 (h_q 对齐),与 bs 无关。 详细实验日志见测试报告附录 A.9。 |
A.8 FlashMLA
|
| 组件 | 版本 | 关键代码位置 |
|---|---|---|
| FlashMLA | 1.0.0+1408756 | csrc/sm90/prefill/sparse/config.h → B_H = 64 |
| AICB | 23eec3c | AiobDeepSeek.py:182 → h_q = self.num_heads // self.tp |
| vLLM | 0.11.0 | requirements.txt pinned 依赖 |
约束逻辑: FlashMLA 的 SM90 prefill sparse kernel 以 B_H=64 为 WGMMA tile 大小,
运行时要求 params.h_q % B_H == 0。AICB 按 h_q = num_attention_heads / tp 计算每个
TP rank 的 head 数量。对于 DeepSeek-V3 (num_attention_heads=128):
| tp | h_q | h_q % 64 | 结果 |
|---|---|---|---|
| 1 | 128 | 0 | PASS |
| 2 | 64 | 0 | PASS |
| 4 | 32 | 32 | FAIL |
| 8 | 16 | 16 | FAIL |
第二层: 上游最新版本观察
- FlashMLA main 分支 (截至 2026-04-21):
B_H=64未变 (config.h) - 代码结构重构: 断言从
fwd.cu:647移至phase1.cuh,但约束逻辑不变 - 上游 FlashMLA PR #150 (2026-01-16) 做了多处重构,但 B_H 值未修改
结论: 上游最新版本仍有此约束,非 pinned 版本特有问题。
vLLM 源码对照 (回答 reviewer: "vllm源码怎么写的")
(a) vLLM v0.11.0 (AICB pinned) 中的调用路径:
| 层级 | 文件 | 关键内容 |
|---|---|---|
| ops 层 | vllm/attention/ops/flashmla.py |
flash_mla_sparse_prefill() → 调用 torch.ops._flashmla_C.sparse_prefill_fwd |
| backend 层 | vllm/v1/attention/backends/mla/flashmla_sparse.py (544行) |
导入 flash_mla_sparse_prefill,prefill 阶段直接调用 |
- 无 head padding 机制: h_q 直接传入 CUDA kernel,h_q < 64 → 触发
B_H=64断言失败
(b) AICB 的调用路径:
| 层级 | 文件 | 关键内容 |
|---|---|---|
| 入口 | AiobDeepSeek.py:235 |
调用 flash_mla_sparse_fwd() (直接导入 flash_mla Python 包) |
- Python 入口函数名不同: vLLM 用
flash_mla_sparse_prefill,AICB 用flash_mla_sparse_fwd - 底层执行相同的 FlashMLA SM90 CUDA sparse prefill kernel
- h_q 计算方式一致:
h_q = num_attention_heads // tp - 结论: AICB 仿真调用路径与 vLLM v0.11.0 的真实推理路径在底层 CUDA kernel 层面一致
(c) latest vLLM 补充备注:
版本标注: vLLM main HEAD
582340f27(约 v0.18.2, 截至 2026-04-21)
| DeepSeek-V3 tp | h_q | v0.11.0 行为 | main 行为 | main workaround 路径 |
|---|---|---|---|---|
| tp=1 | 128 | PASS | PASS | 不需要 |
| tp=2 | 64 | PASS | PASS | 不需要 |
| tp=4 | 32 | FAIL | PASS | BF16 prefill + head padding (32→64) |
| tp=8 | 16 | FAIL | PASS | mixed batch FP8 decode kernel (绕过 BF16 prefill) |
- 新增
MIN_HEADS_FOR_BF16_PREFILL = 32(L63) - tp=4 (h_q=32):
32 < 32= False → BF16 prefill + head padding 到 64 - tp=8 (h_q=16):
16 < 32= True → mixed batch FP8 → 绕过 BF16 prefill 约束 - 备注: 此 workaround 不影响当前 AICB 结论,因为 AICB pinned 的是 v0.11.0
第三层: 仿真与真实一致性
- AICB 的
h_q = num_heads // tp与真实 vLLM 推理时的 head 分配逻辑一致 - 真实 vLLM 在 tp>=4 时也会触发同样的 FlashMLA 断言
- vLLM v0.11.0: AICB 仿真行为 = vLLM 真实推理行为(都会在 tp>=4 触发断言)
- vLLM main (582340f27): 已通过两种方式规避了此问题:
- tp=4 (h_q=32): 在调用 kernel 前将 h_q 从 32 填充到 64 (head padding),使其满足
h_q % 64 == 0,不再触发断言 - tp=8 (h_q=16): 切换到 FP8 mixed batch decode kernel,完全绕过了 BF16 sparse prefill 路径,不会触发 B_H=64 约束
- 但这不影响当前 AICB 结论,因为 AICB pinned 的是 v0.11.0,该版本没有以上 workaround
- tp=4 (h_q=32): 在调用 kernel 前将 h_q 从 32 填充到 64 (head padding),使其满足
- 结论: 仿真行为 = 真实行为(都会触发),仿真是准确的
详细分析见测试报告附录 A.8: vidur-alibabacloud/tests/test_tutorial_report_cn.md
[修订于 2026-04-21: 新增 vLLM v0.11.0 源码对照,修复 FlashMLA PR auto-link,展开 workaround 细节]
… analysis [EN] - Fix auto-linked aliyun#150 reference: replace bare aliyun#150 with full FlashMLA PR URL - Add vLLM v0.11.0 vs AICB call path comparison (flash_mla_sparse_prefill vs flash_mla_sparse_fwd) - Add supplementary note on latest vLLM (582340f27) head padding / FP8 mixed batch workarounds - Qualify "same kernel" statement: same underlying SM90 CUDA kernel, different Python entry points - Use version-pinned permalinks for all external references [ZH] - 修复 aliyun#150 auto-link: 将裸 aliyun#150 替换为 FlashMLA PR 完整 URL - 新增 vLLM v0.11.0 与 AICB 调用路径对照 (flash_mla_sparse_prefill vs flash_mla_sparse_fwd) - 补充最新 vLLM (582340f27) 的 head padding / FP8 mixed batch workaround 备注 - 限定 "同一 kernel" 表述: 底层 SM90 CUDA kernel 相同,Python 入口不同 - 所有外部引用改用版本固定的 permalink
…ogy unification [EN] - Condense SimAI 1.6 release notes to 3 key points: GPU memory, decode interpolation, PD disaggregation - Update release date to Apr 23, 2026 (confirmed release date) - Unify terminology: PD Separation → PD Disaggregation across all READMEs - Full sync README.ja.md with all missing sections - Add trilingual language switcher links [ZH] - SimAI 1.6 release notes 精简为 3 点:GPU 显存建模、decode 插值、PD 分离 - 发布日期更新为 Apr 23, 2026 - 术语统一:PD Separation → PD Disaggregation - README.ja.md 全量同步补齐所有缺失章节 - 三语语言切换链接互通
…te scenario commands [EN] - Restore README-vidur.md to match official Microsoft Vidur README (remove SimAI extensions) - Migrate 4-scenario manual CLI commands to README.md and README_CN.md - Add detailed per-scenario python commands after "Run 4-Scenario Suite" section - Keep run_scenarios.sh quick-start + full manual commands for advanced users [ZH] - 恢复 README-vidur.md 与官方 Microsoft Vidur README 完全一致(移除 SimAI 扩展内容) - 将四场景手动 CLI 命令迁移至 README.md 和 README_CN.md - 在"运行四场景套件"小节后添加逐场景完整 python 命令 - 保留脚本一键运行 + 完整手动命令两种方式供用户选择

Summary
This PR adds GPU memory-aware inference simulation to SimAI 1.6, with support for Prefill-Decode (PD) disaggregation.
This is a clean resubmission of PR #243 with all review fixes applied.
Key Changes
GPU Memory Module (
vidur-alibabacloud/)PD Separation (Prefill-Decode Disaggregation)
pd_node_ratioparameter:1(default) = MIXED mode;(0,1)= PD separation enablednum_prefill_replicasoverride for fine-grained controltests/test_pd_separation.py(10 test cases)Device & Node SKU Support
Review Fixes (from PR #243 feedback)
assert→if not: raise ValueError()for better error messageslogger.info→logger.debug(36 places) to reduce noise_generate_or_find_bs1_csv()controlled byaicb_force_bs1config flagH20DgxNodeSKUConfig.device_sku_typecorrected from H800 to H20@dataclassdecorators to H200/GB200 device SKU configspd_p2p_comm_dtypemetadata: structuredchoicesfieldQwen3235BA22BModelConfig→Qwen3Moe235BA22BModelConfigqwen3-235B-A22B_FP8_config.json: architectures and eos_token_id.gitignore: addeddata/aicb_workload/to prevent large data file commitsDocumentation
README.md/README_CN.md(removed dead docs/ links)vidur-alibabacloud/README.mdTesting
Notes
upstream/masterwith a single clean commit (noaicb_workloaddata files in history)