diff --git a/README.ja.md b/README.ja.md index dd96f29f..5539f64c 100644 --- a/README.ja.md +++ b/README.ja.md @@ -1,6 +1,36 @@ +
+ +# SimAI + +[](LICENSE) +[](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) + # 最新ニュース -### SimCCLのアップデート -[2025/06] SimCCLのコードが最初に[SimCCL](https://github.com/aliyun/SimAI/tree/SimCCL)ブランチで公開され、まもなくSimCCLリポジトリでリリースされます。 + +### 最近のアップデート + +- [2026/04] **SimAI 1.6 リリース!** 主な更新: + - 推論シミュレーション向け GPU メモリモデリング(パラメータカウント&KV Cache)。 + - Decode 時間推定の線形補間(最近傍探索の代替)。 + - PD Disaggregation メモリプランニング(Prefill/Decode 独立バジェット)。 + +- [2025/12] **SimAI 1.5 リリース!** このリリースでは、マルチリクエスト**推論**ワークロード向けのエンドツーエンドシミュレーションが実現されました。主な機能: + + - **高度な推論シミュレーション:** Prefill/Decode 分離を用いた複雑なシナリオのモデリング。 + - **最新モデルサポート:** DeepSeek、Qwen3Moe、Qwen3Next に対応。詳細は [AICB の README](./aicb/README.md) を参照してください。 + - **リクエストスケジューリング:** リクエストスケジューリングは、Microsoft の [Vidur](https://github.com/microsoft/vidur) から適応したコンポーネントによって処理されます。詳細は [Vidur-Alibabacloud の README](./vidur-alibabacloud/README.md) を参照してください。 + +- [2025/11] [AICB](https://github.com/aliyun/aicb/tree/master) が **DeepSeek**、**Qwen3-MoE**、**Qwen3-Next** 向けの **prefill/decode** 推論ワークロード生成に対応しました。 + +- [2025/09] [AICB](https://github.com/aliyun/aicb/tree/master) が DeepSeek 向けのトレーニングワークロード生成に対応しました。[@parthpower](https://github.com/parthpower) 氏のコントリビューションに感謝します。 + +- [2025/06] SimCCLのコードが最初に[SimCCL](https://github.com/aliyun/SimAI/tree/SimCCL)ブランチで公開され、まもなくSimCCLリポジトリでリリースされます。 + +**コミュニティからの貢献を歓迎します!** SimAI の未来を一緒に作りたい方は、お気軽に Issue を開いてアイデアを議論したり、プルリクエストを送信してください。 + +
-
+
+ 中文  |  English  |  日本語 +
+ +# SimAI + +[](LICENSE) +[](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) + # Latest News ### Recent Updates +- [2026/04] **SimAI 1.6 Released!** Key updates: + - GPU memory modeling for inference simulation (parameter counting & KV cache). + - Linear interpolation for decode time estimation (replacing nearest-neighbor). + - Prefill-Decode Disaggregation memory planning (independent budgets for Prefill/Decode). + - [2025/12] **SimAI 1.5 Released!** This release brings end-to-end simulation for multi-request **inference** workloads. Key features include: - - - **Advanced Inference Simulation:** Model complex scenarios with Prefill/Decode separation. - - **Modern Model Support:** Now includes DeepSeek, Qwen3Moe and Qwen3Next. See [AICB's README](./aicb/README.md) for more detailed information. - - **Request Scheduling:** Request scheduling is now handled by a component adapted from Microsoft's [Vidur](https://github.com/microsoft/vidur). See [Vidur-Alibabacloud's README](./vidur-alibabacloud/README.md) for more detailed information. + + - **Advanced Inference Simulation:** Model complex scenarios with Prefill/Decode separation. + - **Modern Model Support:** Now includes DeepSeek, Qwen3Moe and Qwen3Next. See [AICB's README](./aicb/README.md) for more detailed information. + - **Request Scheduling:** Request scheduling is now handled by a component adapted from Microsoft's [Vidur](https://github.com/microsoft/vidur). See [Vidur-Alibabacloud's README](./vidur-alibabacloud/README.md) for more detailed information. - [2025/11] [AICB](https://github.com/aliyun/aicb/tree/master) now supports generating **prefill/decode** inference workloads for **DeepSeek**, **Qwen3-MoE** and **Qwen3-Next**. @@ -14,7 +28,8 @@ - [2025/06] The code of SimCCL is first released in the branch [SimCCL](https://github.com/aliyun/SimAI/tree/SimCCL) and will be released in SimCCL repository soon. -**We warmly welcome contributions from the community!** If you are interested in helping shape the future of SimAI, please feel free to open an issue to discuss your ideas or submit a pull request. +**We warmly welcome contributions from the community!** If you are interested in helping shape the future of SimAI, please feel free to open an issue to discuss your ideas or submit a pull request. +
+ 中文  |  English  |  日本語 +
+ +# SimAI + +[](LICENSE) +[](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) + +# 最新动态 + +### 近期更新 + +- [2026/04] **SimAI 1.6 正式发布!** 主要更新: + - 推理仿真 GPU 显存建模(参数计数与 KV Cache 管理)。 + - Decode 耗时线性插值估算(替代最近邻查找)。 + - PD 分离内存规划(Prefill/Decode 独立预算)。 + +- [2025/12] **SimAI 1.5 正式发布!** 本版本新增对多请求**推理**工作负载的端到端仿真支持,主要特性包括: + + - **高级推理仿真:** 支持 Prefill/Decode 分离等复杂场景建模。 + - **主流模型支持:** 新增 DeepSeek、Qwen3Moe 和 Qwen3Next 模型。详见 [AICB README](./aicb/README.md)。 + - **请求调度:** 请求调度组件基于微软 [Vidur](https://github.com/microsoft/vidur) 适配,详见 [Vidur-Alibabacloud README](./vidur-alibabacloud/README_CN.md)。 + +- [2025/11] [AICB](https://github.com/aliyun/aicb/tree/master) 新增对 **DeepSeek**、**Qwen3-MoE** 和 **Qwen3-Next** 的 **prefill/decode** 推理工作负载生成支持。 + +- [2025/09] [AICB](https://github.com/aliyun/aicb/tree/master) 新增 DeepSeek 训练工作负载生成支持。感谢 [@parthpower](https://github.com/parthpower) 的贡献。 + +- [2025/06] SimCCL 代码首次在 [SimCCL](https://github.com/aliyun/SimAI/tree/SimCCL) 分支发布,后续将在独立仓库正式开源。 + +**欢迎社区贡献!** 如有想法,欢迎提交 Issue 讨论或发起 Pull Request。 + ++ |--- AICB +SimAI --|--- SimCCL + |--- astra-sim-alibabacloud + |--- ns-3-alibabacloud + |--- vidur-alibabacloud ++ +在纯仿真能力基础上,SimAI 已演进为一个由四个组件([aicb](https://github.com/aliyun/aicb)、[SimCCL](https://github.com/aliyun/SimCCL)、[astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)、[ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud))构成的全栈工具套件。这些组件可以灵活组合以实现不同功能。我们鼓励用户探索更多可能性。 + +下图为 SimAI 模拟器架构图: + + +astra-sim-alibabacloud 基于 [astra-sim](https://github.com/astra-sim/astra-sim/tree/ASTRA-sim-1.0) 扩展开发。感谢 astra-sim 团队的优秀工作和开源贡献。我们在其基础上集成了 NCCL 算法并添加了若干新特性。 + +## 应用场景 + +SimAI 支持三种主要运行模式: + +**SimAI-Analytical** 通过使用总线带宽(busbw)抽象网络通信细节来估算集合通信时间,实现快速仿真。目前支持用户自定义 busbw,自动计算 busbw 功能即将推出。 + +**SimAI-Simulation** 提供基于细粒度网络通信建模的全栈仿真。利用 NS-3 或其他网络模拟器(当前 NS-3 已开源)实现对所有通信行为的详细仿真,力求高保真还原真实训练环境。 + +**SimAI-Physical** *(Beta)* 支持在 CPU RDMA 集群环境下生成物理流量,通过生成类 NCCL 的流量模式深入研究 LLM 训练中的 NIC 行为。当前处于内测阶段。 + +| 场景 | 描述 | 组件组合 | +|------|------|----------| +| 1. AICB 测试套件 | 在 GPU 集群上使用 AICB 测试套件运行通信模式 | [AICB](https://github.com/aliyun/aicb) | +| 2. AICB/AIOB 工作负载 | 建模**推理**/训练过程的计算/通信模式以生成工作负载 | [AICB](https://github.com/aliyun/aicb) | +| 3. 集合通信分析 | 将集合通信操作分解为点对点通信集合 | [SimCCL](https://github.com/aliyun/SimCCL) | +| 4. 无 GPU 集合通信 | 在非 GPU 集群上执行 RDMA 集合通信流量 | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(physical) | +| 5. SimAI-Analytical | 在任意服务器上快速进行 AICB 工作负载分析与仿真(忽略底层网络细节) | [AICB](https://github.com/aliyun/aicb) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(analytical) | +| 6. SimAI-Simulation | 在任意服务器上进行全栈仿真 | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(simulation) + [ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | +| 7. 多请求推理仿真 | 在单 GPU 服务器上进行多请求**推理**全栈仿真 | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [vidur-alibabacloud](./vidur-alibabacloud) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(analytical/simulation) | + +## 引用 + +SimAI 论文已被 NSDI'25 Spring 接收,详情请参阅: + +*SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision.* + +[[pdf](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf)] / [[slides](./docs/SimAI_Intro_Online.pdf)] / [[video](https://n.dingtalk.com/dingding/live-room/index.html?roomId=OF5BkBUXVxmgsK7x&liveUuid=305736cd-aa70-498b-8003-2b471a53decd)] + +欢迎基于 SimAI 开展创新研究和功能扩展。欢迎加入社区群或通过邮件联系我们交流,我们可提供技术支持。 + +# 快速开始 + +以下为简单示例。完整教程请参见:[**SimAI@Tutorial**](./docs/Tutorial.md)、[**aicb@Tutorial**](https://github.com/aliyun/aicb/blob/master/training/tutorial.md)、[SimCCL@Tutorial]、[ns-3-alibabacloud@Tutorial] + +## 环境搭建 + +请按照以下步骤快速搭建环境并运行 SimAI。 + +### 从源码安装 + +以下步骤已在 Ubuntu 20.04 的 GCC/G++ 9.4.0、python 3.8.10 环境下验证。 + +可使用官方 Ubuntu 20.04 镜像,**不要安装 ninja**。 + +(对于工作负载生成场景,推荐直接使用 NGC 容器镜像。) + +```bash +# 克隆仓库 +$ git clone https://github.com/aliyun/SimAI.git +$ cd ./SimAI/ + +# 初始化子模块 +$ git submodule update --init --recursive +# 更新到最新提交 +$ git submodule update --remote + +# 编译 SimAI-Analytical +$ ./scripts/build.sh -c analytical + +# 编译 SimAI-Simulation (ns3) +$ ./scripts/build.sh -c ns3 +``` + +## 使用 SimAI-Analytical + +```bash +$ ./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml +``` + +若需自动计算总线带宽,请尝试: + +```bash +$ ./bin/SimAI_analytical -w ./example/workload_analytical.txt -g 9216 -nv 360 -nic 48.5 -n_p_s 8 -g_p_s 8 -r example- +``` + +## 使用 SimAI-Simulation + +```bash +# 生成网络拓扑 +$ python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps + +# 运行仿真 +$ AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/microAllReduce.txt -n ./Spectrum-X_128g_8gps_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +## 使用多请求推理仿真 + +详情请参见 `vidur-alibabacloud` 目录下的 [README](./vidur-alibabacloud/README_CN.md)。该模块利用 AICB 对**推理**工作负载的计算时间进行 profiling。由于依赖 DeepGEMM 和 FlashMLA 等特定硬件加速库,目前仅兼容基于 **Hopper(SM90)** 和 **Blackwell(SM100)** 架构的 NVIDIA GPU。 + +```bash +# 从 Dockerfile 构建 +docker build -t image:latest . +docker run --gpus all -it --rm image:latest +``` + +**注意:** 若使用 Hopper GPU,请在 Dockerfile 中添加 `ENV FLASH_MLA_DISABLE_SM100=1`。 + +如需快速验证所有支持的推理场景(Qwen3-Next-80B、DeepSeek-671B、Qwen3-MoE-235B),可使用内置的四场景测试套件: + +```bash +# 前置条件:conda activate vidur +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all +# 或单独运行某个场景: +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 +``` + +> **前置条件:** 需先激活 `conda activate vidur` 环境。详见 [环境配置](./vidur-alibabacloud/README_CN.md#-环境配置)。 +> +> 完整场景配置表与输出文件说明请参见 [Vidur-AlibabaCloud README](./vidur-alibabacloud/README_CN.md#四场景配置说明)。 + +# 致谢 + +衷心感谢以下人员和机构对本项目的贡献: + + +- TianHao Fu (Peking University) and [TELOS-syslab](https://github.com/TELOS-syslab/) +- Parth Parikh (KEYSIGHT) +- Sarah-Michelle Hammer & Ziyi Wang (TU-Berlin) +- Xinyue Li (BUPT) +- Tong Chen (Zhejiang University) +- Ming Wang (BUPT) +- Tao Jiang (Institute of Computing Technology, Chinese Academy of Sciences) + +……以及众多来自社区的个人贡献者(详见 [Contributors to aliyun/SimAI](https://github.com/aliyun/SimAI/graphs/contributors))。 + +同时感谢 Chenning Li(MIT CSAIL)发起了将 SimAI 集成到 [M4](https://github.com/netiken/m4) 的合作——M4 是一个新型创新模拟器。 + +**本项目持续欢迎更多贡献与建议。** + +# 贡献指南 + +欢迎参与贡献!开始前请阅读以下指引: + +| | | +|---|---| +| [贡献指南](./CONTRIBUTING.zh-CN.md) | 如何提交 Issue 和 Pull Request | +| [安全政策](./SECURITY_CN.md) | 如何报告安全漏洞 | +| [行为准则](./CODE_OF_CONDUCT_CN.md) | 社区行为规范 | +| [更新日志](./CHANGELOG_CN.md) | v1.5 起的版本历史 | + +# 联系我们 + +如有任何问题,欢迎发送邮件至:Gang Lu(yunding.lg@alibaba-inc.com)、Feiyang Xue(xuefeiyang.xfy@alibaba-inc.com)或 Qingxu Li(qingxu.lqx@alibaba-inc.com)。 + +欢迎加入 SimAI 社区交流群,左侧为钉钉群,右侧为微信群。 + +
+
++ 中文  |  English +
+# Vidur-AlibabaCloud -Vidur ([original](https://github.com/microsoft/vidur)) is a simulation framework for large language model (LLM) inference systems. -**Vidur-AlibabaCloud** (this repository) is a customized version optimized for Alibaba Cloud **SimAI** scenarios. It supports advanced features such as **Prefill–Decode (PD) disaggregation** and includes dedicated adaptations for state-of-the-art (SOTA) LLM models including **DeepSeek-V3-671B**, **Qwen3-MoE-235B**, **Qwen3-Next-80B**, and other models. +[](https://www.python.org/downloads/) +[](LICENSE) + +Vidur ([original](https://github.com/microsoft/vidur)) is a simulation framework for large language model (LLM) inference systems. +**Vidur-AlibabaCloud** (this repository) is a customized version optimized for Alibaba Cloud **SimAI** scenarios. It supports advanced features such as **Prefill–Decode (PD) disaggregation** and includes dedicated adaptations for SOTA LLM models including **DeepSeek-V3-671B**, **Qwen3-MoE-235B**, **Qwen3-Next-80B**, and others. + + +--- + +## Table of Contents + +- [Key Features](#key-features) +- [GPU Memory Calculation](#gpu-memory-calculation) +- [Supported Models](#supported-models) +- [Environment Setup](#-environment-setup) +- [Running Examples](#%EF%B8%8F-running-examples) + - [4-Scenario Configuration](#4-scenario-configuration) + - [Output Files](#output-files) +- [Key Input Parameters](#-key-input-parameter-reference) +- [Key Output Interpretation](#-key-output-interpretation) +- [Known Issues](#%EF%B8%8F-known-issues) +- [Help](#-help) --- ## Key Features -+ **Prefill–Decode (PD) Separation** – Enables running the prefill and decode stages on different nodes, allowing elastic resource allocation and performance isolation. -(Inspired by [splitwise-sim](https://github.com/Mutinifni/splitwise-sim)). -+ **Flexible Parallelism** – Supports: - - **Data Parallel (DP)** - - **Tensor Parallel (TP)** - - **Pipeline Parallel (PP)** - - **Expert Parallel (EP)** (support in progress) -Works for both **dense** and **Mixture-of-Experts (MoE)** models (MoE support in progress). -+ **Multiple Execution-Time Prediction Backends** – Choose from: - - **AICB/AIOB** - Partially supports computation kernels and TP, DP, PP, EP communication size for DeepSeek-V3-671B, Qwen3-Moe-235B, Qwen3-Next-80B - - **SimAi_simulation** – SimAI NS-3-based network simulation (supports TP) - - **SimAi_analytical** – SimAI analytical performance model (supports TP) - - **Native Vidur [original]** – Supports TP, DP, PP -+ **Workload Generation & Replay** – Replay real-world traces or generate synthetic requests using fixed or Poisson distributions. -+ **Fine-Grained Metrics** – Records: - - TTFT – Time to First Token - - TBT / TPOT – Time Between Tokens / Time Per Output Token - - End-to-end latency - - Communication cost - - Computation cost - - Scheduling delay + +- **Prefill–Decode (PD) Disaggregation** — Enables running the prefill and decode stages on different nodes, allowing elastic resource allocation and performance isolation. + (Inspired by [splitwise-sim](https://github.com/Mutinifni/splitwise-sim)) +- **Flexible Parallelism** — Supports: + - **Data Parallel (DP)** + - **Tensor Parallel (TP)** + - **Pipeline Parallel (PP)** + - **Expert Parallel (EP)** (auto-set to cluster world_size, manual override not supported) + + Works for both **dense** and **Mixture-of-Experts (MoE)** models (MoE support in progress). +- **Multiple Execution-Time Prediction Backends** — Choose from: + - **AICB/AIOB** — Partially supports computation kernels and TP, DP, PP, EP communication size for DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B + - **SimAI Simulation** — SimAI NS-3-based network simulation (supports TP) + - **SimAI Analytical** — SimAI analytical performance model (supports TP) + - **Native Vidur [original]** — Supports TP, DP, PP +- **Workload Generation & Replay** — Replay real-world traces or generate synthetic requests using fixed or Poisson distributions. +- **Fine-Grained Metrics** — Records: + - TTFT — Time to First Token + - TBT / TPOT — Time Between Tokens / Time Per Output Token + - End-to-end latency + - Communication cost + - Computation cost + - Scheduling delay + +--- + +## GPU Memory Calculation + +This module provides accurate GPU memory estimation for modern MoE (Mixture-of-Experts) models during inference simulation, covering **model parameter memory**, **KV cache memory**, and **maximum batch size** calculation under Prefill–Decode (PD) disaggregation. + +### Supported Attention Architectures + +| Architecture | Model | Description | +|---|---|---| +| **MLA** (Multi-head Latent Attention) | DeepSeek-V3-671B | Uses LoRA-compressed KV cache (`kv_lora_rank` + `qk_rope_head_dim`) for reduced memory footprint | +| **MHA / GQA** (Multi-Head / Grouped-Query Attention) | Qwen3-MoE-235B | Standard KV cache with `num_kv_heads * head_dim` per token per layer | +| **Hybrid Full + Linear Attention** | Qwen3-Next-80B | Alternates between full attention and linear (GDN) attention every 4 layers | + +### Key Components + +- **`ParamCounter`** (`vidur/utils/param_counter.py`) — Computes per-layer and per-device parameter counts for MLA, MHA/GQA, linear attention, and MoE expert weights, with FP8 quantization support. Under PD disaggregation, it returns separate `(total_params, prefill_params, decode_params)` based on `prefill_world_size` / `decode_world_size`. +- **`MemoryPlanner`** (`vidur/scheduler/utils/memory_planner.py`) — Plans GPU memory budget: `available = GPU_mem * (1 - margin) - param_mem`, then computes KV cache capacity and maximum concurrent requests. Includes OOM detection with actionable suggestions. +- **Per-request KV cache tracking** (`vidur/entities/replica.py`) — Allocates and releases KV cache memory on a per-request basis, enabling accurate remaining-capacity queries at runtime. + +### References & Acknowledgments + +The GPU memory calculation module was developed with reference to the following works: + +- [InferSim](https://github.com/alibaba/InferSim) — Parameter counting and KV cache estimation methodology +- [DeepSeek V3 Parameter Size Analysis](https://yangwenbo.com/articles/deepseek-v3-parameter-size.html) — DeepSeek V3 MLA parameter derivation +- [DeepSeek V3 Parameter Derivation (Chinese)](https://zhuanlan.zhihu.com/p/21455638257) — Detailed MLA weight decomposition + +We gratefully acknowledge these resources for providing the foundational analysis that guided our implementation. --- ## Supported Models -+ **DeepSeek-V3-671B** (SimAI PP/EP communication、GPU memory allocation module adaptations in progress) -+ **Qwen3-Moe-235B**, **Qwen3-Next-80B** (SimAI PP/EP communication、GPU memory allocation module adaptations in progress) -+ **meta-llama/Meta-Llama-3-8B** / **Meta-Llama-3-70B** -+ **meta-llama/Llama-2-7b-hf** / **Llama-2-70b-hf** -+ **codellama/CodeLlama-34b-Instruct-hf** -+ **internlm/internlm-20b** -+ **Qwen/Qwen-72B** + +- **DeepSeek-V3-671B** (SimAI PP communication module in progress; EP auto-set to world_size; GPU memory management supported) +- **Qwen3-MoE-235B**, **Qwen3-Next-80B** (SimAI PP communication module in progress; EP auto-set to world_size; GPU memory management supported) +- **meta-llama/Meta-Llama-3-8B** / **Meta-Llama-3-70B** +- **meta-llama/Llama-2-7b-hf** / **Llama-2-70b-hf** +- **codellama/CodeLlama-34b-Instruct-hf** +- **internlm/internlm-20b** +- **Qwen/Qwen-72B** --- ## 📦 Environment Setup + ### 1. Create Conda Environment + ```bash conda env create -p ./env -f ./environment.yml ``` ### 2. (Optional) Update Dev Dependencies + ```bash conda env update -f environment-dev.yml ``` ### 3. Activate Environment + ```bash conda activate vidur ``` ### 4. Install Python Dependencies (Using Alibaba Cloud PyPI Mirror) + ```bash pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ pip install -r requirements-dev.txt -i https://mirrors.aliyun.com/pypi/simple/ @@ -66,13 +127,46 @@ pip install -r requirements-dev.txt -i https://mirrors.aliyun.com/pypi/simple/ --- -## ▶️ Running Example -### Run DeepSeek-671B **with** AICB -**Requirements: **SimAI and AICB Docker environment (see [README](../README.md) for setup instructions). +### 5. Data Preparation + +The examples below use trace files from `data/processed_traces/`. These files are provided by the upstream [microsoft/vidur](https://github.com/microsoft/vidur) project. + +**Option A**: Clone upstream vidur and copy the trace files: + +```bash +git clone https://github.com/microsoft/vidur.git /tmp/vidur +cp -r /tmp/vidur/data/processed_traces ./data/ +``` + +**Option B**: If you already have the vidur data locally: + +```bash +cp -r /path/to/vidur/data/processed_traces ./data/ +``` + +After preparation, your directory structure should look like: + +``` +data/ +├── processed_traces/ +│ ├── splitwise_conv.csv +│ ├── splitwise_code.csv +│ └── arxiv_summarization_stats_llama2_tokenizer_filtered_v2.csv +└── hf_configs/ # Already included in this repo +``` + +--- + +## ▶️ Running Examples + +### Run DeepSeek-671B with AICB -After setting up the environment, run the following commands: +**Requirements:** SimAI and AICB Docker environment (see [README](../README.md) for setup instructions). + +After setting up the environment, run the following commands: + +#### DeepSeek-671B with AICB (Fixed Length Generator) -#### Run DeepSeek-671B **with** AICB (Fixed Length Generator) ```bash cd SimAI/vidur-alibabacloud @@ -93,11 +187,11 @@ python -m vidur.main --replica_config_pd_p2p_comm_bandwidth 800 \ --replica_config_model_name deepseek-671B \ --replica_config_tensor_parallel_size 2 \ --replica_config_num_pipeline_stages 1 \ - --replica_config_expert_model_parallel_size 8 \ - --random_forrest_execution_time_predictor_config_backend aicb + --random_forrest_execution_time_predictor_config_backend aicb ``` -#### Run DeepSeek-671B **with** AICB (Trace Length Generator) +#### DeepSeek-671B with AICB (Trace Length Generator) + ```bash cd SimAI/vidur-alibabacloud @@ -119,16 +213,13 @@ python -m vidur.main \ --replica_config_model_name deepseek-671B \ --replica_config_tensor_parallel_size 2 \ --replica_config_num_pipeline_stages 1 \ - --replica_config_expert_model_parallel_size 8 \ --random_forrest_execution_time_predictor_config_backend aicb ``` > ✅ Full parameter descriptions are available via `python -m vidur.main -h`. -> - +### Run Llama-3-8B with SimAI Simulation -### Run Llama-3-8B **with** simai_simulation ```bash cd SimAI @@ -136,8 +227,8 @@ cd SimAI ./scripts/build.sh -c ns3 # Create network topo (Spectrum-X_128g_8gps_100Gbps_A100) -python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps - +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps cd SimAI/vidur-alibabacloud @@ -159,22 +250,19 @@ python -m vidur.main \ --replica_config_model_name meta-llama/Meta-Llama-3-8B \ --replica_config_tensor_parallel_size 4 \ --replica_config_num_pipeline_stages 1 \ - --replica_config_expert_model_parallel_size 1 \ --random_forrest_execution_time_predictor_config_backend simai_simulation \ --random_forrest_execution_time_predictor_config_simai_dir ../ \ --random_forrest_execution_time_predictor_config_simai_simulation_topo ../Spectrum-X_128g_8gps_100Gbps_A100 \ - --random_forrest_execution_time_predictor_config_simai_simulation_config ../astra-sim-alibabacloud/inputs/config/SimAI.conf + --random_forrest_execution_time_predictor_config_simai_simulation_config ../astra-sim-alibabacloud/inputs/config/SimAI.conf ``` -> -> +### Run Llama-3-8B with SimAI Analytical -### Run Llama-3-8B **with** simai_analytical ```bash cd SimAI # Compile SimAI-Analytical -$ ./scripts/build.sh -c analytical +./scripts/build.sh -c analytical cd SimAI/vidur-alibabacloud @@ -196,14 +284,11 @@ python -m vidur.main \ --replica_config_model_name meta-llama/Meta-Llama-3-8B \ --replica_config_tensor_parallel_size 4 \ --replica_config_num_pipeline_stages 1 \ - --replica_config_expert_model_parallel_size 1 \ --random_forrest_execution_time_predictor_config_backend simai_analytical ``` -> -> +### Run Llama-3-8B with Native Vidur [original] -### Run Llama-3-8B **with** native Vidur [original] ```bash cd SimAI/vidur-alibabacloud @@ -225,129 +310,342 @@ python -m vidur.main \ --replica_config_model_name meta-llama/Meta-Llama-3-8B \ --replica_config_tensor_parallel_size 4 \ --replica_config_num_pipeline_stages 1 \ - --replica_config_expert_model_parallel_size 1 \ --random_forrest_execution_time_predictor_config_backend vidur ``` -> -> +### Run 4-Scenario Suite + +For a quick validation of all supported configurations, use the bundled test script: + +```bash +bash examples/vidur-ali-scenarios/run_scenarios.sh --all +``` + +See `bash examples/vidur-ali-scenarios/run_scenarios.sh --help` for details. +#### 4-Scenario Configuration +The following scenarios are pre-configured in `run_scenarios.sh`. All scenarios share the hardware configuration below. + +**Shared Hardware Configuration:** +- GPU: H20 (h20_dgx), NVLink: 1600 Gbps, RDMA: 800 Gbps +- PD P2P bandwidth: 800 Gbps, dtype: fp8 +- Request: Poisson QPS=100, 4 requests, fixed prefill=100 / decode=8 tokens + +| Scenario | Model | PD Disaggregation | World Size | TP | PP | EP | Global Scheduler | +|----------|-------|---------------|------------|----|----|------------|------------------| +| 1 | Qwen3-Next-80B (MoE) | No | 32 (dp=32) | 1 | 1 | auto (=world_size) | lor | +| 2 | Qwen3-Next-80B (MoE) | Yes (P=2, D=6) | 8 | 1 | 1 | auto (=world_size) | split_wise | +| 3 | DeepSeek-671B (MoE) | Yes (P=2, D=6) | 8 | 8 | 1 | auto (=world_size) | split_wise | +| 4 | Qwen3-MoE-235B (MoE) | Yes (P=2, D=6) | 8 | 4 | 1 | auto (=world_size) | split_wise | + +> **Note:** All four models use Mixture-of-Experts (MoE) architecture. EP is automatically set to the cluster world_size at runtime and cannot be manually overridden. + +#### Usage + +```bash +# Activate environment +conda activate vidur + +# Run a single scenario (1~4) +bash examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 + +# Run all scenarios sequentially +bash examples/vidur-ali-scenarios/run_scenarios.sh --all + +# Show help +bash examples/vidur-ali-scenarios/run_scenarios.sh --help +``` + +#### Manual Commands (Per Scenario) + +**Scenario 1: Qwen3-Next-80B without PD Disaggregation (ws=32, lor)** + +```bash +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 32 \ + --replica_config_pd_node_ratio 1 \ + --global_scheduler_config_type lor \ + --replica_scheduler_config_type sarathi \ + --replica_config_model_name qwen3-next-80B \ + --replica_config_tensor_parallel_size 1 \ + --replica_config_num_pipeline_stages 1 +``` + +**Scenario 2: Qwen3-Next-80B with PD Disaggregation (P=2, D=6, split_wise)** + +```bash +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 8 \ + --replica_config_pd_node_ratio 0.25 \ + --replica_config_num_prefill_replicas 2 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name qwen3-next-80B \ + --replica_config_tensor_parallel_size 1 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_prefill_tensor_parallel_size 1 \ + --replica_config_prefill_num_pipeline_stages 1 \ + --replica_config_decode_tensor_parallel_size 1 \ + --replica_config_decode_num_pipeline_stages 1 +``` + +**Scenario 3: DeepSeek-671B with PD Disaggregation (tp=8, EP=auto, split_wise)** + +```bash +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 8 \ + --replica_config_pd_node_ratio 0.25 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name deepseek-671B \ + --replica_config_tensor_parallel_size 8 \ + --replica_config_num_pipeline_stages 1 +``` + +**Scenario 4: Qwen3-MoE-235B with PD Disaggregation (tp=4, EP=auto, split_wise)** + +```bash +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 8 \ + --replica_config_pd_node_ratio 0.25 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name qwen3-moe-235B \ + --replica_config_tensor_parallel_size 4 \ + --replica_config_num_pipeline_stages 1 +``` + +#### Output Files + +**Output path depends on how you run the simulation:** + +- **`run_scenarios.sh`** --- outputs to `examples/vidur-ali-scenarios/simulator_output/` +- **Direct `python -m vidur.main`** --- outputs to `./simulator_output/` (or the path specified by `--metrics_config_output_dir`) + +Each run produces the following directory: + +``` ++ 中文  |  English +
+ +# Vidur-AlibabaCloud + +[](https://www.python.org/downloads/) +[](LICENSE) + +Vidur([原版](https://github.com/microsoft/vidur))是一个大语言模型(LLM)推理系统的模拟框架。 +**Vidur-AlibabaCloud**(本仓库)是针对阿里云 **SimAI** 场景优化的定制版本。支持 **Prefill–Decode(PD)分离**等高级特性,并针对 **DeepSeek-V3-671B**、**Qwen3-MoE-235B**、**Qwen3-Next-80B** 等 SOTA 大模型进行了专门适配。 + +--- + +## 目录 + +- [主要特性](#主要特性) +- [GPU 显存计算模块](#gpu-显存计算模块) +- [支持的模型](#支持的模型) +- [📦 环境配置](#-环境配置) +- [▶️ 运行示例](#️-运行示例) + - [四场景配置说明](#四场景配置说明) + - [输出文件说明](#输出文件说明) +- [🔧 关键输入参数参考](#-关键输入参数参考) +- [📊 输出结果解读](#-输出结果解读) +- [⚠️ 已知问题](#️-已知问题) +- [📚 帮助](#-帮助) + +--- + +## 主要特性 + +- **Prefill–Decode(PD)分离** — 支持 prefill 和 decode 阶段在不同节点运行,实现弹性资源分配和性能隔离。 + (参考 [splitwise-sim](https://github.com/Mutinifni/splitwise-sim)) +- **灵活的并行策略** — 支持: + - **数据并行(DP)** + - **张量并行(TP)** + - **流水线并行(PP)** + - **专家并行(EP)**(自动设为 cluster world_size,不支持手动指定) + + 同时支持 **Dense** 模型和 **混合专家(MoE)** 模型(MoE 适配中)。 +- **多种执行时间预测后端** — 可选: + - **AICB/AIOB** — 部分支持 DeepSeek-V3-671B、Qwen3-MoE-235B、Qwen3-Next-80B 的计算核与 TP、DP、PP、EP 通信量建模 + - **SimAI 仿真(Simulation)** — 基于 SimAI NS-3 的网络通信全栈仿真(支持 TP) + - **SimAI 解析(Analytical)** — SimAI 解析性能模型(支持 TP) + - **原版 Vidur [original]** — 支持 TP、DP、PP +- **负载生成与回放** — 支持真实 trace 回放,或使用固定/泊松分布生成合成请求。 +- **细粒度指标** — 记录: + - TTFT — 首 token 时延 + - TBT / TPOT — 相邻 token 时延 / 每输出 token 耗时 + - 端到端延迟 + - 通信开销 + - 计算开销 + - 调度延迟 + +--- + +## GPU 显存计算模块 + +本模块为现代 MoE(混合专家)模型的推理仿真提供精确的 GPU 显存估算,涵盖**模型参数显存**、**KV Cache 显存**以及 Prefill–Decode(PD)分离架构下的**最大批处理量**计算。 + +### 支持的注意力架构 + +| 架构 | 模型 | 说明 | +|---|---|---| +| **MLA**(多头潜在注意力) | DeepSeek-V3-671B | 使用 LoRA 压缩的 KV Cache(`kv_lora_rank` + `qk_rope_head_dim`),显著降低显存占用 | +| **MHA / GQA**(多头 / 分组查询注意力) | Qwen3-MoE-235B | 标准 KV Cache,每 token 每层使用 `num_kv_heads * head_dim` | +| **混合全注意力 + 线性注意力** | Qwen3-Next-80B | 每 4 层交替使用全注意力和线性(GDN)注意力 | + +### 核心组件 + +- **`ParamCounter`**(`vidur/utils/param_counter.py`)— 计算每层和每设备的参数量,支持 MLA、MHA/GQA、线性注意力和 MoE 专家权重,支持 FP8 量化。在 PD 分离架构下,根据 `prefill_world_size` / `decode_world_size` 分别返回 `(total_params, prefill_params, decode_params)` 三元组。 +- **`MemoryPlanner`**(`vidur/scheduler/utils/memory_planner.py`)— 规划 GPU 显存预算:`available = GPU_mem * (1 - margin) - param_mem`,计算 KV Cache 容量和最大并发请求数,包含 OOM 检测与建议输出。 +- **逐请求 KV Cache 追踪**(`vidur/entities/replica.py`)— 按请求粒度分配和释放 KV Cache 显存,支持运行时精确查询剩余容量。 + +### 参考与致谢 + +本 GPU 显存计算模块的开发参考了以下工作: + +- [InferSim](https://github.com/alibaba/InferSim) — 参数量计算与 KV Cache 估算方法论 +- [DeepSeek V3 Parameter Size Analysis](https://yangwenbo.com/articles/deepseek-v3-parameter-size.html) — DeepSeek V3 MLA 参数推导 +- [DeepSeek V3 参数推导详解](https://zhuanlan.zhihu.com/p/21455638257) — MLA 权重分解详细分析 + +衷心感谢以上资源为我们的实现提供了基础性的分析与指导。 + +--- + +## 支持的模型 + +- **DeepSeek-V3-671B**(SimAI PP 通信模块适配中;EP 自动设为 world_size;GPU 显存管理已支持) +- **Qwen3-MoE-235B**、**Qwen3-Next-80B**(SimAI PP 通信模块适配中;EP 自动设为 world_size;GPU 显存管理已支持) +- **meta-llama/Meta-Llama-3-8B** / **Meta-Llama-3-70B** +- **meta-llama/Llama-2-7b-hf** / **Llama-2-70b-hf** +- **codellama/CodeLlama-34b-Instruct-hf** +- **internlm/internlm-20b** +- **Qwen/Qwen-72B** + +--- + +## 📦 环境配置 + +### 1. 创建 Conda 环境 + +```bash +conda env create -p ./env -f ./environment.yml +``` + +### 2.(可选)更新开发依赖 + +```bash +conda env update -f environment-dev.yml +``` + +### 3. 激活环境 + +```bash +conda activate vidur +``` + +### 4. 安装 Python 依赖(使用阿里云 PyPI 镜像) + +```bash +pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ +pip install -r requirements-dev.txt -i https://mirrors.aliyun.com/pypi/simple/ +``` + +--- + +### 5. 数据准备 + +下面的示例使用 `data/processed_traces/` 中的 trace 文件。这些文件来自上游 [microsoft/vidur](https://github.com/microsoft/vidur) 项目。 + +**方式一**:从上游 vidur 克隆并拷贝 trace 文件: + +```bash +git clone https://github.com/microsoft/vidur.git /tmp/vidur +cp -r /tmp/vidur/data/processed_traces ./data/ +``` + +**方式二**:如果本地已有 vidur 数据: + +```bash +cp -r /path/to/vidur/data/processed_traces ./data/ +``` + +准备完成后,目录结构应如下: + +``` +data/ +├── processed_traces/ +│ ├── splitwise_conv.csv +│ ├── splitwise_code.csv +│ └── arxiv_summarization_stats_llama2_tokenizer_filtered_v2.csv +└── hf_configs/ # 本仓库已包含 +``` + +--- + +## ▶️ 运行示例 + +### 使用 AICB 运行 DeepSeek-671B + +**前置条件:** 需要 SimAI 和 AICB Docker 环境(参见 [README](../README.md) 了解搭建方法)。 + +完成环境配置后,运行以下命令: + +#### DeepSeek-671B + AICB(固定长度生成器) + +```bash +cd SimAI/vidur-alibabacloud + +python -m vidur.main --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 5 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 1024 \ + --fixed_request_length_generator_config_decode_tokens 10 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name deepseek-671B \ + --replica_config_tensor_parallel_size 2 \ + --replica_config_num_pipeline_stages 1 \ + --random_forrest_execution_time_predictor_config_backend aicb +``` + +#### DeepSeek-671B + AICB(Trace 长度生成器) + +```bash +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 10 \ + --length_generator_config_type trace \ + --trace_request_length_generator_config_max_tokens 1024 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --interval_generator_config_type poisson \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name deepseek-671B \ + --replica_config_tensor_parallel_size 2 \ + --replica_config_num_pipeline_stages 1 \ + --random_forrest_execution_time_predictor_config_backend aicb +``` + +> ✅ 完整参数说明可通过 `python -m vidur.main -h` 查看。 + +### 使用 SimAI 仿真运行 Llama-3-8B + +```bash +cd SimAI + +# 编译 SimAI-Simulation(ns3) +./scripts/build.sh -c ns3 + +# 生成网络拓扑(Spectrum-X_128g_8gps_100Gbps_A100) +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps + +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 10 \ + --length_generator_config_type trace \ + --trace_request_length_generator_config_max_tokens 2048 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --interval_generator_config_type poisson \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name meta-llama/Meta-Llama-3-8B \ + --replica_config_tensor_parallel_size 4 \ + --replica_config_num_pipeline_stages 1 \ + --random_forrest_execution_time_predictor_config_backend simai_simulation \ + --random_forrest_execution_time_predictor_config_simai_dir ../ \ + --random_forrest_execution_time_predictor_config_simai_simulation_topo ../Spectrum-X_128g_8gps_100Gbps_A100 \ + --random_forrest_execution_time_predictor_config_simai_simulation_config ../astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +### 使用 SimAI 解析模型运行 Llama-3-8B + +```bash +cd SimAI + +# 编译 SimAI-Analytical +./scripts/build.sh -c analytical + +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 10 \ + --length_generator_config_type trace \ + --trace_request_length_generator_config_max_tokens 2048 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --interval_generator_config_type poisson \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name meta-llama/Meta-Llama-3-8B \ + --replica_config_tensor_parallel_size 4 \ + --replica_config_num_pipeline_stages 1 \ + --random_forrest_execution_time_predictor_config_backend simai_analytical +``` + +### 使用原版 Vidur 运行 Llama-3-8B + +```bash +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 10 \ + --length_generator_config_type trace \ + --trace_request_length_generator_config_max_tokens 2048 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --interval_generator_config_type poisson \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name meta-llama/Meta-Llama-3-8B \ + --replica_config_tensor_parallel_size 4 \ + --replica_config_num_pipeline_stages 1 \ + --random_forrest_execution_time_predictor_config_backend vidur +``` + +### 运行四场景套件 + +使用内置脚本快速验证所有支持的配置: + +```bash +bash examples/vidur-ali-scenarios/run_scenarios.sh --all +``` + +详细信息请运行 `bash examples/vidur-ali-scenarios/run_scenarios.sh --help`。 + +#### 四场景配置说明 + +以下场景已在 `run_scenarios.sh` 中预配置,所有场景共享下方硬件配置。 + +**共用硬件配置:** +- GPU:H20(h20_dgx),NVLink:1600 Gbps,RDMA:800 Gbps +- PD P2P 带宽:800 Gbps,数据类型:fp8 +- 请求生成:Poisson QPS=100,4 requests,固定 prefill=100 / decode=8 tokens + +| 场景 | 模型 | PD 分离 | World Size | TP | PP | EP | 全局调度器 | +|------|------|---------|------------|----|----|------------|------------| +| 1 | Qwen3-Next-80B (MoE) | 无 | 32 (dp=32) | 1 | 1 | auto (=world_size) | lor | +| 2 | Qwen3-Next-80B (MoE) | 是(P=2, D=6) | 8 | 1 | 1 | auto (=world_size) | split_wise | +| 3 | DeepSeek-671B (MoE) | 是(P=2, D=6) | 8 | 8 | 1 | auto (=world_size) | split_wise | +| 4 | Qwen3-MoE-235B (MoE) | 是(P=2, D=6) | 8 | 4 | 1 | auto (=world_size) | split_wise | + +> **说明:** 四个模型均使用混合专家(MoE)架构。EP 在运行时自动设为 cluster world_size,不支持手动指定。 + +#### run_scenarios.sh 使用方法 + +```bash +# 激活环境 +conda activate vidur + +# 运行单个场景(1~4) +bash examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 + +# 顺序运行所有场景 +bash examples/vidur-ali-scenarios/run_scenarios.sh --all + +# 查看帮助 +bash examples/vidur-ali-scenarios/run_scenarios.sh --help +``` + +#### 手动运行命令(逐场景) + +以下为四个场景的完整 CLI 命令,可直接复制运行。所有命令均在 `vidur-alibabacloud/` 目录下执行。 + +**场景 1:Qwen3-Next-80B 无PD分离(ws=32, lor)** + +```bash +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 32 \ + --replica_config_pd_node_ratio 1 \ + --global_scheduler_config_type lor \ + --replica_scheduler_config_type sarathi \ + --replica_config_model_name qwen3-next-80B \ + --replica_config_tensor_parallel_size 1 \ + --replica_config_num_pipeline_stages 1 +``` + +**场景 2:Qwen3-Next-80B PD分离(P=2, D=6, split_wise)** + +```bash +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 8 \ + --replica_config_pd_node_ratio 0.25 \ + --replica_config_num_prefill_replicas 2 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name qwen3-next-80B \ + --replica_config_tensor_parallel_size 1 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_prefill_tensor_parallel_size 1 \ + --replica_config_prefill_num_pipeline_stages 1 \ + --replica_config_decode_tensor_parallel_size 1 \ + --replica_config_decode_num_pipeline_stages 1 +``` + +**场景 3:DeepSeek-671B PD分离(tp=8, EP=auto, split_wise)** + +```bash +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 8 \ + --replica_config_pd_node_ratio 0.25 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name deepseek-671B \ + --replica_config_tensor_parallel_size 8 \ + --replica_config_num_pipeline_stages 1 +``` + +**场景 4:Qwen3-MoE-235B PD分离(tp=4, EP=auto, split_wise)** + +```bash +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 8 \ + --replica_config_pd_node_ratio 0.25 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name qwen3-moe-235B \ + --replica_config_tensor_parallel_size 4 \ + --replica_config_num_pipeline_stages 1 +``` + +#### 输出文件说明 + +**输出路径取决于运行方式:** + +- **`run_scenarios.sh`** --- 输出到 `examples/vidur-ali-scenarios/simulator_output/` +- **直接 `python -m vidur.main`** --- 输出到 `./simulator_output/`(或通过 `--metrics_config_output_dir` 指定的路径) + +每次运行产生如下目录: + +``` +