From 19aa92b1f0ae2d020ba0c48f29a410b3813931f1 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Mon, 17 Mar 2025 23:31:08 -0700 Subject: [PATCH 1/6] [Blog]: DeepSeek R1 inference performance: MI300X vs. H200 --- .../posts/h200-mi300x-deepskeek-benchmark.md | 226 ++++++++++++++++++ 1 file changed, 226 insertions(+) create mode 100644 docs/blog/posts/h200-mi300x-deepskeek-benchmark.md diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md new file mode 100644 index 000000000..4d1ed5c5e --- /dev/null +++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md @@ -0,0 +1,226 @@ +--- +title: "DeepSeek R1 inference performance: MI300X vs. H200" +date: 2025-03-18 +description: "TBA" +slug: h200-mi300x-deepskeek-benchmark +image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/h200-mi300x-deepskeek-benchmark-v2.png?raw=true +categories: + - Benchmarks + - AMD + - NVIDIA +--- + +# DeepSeek R1 inference performance: MI300X vs. H200 + +DeepSeek-R1, with its innovative architecture combining Multi-head Latent Attention (MLA) and DeepSeekMoE, presents +unique challenges for inference workloads. As a reasoning-focused model, it generates intermediate chain-of-thought +outputs, placing significant demands on memory capacity and bandwidth. + +In this benchmark, we evaluate the performance of three inference backends—SGLang, vLLM, and TensorRT-LLM—on two hardware +configurations: 8x NVIDIA H200 and 8x AMD MI300X. Our goal is to compare throughput, latency, and overall efficiency to +determine the optimal backend and hardware pairing for DeepSeek-R1's demanding requirements. + + + +This benchmark was made possible through the generous support of our partners at +[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} and +[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"}, +who provided access to the necessary hardware. + + + +## Benchmark setup + +### Hardware configurations + +1. AMD 8xMI300x + * 2x Intel Xeon Platinum 8468, 48C/96T, 16GT/s, 105M Cache (350W) + * 8x AMD MI300x GPU, 192GB, 750W + * 32x 64GB DDR5, 4800MT/s +2. NVIDIA 8xH200 SXM5 + * 2x Intel Xeon Platinum 8570, 56C/112T, 20GT/s, 300M Cache (350W) + * 8x NVIDIA H200 SXM5 GPU, 141GB, 700W + * 32x 64GB DDR5, 5600MT/s + +### Benchmark methodology + +#### Online inference + +We utilized SGLang's [`Deepseek-R1/bench_serving.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/Deepseek-R1/bench_serving.py){:target="_blank"} +script, modified to incorporate TensorRT-LLM. Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200. + +| Request Concurrencies | Output Token Lengths | Prefix-Cached | +|------------------------|----------------------|----------------| +| 4,8,16,...,128 | 800 | No | +| 128 | 1600, 3200, 6400 | No | +| 128 | 800 | Yes | + +#### Offline Inference + +For offline inference, we used vLLM’s [`benchmark_throughput.py` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py){:target="_blank"}, +modified for SGLang. TensorRT-LLM was tested using a custom +[`benchmark_throughput_trt.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/blob/deepseek-r1-benchmark/Deepseek-R1/benchmark_throughput_trt.py){:target="_blank"}. +The benchmark examined performance across various batch sizes and output token lengths. + +| Batch Sizes | Output Token Lengths | +|--------------------|----------------------| +| 32,64,128,...,1024 | 800 | +| 256, 512, 1024 | 1600 | +| 256, 512, 1024 | 3200 | + +## Key observations + +### Throughput and End-to-End Latency + +**NVIDIA H200 performance** + +* TensorRT-LLM outperformed both vLLM and SGLang, achieving the highest online throughput of 4176 tokens/s on H200. +* At concurrencies below 128, vLLM led in online throughput and end-to-end latency. +* In offline scenarios, H200 achieved the highest overall throughput of 6311 tokens/s with SGLang. + + + +**AMD MI300X performance** + +* vLLM outperformed SGLang in both online and offline throughput and end-to-end latency. +* MI300X with vLLM achieved the highest overall throughput of 4574 tokens/s in online scenarios. +* At request concurrencies below 32, SGLang outperformed vLLM in online throughput and latency. + + + +While MI300X's larger memory capacity and higher bandwidth should theoretically enable higher throughput at larger batch +sizes, the results suggest that inference backends for MI300X may require further optimization to fully leverage its +architectural advantages. + +### Throughput and Latency vs. Output Token Length + +**NVIDIA H200 performance** + +* SGLang delivered slightly higher throughput and better latency as output token length increased in online scenarios. +* In offline scenarios, SGLang with H200 outperformed MI300X as output token length increased. + +=== "Throughput" + + +=== "Latency" + + +**AMD MI300X performance** + +vLLM maintained the lead in both online and offline scenarios as output token length increased. + +=== "Throughput" + + +=== "Latency" + + +### Time to First Token (TTFT) + +**NVIDIA H200 performance** + +TensorRT-LLM maintained the lowest and most consistent TTFT up to concurrency 64. + + + +**AMD MI300X performance** + +vLLM achieved the lowest TTFT at concurrency 128. Below 128, vLLM and SGLang had similar TTFT. + +TTFT, being compute-intensive, highlights H200's advantage, aligning with [SemiAnalysis’s MI300X vs. H200 TFLOPS benchmark :material-arrow-top-right-thin:{ .external }](https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-benchmark-part-1-training/){:target="_blank"}. +However, at 128 concurrent requests, MI300X's memory capacity and bandwidth advantages become evident. + +### Time Per Output Token (TPOT) + +**NVIDIA H200 performance** + +vLLM maintained the lowest TPOT across all request concurrencies. + + + +**AMD MI300X performance** + +SGLang delivered the lowest TPOT up to concurrency 32. Beyond that, vLLM took the lead. + +Given that TPOT is memory-bound, MI300X should have a stronger advantage with further optimizations. + +### TTFT vs. Output Token Length + +**NVIDIA H200 performance** + +* SGLang demonstrated stable TTFT across increasing output token lengths. +* vLLM and TensorRT-LLM showed significant increases in TTFT as output token length grew, likely due to KV cache memory pressure. + + + +**AMD MI300X performance** + +Both vLLM and SGLang demonstrated stable TTFT across increasing output token lengths, with vLLM maintaining lower TTFT. + + + +### TPOT vs. Output Token Length + +**NVIDIA H200 performance** + +SGLang and TensorRT-LLM demonstrated stable TPOT across increasing output token lengths. + + + +vLLM maintained the lowest TPOT up to 3200 tokens but showed a sudden increase at 6400 tokens, likely due to memory pressure. + +**AMD MI300X performance** + +Both SGLang and vLLM demonstrated stable TPOT across increasing output token lengths, with vLLM maintaining the lowest TPOT. + +### Prefix caching + +**NVIDIA H200 performance** + +vLLM outperformed SGLang in online throughput, TTFT, and end-to-end latency with prefix caching enabled. However, vLLM's +TPOT increased after prefix caching, which requires further investigation. + +=== "Throughput" + +=== "TTFT" + +=== "TPOT" + +=== "End-to-end Latency" + + +## Limitations + +1. The offline benchmark results for TensorRT-LLM were obtained using the DeepSeek-R1 model engine built from the + [`deepseek` branch :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek){:target="_blank"}. + However, the TensorRT-LLM team recommends using the TorchFlow-based approach for deployment. +2. The impact of dynamic batching on inference efficiency was not tested. +3. vLLM's prefix caching support for MI300X is a work in progress and can be tracked [here :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/vllm/issues/457){:target="_blank"}. + +## Source code + +All source code and findings are available in +[our GitHub repo :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/deepseek-r1-benchmark/Deepseek-R1){:target="_blank"}. + +## References + +* [Unlock DeepSeek-R1 Inference Performance on AMD Instinct MI300X GPU :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html){:target="_blank"} +* [Deploy DeepSeek-R1 671B on 8x NVIDIA H200 with SGLang :material-arrow-top-right-thin:{ .external }](https://datacrunch.io/blog/deploy-deepseek-r1-on-8x-nvidia-h200){:target="_blank"} +* [vLLM Prefix Caching :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching){:target="_blank"} +* [SgLang Prefix Caching :material-arrow-top-right-thin:{ .external }](https://lmsys.org/blog/2024-01-17-sglang/){:target="_blank"} + +## Acknowledgments + +### Vultr + +[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} provided access to 8x NVIDIA H200 GPUs. We are truly thankful for their support. + +If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Vultr. With `dstack`, provisioning +and accessing compute via `dstack` is seamless and straightforward. + +### Lambda + +[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"} provided access to 8x +NVIDIA H200 GPUs. We are truly thankful for their support + +Both Vultr and Lambda are natively supported and can be seamlessly integrated with `dstack`. From d1ca8c0e2d09f8bc0b8072702b3830df66b17fd5 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Mon, 17 Mar 2025 23:35:42 -0700 Subject: [PATCH 2/6] [Blog]: DeepSeek R1 inference performance: MI300X vs. H200 --- docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md index 4d1ed5c5e..851688a9d 100644 --- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md +++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md @@ -165,7 +165,7 @@ Both vLLM and SGLang demonstrated stable TTFT across increasing output token len SGLang and TensorRT-LLM demonstrated stable TPOT across increasing output token lengths. - + vLLM maintained the lowest TPOT up to 3200 tokens but showed a sudden increase at 6400 tokens, likely due to memory pressure. From 0591640f11fc7804a0210d7baf5fee300f336488 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Mon, 17 Mar 2025 23:36:12 -0700 Subject: [PATCH 3/6] [Blog]: DeepSeek R1 inference performance: MI300X vs. H200 --- docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md index 851688a9d..213c9cb58 100644 --- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md +++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md @@ -186,7 +186,7 @@ TPOT increased after prefix caching, which requires further investigation. === "TPOT" -=== "End-to-end Latency" +=== "Latency" ## Limitations From 8f4224598300b7591d5009001d43c02022c10121 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 18 Mar 2025 00:35:20 -0700 Subject: [PATCH 4/6] [Blog]: DeepSeek R1 inference performance: MI300X vs. H200 --- docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md index 213c9cb58..7fc54df90 100644 --- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md +++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md @@ -47,7 +47,9 @@ who provided access to the necessary hardware. #### Online inference We utilized SGLang's [`Deepseek-R1/bench_serving.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/Deepseek-R1/bench_serving.py){:target="_blank"} -script, modified to incorporate TensorRT-LLM. Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200. +script, modified to incorporate TensorRT-LLM. + +Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200. | Request Concurrencies | Output Token Lengths | Prefix-Cached | |------------------------|----------------------|----------------| @@ -55,6 +57,8 @@ script, modified to incorporate TensorRT-LLM. Tests were conducted across multip | 128 | 1600, 3200, 6400 | No | | 128 | 800 | Yes | +To test prefix caching ability, about 62.5% of each ~3200-token prompt (i.e., 2000 out of 3200 tokens) is a repeated prefix across multiple requests. + #### Offline Inference For offline inference, we used vLLM’s [`benchmark_throughput.py` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py){:target="_blank"}, @@ -196,6 +200,9 @@ TPOT increased after prefix caching, which requires further investigation. However, the TensorRT-LLM team recommends using the TorchFlow-based approach for deployment. 2. The impact of dynamic batching on inference efficiency was not tested. 3. vLLM's prefix caching support for MI300X is a work in progress and can be tracked [here :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/vllm/issues/457){:target="_blank"}. +4. The inference backends are being optimized for the DeepSeek-R1 model. Given these continuous updates, the current + results reflect only the performance tested at the time of the benchmark. Overall, performance for all backends is + expected to improve as more optimizations are made by the backend teams. ## Source code From b92dc29b658592e49b661c0788cba5d3b55699eb Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 18 Mar 2025 00:45:32 -0700 Subject: [PATCH 5/6] [Blog]: DeepSeek R1 inference performance: MI300X vs. H200 --- docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md index 7fc54df90..b44377345 100644 --- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md +++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md @@ -220,7 +220,7 @@ All source code and findings are available in ### Vultr -[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} provided access to 8x NVIDIA H200 GPUs. We are truly thankful for their support. +[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} provided access to 8x AMD MI300X GPUs. We are truly thankful for their support. If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Vultr. With `dstack`, provisioning and accessing compute via `dstack` is seamless and straightforward. From 8452a3a952261fe28d51c0022e8c6336d8dee28f Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 18 Mar 2025 00:48:38 -0700 Subject: [PATCH 6/6] [Blog]: DeepSeek R1 inference performance: MI300X vs. H200 --- docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md index b44377345..c07782c5b 100644 --- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md +++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md @@ -44,7 +44,7 @@ who provided access to the necessary hardware. ### Benchmark methodology -#### Online inference +**Online inference** We utilized SGLang's [`Deepseek-R1/bench_serving.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/Deepseek-R1/bench_serving.py){:target="_blank"} script, modified to incorporate TensorRT-LLM. @@ -59,7 +59,7 @@ Tests were conducted across multiple request concurrencies and output token leng To test prefix caching ability, about 62.5% of each ~3200-token prompt (i.e., 2000 out of 3200 tokens) is a repeated prefix across multiple requests. -#### Offline Inference +**Offline inference** For offline inference, we used vLLM’s [`benchmark_throughput.py` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py){:target="_blank"}, modified for SGLang. TensorRT-LLM was tested using a custom