From 19aa92b1f0ae2d020ba0c48f29a410b3813931f1 Mon Sep 17 00:00:00 2001
From: peterschmidt85 <andrey.cheptsov@gmail.com>
Date: Mon, 17 Mar 2025 23:31:08 -0700
Subject: [PATCH 1/6] [Blog]: DeepSeek R1 inference performance: MI300X vs.
 H200

---
 .../posts/h200-mi300x-deepskeek-benchmark.md  | 226 ++++++++++++++++++
 1 file changed, 226 insertions(+)
 create mode 100644 docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
new file mode 100644
index 000000000..4d1ed5c5e
--- /dev/null
+++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
@@ -0,0 +1,226 @@
+---
+title: "DeepSeek R1 inference performance: MI300X vs. H200"
+date: 2025-03-18
+description: "TBA"
+slug: h200-mi300x-deepskeek-benchmark
+image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/h200-mi300x-deepskeek-benchmark-v2.png?raw=true
+categories:
+  - Benchmarks
+  - AMD
+  - NVIDIA
+---
+
+# DeepSeek R1 inference performance: MI300X vs. H200
+
+DeepSeek-R1, with its innovative architecture combining Multi-head Latent Attention (MLA) and DeepSeekMoE, presents
+unique challenges for inference workloads. As a reasoning-focused model, it generates intermediate chain-of-thought
+outputs, placing significant demands on memory capacity and bandwidth.
+
+In this benchmark, we evaluate the performance of three inference backends—SGLang, vLLM, and TensorRT-LLM—on two hardware
+configurations: 8x NVIDIA H200 and 8x AMD MI300X. Our goal is to compare throughput, latency, and overall efficiency to
+determine the optimal backend and hardware pairing for DeepSeek-R1's demanding requirements.
+
+<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h200-mi300x-deepskeek-benchmark-v2.png?raw=true" width="630"/>
+
+This benchmark was made possible through the generous support of our partners at
+[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} and 
+[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"},
+who provided access to the necessary hardware.
+
+<!-- more -->
+
+## Benchmark setup
+
+### Hardware configurations
+
+1. AMD 8xMI300x
+    * 2x Intel Xeon Platinum 8468, 48C/96T, 16GT/s, 105M Cache (350W)
+    * 8x AMD MI300x GPU, 192GB, 750W
+	* 32x 64GB DDR5, 4800MT/s
+2. NVIDIA 8xH200 SXM5
+    * 2x Intel Xeon Platinum 8570, 56C/112T, 20GT/s, 300M Cache (350W)
+    * 8x NVIDIA H200 SXM5 GPU, 141GB, 700W
+    * 32x 64GB DDR5, 5600MT/s
+
+### Benchmark methodology
+
+#### Online inference
+
+We utilized SGLang's [`Deepseek-R1/bench_serving.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/Deepseek-R1/bench_serving.py){:target="_blank"} 
+script, modified to incorporate TensorRT-LLM. Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200.
+
+| Request Concurrencies  | Output Token Lengths | Prefix-Cached  |
+|------------------------|----------------------|----------------|
+| 4,8,16,...,128         | 800                  | No             |
+| 128                    | 1600, 3200, 6400     | No             |
+| 128                    | 800                  | Yes            |
+
+#### Offline Inference
+
+For offline inference, we used vLLM’s [`benchmark_throughput.py` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py){:target="_blank"},
+modified for SGLang. TensorRT-LLM was tested using a custom 
+[`benchmark_throughput_trt.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/blob/deepseek-r1-benchmark/Deepseek-R1/benchmark_throughput_trt.py){:target="_blank"}. 
+The benchmark examined performance across various batch sizes and output token lengths.
+
+| Batch Sizes        | Output Token Lengths |
+|--------------------|----------------------|
+| 32,64,128,...,1024 | 800                  |
+| 256, 512, 1024     | 1600                 |
+| 256, 512, 1024     | 3200                 |
+
+## Key observations
+
+### Throughput and End-to-End Latency
+
+**NVIDIA H200 performance**
+
+* TensorRT-LLM outperformed both vLLM and SGLang, achieving the highest online throughput of 4176 tokens/s on H200.
+* At concurrencies below 128, vLLM led in online throughput and end-to-end latency.
+* In offline scenarios, H200 achieved the highest overall throughput of 6311 tokens/s with SGLang.
+
+<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online_throughput_vs_latency.png" />
+
+**AMD MI300X performance**
+
+* vLLM outperformed SGLang in both online and offline throughput and end-to-end latency.
+* MI300X with vLLM achieved the highest overall throughput of 4574 tokens/s in online scenarios.
+* At request concurrencies below 32, SGLang outperformed vLLM in online throughput and latency.
+
+<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline_throughput_vs_latency.png" />
+
+While MI300X's larger memory capacity and higher bandwidth should theoretically enable higher throughput at larger batch
+sizes, the results suggest that inference backends for MI300X may require further optimization to fully leverage its
+architectural advantages.
+
+### Throughput and Latency vs. Output Token Length
+
+**NVIDIA H200 performance**
+
+* SGLang delivered slightly higher throughput and better latency as output token length increased in online scenarios.
+* In offline scenarios, SGLang with H200 outperformed MI300X as output token length increased.
+
+=== "Throughput"
+    <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online-throughput-vs-output.png" />
+
+=== "Latency"
+    <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online-latency-vs-output.png" />
+
+**AMD MI300X performance**
+
+vLLM maintained the lead in both online and offline scenarios as output token length increased.
+
+=== "Throughput"
+    <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline-throughput-vs-output-256.png" />
+
+=== "Latency"
+    <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline-latency-vs-output-length-256.png" />
+
+### Time to First Token (TTFT)
+
+**NVIDIA H200 performance**
+
+TensorRT-LLM maintained the lowest and most consistent TTFT up to concurrency 64.
+
+<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-concurrency.png" />
+
+**AMD MI300X performance**
+
+vLLM achieved the lowest TTFT at concurrency 128. Below 128, vLLM and SGLang had similar TTFT.
+
+TTFT, being compute-intensive, highlights H200's advantage, aligning with [SemiAnalysis’s MI300X vs. H200 TFLOPS benchmark :material-arrow-top-right-thin:{ .external }](https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-benchmark-part-1-training/){:target="_blank"}.
+However, at 128 concurrent requests, MI300X's memory capacity and bandwidth advantages become evident.
+
+### Time Per Output Token (TPOT)
+
+**NVIDIA H200 performance**
+
+vLLM maintained the lowest TPOT across all request concurrencies.
+
+<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-concurrency.png" />
+
+**AMD MI300X performance**
+
+SGLang delivered the lowest TPOT up to concurrency 32. Beyond that, vLLM took the lead.
+
+Given that TPOT is memory-bound, MI300X should have a stronger advantage with further optimizations.
+
+### TTFT vs. Output Token Length
+
+**NVIDIA H200 performance**
+
+* SGLang demonstrated stable TTFT across increasing output token lengths.
+* vLLM and TensorRT-LLM showed significant increases in TTFT as output token length grew, likely due to KV cache memory pressure.
+
+<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-output-length.png" />
+
+**AMD MI300X performance**
+
+Both vLLM and SGLang demonstrated stable TTFT across increasing output token lengths, with vLLM maintaining lower TTFT.
+
+<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-output-length-no-h200-vllm.png" />
+
+### TPOT vs. Output Token Length
+
+**NVIDIA H200 performance**
+
+SGLang and TensorRT-LLM demonstrated stable TPOT across increasing output token lengths.
+
+<img src="https://github.com/dstackai/benchmarks/blob/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-output-length.png" />
+
+vLLM maintained the lowest TPOT up to 3200 tokens but showed a sudden increase at 6400 tokens, likely due to memory pressure.
+
+**AMD MI300X performance**
+
+Both SGLang and vLLM demonstrated stable TPOT across increasing output token lengths, with vLLM maintaining the lowest TPOT.
+
+### Prefix caching
+
+**NVIDIA H200 performance**
+
+vLLM outperformed SGLang in online throughput, TTFT, and end-to-end latency with prefix caching enabled. However, vLLM's
+TPOT increased after prefix caching, which requires further investigation.
+
+=== "Throughput"
+    <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-throughput-comparison.png" />
+=== "TTFT"
+    <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-ttft-comparison.png" />
+=== "TPOT"
+    <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-tpot-comparison.png" />
+=== "End-to-end Latency"
+    <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-end-to-end-latency-comparison.png" />
+
+## Limitations
+
+1. The offline benchmark results for TensorRT-LLM were obtained using the DeepSeek-R1 model engine built from the
+   [`deepseek` branch :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek){:target="_blank"}.
+   However, the TensorRT-LLM team recommends using the TorchFlow-based approach for deployment.
+2. The impact of dynamic batching on inference efficiency was not tested.
+3. vLLM's prefix caching support for MI300X is a work in progress and can be tracked [here :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/vllm/issues/457){:target="_blank"}.
+
+## Source code
+
+All source code and findings are available in
+[our GitHub repo :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/deepseek-r1-benchmark/Deepseek-R1){:target="_blank"}.
+
+## References
+
+* [Unlock DeepSeek-R1 Inference Performance on AMD Instinct MI300X GPU :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html){:target="_blank"}
+* [Deploy DeepSeek-R1 671B on 8x NVIDIA H200 with SGLang :material-arrow-top-right-thin:{ .external }](https://datacrunch.io/blog/deploy-deepseek-r1-on-8x-nvidia-h200){:target="_blank"}
+* [vLLM Prefix Caching :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching){:target="_blank"}
+* [SgLang Prefix Caching :material-arrow-top-right-thin:{ .external }](https://lmsys.org/blog/2024-01-17-sglang/){:target="_blank"}
+
+## Acknowledgments
+
+### Vultr
+
+[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} provided access to 8x NVIDIA H200 GPUs. We are truly thankful for their support.
+
+If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Vultr. With `dstack`, provisioning
+and accessing compute via `dstack` is seamless and straightforward.
+
+### Lambda
+
+[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"} provided access to 8x
+NVIDIA H200 GPUs. We are truly thankful for their support
+
+Both Vultr and Lambda are natively supported and can be seamlessly integrated with `dstack`. 

From d1ca8c0e2d09f8bc0b8072702b3830df66b17fd5 Mon Sep 17 00:00:00 2001
From: peterschmidt85 <andrey.cheptsov@gmail.com>
Date: Mon, 17 Mar 2025 23:35:42 -0700
Subject: [PATCH 2/6] [Blog]: DeepSeek R1 inference performance: MI300X vs.
 H200

---
 docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
index 4d1ed5c5e..851688a9d 100644
--- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
+++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
@@ -165,7 +165,7 @@ Both vLLM and SGLang demonstrated stable TTFT across increasing output token len
 
 SGLang and TensorRT-LLM demonstrated stable TPOT across increasing output token lengths.
 
-<img src="https://github.com/dstackai/benchmarks/blob/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-output-length.png" />
+<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-output-length.png" />
 
 vLLM maintained the lowest TPOT up to 3200 tokens but showed a sudden increase at 6400 tokens, likely due to memory pressure.
 

From 0591640f11fc7804a0210d7baf5fee300f336488 Mon Sep 17 00:00:00 2001
From: peterschmidt85 <andrey.cheptsov@gmail.com>
Date: Mon, 17 Mar 2025 23:36:12 -0700
Subject: [PATCH 3/6] [Blog]: DeepSeek R1 inference performance: MI300X vs.
 H200

---
 docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
index 851688a9d..213c9cb58 100644
--- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
+++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
@@ -186,7 +186,7 @@ TPOT increased after prefix caching, which requires further investigation.
     <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-ttft-comparison.png" />
 === "TPOT"
     <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-tpot-comparison.png" />
-=== "End-to-end Latency"
+=== "Latency"
     <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-end-to-end-latency-comparison.png" />
 
 ## Limitations

From 8f4224598300b7591d5009001d43c02022c10121 Mon Sep 17 00:00:00 2001
From: peterschmidt85 <andrey.cheptsov@gmail.com>
Date: Tue, 18 Mar 2025 00:35:20 -0700
Subject: [PATCH 4/6] [Blog]: DeepSeek R1 inference performance: MI300X vs.
 H200

---
 docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
index 213c9cb58..7fc54df90 100644
--- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
+++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
@@ -47,7 +47,9 @@ who provided access to the necessary hardware.
 #### Online inference
 
 We utilized SGLang's [`Deepseek-R1/bench_serving.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/Deepseek-R1/bench_serving.py){:target="_blank"} 
-script, modified to incorporate TensorRT-LLM. Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200.
+script, modified to incorporate TensorRT-LLM. 
+
+Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200.
 
 | Request Concurrencies  | Output Token Lengths | Prefix-Cached  |
 |------------------------|----------------------|----------------|
@@ -55,6 +57,8 @@ script, modified to incorporate TensorRT-LLM. Tests were conducted across multip
 | 128                    | 1600, 3200, 6400     | No             |
 | 128                    | 800                  | Yes            |
 
+To test prefix caching ability, about 62.5% of each ~3200-token prompt (i.e., 2000 out of 3200 tokens) is a repeated prefix across multiple requests.
+
 #### Offline Inference
 
 For offline inference, we used vLLM’s [`benchmark_throughput.py` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py){:target="_blank"},
@@ -196,6 +200,9 @@ TPOT increased after prefix caching, which requires further investigation.
    However, the TensorRT-LLM team recommends using the TorchFlow-based approach for deployment.
 2. The impact of dynamic batching on inference efficiency was not tested.
 3. vLLM's prefix caching support for MI300X is a work in progress and can be tracked [here :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/vllm/issues/457){:target="_blank"}.
+4. The inference backends are being optimized for the DeepSeek-R1 model. Given these continuous updates, the current
+   results reflect only the performance tested at the time of the benchmark. Overall, performance for all backends is
+   expected to improve as more optimizations are made by the backend teams.
 
 ## Source code
 

From b92dc29b658592e49b661c0788cba5d3b55699eb Mon Sep 17 00:00:00 2001
From: peterschmidt85 <andrey.cheptsov@gmail.com>
Date: Tue, 18 Mar 2025 00:45:32 -0700
Subject: [PATCH 5/6] [Blog]: DeepSeek R1 inference performance: MI300X vs.
 H200

---
 docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
index 7fc54df90..b44377345 100644
--- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
+++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
@@ -220,7 +220,7 @@ All source code and findings are available in
 
 ### Vultr
 
-[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} provided access to 8x NVIDIA H200 GPUs. We are truly thankful for their support.
+[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} provided access to 8x AMD MI300X GPUs. We are truly thankful for their support.
 
 If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Vultr. With `dstack`, provisioning
 and accessing compute via `dstack` is seamless and straightforward.

From 8452a3a952261fe28d51c0022e8c6336d8dee28f Mon Sep 17 00:00:00 2001
From: peterschmidt85 <andrey.cheptsov@gmail.com>
Date: Tue, 18 Mar 2025 00:48:38 -0700
Subject: [PATCH 6/6] [Blog]: DeepSeek R1 inference performance: MI300X vs.
 H200

---
 docs/blog/posts/h200-mi300x-deepskeek-benchmark.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
index b44377345..c07782c5b 100644
--- a/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
+++ b/docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
@@ -44,7 +44,7 @@ who provided access to the necessary hardware.
 
 ### Benchmark methodology
 
-#### Online inference
+**Online inference**
 
 We utilized SGLang's [`Deepseek-R1/bench_serving.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/Deepseek-R1/bench_serving.py){:target="_blank"} 
 script, modified to incorporate TensorRT-LLM. 
@@ -59,7 +59,7 @@ Tests were conducted across multiple request concurrencies and output token leng
 
 To test prefix caching ability, about 62.5% of each ~3200-token prompt (i.e., 2000 out of 3200 tokens) is a repeated prefix across multiple requests.
 
-#### Offline Inference
+**Offline inference**
 
 For offline inference, we used vLLM’s [`benchmark_throughput.py` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py){:target="_blank"},
 modified for SGLang. TensorRT-LLM was tested using a custom