Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
233 changes: 233 additions & 0 deletions docs/blog/posts/h200-mi300x-deepskeek-benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
---
title: "DeepSeek R1 inference performance: MI300X vs. H200"
date: 2025-03-18
description: "TBA"
slug: h200-mi300x-deepskeek-benchmark
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/h200-mi300x-deepskeek-benchmark-v2.png?raw=true
categories:
- Benchmarks
- AMD
- NVIDIA
---

# DeepSeek R1 inference performance: MI300X vs. H200

DeepSeek-R1, with its innovative architecture combining Multi-head Latent Attention (MLA) and DeepSeekMoE, presents
unique challenges for inference workloads. As a reasoning-focused model, it generates intermediate chain-of-thought
outputs, placing significant demands on memory capacity and bandwidth.

In this benchmark, we evaluate the performance of three inference backends—SGLang, vLLM, and TensorRT-LLM—on two hardware
configurations: 8x NVIDIA H200 and 8x AMD MI300X. Our goal is to compare throughput, latency, and overall efficiency to
determine the optimal backend and hardware pairing for DeepSeek-R1's demanding requirements.

<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h200-mi300x-deepskeek-benchmark-v2.png?raw=true" width="630"/>

This benchmark was made possible through the generous support of our partners at
[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} and
[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"},
who provided access to the necessary hardware.

<!-- more -->

## Benchmark setup

### Hardware configurations

1. AMD 8xMI300x
* 2x Intel Xeon Platinum 8468, 48C/96T, 16GT/s, 105M Cache (350W)
* 8x AMD MI300x GPU, 192GB, 750W
* 32x 64GB DDR5, 4800MT/s
2. NVIDIA 8xH200 SXM5
* 2x Intel Xeon Platinum 8570, 56C/112T, 20GT/s, 300M Cache (350W)
* 8x NVIDIA H200 SXM5 GPU, 141GB, 700W
* 32x 64GB DDR5, 5600MT/s

### Benchmark methodology

**Online inference**

We utilized SGLang's [`Deepseek-R1/bench_serving.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/Deepseek-R1/bench_serving.py){:target="_blank"}
script, modified to incorporate TensorRT-LLM.

Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200.

| Request Concurrencies | Output Token Lengths | Prefix-Cached |
|------------------------|----------------------|----------------|
| 4,8,16,...,128 | 800 | No |
| 128 | 1600, 3200, 6400 | No |
| 128 | 800 | Yes |

To test prefix caching ability, about 62.5% of each ~3200-token prompt (i.e., 2000 out of 3200 tokens) is a repeated prefix across multiple requests.

**Offline inference**

For offline inference, we used vLLM’s [`benchmark_throughput.py` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py){:target="_blank"},
modified for SGLang. TensorRT-LLM was tested using a custom
[`benchmark_throughput_trt.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/blob/deepseek-r1-benchmark/Deepseek-R1/benchmark_throughput_trt.py){:target="_blank"}.
The benchmark examined performance across various batch sizes and output token lengths.

| Batch Sizes | Output Token Lengths |
|--------------------|----------------------|
| 32,64,128,...,1024 | 800 |
| 256, 512, 1024 | 1600 |
| 256, 512, 1024 | 3200 |

## Key observations

### Throughput and End-to-End Latency

**NVIDIA H200 performance**

* TensorRT-LLM outperformed both vLLM and SGLang, achieving the highest online throughput of 4176 tokens/s on H200.
* At concurrencies below 128, vLLM led in online throughput and end-to-end latency.
* In offline scenarios, H200 achieved the highest overall throughput of 6311 tokens/s with SGLang.

<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online_throughput_vs_latency.png" />

**AMD MI300X performance**

* vLLM outperformed SGLang in both online and offline throughput and end-to-end latency.
* MI300X with vLLM achieved the highest overall throughput of 4574 tokens/s in online scenarios.
* At request concurrencies below 32, SGLang outperformed vLLM in online throughput and latency.

<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline_throughput_vs_latency.png" />

While MI300X's larger memory capacity and higher bandwidth should theoretically enable higher throughput at larger batch
sizes, the results suggest that inference backends for MI300X may require further optimization to fully leverage its
architectural advantages.

### Throughput and Latency vs. Output Token Length

**NVIDIA H200 performance**

* SGLang delivered slightly higher throughput and better latency as output token length increased in online scenarios.
* In offline scenarios, SGLang with H200 outperformed MI300X as output token length increased.

=== "Throughput"
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online-throughput-vs-output.png" />

=== "Latency"
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online-latency-vs-output.png" />

**AMD MI300X performance**

vLLM maintained the lead in both online and offline scenarios as output token length increased.

=== "Throughput"
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline-throughput-vs-output-256.png" />

=== "Latency"
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline-latency-vs-output-length-256.png" />

### Time to First Token (TTFT)

**NVIDIA H200 performance**

TensorRT-LLM maintained the lowest and most consistent TTFT up to concurrency 64.

<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-concurrency.png" />

**AMD MI300X performance**

vLLM achieved the lowest TTFT at concurrency 128. Below 128, vLLM and SGLang had similar TTFT.

TTFT, being compute-intensive, highlights H200's advantage, aligning with [SemiAnalysis’s MI300X vs. H200 TFLOPS benchmark :material-arrow-top-right-thin:{ .external }](https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-benchmark-part-1-training/){:target="_blank"}.
However, at 128 concurrent requests, MI300X's memory capacity and bandwidth advantages become evident.

### Time Per Output Token (TPOT)

**NVIDIA H200 performance**

vLLM maintained the lowest TPOT across all request concurrencies.

<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-concurrency.png" />

**AMD MI300X performance**

SGLang delivered the lowest TPOT up to concurrency 32. Beyond that, vLLM took the lead.

Given that TPOT is memory-bound, MI300X should have a stronger advantage with further optimizations.

### TTFT vs. Output Token Length

**NVIDIA H200 performance**

* SGLang demonstrated stable TTFT across increasing output token lengths.
* vLLM and TensorRT-LLM showed significant increases in TTFT as output token length grew, likely due to KV cache memory pressure.

<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-output-length.png" />

**AMD MI300X performance**

Both vLLM and SGLang demonstrated stable TTFT across increasing output token lengths, with vLLM maintaining lower TTFT.

<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-output-length-no-h200-vllm.png" />

### TPOT vs. Output Token Length

**NVIDIA H200 performance**

SGLang and TensorRT-LLM demonstrated stable TPOT across increasing output token lengths.

<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-output-length.png" />

vLLM maintained the lowest TPOT up to 3200 tokens but showed a sudden increase at 6400 tokens, likely due to memory pressure.

**AMD MI300X performance**

Both SGLang and vLLM demonstrated stable TPOT across increasing output token lengths, with vLLM maintaining the lowest TPOT.

### Prefix caching

**NVIDIA H200 performance**

vLLM outperformed SGLang in online throughput, TTFT, and end-to-end latency with prefix caching enabled. However, vLLM's
TPOT increased after prefix caching, which requires further investigation.

=== "Throughput"
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-throughput-comparison.png" />
=== "TTFT"
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-ttft-comparison.png" />
=== "TPOT"
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-tpot-comparison.png" />
=== "Latency"
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-end-to-end-latency-comparison.png" />

## Limitations

1. The offline benchmark results for TensorRT-LLM were obtained using the DeepSeek-R1 model engine built from the
[`deepseek` branch :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek){:target="_blank"}.
However, the TensorRT-LLM team recommends using the TorchFlow-based approach for deployment.
2. The impact of dynamic batching on inference efficiency was not tested.
3. vLLM's prefix caching support for MI300X is a work in progress and can be tracked [here :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/vllm/issues/457){:target="_blank"}.
4. The inference backends are being optimized for the DeepSeek-R1 model. Given these continuous updates, the current
results reflect only the performance tested at the time of the benchmark. Overall, performance for all backends is
expected to improve as more optimizations are made by the backend teams.

## Source code

All source code and findings are available in
[our GitHub repo :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/deepseek-r1-benchmark/Deepseek-R1){:target="_blank"}.

## References

* [Unlock DeepSeek-R1 Inference Performance on AMD Instinct MI300X GPU :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html){:target="_blank"}
* [Deploy DeepSeek-R1 671B on 8x NVIDIA H200 with SGLang :material-arrow-top-right-thin:{ .external }](https://datacrunch.io/blog/deploy-deepseek-r1-on-8x-nvidia-h200){:target="_blank"}
* [vLLM Prefix Caching :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching){:target="_blank"}
* [SgLang Prefix Caching :material-arrow-top-right-thin:{ .external }](https://lmsys.org/blog/2024-01-17-sglang/){:target="_blank"}

## Acknowledgments

### Vultr

[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} provided access to 8x AMD MI300X GPUs. We are truly thankful for their support.

If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Vultr. With `dstack`, provisioning
and accessing compute via `dstack` is seamless and straightforward.

### Lambda

[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"} provided access to 8x
NVIDIA H200 GPUs. We are truly thankful for their support

Both Vultr and Lambda are natively supported and can be seamlessly integrated with `dstack`.