ggml-zendnn : add ZenDNN backend for AMD CPUs #17690

z-vishal · 2025-12-02T12:44:35Z

This PR adds ZenDNN backend support for accelerated inference on AMD EPYC™ CPUs.

Background

ZenDNN is AMD's optimized deep learning library for EPYC processors, providing high-performance primitives for inference workloads. It uses the LowOHA (Low Overhead High-performance) MatMul operator for efficient matrix multiplication.

Changes

Backend implementation:
- New ZenDNN backend in ggml/src/ggml-zendnn/
- Implements GGML_OP_MUL_MAT acceleration using ZenDNN primitives
- Supports FP32 and BF16 data types
- Auto-converts types
Build system:
- CMake integration with automatic download/build option: -DGGML_ZENDNN=ON
- Custom installation path support: -DGGML_ZENDNN_PATH=/path/to/zendnn
- Uses ZenDNN's CMake package for clean dependency management
Documentation:
- Comprehensive backend documentation in docs/backend/ZenDNN.md
- Build instructions added to docs/build.md
- Covers hardware support, setup, performance tuning, and profiling

Hardware Support

AMD EPYC 9005 Series (Turin/Zen 5)
AMD EPYC 9004 Series (Zen 4) - Recommended (best BF16 performance)
AMD EPYC 7003 Series (Milan/Zen 3)
AMD Ryzen AI MAX (Strix Halo)

Performance Notes

Best performance with export ZENDNNL_MATMUL_ALGO=2 (Blocked AOCL BLIS backend)
Optimized for BF16 inference on Zen 4/5 processors
Automatic parallel dispatch using OpenMP

Testing

Tested on AMD EPYC systems with llama-server and llama-cli using various models (LLaMA, Mistral, Qwen).

Performance Results

Test Configuration

Hardware: AMD EPYC 9004 Series (Zen 4)
Threads: 96
Batch Size: 4096
Tool: llama-bench
llama.cpp version: 7134
ZenDNN version: 1.0.0
Environment: ZENDNNL_MATMUL_ALGO=2 (Blocked AOCL BLIS)

Benchmark Results

LLaMA 3.1 8B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	341.50	395.58	1.16x
pp256	382.52	561.94	1.47x
pp512	423.40	624.61	1.48x
pp1024	414.12	637.97	1.54x
pp2048	338.50	622.08	1.84x
pp4096	308.53	534.76	1.73x
tg128	7.28	10.53	1.45x

LLaMA 3.1 8B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	184.44	293.39	1.59x
pp256	189.69	384.71	2.03x
pp512	234.74	431.21	1.84x
pp1024	231.49	451.51	1.95x
pp2048	220.05	425.65	1.93x
pp4096	189.75	396.73	2.09x
tg128	2.69	7.34	2.73x

Qwen2 7B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	339.58	381.26	1.12x
pp256	380.82	482.33	1.27x
pp512	434.41	639.02	1.47x
pp1024	432.35	703.14	1.63x
pp2048	382.49	694.71	1.82x
pp4096	316.63	640.01	2.02x
tg128	6.30	11.96	1.90x

Qwen2 7B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	201.64	309.29	1.53x
pp256	217.81	408.51	1.88x
pp512	250.92	451.24	1.80x
pp1024	251.71	461.91	1.84x
pp2048	228.00	454.05	1.99x
pp4096	207.30	445.56	2.15x
tg128	2.75	8.11	2.95x

LLaMA 2 7B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	325.94	387.72	1.19x
pp256	364.62	547.76	1.50x
pp512	417.88	613.29	1.47x
pp1024	418.46	603.59	1.44x
pp2048	382.10	623.88	1.63x
pp4096	316.20	559.45	1.77x
tg128	7.05	11.59	1.64x

LLaMA 2 7B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	201.47	315.96	1.57x
pp256	217.71	397.12	1.82x
pp512	249.96	436.97	1.75x
pp1024	249.78	454.70	1.82x
pp2048	224.65	440.21	1.96x
pp4096	195.72	392.68	2.01x
tg128	3.70	8.15	2.20x

LLaMA 2 13B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	185.20	202.39	1.09x
pp256	200.55	300.21	1.50x
pp512	227.04	370.78	1.63x
pp1024	221.33	358.21	1.62x
pp2048	170.63	377.57	2.21x
pp4096	177.55	302.23	1.70x
tg128	3.72	6.76	1.82x

LLaMA 2 13B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	107.74	174.92	1.62x
pp256	114.34	215.51	1.88x
pp512	129.28	246.26	1.90x
pp1024	127.64	232.02	1.82x
pp2048	113.25	253.00	2.23x
pp4096	105.44	220.49	2.09x
tg128	1.92	4.73	2.46x

Mixtral 8x7B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	92.74	94.24	1.02x
pp256	136.77	143.61	1.05x
pp512	164.38	167.70	1.02x
pp1024	169.80	175.44	1.03x
pp2048	166.19	176.64	1.06x
pp4096	151.95	174.29	1.15x
tg128	3.73	3.43	0.92x

Key Observations:

Best gains on F32 models: up to 2.95x speedup (Qwen2-7B token generation)
BF16: 1.5-2x faster with lower memory usage
Larger batches (pp2048, pp4096) show better performance
Smaller models (7B-13B) benefit more than large MoE models (Mixtral 8x7B)
Token generation: 1.45x-2.95x faster across models

AI usage disclosure: AI assistance was used for documentation writing, formatting and CMake syntax. All code logic, implementation decisions, backend integration, and testing were done manually. The core ZenDNN backend implementation, performance optimizations, and benchmark testing were human-authored and validated.

Djip007 · 2025-12-02T23:40:19Z

I was thinking to create a backend with https://github.com/amd/blis (with FBGEMM) but good with zenDNN to.

taronaeo · 2025-12-03T00:27:25Z

Can you also include the benchmark results from #17684 into this PR?

z-vishal · 2025-12-03T02:27:53Z

@taronaeo Updated the PR description with benchmark results

z-vishal · 2025-12-03T02:29:44Z

@Djip007 Thanks! AMD BLIS is actually what ZenDNN uses under the hood the ZENDNNL_MATMUL_ALGO=2 setting activates the "Blocked AOCL BLIS" backend for optimal performance so you're getting BLIS optimizations through ZenDNN

ggml/src/ggml-zendnn/ggml-zendnn.cpp

docs/build.md

ggml/src/ggml-zendnn/ggml-zendnn.cpp

docs/build.md

taronaeo

General implementation looks good. Just needs fixing of the unnecessary enum declarations.

You should also look into supporting GGML_OP_MUL_MAT_ID for MoE as well but this can probably be in another PR in continuation of this.

For quantised model support, you can disable the following line

        /* .buffer_from_host_ptr = */ true, // set to false

and weight tensors will go through .set_tensor() where you can manually upscale it to either BF16 or FP32 before it runs the same matmul calculations. I'm quite interested to see if you'll still get a performance boost though :)

docs/backend/ZenDNN.md

ggml/src/ggml-zendnn/CMakeLists.txt

ggml/src/ggml-zendnn/ggml-zendnn.cpp

z-vishal · 2025-12-04T19:08:12Z

Thanks @taronaeo for the review

for MoE support, will add in a follow-up PR after this merges.

Quantized models support with the upscaling approach may be not needed the ZenDNN team is also working on native quantized support.

ggml-zennn: add ZenDNN backend support

7db9002

z-vishal requested a review from ggerganov as a code owner December 2, 2025 12:44

loci-dev mentioned this pull request Dec 2, 2025

UPSTREAM PR #17690: ggml-zendnn : add ZenDNN backend for AMD CPUs auroralabs-loci/llama.cpp#402

Open

github-actions bot added documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning labels Dec 2, 2025

taronaeo self-requested a review December 3, 2025 00:23

amukho mentioned this pull request Dec 3, 2025

[ZENDNN] Add ZenDNN as an optional third-party lib pytorch/pytorch#161155

Open

danbev reviewed Dec 3, 2025

View reviewed changes

ggml/src/ggml-zendnn/ggml-zendnn.cpp Outdated Show resolved Hide resolved

docs/build.md Show resolved Hide resolved

danbev reviewed Dec 3, 2025

View reviewed changes

ggml/src/ggml-zendnn/ggml-zendnn.cpp Show resolved Hide resolved

Djip007 mentioned this pull request Dec 4, 2025

add llama_matmul_demo2_bf16.c with other parallelize experiment mozilla-ai/llamafile#586

Closed

taronaeo reviewed Dec 4, 2025

View reviewed changes

docs/build.md Outdated Show resolved Hide resolved

taronaeo requested changes Dec 4, 2025

View reviewed changes

ggml-zendnn : address ZenDNN backend review fixes and suggestions

244979b

z-vishal requested a review from taronaeo December 4, 2025 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-zendnn : add ZenDNN backend for AMD CPUs #17690

ggml-zendnn : add ZenDNN backend for AMD CPUs #17690

z-vishal commented Dec 2, 2025 •

edited

Loading

Uh oh!

Djip007 commented Dec 2, 2025 •

edited

Loading

Uh oh!

taronaeo commented Dec 3, 2025 •

edited

Loading

Uh oh!

z-vishal commented Dec 3, 2025

Uh oh!

z-vishal commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taronaeo left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

z-vishal commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggml-zendnn : add ZenDNN backend for AMD CPUs #17690

Are you sure you want to change the base?

ggml-zendnn : add ZenDNN backend for AMD CPUs #17690

Conversation

z-vishal commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes

Hardware Support

Performance Notes

Testing

Performance Results

Test Configuration

Benchmark Results

LLaMA 3.1 8B (BF16)

LLaMA 3.1 8B (F32)

Qwen2 7B (BF16)

Qwen2 7B (F32)

LLaMA 2 7B (BF16)

LLaMA 2 7B (F32)

LLaMA 2 13B (BF16)

LLaMA 2 13B (F32)

Mixtral 8x7B (BF16)

Related

Uh oh!

Djip007 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taronaeo commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

z-vishal commented Dec 3, 2025

Uh oh!

z-vishal commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taronaeo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

z-vishal commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

z-vishal commented Dec 2, 2025 •

edited

Loading

Djip007 commented Dec 2, 2025 •

edited

Loading

taronaeo commented Dec 3, 2025 •

edited

Loading