RFC: ZenDNN Backend Integration for llama.cpp #17684

z-vishal · 2025-12-02T08:51:28Z

z-vishal
Dec 2, 2025

RFC: ZenDNN Backend Integration for llama.cpp

Summary

This RFC proposes the integration of AMD ZenDNN (Zen Deep Neural Network library) as a backend for llama.cpp to accelerate LLM inference on AMD EPYC™ processors. ZenDNN provides optimized deep learning primitives specifically tuned for AMD Zen CPU microarchitecture, offering significant performance improvements for matrix multiplication operations.

Motivation

AMD EPYC processors are widely deployed in data centers and cloud infrastructure for AI/ML workloads. Currently, llama.cpp users on AMD EPYC systems rely on generic CPU implementations or BLAS libraries that may not fully utilize the processor's capabilities. By integrating ZenDNN, we can:

Unlock AMD-specific optimizations: Leverage ZenDNN's LowOHA (Low Overhead High-performance) MatMul operator optimized for Zen architecture
Improve inference performance: Deliver up to 2.95x speedup (see benchmarks below) for FP32 models and 2.02x for BF16 models on prompt processing
Support modern AMD processors: Enable optimal performance on Zen 3, Zen 4, and Zen 5 architectures including the latest EPYC 9005 (Turin) series
Maintain compatibility: Provide seamless fallback to CPU backend for unsupported operations

Design

Architecture

The ZenDNN backend follows llama.cpp's standard backend architecture:

┌─────────────────────────────────────────────────┐
│           llama.cpp Core Engine                 │
└─────────────────┬───────────────────────────────┘
                  │
                  ├──> CPU Backend (fallback)
                  │
                  └──> ZenDNN Backend
                        │
                        ├─> GGML_OP_MUL_MAT → ZenDNN MatMul
                        │
                        └─> Other ops → CPU Backend

Key Components

Backend Implementation (ggml/src/ggml-zendnn/ggml-zendnn.cpp):
- Implements GGML_OP_MUL_MAT using ZenDNN primitives
- Handles buffer management and tensor operations
- Automatic type conversion
Build System (ggml/src/ggml-zendnn/CMakeLists.txt):
- CMake integration with automatic ZenDNN download/build
- Custom installation path support
- Uses ZenDNN's CMake package for dependency management
Documentation (docs/backend/ZenDNN.md):
- Hardware requirements and support matrix
- Setup and build instructions
- Performance tuning guidelines
- Server deployment examples

Supported Operations

GGML_OP_MUL_MAT: Matrix multiplication (primary acceleration target)
All other operations fall back to CPU backend

Data Type Support

Data Type	ZenDNN Native	Handling
FP32	Yes	Direct processing
BF16	Yes	Direct processing (optimal on Zen 4/5)

Hardware Support

Processor Generation	Architecture	Status	BF16 Optimal
AMD EPYC 9005 Series	Turin (Zen 5)	Supported	Yes
AMD EPYC 9004 Series	Zen 4	Supported	Yes
AMD EPYC 7003 Series	Milan (Zen 3)	Supported	Limited
AMD Ryzen AI MAX	Strix Halo	Supported	Yes

Performance Results

Test Configuration

Hardware: AMD EPYC 9004 Series (Zen 4)
Threads: 96
Batch Size: 4096
Tool: llama-bench
llama.cpp version: 7134
ZenDNN version: 1.0.0
Environment: ZENDNNL_MATMUL_ALGO=2 (Blocked AOCL BLIS)

Benchmark Results

LLaMA 3.1 8B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	341.50	395.58	1.16x
pp256	382.52	561.94	1.47x
pp512	423.40	624.61	1.48x
pp1024	414.12	637.97	1.54x
pp2048	338.50	622.08	1.84x
pp4096	308.53	534.76	1.73x
tg128	7.28	10.53	1.45x

LLaMA 3.1 8B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	184.44	293.39	1.59x
pp256	189.69	384.71	2.03x
pp512	234.74	431.21	1.84x
pp1024	231.49	451.51	1.95x
pp2048	220.05	425.65	1.93x
pp4096	189.75	396.73	2.09x
tg128	2.69	7.34	2.73x

Qwen2 7B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	339.58	381.26	1.12x
pp256	380.82	482.33	1.27x
pp512	434.41	639.02	1.47x
pp1024	432.35	703.14	1.63x
pp2048	382.49	694.71	1.82x
pp4096	316.63	640.01	2.02x
tg128	6.30	11.96	1.90x

Qwen2 7B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	201.64	309.29	1.53x
pp256	217.81	408.51	1.88x
pp512	250.92	451.24	1.80x
pp1024	251.71	461.91	1.84x
pp2048	228.00	454.05	1.99x
pp4096	207.30	445.56	2.15x
tg128	2.75	8.11	2.95x

LLaMA 2 7B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	325.94	387.72	1.19x
pp256	364.62	547.76	1.50x
pp512	417.88	613.29	1.47x
pp1024	418.46	603.59	1.44x
pp2048	382.10	623.88	1.63x
pp4096	316.20	559.45	1.77x
tg128	7.05	11.59	1.64x

LLaMA 2 7B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	201.47	315.96	1.57x
pp256	217.71	397.12	1.82x
pp512	249.96	436.97	1.75x
pp1024	249.78	454.70	1.82x
pp2048	224.65	440.21	1.96x
pp4096	195.72	392.68	2.01x
tg128	3.70	8.15	2.20x

LLaMA 2 13B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	185.20	202.39	1.09x
pp256	200.55	300.21	1.50x
pp512	227.04	370.78	1.63x
pp1024	221.33	358.21	1.62x
pp2048	170.63	377.57	2.21x
pp4096	177.55	302.23	1.70x
tg128	3.72	6.76	1.82x

LLaMA 2 13B (F32)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	107.74	174.92	1.62x
pp256	114.34	215.51	1.88x
pp512	129.28	246.26	1.90x
pp1024	127.64	232.02	1.82x
pp2048	113.25	253.00	2.23x
pp4096	105.44	220.49	2.09x
tg128	1.92	4.73	2.46x

Mixtral 8x7B (BF16)

Test	CPU t/s	ZenDNN t/s	Speedup
pp128	92.74	94.24	1.02x
pp256	136.77	143.61	1.05x
pp512	164.38	167.70	1.02x
pp1024	169.80	175.44	1.03x
pp2048	166.19	176.64	1.06x
pp4096	151.95	174.29	1.15x
tg128	3.73	3.43	0.92x

Key Observations

Best Performance: F32 models show the highest speedups (up to 2.95x on Qwen2-7B token generation)
BF16 Efficiency: BF16 models achieve 1.5-2x speedup on prompt processing with lower memory footprint
Batch Size Scaling: Larger batch sizes (pp2048, pp4096) show more consistent improvements
Model Size Impact: Smaller models (7B-13B) show better acceleration than very large MoE models (Mixtral 8x7B)
Token Generation: Significant speedups (1.45x-2.95x) on token generation workloads

Implementation Details

Build Options

# Automatic build (recommended)
cmake -B build -DGGML_ZENDNN=ON
cmake --build build --config Release

# Custom ZenDNN installation
cmake -B build -DGGML_ZENDNN=ON -DGGML_ZENDNN_PATH=/path/to/zendnn/install
cmake --build build --config Release

Runtime Configuration

# Set optimal MatMul algorithm (Blocked AOCL BLIS)
export ZENDNNL_MATMUL_ALGO=2

# Run server with ZenDNN backend
./build/bin/llama-server -m model.gguf --port 8080

Testing Plan

Unit Tests: Backend operation tests using test-backend-ops
Integration Tests: Server functionality tests in tools/server/tests/
Performance Tests: llama-bench on various models and batch sizes
Compatibility Tests: Verify fallback behavior for unsupported operations

Alternatives Considered

OpenBLAS/BLIS only: Generic BLAS libraries don't provide AMD-specific optimizations
OneDNN: ZenDNN is specifically tuned for AMD EPYC processors
No acceleration: Unacceptable performance for production deployments

Migration Impact

Backwards Compatible: Existing builds continue to work unchanged
Opt-in Feature: ZenDNN backend only enabled with -DGGML_ZENDNN=ON
No API Changes: Standard llama.cpp API remains unchanged
Fallback Support: Unsupported operations automatically use CPU backend

References

Conclusion

The ZenDNN backend integration provides significant performance improvements for llama.cpp users on AMD EPYC systems, with up to 2.95x speedup on token generation and 2.23x on prompt processing. The implementation follows llama.cpp's standard backend patterns, maintains full compatibility, and provides an easy migration path for existing deployments.

cc @ggerganov @slaren @ngxson @JohannesGaessler @CISC @jeffbolznv @danbev @cebtenzzre @ikawrakow @avinashcpandey @amukho

z-vishal · 2025-12-02T12:56:09Z

z-vishal
Dec 2, 2025
Author

The ZenDNN backend implementation is now ready for review. I've opened a pull request with the complete integration:

PR: ggml-zendnn : add ZenDNN backend for AMD CPUs #17690

What's Included

Complete backend implementation in ggml/src/ggml-zendnn/
CMake integration with automatic download/build support
Comprehensive documentation in docs/backend/ZenDNN.md
Build instructions added to docs/build.md
Benchmark results showing 1.5-2.95x speedup across various models

4 replies

ngxson Dec 2, 2025
Collaborator

Please explicitly state any AI usages in your PR

z-vishal Dec 2, 2025
Author

Used ChatGPT for Refractor the docs and apart from that everything by us

z-vishal Dec 2, 2025
Author

the core ZenDNN backend implementation, performance optimizations, and benchmark testing were human-authored and validated

z-vishal Dec 2, 2025
Author

now added the AI usage disclosure in the PR, i had forgotten to include this earlier as part of contribution rules

Djip007 · 2025-12-03T11:18:16Z

Djip007
Dec 3, 2025

Good job!

For Mixtral 8x7B (BF16) it is a MoE model so it mostly use MUL_MAT_ID, not MUL_MAT...

Just a question out of curiosity: I know that when used like this, it uses "Blocked AOCL BLIS". So, do you plan to implement other operations processors (OPs) since you chose ZenDNN rather than AOCL (fbgemm)?

I've tested AOCL, and it's very efficient, for two reasons:

very good performance in BF16 on Zen 4/5
much lower power consumption than anything I've done before (this contributes to the performance with a higher clock speed :) )

But before creating a backend, I wanted to see how to manage OP merging in llama.cpp to take advantage of the post-processing options available with fbgemm (and learn how to code them as well). https://docs.amd.com/r/en-US/57404-AOCL-user-guide/AOCL-BLAS?section=lpgemm-in-aocl-blas

Note: with a C_fp32[8192, 8192] = A_bf16[8192, 8192] @ B_bf16[8192, 8192] and lpgemm :

AMD Ryzen™ AI Max+ 395 : @ 8,5 TFlops.
AMD Ryzen 7940HS : 2,246 TFlops

0 replies

Djip007 · 2025-12-04T16:49:16Z

Djip007
Dec 4, 2025

some benchmark:

ryzen 7940HS (zen4 x8 mobile)

OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="1,3,5,7,9,11,13,15" \
ZENDNNL_MATMUL_ALGO=2 \
 llama-bench -ctk bf16 -ctv bf16 -ub 4096 -b 8192 \
  -r 3 \
  -p "1,1,2,3,4,8,12,16,24,32,48,64,96,128,192,256,384,512,768,1024,1536,2048,3072,4096,8192" \
  -n 16 \
  -pg "512,64" \
  -m Meta-Llama-3.1-8B-Instruct/BF16.gguf

model	size	params	test	CPU t/s	ZenDNN t/s	diff %
llama 8B BF16	14.96 GiB	8.03 B	pp1	3.78 ± 0.02	3.71 ± 0.01	-1.8
llama 8B BF16	14.96 GiB	8.03 B	pp2	7.36 ± 0.08	7.48 ± 0.05	1.6
llama 8B BF16	14.96 GiB	8.03 B	pp3	11.00 ± 0.07	10.87 ± 0.02	-1.1
llama 8B BF16	14.96 GiB	8.03 B	pp4	14.58 ± 0.09	14.54 ± 0.23	-0.2
llama 8B BF16	14.96 GiB	8.03 B	pp8	28.44 ± 0.12	29.09 ± 0.27	2.2
llama 8B BF16	14.96 GiB	8.03 B	pp12	40.52 ± 1.40	41.51 ± 0.48	2.4
llama 8B BF16	14.96 GiB	8.03 B	pp16	51.94 ± 0.39	52.27 ± 0.22	0.6
llama 8B BF16	14.96 GiB	8.03 B	pp24	67.92 ± 1.38	63.76 ± 1.51	-6.1
llama 8B BF16	14.96 GiB	8.03 B	pp32	72.46 ± 1.71	77.65 ± 2.65	7.1
llama 8B BF16	14.96 GiB	8.03 B	pp48	71.86 ± 1.06	91.02 ± 1.48	26.6
llama 8B BF16	14.96 GiB	8.03 B	pp64	75.88 ± 0.08	92.18 ± 2.37	21.4
llama 8B BF16	14.96 GiB	8.03 B	pp96	83.92 ± 0.24	104.34 ± 2.73	24.3
llama 8B BF16	14.96 GiB	8.03 B	pp128	91.72 ± 0.58	105.27 ± 0.63	14.7
llama 8B BF16	14.96 GiB	8.03 B	pp192	93.37 ± 0.20	110.45 ± 2.89	18.2
llama 8B BF16	14.96 GiB	8.03 B	pp256	93.82 ± 0.72	114.97 ± 2.64	22.5
llama 8B BF16	14.96 GiB	8.03 B	pp384	92.48 ± 1.20	116.75 ± 0.89	26.2
llama 8B BF16	14.96 GiB	8.03 B	pp512	93.57 ± 0.30	117.16 ± 0.78	25.2
llama 8B BF16	14.96 GiB	8.03 B	pp768	93.62 ± 0.11	115.36 ± 0.74	23.2
llama 8B BF16	14.96 GiB	8.03 B	pp1024	91.33 ± 0.05	109.99 ± 3.29	20.4
llama 8B BF16	14.96 GiB	8.03 B	pp1536	87.40 ± 1.20	103.53 ± 3.98	18.4
llama 8B BF16	14.96 GiB	8.03 B	pp2048	79.66 ± 1.64	96.56 ± 0.42	21.2
llama 8B BF16	14.96 GiB	8.03 B	pp3072	75.83 ± 1.79	91.61 ± 0.87	20.8
llama 8B BF16	14.96 GiB	8.03 B	pp4096	68.81 ± 0.60	82.38 ± 1.29	19.7
llama 8B BF16	14.96 GiB	8.03 B	pp8192	58.69 ± 0.44	70.03 ± 0.68	19.3
llama 8B BF16	14.96 GiB	8.03 B	tg16	3.76 ± 0.03	3.68 ± 0.05	-2.1
llama 8B BF16	14.96 GiB	8.03 B	pp512+tg64	25.12 ± 0.40	26.11 ± 0.10	3.9

Ryzen AI Max+ 395 (zen5 x 16)

OMP_NUM_THREADS=16 GOMP_CPU_AFFINITY="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15" \
ZENDNNL_MATMUL_ALGO=2 \
 llama-bench -ctk bf16 -ctv bf16 -ub 4096 -b 8192 \
  -r 3 \
  -p "1,1,2,3,4,8,12,16,24,32,48,64,96,128,192,256,384,512,768,1024,1536,2048,3072,4096,8192" \
  -n 16 \
  -pg "512,64" \
  -m Meta-Llama-3.1-8B-Instruct/BF16.gguf

model	size	params	test	CPU t/s	t/s	diff %
llama 8B BF16	14.96 GiB	8.03 B	pp1	7.62 ± 0.00	7.18 ± 0.00	-5.8
llama 8B BF16	14.96 GiB	8.03 B	pp2	14.19 ± 0.01	14.05 ± 0.00	-1.0
llama 8B BF16	14.96 GiB	8.03 B	pp3	21.10 ± 0.02	20.84 ± 0.01	-1.2
llama 8B BF16	14.96 GiB	8.03 B	pp4	27.79 ± 0.04	27.67 ± 0.02	-0.4
llama 8B BF16	14.96 GiB	8.03 B	pp8	53.75 ± 0.02	53.10 ± 0.03	-1.2
llama 8B BF16	14.96 GiB	8.03 B	pp12	78.03 ± 0.03	81.13 ± 0.04	4.0
llama 8B BF16	14.96 GiB	8.03 B	pp16	102.05 ± 0.14	102.72 ± 0.10	0.7
llama 8B BF16	14.96 GiB	8.03 B	pp24	143.54 ± 0.16	146.26 ± 0.13	1.9
llama 8B BF16	14.96 GiB	8.03 B	pp32	175.47 ± 0.18	176.39 ± 0.12	0.5
llama 8B BF16	14.96 GiB	8.03 B	pp48	231.75 ± 0.18	238.82 ± 0.27	3.1
llama 8B BF16	14.96 GiB	8.03 B	pp64	255.89 ± 0.22	280.80 ± 0.47	9.7
llama 8B BF16	14.96 GiB	8.03 B	pp96	269.43 ± 0.36	328.61 ± 0.32	22.0
llama 8B BF16	14.96 GiB	8.03 B	pp128	271.33 ± 0.14	370.01 ± 0.26	36.4
llama 8B BF16	14.96 GiB	8.03 B	pp192	274.50 ± 3.34	376.03 ± 0.37	37.0
llama 8B BF16	14.96 GiB	8.03 B	pp256	275.62 ± 0.11	390.93 ± 7.04	41.8
llama 8B BF16	14.96 GiB	8.03 B	pp384	285.08 ± 0.14	393.13 ± 0.08	37.9
llama 8B BF16	14.96 GiB	8.03 B	pp512	282.78 ± 0.04	405.17 ± 0.22	43.3
llama 8B BF16	14.96 GiB	8.03 B	pp768	275.96 ± 0.13	399.88 ± 0.24	44.9
llama 8B BF16	14.96 GiB	8.03 B	pp1024	268.62 ± 0.07	383.23 ± 0.05	42.7
llama 8B BF16	14.96 GiB	8.03 B	pp1536	258.30 ± 0.06	379.29 ± 0.07	46.8
llama 8B BF16	14.96 GiB	8.03 B	pp2048	238.97 ± 0.15	346.95 ± 0.21	45.2
llama 8B BF16	14.96 GiB	8.03 B	pp3072	214.74 ± 0.03	302.10 ± 0.10	40.7
llama 8B BF16	14.96 GiB	8.03 B	pp4096	195.00 ± 0.06	269.62 ± 0.10	38.3
llama 8B BF16	14.96 GiB	8.03 B	pp8192	160.60 ± 0.07	214.52 ± 0.02	33.6
llama 8B BF16	14.96 GiB	8.03 B	tg16	7.64 ± 0.00	7.17 ± 0.00	-6.2
llama 8B BF16	14.96 GiB	8.03 B	pp512+tg64	56.27 ± 0.02	56.09 ± 0.01	-0.3

Good job.

0 replies

RFC: ZenDNN Backend Integration for llama.cpp #17684

Uh oh!

Uh oh!

z-vishal Dec 2, 2025

RFC: ZenDNN Backend Integration for llama.cpp

Table of Contents

Summary

Motivation

Design

Architecture

Key Components

Supported Operations

Data Type Support

Hardware Support

Performance Results

Test Configuration

Benchmark Results

LLaMA 3.1 8B (BF16)

LLaMA 3.1 8B (F32)

Qwen2 7B (BF16)

Qwen2 7B (F32)

LLaMA 2 7B (BF16)

LLaMA 2 7B (F32)

LLaMA 2 13B (BF16)

LLaMA 2 13B (F32)

Mixtral 8x7B (BF16)

Key Observations

Implementation Details

Build Options

Runtime Configuration

Testing Plan

Alternatives Considered

Migration Impact

References

Conclusion

Replies: 3 comments · 4 replies

Uh oh!

Uh oh!

z-vishal Dec 2, 2025 Author

What's Included

Uh oh!

ngxson Dec 2, 2025 Collaborator

Uh oh!

z-vishal Dec 2, 2025 Author

Uh oh!

z-vishal Dec 2, 2025 Author

Uh oh!

z-vishal Dec 2, 2025 Author

Uh oh!

Uh oh!

Djip007 Dec 3, 2025

Uh oh!

Djip007 Dec 4, 2025

z-vishal
Dec 2, 2025

Replies: 3 comments 4 replies

z-vishal
Dec 2, 2025
Author

ngxson Dec 2, 2025
Collaborator

z-vishal Dec 2, 2025
Author

z-vishal Dec 2, 2025
Author

z-vishal Dec 2, 2025
Author

Djip007
Dec 3, 2025

Djip007
Dec 4, 2025