Replies: 3 comments 4 replies
-
|
The ZenDNN backend implementation is now ready for review. I've opened a pull request with the complete integration: PR: ggml-zendnn : add ZenDNN backend for AMD CPUs #17690 What's Included
|
Beta Was this translation helpful? Give feedback.
-
|
Good job! For Mixtral 8x7B (BF16) it is a MoE model so it mostly use MUL_MAT_ID, not MUL_MAT... Just a question out of curiosity: I know that when used like this, it uses "Blocked AOCL BLIS". So, do you plan to implement other operations processors (OPs) since you chose ZenDNN rather than AOCL (fbgemm)? I've tested AOCL, and it's very efficient, for two reasons:
But before creating a backend, I wanted to see how to manage OP merging in llama.cpp to take advantage of the post-processing options available with fbgemm (and learn how to code them as well). https://docs.amd.com/r/en-US/57404-AOCL-user-guide/AOCL-BLAS?section=lpgemm-in-aocl-blas Note: with a C_fp32[8192, 8192] = A_bf16[8192, 8192] @ B_bf16[8192, 8192] and lpgemm :
|
Beta Was this translation helpful? Give feedback.
-
|
some benchmark:
OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="1,3,5,7,9,11,13,15" \
ZENDNNL_MATMUL_ALGO=2 \
llama-bench -ctk bf16 -ctv bf16 -ub 4096 -b 8192 \
-r 3 \
-p "1,1,2,3,4,8,12,16,24,32,48,64,96,128,192,256,384,512,768,1024,1536,2048,3072,4096,8192" \
-n 16 \
-pg "512,64" \
-m Meta-Llama-3.1-8B-Instruct/BF16.gguf
OMP_NUM_THREADS=16 GOMP_CPU_AFFINITY="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15" \
ZENDNNL_MATMUL_ALGO=2 \
llama-bench -ctk bf16 -ctv bf16 -ub 4096 -b 8192 \
-r 3 \
-p "1,1,2,3,4,8,12,16,24,32,48,64,96,128,192,256,384,512,768,1024,1536,2048,3072,4096,8192" \
-n 16 \
-pg "512,64" \
-m Meta-Llama-3.1-8B-Instruct/BF16.gguf
Good job. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RFC: ZenDNN Backend Integration for llama.cpp
Table of Contents
Summary
This RFC proposes the integration of AMD ZenDNN (Zen Deep Neural Network library) as a backend for llama.cpp to accelerate LLM inference on AMD EPYC™ processors. ZenDNN provides optimized deep learning primitives specifically tuned for AMD Zen CPU microarchitecture, offering significant performance improvements for matrix multiplication operations.
Motivation
AMD EPYC processors are widely deployed in data centers and cloud infrastructure for AI/ML workloads. Currently, llama.cpp users on AMD EPYC systems rely on generic CPU implementations or BLAS libraries that may not fully utilize the processor's capabilities. By integrating ZenDNN, we can:
Design
Architecture
The ZenDNN backend follows llama.cpp's standard backend architecture:
Key Components
Backend Implementation (
ggml/src/ggml-zendnn/ggml-zendnn.cpp):GGML_OP_MUL_MATusing ZenDNN primitivesBuild System (
ggml/src/ggml-zendnn/CMakeLists.txt):Documentation (
docs/backend/ZenDNN.md):Supported Operations
Data Type Support
Hardware Support
Performance Results
Test Configuration
ZENDNNL_MATMUL_ALGO=2(Blocked AOCL BLIS)Benchmark Results
LLaMA 3.1 8B (BF16)
LLaMA 3.1 8B (F32)
Qwen2 7B (BF16)
Qwen2 7B (F32)
LLaMA 2 7B (BF16)
LLaMA 2 7B (F32)
LLaMA 2 13B (BF16)
LLaMA 2 13B (F32)
Mixtral 8x7B (BF16)
Key Observations
Implementation Details
Build Options
Runtime Configuration
Testing Plan
test-backend-opstools/server/tests/llama-benchon various models and batch sizesAlternatives Considered
Migration Impact
-DGGML_ZENDNN=ONReferences
Conclusion
The ZenDNN backend integration provides significant performance improvements for llama.cpp users on AMD EPYC systems, with up to 2.95x speedup on token generation and 2.23x on prompt processing. The implementation follows llama.cpp's standard backend patterns, maintains full compatibility, and provides an easy migration path for existing deployments.
cc @ggerganov @slaren @ngxson @JohannesGaessler @CISC @jeffbolznv @danbev @cebtenzzre @ikawrakow @avinashcpandey @amukho
Beta Was this translation helpful? Give feedback.
All reactions