-
Notifications
You must be signed in to change notification settings - Fork 13.9k
ggml-zendnn : add ZenDNN backend for AMD CPUs #17690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I was thinking to create a backend with https://github.com/amd/blis (with FBGEMM) but good with zenDNN to. |
|
Can you also include the benchmark results from #17684 into this PR? |
|
@taronaeo Updated the PR description with benchmark results |
|
@Djip007 Thanks! AMD BLIS is actually what ZenDNN uses under the hood the |
taronaeo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General implementation looks good. Just needs fixing of the unnecessary enum declarations.
You should also look into supporting GGML_OP_MUL_MAT_ID for MoE as well but this can probably be in another PR in continuation of this.
For quantised model support, you can disable the following line
/* .buffer_from_host_ptr = */ true, // set to falseand weight tensors will go through .set_tensor() where you can manually upscale it to either BF16 or FP32 before it runs the same matmul calculations. I'm quite interested to see if you'll still get a performance boost though :)
|
Thanks @taronaeo for the review for MoE support, will add in a follow-up PR after this merges. Quantized models support with the upscaling approach may be not needed the ZenDNN team is also working on native quantized support. |
This PR adds ZenDNN backend support for accelerated inference on AMD EPYC™ CPUs.
Background
ZenDNN is AMD's optimized deep learning library for EPYC processors, providing high-performance primitives for inference workloads. It uses the LowOHA (Low Overhead High-performance) MatMul operator for efficient matrix multiplication.
Changes
Backend implementation:
ggml/src/ggml-zendnn/GGML_OP_MUL_MATacceleration using ZenDNN primitivesBuild system:
-DGGML_ZENDNN=ON-DGGML_ZENDNN_PATH=/path/to/zendnnDocumentation:
docs/backend/ZenDNN.mddocs/build.mdHardware Support
Performance Notes
export ZENDNNL_MATMUL_ALGO=2(Blocked AOCL BLIS backend)Testing
Tested on AMD EPYC systems with llama-server and llama-cli using various models (LLaMA, Mistral, Qwen).
Performance Results
Test Configuration
ZENDNNL_MATMUL_ALGO=2(Blocked AOCL BLIS)Benchmark Results
LLaMA 3.1 8B (BF16)
LLaMA 3.1 8B (F32)
Qwen2 7B (BF16)
Qwen2 7B (F32)
LLaMA 2 7B (BF16)
LLaMA 2 7B (F32)
LLaMA 2 13B (BF16)
LLaMA 2 13B (F32)
Mixtral 8x7B (BF16)
Key Observations:
Related
AI usage disclosure: AI assistance was used for documentation writing, formatting and CMake syntax. All code logic, implementation decisions, backend integration, and testing were done manually. The core ZenDNN backend implementation, performance optimizations, and benchmark testing were human-authored and validated.