sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations#22147
sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations#22147aicss-genai wants to merge 2 commits intoggml-org:masterfrom
Conversation
Signed-off-by: Chun Tao <chun.tao@intel.com>
|
Hi @aicss-genai, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
arthw
left a comment
There was a problem hiding this comment.
Support malloc more than 4GB is mandatory for both AOT or not AOT.
If can't provide the alternative for 4GB, we like to disable AOT which is not more important than 4GB issue.
What's the benefit after set sub-group size to 16? Any test result?
| add_compile_definitions(GGML_SYCL_WARP_SIZE=16) | ||
| target_link_options(ggml-sycl PRIVATE -Xs -ze-intel-greater-than-4GB-buffer-required) | ||
| if (NOT GGML_SYCL_DEVICE_ARCH) | ||
| target_link_options(ggml-sycl PRIVATE -Xs -ze-intel-greater-than-4GB-buffer-required) |
There was a problem hiding this comment.
"-ze-intel-greater-than-4GB-buffer-required" is used to support malloc memory more than 4GB.
It's mandatory for some LLMs support.
If remove it, please provide the alternative solution for 4GB in same PR.
There was a problem hiding this comment.
Hi Neo, thanks a lot for this input. We will fix it asap. This PR add the AOT support for bmg-g31, while the current SYCL --offload-arch only has AOT support up to bmg-g21.
There was a problem hiding this comment.
Set single AOT flag will make the binary can't run for other GPUs - The binary file with AOT can't trigger JIT on other GPUs.
For example, the SYCL code for AOT bmg-31 can't run on iGPU and other dGPU.
I suggest not set AOT in SYCL backend cmakefile as default.
You could update it in SYC.md to guide user if needed.
Thank you!
| #endif | ||
| if ((src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && use_fp16 && ggml_is_contiguous(src0) && | ||
| row_diff == src0->ne[1] && dst->op_params[0] == GGML_PREC_DEFAULT) { | ||
| // NOTE: Fused dequant+GEMM and MMQ/DPAS were both attempted (Steps 10-11 |
There was a problem hiding this comment.
remove these lines.
It has no value to user/developer.
| GGML_UNUSED(type); | ||
| return false; | ||
| // DPAS INT8 MMQ kernel exists in mmq.cpp but is slower than dequant+oneDNN. | ||
| // Disabled pending further optimization. See optimization-workbook.md Step 11. |
There was a problem hiding this comment.
This code doesn't change any behavior.
remove them.
Overview
Authors
Enables AOT builds for Intel GPUs (validated on Intel® Arc™ Pro B70, BMG-G31, Xe2-HPG):
GGML_SYCL_DEVICE_ARCHis set, switch to-fsycl-targets=spir64_genwith-Xsycl-target-backend="-device <arch>"and skip-ze-intel-greater-than-4GB-buffer-required(not accepted by the AOT path). Behavior is unchanged whenGGML_SYCL_DEVICE_ARCHis unset.[[intel::reqd_sub_group_size(WARP_SIZE)]]to MMQ Q4_0/Q4_1/Q5_0/Q5_1 kernel launches.WARP_SIZE=16on Intel targets; pinning the required subgroup size is required for spir64_gen AOT correctness and documents intent on JIT.Also adds a documentation-only update to
ggml_sycl_supports_mmq(still returns false) and a note inggml_sycl_op_mul_mat_syclrecording that fused dequant+GEMM and MMQ/DPAS were both slower than dequant+oneDNN in our experiments.Additional information
Split from #22066 per reviewer request for independent review.
Requirements