-
Notifications
You must be signed in to change notification settings - Fork 13.8k
ggml-cpu : add runtime rvv detection #17496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Also vlen-agnostic kernel selection is added to ggml_vec_dot_q2_K_q8_K for rvv-disabled and wider devices.
|
Further refactoring of related kernels is planned for follow-up PRs. |
| // allow benign data race here | ||
| static volatile ggml_vec_dot_t func_ptr = NULL; | ||
| ggml_vec_dot_t func = func_ptr; | ||
| if (func == NULL) { | ||
| func = ggml_vec_dot_q2_K_q8_K_generic; | ||
| #if defined(__riscv_v) | ||
| const int vlen = ggml_cpu_get_riscv_vlen(); | ||
| if (vlen >= 256) { | ||
| func = ggml_vec_dot_q2_K_q8_K_rvv256; | ||
| } else if (vlen >= 128) { | ||
| func = ggml_vec_dot_q2_K_q8_K_rvv128; | ||
| } | ||
| #endif | ||
| func_ptr = func; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it is indeed benign, would be better to avoid the race to not trip thread sanitizers. What is the concern that makes you introduce the func_ptr cache - want to avoid repeated calls to ggml_cpu_get_riscv_vlen()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify compilation and avoid building the CPU library twice for different targets, I've implemented detection for devices lacking RVV support via an additional getauxval() call. This guarding is essential because the VLEN query instruction will trigger a SIGILL signal if RVV is absent or disabled, making ggml_cpu_get_riscv_vlen() inherently heavier than a direct __riscv_vlenb() call.
For improved TSAN compatibility, I propose modifying the initialization to use a relaxed atomic store. Currently, ggml_vec_dot_q2_K_q8_K() is compiled to a light-weight 4-instruction trampoline before its initialization logic, as shown here:
000000000005d8f4 <ggml_vec_dot_q2_K_q8_K>:
5d8f4: 00089317 auipc t1,0x89
5d8f8: ec433303 ld t1,-316(t1) # e67b8 <func_ptr.0>
5d8fc: 00030363 beqz t1,5d902 <ggml_vec_dot_q2_K_q8_K+0xe>
5d900: 8302 jr t1
// [snip]
I will proceed with implementing a TSAN-friendly version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have an idea how to resolve the TSAN via atomics that should be OK. I was thinking there is probably a simple way by caching the vlen result:
// this is thread-safe
static const int vlen = ggml_cpu_get_riscv_vlen();And then this vlen could be used to index a static table of functions, f.ex:
const int idx = vlen/128;
func = func_table[idx];Ignore my comment if I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While a static lookup table isn't future-proof for wider VLEN devices, we can clamp the detected VLEN stored in the cache. I'll try your suggestion.
Also added VLEN-agnostic kernel selection to
ggml_vec_dot_q2_K_q8_Kfor RVV-disabled and wider devices.