Skip to content

Conversation

@xctan
Copy link
Collaborator

@xctan xctan commented Nov 25, 2025

Also added VLEN-agnostic kernel selection to ggml_vec_dot_q2_K_q8_K for RVV-disabled and wider devices.

Also vlen-agnostic kernel selection is added to ggml_vec_dot_q2_K_q8_K
for rvv-disabled and wider devices.
@xctan xctan requested a review from ggerganov as a code owner November 25, 2025 15:11
@xctan
Copy link
Collaborator Author

xctan commented Nov 25, 2025

Further refactoring of related kernels is planned for follow-up PRs.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 25, 2025
Comment on lines +743 to +757
// allow benign data race here
static volatile ggml_vec_dot_t func_ptr = NULL;
ggml_vec_dot_t func = func_ptr;
if (func == NULL) {
func = ggml_vec_dot_q2_K_q8_K_generic;
#if defined(__riscv_v)
const int vlen = ggml_cpu_get_riscv_vlen();
if (vlen >= 256) {
func = ggml_vec_dot_q2_K_q8_K_rvv256;
} else if (vlen >= 128) {
func = ggml_vec_dot_q2_K_q8_K_rvv128;
}
#endif
func_ptr = func;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it is indeed benign, would be better to avoid the race to not trip thread sanitizers. What is the concern that makes you introduce the func_ptr cache - want to avoid repeated calls to ggml_cpu_get_riscv_vlen()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify compilation and avoid building the CPU library twice for different targets, I've implemented detection for devices lacking RVV support via an additional getauxval() call. This guarding is essential because the VLEN query instruction will trigger a SIGILL signal if RVV is absent or disabled, making ggml_cpu_get_riscv_vlen() inherently heavier than a direct __riscv_vlenb() call.

For improved TSAN compatibility, I propose modifying the initialization to use a relaxed atomic store. Currently, ggml_vec_dot_q2_K_q8_K() is compiled to a light-weight 4-instruction trampoline before its initialization logic, as shown here:

000000000005d8f4 <ggml_vec_dot_q2_K_q8_K>:
   5d8f4:       00089317                auipc   t1,0x89
   5d8f8:       ec433303                ld      t1,-316(t1) # e67b8 <func_ptr.0>
   5d8fc:       00030363                beqz    t1,5d902 <ggml_vec_dot_q2_K_q8_K+0xe>
   5d900:       8302                    jr      t1
// [snip]

I will proceed with implementing a TSAN-friendly version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have an idea how to resolve the TSAN via atomics that should be OK. I was thinking there is probably a simple way by caching the vlen result:

// this is thread-safe
static const int vlen = ggml_cpu_get_riscv_vlen();

And then this vlen could be used to index a static table of functions, f.ex:

const int idx = vlen/128;
func = func_table[idx];

Ignore my comment if I am missing something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While a static lookup table isn't future-proof for wider VLEN devices, we can clamp the detected VLEN stored in the cache. I'll try your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants