hexagon: add SOLVE_TRI op by mengshengwu · Pull Request #21974 · ggml-org/llama.cpp

mengshengwu · 2026-04-16T01:05:29Z

Overview

This PR add solve tri op support for hexagon. Use hvx to accelarate the caculation.

Tests all passes with test-backend-ops.
Profile kernels on both CPU and HTP0 backends:

Shape	CPU runs	CPU time / run	Work / run	CPU throughput	HTP0 runs	HTP0 time / run	Work / run	HTP0 throughput
`ne_lhs=[64,64,4,4], ne_rhs=[32,64,4,4]`	11271	128.73 us	2.13 MFLOP	16.55 GFLOPS	16380	81.78 us	2.13 MFLOP	26.04 GFLOPS
`ne_lhs=[128,128,4,2], ne_rhs=[32,128,4,2]`	3786	275.89 us	4.23 MFLOP	15.32 GFLOPS	8190	144.31 us	4.23 MFLOP	29.29 GFLOPS
`ne_lhs=[64,64,8,32], ne_rhs=[64,64,8,32]`	354	4008.46 us	68.16 MFLOP	17.00 GFLOPS	1468	8059.77 us	68.16 MFLOP	8.46 GFLOPS
`ne_lhs=[128,128,4,32], ne_rhs=[128,128,4,32]`	60	17616.20 us	270.53 MFLOP	15.36 GFLOPS	370	17405.15 us	270.53 MFLOP	15.54 GFLOPS
`ne_lhs=[256,256,4,2], ne_rhs=[128,256,4,2]`	238	7489.87 us	67.37 MFLOP	8.99 GFLOPS	1485	2717.44 us	67.37 MFLOP	24.79 GFLOPS

Perf summary

HTP0 is faster on 64x64, rhs=32: 26.04 vs 16.55 GFLOPS.
HTP0 is faster on 128x128, rhs=32: 29.29 vs 15.32 GFLOPS.
CPU is faster on 64x64, rhs=64, batch=8x32: 17.00 vs 8.46 GFLOPS.
HTP0 is slightly faster on 128x128, rhs=128, batch=4x32: 15.54 vs 15.36 GFLOPS.
HTP0 is faster on 256x256, rhs=128: 24.79 vs 8.99 GFLOPS.

Also I tested this with a full model on device using Qwen3.5-0.8B-Q4_K_M.gguf. Here are representative tensor shape samples from the delta-net chunked path around SOLVE_TRI:

common_debug_cb_eval:        attn_pre_solve-21 = (f32) NEG(HTP0#attn-21#0{64, 64, 1, 16}, }) = {64, 64, 1, 16}
common_debug_cb_eval: dnet_add_ch_attn_solved-21 = (f32) ADD(node_2574{64, 64, 1, 16}, HTP0#node_2571#0{64, 64, 1, 1}}) = {64, 64, 1, 16}

common_debug_cb_eval:       dnet_add_ch_lhs-22 = (f32) ADD(HTP0#attn-22#0{64, 64, 1, 16}, HTP0#node_2714#0{64, 64, 1, 1}}) = {64, 64, 1, 16}
common_debug_cb_eval:        attn_pre_solve-22 = (f32) NEG(HTP0#attn-22#0{64, 64, 1, 16}, }) = {64, 64, 1, 16}
common_debug_cb_eval: dnet_add_ch_attn_solved-22 = (f32) ADD(node_2717{64, 64, 1, 16}, HTP0#node_2714#0{64, 64, 1, 1}}) = {64, 64, 1, 16}

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:
YES, I use AI to learn the API usage of HVX and the algorithms for solving triangular matrix equations and review my code.

mengshengwu · 2026-04-16T03:45:26Z

Waitting for DMA optimize

tboinovski1 · 2026-04-23T19:01:38Z

I experimented with adding DMA to the kernel and to work off of VTCM but did not see meaningful improvement on v75 and v81.

* hexagon: add SOLVE_TRI op * ggml: fix TODO description for solve_tri * hexagon: rm unused variable/function warnings * hexagon: chunk vs batch processingfor better thread utilization * hexagon: vectorize partial f32 loads * hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>

mengshengwu requested review from a team and ggerganov as code owners April 16, 2026 01:05

mengshengwu changed the title ~~Hexagon solve tri op~~ hexagon: add SOLVE_TRI op Apr 16, 2026

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Apr 16, 2026

mengshengwu marked this pull request as draft April 16, 2026 03:45

tboinovski1 force-pushed the hexagon-solve-tri-op branch from d70d466 to a7f1c1d Compare April 23, 2026 18:52

max-krasnyansky reviewed Apr 23, 2026

View reviewed changes

Comment thread ggml/src/ggml-hexagon/htp/solve-tri-ops.c Outdated

mengshengwu marked this pull request as ready for review April 23, 2026 23:32

mengshengwu and others added 6 commits April 23, 2026 16:49

hexagon: add SOLVE_TRI op

96b32ff

ggml: fix TODO description for solve_tri

67e017e

hexagon: rm unused variable/function warnings

fdfa53b

hexagon: chunk vs batch processingfor better thread utilization

82f2809

hexagon: vectorize partial f32 loads

976b9d9

hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

a4ec494

tboinovski1 force-pushed the hexagon-solve-tri-op branch from a7f1c1d to a4ec494 Compare April 24, 2026 00:05

max-krasnyansky approved these changes Apr 24, 2026

View reviewed changes

lhez approved these changes Apr 24, 2026

View reviewed changes

max-krasnyansky merged commit 8bc492e into ggml-org:master Apr 24, 2026
45 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hexagon: add SOLVE_TRI op#21974

hexagon: add SOLVE_TRI op#21974
max-krasnyansky merged 6 commits into
ggml-org:masterfrom
qualcomm:hexagon-solve-tri-op

mengshengwu commented Apr 16, 2026

Uh oh!

mengshengwu commented Apr 16, 2026

Uh oh!

tboinovski1 commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mengshengwu commented Apr 16, 2026

Overview

Requirements

Uh oh!

mengshengwu commented Apr 16, 2026

Uh oh!

tboinovski1 commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants