Skip to content

hexagon: add SOLVE_TRI op#21974

Merged
max-krasnyansky merged 6 commits into
ggml-org:masterfrom
qualcomm:hexagon-solve-tri-op
Apr 24, 2026
Merged

hexagon: add SOLVE_TRI op#21974
max-krasnyansky merged 6 commits into
ggml-org:masterfrom
qualcomm:hexagon-solve-tri-op

Conversation

@mengshengwu

Copy link
Copy Markdown
Contributor

Overview

This PR add solve tri op support for hexagon. Use hvx to accelarate the caculation.

  • Tests all passes with test-backend-ops.
  • Profile kernels on both CPU and HTP0 backends:
Shape CPU runs CPU time / run Work / run CPU throughput HTP0 runs HTP0 time / run Work / run HTP0 throughput
ne_lhs=[64,64,4,4], ne_rhs=[32,64,4,4] 11271 128.73 us 2.13 MFLOP 16.55 GFLOPS 16380 81.78 us 2.13 MFLOP 26.04 GFLOPS
ne_lhs=[128,128,4,2], ne_rhs=[32,128,4,2] 3786 275.89 us 4.23 MFLOP 15.32 GFLOPS 8190 144.31 us 4.23 MFLOP 29.29 GFLOPS
ne_lhs=[64,64,8,32], ne_rhs=[64,64,8,32] 354 4008.46 us 68.16 MFLOP 17.00 GFLOPS 1468 8059.77 us 68.16 MFLOP 8.46 GFLOPS
ne_lhs=[128,128,4,32], ne_rhs=[128,128,4,32] 60 17616.20 us 270.53 MFLOP 15.36 GFLOPS 370 17405.15 us 270.53 MFLOP 15.54 GFLOPS
ne_lhs=[256,256,4,2], ne_rhs=[128,256,4,2] 238 7489.87 us 67.37 MFLOP 8.99 GFLOPS 1485 2717.44 us 67.37 MFLOP 24.79 GFLOPS

Perf summary

  • HTP0 is faster on 64x64, rhs=32: 26.04 vs 16.55 GFLOPS.
  • HTP0 is faster on 128x128, rhs=32: 29.29 vs 15.32 GFLOPS.
  • CPU is faster on 64x64, rhs=64, batch=8x32: 17.00 vs 8.46 GFLOPS.
  • HTP0 is slightly faster on 128x128, rhs=128, batch=4x32: 15.54 vs 15.36 GFLOPS.
  • HTP0 is faster on 256x256, rhs=128: 24.79 vs 8.99 GFLOPS.

Also I tested this with a full model on device using Qwen3.5-0.8B-Q4_K_M.gguf. Here are representative tensor shape samples from the delta-net chunked path around SOLVE_TRI:

common_debug_cb_eval:        attn_pre_solve-21 = (f32) NEG(HTP0#attn-21#0{64, 64, 1, 16}, }) = {64, 64, 1, 16}
common_debug_cb_eval: dnet_add_ch_attn_solved-21 = (f32) ADD(node_2574{64, 64, 1, 16}, HTP0#node_2571#0{64, 64, 1, 1}}) = {64, 64, 1, 16}

common_debug_cb_eval:       dnet_add_ch_lhs-22 = (f32) ADD(HTP0#attn-22#0{64, 64, 1, 16}, HTP0#node_2714#0{64, 64, 1, 1}}) = {64, 64, 1, 16}
common_debug_cb_eval:        attn_pre_solve-22 = (f32) NEG(HTP0#attn-22#0{64, 64, 1, 16}, }) = {64, 64, 1, 16}
common_debug_cb_eval: dnet_add_ch_attn_solved-22 = (f32) ADD(node_2717{64, 64, 1, 16}, HTP0#node_2714#0{64, 64, 1, 1}}) = {64, 64, 1, 16}

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
    YES, I use AI to learn the API usage of HVX and the algorithms for solving triangular matrix equations and review my code.

@mengshengwu mengshengwu requested review from a team and ggerganov as code owners April 16, 2026 01:05
@mengshengwu mengshengwu changed the title Hexagon solve tri op hexagon: add SOLVE_TRI op Apr 16, 2026
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Apr 16, 2026
@mengshengwu mengshengwu marked this pull request as draft April 16, 2026 03:45
@mengshengwu

Copy link
Copy Markdown
Contributor Author

Waitting for DMA optimize

@tboinovski1 tboinovski1 force-pushed the hexagon-solve-tri-op branch from d70d466 to a7f1c1d Compare April 23, 2026 18:52
@tboinovski1

Copy link
Copy Markdown
Contributor

I experimented with adding DMA to the kernel and to work off of VTCM but did not see meaningful improvement on v75 and v81.

Comment thread ggml/src/ggml-hexagon/htp/solve-tri-ops.c Outdated
@mengshengwu mengshengwu marked this pull request as ready for review April 23, 2026 23:32
@tboinovski1 tboinovski1 force-pushed the hexagon-solve-tri-op branch from a7f1c1d to a4ec494 Compare April 24, 2026 00:05
@max-krasnyansky max-krasnyansky merged commit 8bc492e into ggml-org:master Apr 24, 2026
45 of 49 checks passed
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
fewtarius pushed a commit to fewtarius/CachyLLama that referenced this pull request May 30, 2026
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants