hexagon: MUL_MAT and MUL_MAT_ID rework : 32x32 tiled weight repack, kernel-params, cached graphs (#24954)
-
hex-mm: new weight layout and fusion updates
-
hvx-mm: unroll the new tiled vec_dots to optimize hvx register util
-
hex-mm: optimize dyn.quant format for q8_0 and q8_1 to reduce overhead in vec_dots.
-
hvx-mm: parallel quantizer per block for large rows
-
hvx-mm: simplify and futher optimize dyn.quant and vec_dots
-
hvx-mm: keep intermediate per tile accumulators in fp16
-
hmx-mm: optimize weight dequant by aligning the repacked tiles with the DMA
-
hmx-mm: remove qweight scratch and just use vtcm_weight
-
hmx-mm: remove all unused and obsolete code
-
hmx-mm: the new tiled repack format is here to stay -- rename all x4x2 to _tiled
-
hmx-mm: improve activation processing with dma prefetch
-
hex-mm: fix hmx/hvx fallback logic and MUL_MAT_ID allocation (unbreaks OLMoE)
-
hex-mm: align the weight tiles with dma just like we did in hmx-mm
-
hex-mm: factor out common mm bits into htp/matmul-ops.h
-
hex-mm: start moving mm kernel selection to the host
-
hex-mm: move all of the matmul param compute into the host
-
hmx-mm: restore pipelined mode
-
hmx-mm: unroll the dequant functions to optimize register usage
-
hmx-mm: further improve activation process
-
hex-mm: use vtcm_seq_alloc for all vtcm allocations and define more common functions
-
hex-mm: improve mm optimizer to acount for number of activation threads
-
hex-mm: fix matmul-id kernel params selection (unbreaks OLMoE and LFM)
-
hexagon: remove support for arch < v73 since HMX is now required for most use-cases
-
hex-mm: cleanup naming for consistency
-
hex-mm: make sure matmul fusion accounts for vtcm allocation
-
hex-mm: minor cleanup for kernel_params definition
-
hex-mm: replace hardcoded limits with proper checks for vtcm requirements
-
hex-mm: add support for non-tiled mm as a fallback option and factor out hvx kernels into separate header
-
hex-mm: remove unused functions
-
hex-mm: add shorthand for MM_SELECT in run-tool script
-
hvx-mm: factor out hvx/hmx microkernels and unify matmul entry and dispatch
-
hex-mm: further cleanup matmul fallback path
-
hex-mm: refactor matmul entry point and dispatch a bit further
-
hexagon: update cmake build to enable hmx for everything
-
hex-ops: optimize kernel_param updates and include summary in the logs
-
hex-mm: add support for GGML_HEXAGON_MM_SELECT
-
hex-mm: add hex-common header
-
hex-mm: pass correct number of tasks to workpool
-
hex-mm: add proper checks for no-work in dyn.quant tasks
-
hex-mm: convert all quantizers into a macro
-
hex-mm: fix hvx-flat fallback to pass all MUL_MAT tests
-
hex-mm: vectorize q8_1 quantizer
-
hex-mm: improve fused ffn mm stride handling
-
hex-mm: consistent use of n_threads and pipeline in kernel_params
-
hexagon: minor formatting
-
hex-mm: update MUL_MAT_ID kernel_param handling to make sure host/npu are in sync
-
hvx-mm: go back to accumulating in fp32 in tiled hvx kernels, more accurate and same perf
-
hvx-mm: unroll the loops and remove masking that is not needed for tiled accums
-
hmx-mm: optimize activation processing (slit loops, some unrolling, etc)
-
hmx-mm: minor optimization for output processing
-
hex-mm: consistent use of uint32_t and size_t in mm kernels
-
hex-mm: remove legacy restrictions for rows to be multiple of 256
-
hexagon: replace sprintf with snprintf
-
hex-mm: relax hardcoded nrows checks and rely on VTCM size requirements
-
hexagon: minor alignment fix
-
hexagon: fix trailing spaces
-
hex-mm: relax padding from 256 to 128 (leftovers)
-
hex-mm: remove redundant checks for weight align to 128
we always use 2D dma for the weights and align them properly
-
hmx-mm: MUL_MAT_ID better work distribution between hvx threads and hmx tracing
-
hex-mm: specialize per-token mmid activation handling
-
hex-profile: update python scripts to handle kernel-params section in the logging output
-
hex-mm: move n_prefetch (aka dma_depth) into kernel params and remove unused fields
-
hex-trace: use easier to parse format, simply and fix post-proc scripts
-
hmx-mm: relax 32 row limit for output processing which helps utilization
-
hmx-mm: use start-chunk idx for tracing info
-
hmx-mm: parameterize activation dma pipeline
-
hexagon: add support for simple graph caching to avoid recomputing kernel-params
-
hex-mm: remove left-over repack functions
-
hex-mm: tighten n_prefetch asserts
-
hex-mm: remove duplicate round/align_up helper
-
hexagon: cleanup common header used in host/npu
-
hexagon: update early wakeup threshold
-
hmx-mm: define cost constants and update solver to assume that repacked ne[1] is padded to 32
-
hmx-mm: make precompute_matmul a bit more readable (split into smaller functions, etc)
-
hex-mm: remove n_threads constraint
-
hex-mm: minor formatting updates
-
hex-mm: remove obsolete profiling logs
-
hex-mm: restore hardcode gate to refuse lm-head to avoid repacking that tensor
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
- Ubuntu x64 (SYCL FP32)
- Ubuntu x64 (SYCL FP16)
Android:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows arm64 (OpenCL Adreno)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.3 DLLs
- Windows x64 (Vulkan)
- Windows x64 (OpenVINO)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
- DISABLED
- openEuler x86 (310p)
- openEuler x86 (910b, ACL Graph)
- openEuler aarch64 (310p)
- openEuler aarch64 (910b, ACL Graph)
UI: