opencl: add basic support for q5_k#21593
Merged
lhez merged 3 commits intoggml-org:masterfrom Apr 11, 2026
Merged
Conversation
max-krasnyansky
approved these changes
Apr 11, 2026
XeonBloomfield
added a commit
to XeonBloomfield/llama.cpp
that referenced
this pull request
Apr 11, 2026
* model, mtmd: fix gguf conversion for audio/vision mmproj (ggml-org#21309) * fix gguf conversion for audio/vision mmproj * fix test * tests: allow exporting graph ops from HF file without downloading weights (ggml-org#21182) * tests: allow exporting graph ops from HF file without downloading weights * use unique_ptr for llama_context in HF metadata case * fix missing non-required tensors falling back to type f32 * use unique pointers where possible * use no_alloc instead of fixing f32 fallback * fix missing space * ggml-webgpu: add vectorized flash attention (ggml-org#20709) * naive vectorized version * add vectorized flash attention * update vec version * remove unused path and shader * remove unused helper functions * add comments * remove pad path * ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization * change back to vec4 * enable multi split * enable vec path when: - Q->ne[1] < 20 - Q->ne[0] % 32 == 0 - V->ne[0] % 4 == 0 - K->type == f16 * update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select * enable vec path for q4 and q8 * flash-attn vec nwg=1 fast path (skip tmp/reduce staging) * use packed f16 K loads in flash-attn vec split * use packed f16 K loads in flash-attn vec split on host side * tune flash-attn vec f16 VEC_NE by head dim * cleanup * cleanup * keep host side clean * cleanup host side * change back to original host wait/submit behavior * formatting * reverted param-buffer pool r ecfactor * add helper functions * ggml-webgpu: move flash-attn vec pipeline caching back into shader lib * ggml-webgpu: remove duplicate functions * ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation * ggml-webgpu: revert unrelated change * ggml-webgpu: revert deleted comment * disable uniformity check * remove unnecessary change * Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl * Update ggml/src/ggml-webgpu/ggml-webgpu.cpp --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com> * tests : add unit test coverage for llama_tensor_get_type (ggml-org#20112) * Add unit test coverage for llama_tensor_get_type * Fix merge conflicts, add more schemas * clang formatter changes * Trailing whitespace * Update name * Start rebase * Updating files with upstream changes prior to rebase * Changes needed from rebase * Update attn_qkv schema, change throw behaviour * Fix merge conflicts * White space * Update with latest changes to state counters * Revert accidental personal CLAUDE.md changes * Change quotation mark * Reuse metadata.name since we have it * Move test-only stuff out of llama-quant.cpp * Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns * cont : inital deslop guidelines * Cleanup based on review comments * Continue cleanup * Small cleanup * Manually set proper ordering of tensors, mostly applies to gemma * Formatting * Update tests/test-quant-type-selection.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix merge conflicts --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix: gemma 4 template (ggml-org#21326) * [HIP] Bump ROCm version to 7.2.1 (ggml-org#21066) Bump ROCm version on Linux from 7.2 to 7.2.1 Add gfx1102 target Delete LLVM workaround since ROCm 7.2.1 has fix for ROCm 7.2 perf regression ROCm/rocm-systems#2865 --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * ci : add AMD ZenDNN label to PR labeler (ggml-org#21345) * ci : add AMD CPU label to PR labeler Add automatic labeling for PRs that modify AMD CPU (ZenDNN) backend files * ci : rename label AMD CPU to AMD ZenDNN in labeler config Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com> * (revert) kv-cache : do not quantize SWA KV cache (ggml-org#21332) This reverts commit 17193cc. * chat : avoid including json in chat.h (ggml-org#21306) * rpc : reuse compute graph buffers (ggml-org#21299) Reuse the buffer for the ggml context which is used for creating the compute graph on the server side. This partially addresses a memory leak created by the CUDA backend due to using buffer addresses as cache keys. ref: ggml-org#21265 ref: ggml-org#20315 * vocab: fix Gemma4 tokenizer (ggml-org#21343) * seems to work * fix case with new line Co-authored-by: sayap <sokann@gmail.com> * gemma 4: fix pre tok regex --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: sayap <sokann@gmail.com> * ggml-zendnn : add MUL_MAT_ID op support for MoE models (ggml-org#21315) * ggml-zendnn : add MUL_MAT_ID op support for MoE models - Add MUL_MAT_ID op acceleration for Mixture-of-Experts models - MUL_MAT_ID op fallback to CPU backend if total experts > 32 - Point ZenDNN lib to latest bits ZenDNN-2026-WW13 * ggml-zendnn : add braces to sgemm failure condition for consistency Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com> * fix: add openssl to nix dependencies (ggml-org#21353) (ggml-org#21355) * HIP: build eatch ci build test for a different architecture (ggml-org#21337) This helps improve our chances of finding build failures before the release workflow builds for all architectures. * fix: remove stale assert (ggml-org#21369) * ci: add more binary checks (ggml-org#21349) * jinja: coerce input for string-specific filters (ggml-org#21370) * docs: Update build.md: HSA_OVERRIDE_GFX_VERSION clarification (ggml-org#21331) The `HSA_OVERRIDE_GFX_VERSION` variable can be used in ROCm to override an unsupported target architecture with a similar but supported target architecture. This does not and has never worked on Windows. I think the clarification could avoid driving Windows people towards this solution that does not work. * docker : bump cuda12 to 12.9.1 (ggml-org#20920) Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan> Co-authored-by: CISC <CISC@users.noreply.github.com> * common : fix tool call type detection for nullable and enum schemas (ggml-org#21327) * common : fix tool call type detection for nullable and enum schemas * common, tests : fix grammar delegation for nullable/enum schemas and add tests Fix enum type inference to scan all enum values (not just index 0) so schemas like {"enum": [0, "celsius"]} correctly detect string type. Fix schema_delegates in peg-parser to handle nullable type arrays (["string", "null"]) and typeless enum schemas in raw mode, allowing the tagged parser to use raw text instead of JSON-formatted strings. Add test cases for Qwen3-Coder (TAG_WITH_TAGGED format): - nullable string ["string", "null"] - nullable string with null first ["null", "string"] - nullable integer ["integer", "null"] - enum without explicit type key * common/parser: fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers (ggml-org#21230) * Fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers * Rename * Update common/chat-auto-parser-generator.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * server: save and clear idle slots on new task (`--clear-idle`) (ggml-org#20993) * server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE) * server: move idle slot KV clearing to slot release The save "cost" is now paid by the finishing request. * server: add --kv-clear-idle flag, enable by default * server: skip clearing last idle slot, clear on launch * server: test --no-kv-clear-idle flag * server: simplify on-release clearing loop * server: remove on-release KV clearing, keep launch-only * cont : clean-up * tests: update log strings after --clear-idle rename * tests: use debug tags instead of log message matching * test: fix Windows CI by dropping temp log file unlink --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * ci: Add Windows Vulkan backend testing on Intel (ggml-org#21292) * experimenting CI * Experimenting CI fix for MinGW * experimenting CI on Windows * modified script for integration with VisualStudio * added proxy handling * adding python version for Windows execution * fix iterator::end() dereference * fixed proxy handling * Fix errors occurring on Windows * fixed ci script * Reverted to master * Stripping test items to simplify Windows test * adjusting script for windows testing * Changed shell * Fixed shell * Fixed shell * Fix CI setting * Fix CI setting * Fix CI setting * Experimenting ci fix * Experimenting ci fix * Experimenting ci fix * Experimenting ci fix * experimenting fix for unit test error * Changed to use BUILD_LOW_PERF to skip python tests * Fix CI * Added option to specify Ninja generator * Reverted proxy related changes * ggml-webgpu: move from parameter buffer pool to single buffer with offsets (ggml-org#21278) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs * Start work on removing parameter buffer pools * Simplify and optimize further * simplify profile futures * Fix stride * Try using a single command buffer per batch * formatting * llama: add custom newline split for Gemma 4 (ggml-org#21406) * llama-model: read final_logit_softcapping for Gemma 4 (ggml-org#21390) * common : respect specified tag, only fallback when tag is empty (ggml-org#21413) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * server: Fix undefined timing measurement errors in server context (ggml-org#21201) Co-authored-by: Dan Hoffman <dhoffman@cyket.net> * common : add gemma 4 specialized parser (ggml-org#21418) * common : add gemma4 dedicated parser * cont : add '<|tool_response>' as eog * cont : emit JSON from Gemma4 tool call AST * cont : more fixes * cont : refactor convert function * cont : refine rules and mapping * cont : add more tests * cont : clean up * cont : remove autoparser gemma4 implementation * cont : more cleanup * cont : rename gemma4.jinja to match the others * cont : add custom template to support interleaved thinking * cont : preserve reasoning in model turns * cont : fix initializer error * cont : fix unused vars * cont : fix accidental static * cont : fix specialized_template signature * fix extra semicolon * remove debug line and extra space [no ci] * ci: fix vulkan workflow referencing non-existent action (ggml-org#21442) * ci: lower cuda12 floor to 12.8.1 for broader host compatibility (ggml-org#21438) Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan> * server : fix logging of build + system info (ggml-org#21460) This PR changes the logging that occurs at startup of llama-server. Currently, it is redundant (including CPU information twice) and it is missing the build + commit info. * ci : use default RISE RISC-V Runners (ggml-org#21263) * model : add HunyuanOCR support (ggml-org#21395) * HunyuanOCR: add support for text and vision models - Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge - Add separate HUNYUAN_OCR chat template (content-before-role format) - Handle HunyuanOCR's invalid pad_token_id=-1 in converter - Fix EOS/EOT token IDs from generation_config.json - Support xdrope RoPE scaling type - Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.) - Register HunYuanVLForConditionalGeneration for both text and mmproj conversion * fix proper mapping * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * address comments * update * Fix typecheck * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * llama : correct platform-independent loading of BOOL metadata (ggml-org#21428) * model-loader : fix GGUF bool array conversion * model-loader : fix remaining GGUF bool pointer uses * hexagon: slight optimization for argosrt output init (ggml-org#21463) * sycl : handle other FA case (ggml-org#21377) * convert : set "add bos" == True for Gemma 4 (ggml-org#21500) * convert : set "add bos" == True for Gemma 4 * cont : handle old GGUFs * docs: add hunyuan-ocr gguf, also add test [no ci] (ggml-org#21490) * server : handle unsuccessful sink.write in chunked stream provider (ggml-org#21478) Check the return value of sink.write() in the chunked content provider and return false when the write fails, matching cpp-httplib's own streaming contract. This prevents logging chunks as sent when the sink rejected them and properly aborts the stream on connection failure. * convert : fix block_ff_dim retrieval for lfm2 (ggml-org#21508) * vocab : add byte token handling to BPE detokenizer for Gemma4 (ggml-org#21488) * llama-bench: add `-fitc` and `-fitt` to arguments (ggml-org#21304) * llama-bench: add `-fitc` and `-fitt` to arguments * update README.md * address review comments * update compare-llama-bench.py * [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (ggml-org#21159) * Write an optimized flash_attn_stream_k_fixup kernel Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst * Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs * Address review comments * Address review comments * Revert variable names to original * cli: fix stripping of \n in multiline input (ggml-org#21485) * llama-cli: fix stripping of \n in multiline input * Change & string to string_view * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix EditorConfig linter error --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * ggml: add Q1_0 1-bit quantization support (CPU) (ggml-org#21273) * ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU) * add generic fallback for x86 * remove Q1_0 (group size 32) * rename Q1_0_g128 => Q1_0 * fix Q1_0 LlamaFileType Enum * Fix trailing spaces; add generic fallback for othre backends * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix /r/n spacing + arch-fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * ggml-webgpu: Add the support of `MUL_MAT_ID` (ggml-org#21147) * Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com> * docs: fix typo in build.md (emdawbwebgpu -> emdawnwebgpu) (ggml-org#21518) * [SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (ggml-org#21527) Extend the existing reorder optimization to Q8_0. The reorder separates scale factors from weight data for coalesced memory access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing. On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x) on Qwen3.5-27B. BW utilization: 21% -> 66%. The key fix beyond the kernels: Q8_0 was missing from the type check in ggml_backend_sycl_buffer_init_tensor() that allocates the extra struct carrying the reorder flag -- so the optimization was silently skipped. AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. Fixes: ggml-org#21517 * Fix rtl text rendering (ggml-org#21382) * Fix Arabic RTL text rendering in web UI - Add dir='auto' attributes to markdown containers and blocks - Implement post-processing to add dir='auto' to all text elements - Replace directional CSS properties with logical properties for proper RTL list alignment - Ensure bidirectional text support for mixed Arabic/English content * Clean up commented duplicate function Remove the commented-out duplicate transformMdastNode function that was left over from refactoring. * Fix Arabic RTL text rendering in web UI - Add dir='auto' attributes to markdown containers and blocks - Implement post-processing to add dir='auto' to all text elements - Replace directional CSS properties with logical properties for proper RTL list alignment - Minor code formatting improvements This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI. * Implement rehype plugin for comprehensive RTL text support - Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children - Replace DOMParser-based approach with efficient HAST tree processing - Remove hardcoded element lists for better maintainability - Ensure proper bidirectional text rendering for mixed RTL/LTR content * Fix RTL text rendering with rehype plugin and cleanup * fix: prettier formatting * fix: Detect streaming state in reasoning content blocks (ggml-org#21549) * ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) (ggml-org#21519) GGML_CUDA_CC_CDNA2 was set to 0x910 Fix by setting the constant to 0x90a to match the actual gfx90a ISA. * webui : store reasoning_content so it is sent back in subsequent requests (ggml-org#21249) * vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (ggml-org#21029) Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL in the flash attention base shader. Register them in the shader generator, pipeline creation, and enable in the scalar/coopmat1 FA support check. * ggml: Vulkan build, Linux -- output error string for errno on fork failure (ggml-org#20868) (ggml-org#20904) * ggml : deprecate GGML_OP_ADD1 (ggml-org#21363) * ggml : deprecate GGML_OP_ADD1 * cont : remove tests * cont : re-enable vulkan check * server : fix restore for checkpoints with pos_min == 0 (ggml-org#21510) * llama: remove per-arch tensor name lists (ggml-org#21531) * unicode : add custom Qwen2 regex handler to fix segfault on long input (ggml-org#21257) * unicode : add custom Qwen2 regex handler to fix segfault on long input std::regex uses recursive backtracking internally, which causes a stack overflow (segfault) when tokenizing long sequences of repeated characters (e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to the std::regex fallback path instead of using a custom handler. Add unicode_regex_split_custom_qwen2() following the established pattern used by gpt2, llama3, kimi_k2, and afmoe custom handlers. Closes: ggml-org#21113 * cont : remove TODO comment * cont : update comment to reflect original regex * use the correct regex in the comment this time... [no ci] --------- Co-authored-by: Aldehir Rojas <hello@alde.dev> * llama-server: fix model params not propagated (ggml-org#21509) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * CUDA: check for buffer overlap before fusing (ggml-org#21566) * CUDA: check for buffer overlap before fusing * use ggml_cuda_check_fusion_memory_ranges * ggml-webgpu: parameterize submission size and add iOS specific limits (ggml-org#21533) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs * Start work on removing parameter buffer pools * Simplify and optimize further * simplify profile futures * Fix stride * Try using a single command buffer per batch * formatting * Add parameters for different browsers in-flight submissions * Update handling of batch size too * Throttle ios as much as possible * Increase timeout for llvm-pipe testing * kv-cache : support attention rotation for heterogeneous iSWA (ggml-org#21513) * kv-cache : support attention rotation for heterogeneous iSWA * cont : remove assert * gguf-py : fix missing comma after bad merge in tensor-mapping (ggml-org#21558) This commit adds a missing comma in the vision encoder attention qkv block. The motivation for this change is that without the comma there will be a string concatenation of the Kimi-K2.5 and the Nemotron Nano v2 VL tensor mappings which will be broken. * ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (ggml-org#21168) * ds_read_b128 for q4_0 and q4_1 mmq kernels Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both. * Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation * Explicit for loop in mmq, renamed vec into tmp * Fixed max_cpy usage in the loading loop * Fixed typo in q4_1 kernel * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Renoved trailing white line 500 * Update mmq.cuh removed other whitelines * Remove trailing whitespaces --------- Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: iacopPBK <iacop@deneb.com> * CUDA: make cuda graphs props check faster (ggml-org#21472) * CUDA: compute fast hash instead of expensive props check * use seen node * use memcp * devops: kleidiai: provide KleidiAI-Enabled ARM Release Artifact (ggml-org#21259) * Unified macOS release setup with strategy-matrix block * Added KleidiAI arm64 macOS release definition Change-Id: I05520889ffc646488a178d06817a17f29274465a Signed-off-by: Martin Klacer <martin.klacer@arm.com> * webui: fix syntax highlighting lost after streaming for non-common languages (ggml-org#21206) * webui: fix syntax highlighting lost for non-common languages after streaming rehype-highlight uses lowlight internally, which only bundles 37 "common" languages. The streaming code path uses highlight.js directly (192 languages), so languages like Haskell highlight correctly while streaming but lose all color once the code block closes. Pass the full lowlight language set to rehype-highlight so both paths support the same languages. * webui: rebuild static files after rebase * model : support step3-vl-10b (ggml-org#21287) * feat: support step3-vl-10b * use fused QKV && mapping tensor in tensor_mapping.py * guard hardcoded params and drop crop metadata * get understand_projector_stride from global config * img_u8_resize_bilinear_to_f32 move in step3vl class * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix the \r\n mess * add width and heads to MmprojModel.set_gguf_parameters --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * chore: Remove legacy files (ggml-org#21606) * chore: Update labeler to have separate labels for `server/webui` and `server` changes (ggml-org#21567) * tests : remove obsolete .mjs script (ggml-org#21615) * parser: fix MiniMax handling (ggml-org#21573) * examples : disable cb_eval callback for --save-logits (ggml-org#21553) This commit updates the debug example to not create the base_callback_data. The motivation for this is when using `--save-logits`, which is used by examples/model-conversion scripts, we often don't care about the tensor outputs and they just add noise to the output. This changes is quiet by default we can always remove --save-logits to get the tensor outputs when debugging. * gemma : perform per-layer projections in the first layer (ggml-org#21612) * gemma : reduce graph splits by keeping per-layer ops in the input layer * gemma : put the per-layer proj in the first layer * cont : move the projection before the layer loop * metal: Q1_0 backend (ggml-org#21528) * initial Q1_0 Metal backend * tuning q1_0 metal kernels * add Q1_0 to test-backend-ops * add Q1_0<->F32 copy test * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * webgpu : Query for adapter support when registering WebGPU backend (ggml-org#21579) * kv-cache : extend cache quantization checks (ggml-org#21586) to also check for enabled flash attention, instead of just auto. * Propose fix a couple of typos (ggml-org#21581) Signed-off-by: John E <jeis4wpi@outlook.com> * webui : send both backend_sampling == false/true (ggml-org#18781) * webui : send both backend_sampling == false/true * feat: Parameter sync --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * vocab : remove </s> eog token if gemma4 (ggml-org#21492) * server: respect the ignore eos flag (ggml-org#21203) * fix: free ctx_copy in ggml_opt_free to plug per-training-session leak (ggml-org#21592) * fix: free ctx_copy in ggml_opt_free to plug per-training-session leak ggml_opt_alloc populates opt_ctx->ctx_copy via a free+init pair every time the allocated graph shape changes. The last ctx_copy from the final ggml_opt_alloc call survives until ggml_opt_free is invoked, but ggml_opt_free was only freeing ctx_static and ctx_cpu, never ctx_copy. Each opt_ctx lifetime therefore leaks the final per-batch context — ~900 KB for a typical GNN training session in sindarin-pkg-tensor, surfaced via AddressSanitizer. ctx_copy is nullptr-initialized and ggml_free() handles NULL safely, so the new release is guard-free. * Update ggml/src/ggml-opt.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: realorko <realorko@nowhere.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * CUDA: also store `node->src->data` ptrs for equality check (ggml-org#21635) * CUDA: also store node->src->data ptrs for equality check * address review comments * common : skip non-primary GGUF split files when selecting model (ggml-org#21633) We should not assume files are listed in order. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * vulkan: unify type macros to use Vx instead of _VECx (ggml-org#21605) * ci: drop v5 `all:` composition from labeler.yml (ggml-org#21627) actions/labeler@v6 removed the `all:` / `any:` composition keys. The `server/webui` and `server` entries used `all:` to combine `any-glob-to-any-file` with negated `all-globs-to-all-files`, which now errors on every PR with: Unknown config options were under "changed-files": all Flatten both entries to a single `any-glob-to-any-file`. PRs touching both webui and other server files will now receive both labels instead of only `server/webui`. Co-authored-by: Marxist-Leninist <noreply@users.noreply.github.com> * sycl : add flash-attn support for head size 512 (ggml-org#21654) * sycl : add flash-attn support for head size 512 This patch extends the SYCL Flash Attention implementation to support head sizes (DKQ/DV) of 512. Changes: - Added DKQ/DV 512 cases to both tile and vector Flash Attention kernels. - Updated kernel selection logic to allow vector kernels for head sizes up to 512 (previously 256). - Removed unused/redundant AMD and RDNA-specific configuration functions in `fattn-tile.hpp`. - Refactored `ggml_backend_sycl_buffer_init_tensor` to use a switch statement for clearer tensor extra buffer initialization. - Added necessary template instances for the new 512 head size across various quantization types. * remove defunct mxfp4 reorder from setting buffer type * webui: Add option to pre-encode conversation for faster next turns (ggml-org#21034) * server : fix grammar commandline args (ggml-org#21543) Co-authored-by: AUTOMATIC <-> * fix: Model Selector choice sync (ggml-org#21628) * metal : add missing mm-id specializations for q1_0 (ggml-org#21662) * jinja : support ensure_ascii=true, string repetition and int/float self-filtering (ggml-org#21623) * feat: jinja engine improvements for reka-edge Port three Jinja engine improvements needed for the reka-edge model: 1. Python-style string repetition ("ab" * 3 → "ababab") 2. ensure_ascii=true support for tojson filter (escapes non-ASCII to \uXXXX) 3. int() builtin on value_int_t (identity, needed for Reka Edge template) * fix: escape invalid utf8 bytes when ensure_ascii=true The json_ensure_ascii_preserving_format function does not correctly handle an edge case where if UTF-8 parsing fails, it adds the non-ascii character back to the output as a raw byte. This commit fixes that by adding the unicode standard replacement character \\ufffd to the output instead. This is the standard behavior for various programming languages like Python, Rust, Go, etc. * chore: address PR comments 1. Add todo comment for supporting string repetition for array/tuples 2. Add support for float identity operation 3. Move invalid ascii test case to test_fuzzing * chore: accept suggestion for common/jinja/value.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * vocab: add gemma4 tokenizer tests, fix edge case (ggml-org#21534) * YATF (Yet Another Tokenizer Fix) for Gemma 4. With tests! * Remove unnecessary hash from update script. * minor: move constant * mtmd: support dots.ocr (ggml-org#17575) * convert gguf * clip impl * fix conversion * wip * corrections * update docs * add gguf to test script * model: fix multimodal padding token for gemma3n/gemma4 (ggml-org#21625) * model: fix multimodal padding token for gemma3n/gemma4 * nits * common : simplify autoparser tagged parser rules (ggml-org#21216) * common : simplify autoparser tagged parser rules * cont : remove upper limit on optional args * cont : revert changes to parsing at the end * cont : undo arbitrary ordering of optional args * cont : fix uninitialized required parameters * revert to simplify merge * re-apply patches * restore flexible optional arg ordering tests * common : fix ambiguous grammar rule in gemma4 (ggml-org#21661) * common : fix ambiguous grammar rule in gemma4 * cont : fix missing comma... * webui: add "Send message on Enter" setting (ggml-org#21577) * webui: make Enter to send chat a setting * Shorten description * Use isMobile hook from $lib/hooks * Rebuild static output * requirements : update transformers to 5.5.1 (ggml-org#21617) * requirements : update transformers to 5.5.0 This commit updates the transformers dependency to version 5.5.0. The motivation for this is that transformers 5.5.0 includes support for Gemma4 and is required to be able to convert Gemma4 models. This is also causing issues for user of gguf-my-repo. Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/202 * fix huggingface_hub version * set version of transformers to 5.5.0 * convert : add ty ignore directives to convert_hf_to_gguf.py This commit adds `ty: ignore` directives to transformers tokenizers field/methods to avoid type check errors. There might be better ways to handle this and perhaps this can be done in a follow up commit. The motivation for this is that it looks like in transformers 5.5.0 AutoTokenizer.from_pretrained can return generic tokenizer types or None and the type checker now produces an error when the conversion script accesses field like tokenizer.vocab. * convert : add ty ignore to suppress type check errors * convert : remove incorrect type ignores * convert : fix remaining python checks I was running a newer version of ty locally but I've switched to version 0.0.26 which is what CI uses and I was then able to reproduce the errors. Sorry about the noise. * update transformers version to 5.5.1 * ggml : check return value of CUB calls used in argsort and top-k (they all return cudaError_t) (ggml-org#21676) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * ggml: backend-agnostic tensor parallelism (experimental) (ggml-org#19378) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (ggml-org#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (ggml-org#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (ggml-org#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (ggml-org#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (ggml-org#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (ggml-org#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (ggml-org#17) * meta : formatting, naming, indentation (ggml-org#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * HIP: add CDNA4 (gfx950) architecture support for MI350X/MI355X (ggml-org#21570) Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support: - vendors/hip.h: Add CDNA4 preprocessor define for __gfx950__ - common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros - mma.cuh: Route CDNA4 to compatible MFMA instructions: * f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950) * bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3) * int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3) - mmq.cuh: Include CDNA4 in stream-k kernel dispatch CDNA4 is largely compatible with CDNA3 except: - No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path - Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1: - Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950 - llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU): * f16+FA: 40,013 tok/s prefill, 254 tok/s decode * q8_0+FA: functional - Flash attention: works correctly - MMQ: works correctly with stream-k dispatch Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com> * CUDA: fuse muls (ggml-org#21665) * common : add fluidity to the progress bar (ggml-org#21671) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * vulkan: Support Q1_0 (ggml-org#21539) * vulkan: Support Q1_0 * use get_dm * docs : fix broken link to ggml-openvino in OPENVINO.md (ggml-org#21709) * common : enable reasoning budget sampler for gemma4 (ggml-org#21697) * fix: enable reasoning budget sampler for gemma4 Add thinking_start_tag and thinking_end_tag to common_chat_params_init_gemma4(). Without these, the reasoning budget sampler never activates for gemma4. Make the newline after "thought" optional in the PEG parser to handle budget=0 (sampler forces end tag before the newline). Add test case for empty thinking block. Fixes ggml-org#21487 * use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser * webui: Static build output improvements (ggml-org#21667) * refactor: Build improvements * chore: Formatting + package lock update * common: mark --split-mode tensor as experimental (ggml-org#21684) * common : fix when loading a cached HF models with unavailable API (ggml-org#21670) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * server : ignore --alias when using --models-preset (ggml-org#21380) I'm not sure what the purpose of keeping `--alias` was when using `--models-preset`, but the result is really weird, as shown in the following logs: $ build/bin/llama-server --models-preset preset.ini --alias "Gemma 4 E4B UD Q8_K_XL" ... init: using 31 threads for HTTP server srv load_models: Loaded 2 cached model presets srv load_models: Loaded 1 custom model presets from preset.ini main: failed to initialize router models: alias 'Gemma 4 E4B UD Q8_K_XL' for model 'angt/test-split-model-stories260K:F32' conflicts with existing model name So I propose to simply ignore `--alias` too in this case. With this commit, the server starts in routing mode correctly. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml-webgpu: address quantization precision and backend lifecycle managment (ggml-org#21521) * ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after ggml-org#20618, and remove the busy webgpu log * Merge with upstream * Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants * Update Unary wgsl EXP and EXPM1 for f16 stability * Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization * Fix numerical percision for unary sqrt when working with f16 * Fix NaN canonicalization for packed integers using f16 * Update err threshold for binary div ops when using f16 * backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend * clean: uncomment existing code logs * clean: clean the unncessary debug info * Refactor and generalize dequant helpers * Remove deprecated quant structs * Refactor shader defines to reduce repetition * Remove error override for F16 type * fix: fix the accidential removal of the proper initialization of ctx * clean: clean legacy and format code * fix: did not modify tests ops --------- Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv> * ggml-webgpu: support non-square subgroup matrix configs for Intel GPUs (ggml-org#21669) * model : make Gemma 4 shared-KV tail attn_k tensors optional on load (ggml-org#21739) * common : add callback interface for download progress (ggml-org#21735) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : better align to the updated official gemma4 template (ggml-org#21704) * hexagon: improved Op queuing, buffer and cache management (ggml-org#21705) * hexagon: introduce op request batching and rewrite buffer managment The host now prepares batches of requests and dispatches them via a single dspqueue message. Buffers are mapped explicitly by NPU while processing batches. * hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops * hex-utils: add explicit l2flush and l2clear helpers * hex-opreq: use fine-grain per tensor l2 management * hex-opreq: avoid redundant invalidates for tensors we already flushed * hex-opreq: update debug messages * htp-opreq: reuse ops_context * hex-opreq: do not flush or invalidate cache lines beyond buffer boundry * hex-opreq: fix errors in log message * Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry" This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d. * hexagon: limit l2 flushes to 1MB which covers l2 cache * hex-opreq: limit cache flush to 4MB Looks like 4MB cont. vitual space should cover the 1MB cache. * hexagon: drop cache flush size to 2MB * hex-opreq: start reworking opreq packing * hex-opreq: introduce new way of packing opbatch where tensors are stored separately * hex-opreq: add a simple fastrpc call to force unmap all buffers * hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size * hex-opreq: bump opreq batch size to 256 * hex-mm: place src1 spad at the top of vtcm for easy reuse * hex-ops: introduce internal types and disable src1 reuse for now Nothing new just formalizing the repack / qyn.quant types we've been using. * htp-opreq: use tensor pointers instead of copies * hex-opreq: introduce more robust way for tracking vtcm/spad reuse This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops. * hex-cumsum: fix error post opreq merge * hex-opreq: move request batch handling into the session Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner. * hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx * hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers * hex-buf: add support for allocating shared/pinned buffer for opreqs * hex-opbatch: make opbatches configurable * hex-naming: better name for ggml_hexagon_shared_buffer * hex-naming: add session->c_name() helper * hex-opbatch: start using shm but still copy for now * hex-opbatch: use shared buffer for packing opbatch * hex-opbatch: beter naming for opbatch related classes and code * hex-opbatch: reuse batched tensors with same data/dims/strides * hex-opbatch: update logging * hex-opbatch: add support for vmem limit for op batching * hex-opbatch: update htp side to properly support dynamic mmap/unmap * hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing * hex-opbatch: fixed src1 handling in act ops * hex-act: fix empty src1 handling in swiglu and friends Simplify preamble macro while at it * hex-mm: minor fix vtcm and dma handling in matmul cleaning up some left-overs from merges * hex-opbatch: allocate extra 1KB for dspqueue overhead * hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc * hex-mm: properly handle hmx_disabled flag * hex-ops: update comments * hex-ops: add debug output for get/set-rows * hex-mmap: optimize un/mapping of buffers * hex-opreq: global cache flush and invalidate beyond 128KB threshold * hex-ops: add super simple opfilter regex for debugging If an Op matches the regex hex backend will reject it. * hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future * hexagon: improved vtcm acquision to remove inter-op overhead Fully compatible with QNN-HTP coex * hex-mm: fixed hvx fallback path * hex-mm: lower the vmem threshold a bit further to ~3GB * hexagon: update debug & error logs This also fixes an issue with newer llvm merging repack and non-repack functions. We use those pointer to distinguish between buffer types. * hexagon: move ops context into main context Just a cleanup. We don't need separate contexts at this point. * hex-opbatch: cleanup naming and headers for opbatch and related descriptors * hex-fa: it's now better to enable FA during TG to reduce graph splits * hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops if needed for debugging or validation. * hexagon: fixed editorconfig check * Update ggml/src/ggml-hexagon/ggml-hexagon.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * hexagon: add support for linux on snapdragon (ggml-org#21707) * hexagon: add support for debian on ex2 * hexagon: add -fvectotize to c/c++ cmake flags * hexagon: remove trailing white space * update onboarding steps * hexagon: update linux setup documentation * hexagon: update intallation scripts * Hexagon: update docs * hexagon: update onboarding scripts --------- Co-authored-by: Zack Li <zackli@qti.qualcomm.com> * fix: Fix broken structured output when using $refs in json_schema (ggml-org#21699) * CUDA: also store node->src ne/nb for graph equality (ggml-org#21736) * py : Bump typer to latest to fix huggingface_hub issue (ggml-org#21701) * ggml : fix a few instances of missing GGML_TYPE_Q1_0 cases (ggml-org#21716) * TP: fix Qwen 3 Next data split (ggml-org#21732) * opencl: add basic support for q5_k (ggml-org#21593) * opencl: add general q5_k mv * opencl: add flattened Q5_K mv and general Q5_K mm * opencl: fix Q5_K unit tests --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Signed-off-by: Martin Klacer <martin.klacer@arm.com> Signed-off-by: John E <jeis4wpi@outlook.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Ruben Ortlam <rortlam@redhat.com> Co-authored-by: Zheyuan Chen <sephirotheca17@gmail.com> Co-authored-by: Reese Levine <reeselevine1@gmail.com> Co-authored-by: Bartowski <3266127+bartowski1182@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> Co-authored-by: Slobodan Josic <127323561+slojosic-amd@users.noreply.github.com> Co-authored-by: Vishal Singh <vishal@zettabolt.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com> Co-authored-by: sayap <sokann@gmail.com> Co-authored-by: Tillerino <Tillerino@users.noreply.github.com> Co-authored-by: uvos <carl@uvos.xyz> Co-authored-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: jeromew <jerome.wagner@m4x.org> Co-authored-by: M1DNYT3 <42499082+M1DNYT3@users.noreply.github.com> Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan> Co-authored-by: CISC <CISC@users.noreply.github.com> Co-authored-by: Samanvya Tripathi <samanu09@gmail.com> Co-authored-by: Yes You Can Have Your Own <188969017+yychyo@users.noreply.github.com> Co-authored-by: Masato Nakasaka <masato.nakasaka@intel.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: SamareshSingh <97642706+ssam18@users.noreply.github.com> Co-authored-by: Adrien Gallouët <angt@huggingface.co> Co-authored-by: Dan Hoffman <43101339+thedanhoffman@users.noreply.github.com> Co-authored-by: Dan Hoffman <dhoffman@cyket.net> Co-authored-by: Aldehir Rojas <hello@alde.dev> Co-authored-by: Nicholas Sparks <157740354+nisparks@users.noreply.github.com> Co-authored-by: ddh0 <chemist-mulches-39@icloud.com> Co-authored-by: Ludovic Henry <ludovic@rivosinc.com> Co-authored-by: Richard Davison <richard.davison1@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: anchortense <daniel.redshaw@uqconnect.edu.au> Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> Co-authored-by: lainon1 <271530700+lainon1@users.noreply.github.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Bipin Yadav <83943505+bipinyadav3175@users.noreply.github.com> Co-authored-by: Pasha Khosravi <khosravipasha@users.noreply.github.com> Co-authored-by: Masashi Yoshimura <yoshimura.masashi.frbs@gmail.com> Co-authored-by: Dmytro Romanov <casteldazur@gmail.com> Co-authored-by: PMZFX <georgiopapairo@gmail.com> Co-authored-by: Kabir08 <62639358+Kabir08@users.noreply.github.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Antoine Viallon <antoine@lesviallon.fr> Co-authored-by: mkoker <132301062+mkoker@users.noreply.github.com> Co-authored-by: Tom Overlund <tomov@dilacero.org> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Son H. Nguyen <33925625+nhs000@users.noreply.github.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: iacopPBK <iacopogiottorossi@gmail.com> Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com> Co-authored-by: iacopPBK <iacop@deneb.com> Co-authored-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Hamish M. Blair <hmblair@stanford.edu> Co-authored-by: forforever73 <63285796+forforever73@users.noreply.github.com> Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com> Co-authored-by: John Eismeier <42679190+jeis4wpi@users.noreply.github.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: RealOrko <45273739+RealOrko@users.noreply.github.com> Co-authored-by: realorko <realorko@nowhere.com> Co-authored-by: Marxist-Leninist <31905382+Marxist-Leninist@users.noreply.github.com> Co-authored-by: Marxist-Leninist <noreply@users.noreply.github.com> Co-authored-by: Akarshan Biswas <akarshan@menlo.ai> Co-authored-by: AUTOMATIC1111 <16777216c@gmail.com> Co-authored-by: Kwa Jie Hao <31984694+kwajiehao@users.noreply.github.com> Co-authored-by: JvM <mourix@live.nl> Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com> Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: andyluo7 <43718156+andyluo7@users.noreply.github.com> Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com> Co-authored-by: Belem Zhang <belem.zhang@intel.com> Co-authored-by: Berk Idem <55372926+berkidem@users.noreply.github.com> Co-authored-by: Chen Yuan <constant.chen@uwaterloo.ca> Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv> Co-authored-by: Rithik Sharma <rithiksh02@gmail.com> Co-authored-by: MoonRide303 <130458190+MoonRide303@users.noreply.github.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com> Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com> Co-authored-by: Zack Li <zackli@qti.qualcomm.com> Co-authored-by: Galunid <karolek1231456@gmail.com> Co-authored-by: shaofeiqi <shaoqi@qti.qualcomm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds basic support for Q5_K quantization on GPU. With this change, Q5_K operations remain on the GPU instead of falling back to the CPU, which improves performance for models using Q5_K quantization.
This is a general implementation. A follow‑up PR will introduce a more optimized, Adreno‑specific implementation.
Additional information
With Qwen3.5-9B-Q5_K_M.gguf on 8 elite gen 5:
master,
this PR,
Requirements