Upgrade llama.cpp to b9106 by bernardladenthin · Pull Request #125 · bernardladenthin/java-llama.cpp

bernardladenthin · 2026-05-11T16:00:15Z

Summary

Upgrade the pinned llama.cpp version from b9103 to b9106, incorporating improvements to Vulkan flash attention, CUDA argsort, and Mistral Medium 3.5 model conversion support.

Changes

Vulkan flash attention refactor: The pipeline_flash_attn_f32_f16 pipeline changed from a per-type array of maps to a single map. Mixed K/V quantization types (e.g., Q4_0 K + F16 V) are now supported across all Vulkan FA paths (scalar, cm1, cm2) instead of only coopmat2. Per-type SPIR-V variants were replaced with two generic modules (flash_attn_f32_f16 and flash_attn_f32_f16_int8) that select K/V type at runtime via spec constants. A new flash_attn_dequant.glsl shader provides aliased SSBO views and unified dequantization logic.
CUDA argsort enhancement: Added #include <cuda/iterator> to support CCCL ≥ 3.1 strided-iterator path.
Mistral Medium 3.5 conversion support: Updated convert_hf_to_gguf.py to read "dim" key instead of "hidden_dim" for n_embd_text, and added resolution of negative img_break_tok_id placeholders from tekken.json or tokenizer.json.

Notes

All changes are internal backend improvements and conversion tool enhancements. No Java API changes are required.

https://claude.ai/code/session_013S1x9KhvBL6Pr3xuYoTtKh

b9103→b9106 is a Vulkan flash-attention internal refactor (no project API changes): - pipeline_flash_attn_f32_f16 collapsed from per-type array to single map - Mixed K/V quant types now supported on all Vulkan FA paths (scalar/cm1/cm2) - Per-type SPIR-V variants replaced by generic modules + FaTypeK/FaTypeV spec constants - new flash_attn_dequant.glsl with aliased SSBO views and uber dequantize4() switch - Minor CUDA argsort fix (#include <cuda/iterator> for CCCL >= 3.1) - convert_hf_to_gguf.py Mistral Medium 3.5 mmproj support No changes to common/, include/llama.h, server sources, or JNI layer. https://claude.ai/code/session_013S1x9KhvBL6Pr3xuYoTtKh

claude · 2026-05-11T16:01:05Z

Review Summary

✅ Version bump executed correctly — All three required files updated per CLAUDE.md upgrade procedure:

CMakeLists.txt: GIT_TAG b9103 → b9106
README.md: badge and link updated
CLAUDE.md: pinned version and breaking changes table updated

✅ Upstream changes properly documented — The three new rows in the breaking changes table (Vulkan flash attention refactor, CUDA argsort fix, Mistral Medium 3.5 mmproj support) accurately reflect that all changes are internal to GGML/CUDA backends or conversion tooling with no Java/JNI/server-API impact.

✅ No architectural changes — The b9103→b9106 delta touches only:

Internal GPU backend optimizations (Vulkan, CUDA)
Model conversion tooling (convert_hf_to_gguf.py)
None of these affect the compiled upstream server sources or project's C++ headers.

Ready to merge — This is a low-risk version bump with proper documentation.

bernardladenthin merged commit 62274ca into main May 11, 2026
8 of 13 checks passed

bernardladenthin deleted the claude/update-b9106-compatibility-SPPw1 branch May 11, 2026 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade llama.cpp to b9106#125

Upgrade llama.cpp to b9106#125
bernardladenthin merged 1 commit into
mainfrom
claude/update-b9106-compatibility-SPPw1

bernardladenthin commented May 11, 2026

Uh oh!

Uh oh!

claude Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bernardladenthin commented May 11, 2026

Summary

Changes

Notes

Uh oh!

Uh oh!

claude Bot commented May 11, 2026

Review Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants