Skip to content

Conversation

@kpouget
Copy link
Collaborator

@kpouget kpouget commented Aug 29, 2025

Summary by CodeRabbit

  • New Features
    • Added diffusion text-generation CLI and examples.
    • Introduced Granite and GPT-OSS chat formats with template kwargs and BOS/EOS controls.
    • New model-conversion toolkit with logits/embeddings verification, quantization, and HF upload helpers.
    • WebGPU (Dawn) build paths added; CANN runtime images introduced.
  • Bug Fixes
    • More robust tool-call argument parsing.
    • Safer tokenization and eval-callback handling.
    • Improved embeddings handling for SEP/EOS cases.
  • Documentation
    • New GGML ops matrix, expanded build guides (s390x, WebGPU), multimodal docs, and README updates.
  • Chores
    • CI/CD workflows added/updated; Vulkan/ROCm/MUSA versions refreshed; Makefile deprecated in favor of CMake.

lhez and others added 30 commits August 1, 2025 13:15
* support hunyuan_v1_dense

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* update hunyuan_moe to hunyuan_v1_moe

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* fix rope alpha assert and bos token

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* add blank line

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* Revert "update hunyuan_moe to hunyuan_v1_moe"

This reverts commit aa973ca21913aba77f6e81a935270ef7be222e75.

* use hunyuan_dense instead of hunyuan_v1_dense

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* fix hunyuan_moe chat template

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* remove leftover code

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* update hunyuan dense chat template

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

* fix hunyuan dense vocab and chat template

Signed-off-by: stevenkuang <stevenkuang@tencent.com>

---------

Signed-off-by: stevenkuang <stevenkuang@tencent.com>
* vendor : update vendored copy of google/minja

Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>

* Re-remove trailing whitespace

Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>

* Remove another trailing whitespace

Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>

---------

Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com>
* vulkan: optimizations for direct convolution

- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
  the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.

* Three tiles sizes for CONV_2D, and a heuristic to choose

* reallow collectives for pre-Turing

* make SHMEM_PAD a spec constant

* fixes for intel perf - no shmem padding, placeholder shader core count

* shader variants with/without unrolling

* 0cc4m's fixes for AMD perf

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
  interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
* torch is not required for convert_hf_to_gguf_update

* add --check-missing parameter

* check that pre-tokenizer hashes are up-to-date
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1

ggml-ci

* cont : fix cont types

ggml-ci

* cont : adopt variable names and comment from the other branch
…5040)

This commit removes the right alignment the `n_stream` value in the
log message in the `llama_kv_cache_unified` constructor.

The motivation for this change is to enhance the readability of log
message. Currently the output looks like this:
```console
llama_kv_cache_unified: size = 2048.00 MiB (  4096 cells,  32 layers,  1/ 1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
```
Notice that the `n_stream` value is right aligned, which makes it a
little harder to read.

With the change in this commit the output will look like
```console
llama_kv_cache_unified: size = 2048.00 MiB (  4096 cells,  32 layers, 1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
```
… text_config) (#15051)

* basic kimi-vl textmodel conversion

* check config["text_config"] for special tokens
…(#14994)

* imatrix : use a single count for dense 3d tensors

* imatrix : fix 3d activations when model tensor is 2d

* imatrix : fix 3d tensor counts
* imatrix : use GGUF by default

* imatrix : use GGUF regardless of the output filename

The legacy format can only be produced with --output-format dat
* Add parameter buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow
* model: Add GLM 4.5 (#14921)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Merge in PR suggestions

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: Add GLM 4.5 family of models (#14921)

1. Updated tensor_mapping.py with NextN tensor mappings

- Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py
- Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm

2. Added num_nextn_predict_layers configuration

- Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp
- Added num_nextn_predict_layers field to llama_hparams struct
- Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter
- Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers
- Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method
- Updated conversion script to extract and write this parameter from HuggingFace config

3. Added FIM tokens for GLM4_MOE

- Added GLM-4.5's FIM tokens to llama-vocab.cpp:
  - <|code_prefix|> for FIM_PRE
  - <|code_suffix|> for FIM_SUF
  - <|code_middle|> for FIM_MID

4. Removed manual NextN tensor handling

- Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors
- NextN tensors are now handled automatically through the proper tensor mapping system

* glm 4.5 update tensors names

* model: glm 4.5 apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: glm 4.5 apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: glm 4.5 apply suggestions from code review

* Apply suggestions from code review

* patch broken chat template

* typings fix

* add TENSOR_SKIP flag


Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update src/llama-model-loader.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
ngxson and others added 22 commits August 26, 2025 12:54
* convert : fix tensor naming conflict for llama 4 vision

* convert ok

* support kimi vision model

* clean up

* fix style

* fix calc number of output tokens

* refactor resize_position_embeddings

* add test case

* rename build fn

* correct a small bug
* metal : optmize FA vec for large heads and sequences

* metal : adjust small-batch mul mv kernels

ggml-ci

* batched-bench : fix total speed computation

ggml-ci

* cont : add comments

ggml-ci
This commit adds two targets to the Makefile for quantizing of
Quantization Aware Trained (QAT) models to Q4_0 format.

The motivation for this is that this sets the token embedding and the
output tensors data types to Q8_0 instead of the default Q6_K. This is
someting that we wish to enforce for QAT Q4_0 models that are to be
uploaded to ggml-org on Huggingface to guarantee the best quality.
This patch improves GEMM for FP32 Data Type on PowerPC

Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.

Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).

Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces

Performance Testing:

Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model.  Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.

model                   Size                Params     Backend       Threads   Test    Patch   Base
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp512   98.58   60.3
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp1024  95.88   57.36
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp2048  85.46   53.26
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp4096  68.66   45.78
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp6144  57.35   40.44

25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
…#15592)

The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.

This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
* add fused group_norm/norm, mul, add

* fix spacing

* revert rms_norm logic

* fix trailing whitespace
This commit updates the bash completion script to include the -m
short option for the --model argument.

The motivation for this is that currently tab completion only works the
full --model option, and it is nice to have it work for the short option
as well.
* ggml-cpu : add basic RVV support for vector f32 ops

* ggml-cpu : add RVV support for f32 softmax
* CANN(flash-attn): refactor mask handling and improve performance

1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: fix review

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: Optimization FA BNSD to BSND

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link

coderabbitai bot commented Aug 29, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Introduces new diffusion CLI and model-conversion tooling, extends chat formats (Granite, GPT-OSS), adds speculative drafting across incompatible vocabularies, broadens CLI/common params (diffusion, LR/optimizer, server/API), updates build/CI (Makefile stub, CMake presets, multiple workflows), revises Dockerfiles, refreshes docs/templates/labels, and adds ownership/config files.

Changes

Cohort / File(s) Summary
Build system & packaging
CMakeLists.txt, CMakePresets.json, Makefile, common/CMakeLists.txt
Version var tweaks, log build type, remove KOMPUTE deprecation, add gcc/linux presets, add remoting presets; replace Makefile with CMake-only stub; CURL linking via CURL_LIBRARIES; llguidance ext updated.
Common params, CLI, and utilities
common/common.h, common/common.cpp, common/arg.cpp, common/json-schema-to-grammar.cpp
Adds diffusion, LR/optimizer, kv_unified, API prefix, template kwargs, EOG bias handling; new string utility; tokenizer overflow guard; centralized tensor buffer overrides; deprecate defrag-thold; replace custom string_view with std::string_view.
Chat formats & templating
common/chat.h, common/chat.cpp, common/chat-parser.cpp
Adds Granite and GPT-OSS formats; new template inputs (add_bos/eos, kwargs, extra_context); reasoning format AUTO + mapper; safer tool_call arguments serialization.
Speculative drafting API
common/speculative.h, common/speculative.cpp
Split into target/draft contexts, add vocab-compat checks, retokenization with replacement map; new init signature and replacement API.
Conversion and tokenizer tooling
convert_hf_to_gguf_update.py, convert_lora_to_gguf.py
Add --check-missing mode; expand models/hashes; robust download/token logic; adjust ModelBase.load_hparams call signature.
Examples: diffusion
examples/diffusion/*
New diffusion CLI, CMake, and README for diffusion-based generation with multiple algorithms/schedules.
Examples: model-conversion suite
examples/model-conversion/**
New end-to-end conversion/quantization/verification workflows, CMake target (llama-logits), Makefile, scripts (HF hub ops, logits/embeddings checks, perplexity), requirements.
Examples: misc
examples/embedding/embedding.cpp, examples/eval-callback/eval-callback.cpp, examples/*.sh, examples/llama.vim, examples/batched.swift/README.md, examples/lookahead/README.md
Embedding: cls_sep handling, unified KV when n_parallel=1; eval-callback: I64 print, NaN check, empty-input guard; several shebangs to env bash; sample flags updated; added sample commands.
Docker & DevOps containers
.devops/*.Dockerfile, .devops/tools.sh
New CANN multi-stage Dockerfile; CPU Dockerfile simplifies arch handling; CUDA adds pip --break-system-packages; MUSA and ROCm versions bumped; Vulkan SDK install switches to tarball + env; tools.sh shebang via env.
CI/workflows
.github/workflows/*, .devops/cloud-v-pipeline
Add packaging, RISC-V native, pre-tokenizer-hashes, copilot setup, ops-docs sync; update main/release workflows (ccache action swap, RPATH flags, runners, WebGPU/Dawn jobs); disable Vulkan cross-builds; remove old Jenkins node.
GitHub repo meta
.github/ISSUE_TEMPLATE/*, .github/labeler.yml, .github/copilot-instructions.md, CODEOWNERS, OWNERS
Update backend lists (add OpenCL, zDNN; remove Kompute); add labels for zDNN/OpenCL; add Copilot instructions; adjust code ownership; add OWNERS approvers/reviewers.
Docs: backends & build
docs/backend/*.md, docs/build*.md, docs/docker.md
CANN note on NZ weights; SYCL flag description tweak; build docs updated (curl deps, Vulkan/WebGPU sections, Windows/Linux notes); Docker images list changes (SYCL→Vulkan); MUSA tag bump in CI doc.
Docs: multimodal
docs/multimodal/*.md
Add MiniCPM-V 4/4.5 and Voxtral docs; relocate legacy MiniCPM scripts; remove image norm flags; repo URL updates.
Ops table & enforcement
docs/ops.md, .github/workflows/update-ops-docs.yml, scripts/create_ops_docs.py (referenced)
New ops support matrix doc and workflow to enforce sync with generator script.
Repo config
.clang-format, .gitignore, .gitmodules, README.md, ci/run.sh, build-xcframework.sh, ci/README.md
Format config categories updated; unignore models/templates and ignore .ccache; remove kompute submodule; README reorg (backends incl. WebGPU, hot topics); ci scripts: webgpu flag, wget resume, prompt tweak; env shebangs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant CLI as llama-diffusion-cli
  participant L as LLaMA Model (ctx)
  participant S as Samplers/CFG

  U->>CLI: Invoke with model, prompt, diffusion params
  CLI->>L: Load model + vocab, create context
  CLI->>CLI: Tokenize/format input (optional chat template)
  loop steps 1..N
    CLI->>L: Decode (conditional/unconditional if CFG)
    L-->>CLI: Logits
    CLI->>S: Sample/top-k/p + algorithm scoring
    S-->>CLI: Tokens to transfer/replace
    CLI->>CLI: Update masked sequence
    alt visual mode
      CLI-->>U: Render progress/text
    end
  end
  CLI->>L: Detokenize final tokens
  L-->>CLI: Text
  CLI-->>U: Output generated text + timings
Loading
sequenceDiagram
  autonumber
  actor App as Caller
  participant Spec as common_speculative
  participant T as Target ctx_tgt
  participant D as Draft ctx_dft

  App->>Spec: init(ctx_tgt, ctx_dft)
  Spec->>Spec: Check vocab compatibility
  App->>Spec: add_replacement_tgt_dft(map)
  App->>Spec: gen_draft(prompt_tgt)
  alt compatible vocab
    Spec->>D: Decode on draft using target tokens
  else incompatible
    Spec->>Spec: Detokenize target → replace → retokenize for draft
    Spec->>D: Decode on draft with prompt_dft
  end
  D-->>Spec: Draft tokens
  alt incompatible
    Spec->>Spec: Detokenize draft → replace → retokenize for target
  end
  Spec-->>App: Draft tokens in target vocab
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • cfergeau
  • praveenkumar

Poem

Hop, hop—new gears align,
Diffusion dreams and chats combine.
Two vocab burrows, retokenize!
Our CI sky has brighter skies.
Docker carts roll, presets chime—
gguf to stars, one hop at a time. 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@openshift-ci openshift-ci bot requested review from cfergeau and gbraad August 29, 2025 12:09
@openshift-ci
Copy link

openshift-ci bot commented Aug 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cfergeau for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kpouget kpouget closed this Aug 29, 2025
@kpouget kpouget deleted the reshape-b6298 branch August 29, 2025 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.