Skip to content

Conversation

@YaelGitAccount
Copy link
Contributor

This PR adds initial support for NVIDIA's Eagle2-VL vision-language models in llama.cpp, addressing #16704.

The goal is to enable GGUF conversion of the Eagle2-VL mmproj and run basic multimodal inference via llama-mtmd-cli, while keeping the changes fully isolated to Eagle2-VL and leaving all other models unaffected.

What this PR does

1. GGUF conversion: Eagle2-VL mmproj

  • Extend convert_hf_to_gguf.py to recognize the Eagle2_5_VLForConditionalGeneration architecture when --mmproj is used.
  • Parse the Eagle2-VL HF config to extract the projector / hidden sizes needed for the mmproj graph.
  • Add a dedicated tensor mapping entry in gguf/tensor_mapping.py for the Eagle2-VL mmproj so that:
    • All projector weights are mapped deterministically into GGUF.
    • Shapes are consistent with the SigLIP image tower and the Qwen2.5 text model used by Eagle2-VL.

2. Runtime: SigLIP → mmproj → text integration in mtmd CLIP graph

  • Extend tools/mtmd/clip.cpp with an Eagle2-VL–specific branch that:
    • Wires the SigLIP vision tower output into the Eagle2-VL mmproj.
    • Produces projector outputs in the expected [hidden, tokens] layout for the downstream text model.
    • Respects the scale / merge behavior described by the Eagle2-VL configuration.
  • The new path is fully guarded:
    • Only triggers when the model is detected as Eagle2-VL mmproj.
    • Falls back cleanly for all other CLIP / mmproj models.

3. Scope and safety

  • No changes to the core llama model, kv-cache, sampling, or quantization logic.
  • No behavioral changes for non-Eagle models:
    • Existing CLIP / mmproj users should see identical behavior.
  • Runtime guards keep the Eagle2-specific logic isolated to:
    • convert_hf_to_gguf.py (mmproj conversion)
    • gguf/tensor_mapping.py (mmproj tensor mapping)
    • tools/mtmd/clip.cpp (vision → projector → text graph)

Tested models

Conversion and inference were tested end-to-end on:

  • nvidia/Eagle2-1B
  • nvidia/Eagle2-2B

For both models:

  • convert_hf_to_gguf.py --mmproj --outtype f16 produces a GGUF mmproj that loads successfully.
  • llama-mtmd-cli can:
    • Encode an image through the SigLIP tower.
    • Run the Eagle2-VL projector.
    • Generate text conditioned on the image.

At this stage, 9B support is intentionally left out of scope to keep the diff small and focused.

Notes

  • The PR does not introduce any new public CLI flags.
  • Logging is kept minimal and aligned with existing mtmd style.
  • The Eagle2-VL path is designed so that, if the architecture is not detected, behavior is unchanged.

Co-authors

This work was done together with @YaelLogic as part of a focused effort to add Eagle2-VL multimodal support to llama.cpp.

YaelGitAccount and others added 8 commits November 3, 2025 16:12
…branch)

Co-authored-by: YaelLogic <y0548591250@gmail.com>
Co-authored-by: YaelGitAccount <ya0504124870@gmail.com>
Co-authored-by: YaelGitAccount <ya0504124870@gmail.com>
…ing, and shape validation\n\n- Scope strictly to Eagle2-VL with config/arch guards\n- Remove temporary debug logs; preserve upstream logging semantics\n- Keep tokenizer/encode path clean; no behavior change for non-vision models
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel quite comfortable of patching quite a lot of existing code just to make this model to work. Probably this PR will break many existing models.

Instead, you should add its own conversion class, PROJECTOR_TYPE and cgraph build_... function

if fullatt_block_indexes[i] - fullatt_block_indexes[i - 1] != n_wa_pattern:
raise ValueError(f"Invalid fullatt_block_indexes: {fullatt_block_indexes}")
self.gguf_writer.add_vision_n_wa_pattern(n_wa_pattern)
elif model_type in ['eagle_2_5_vl', 'eagle2_vl', 'eagle2_5_vl']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a dedicated conversion class and dedicated projector type

@@ -12,6 +9,7 @@

#include <cassert>
#include <cmath>
#include <cstdio>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused header

@@ -185,6 +183,11 @@ struct clip_hparams {

patch_merge_type mm_patch_merge_type = PATCH_MERGE_FLAT;

int32_t patch_merge_factor = 1;
std::string patch_merge_mode = "flat";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use enum patch_merge_type instead

@@ -185,6 +183,11 @@ struct clip_hparams {

patch_merge_type mm_patch_merge_type = PATCH_MERGE_FLAT;

int32_t patch_merge_factor = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use n_merge instead

Comment on lines +3660 to +3676
const size_t plane_sz = (size_t) dst.nx * (size_t) dst.ny;
dst.buf.resize(3 * plane_sz); // planar RGB



for (int y = 0; y < dst.ny; ++y) {
for (int x = 0; x < dst.nx; ++x) {
size_t base = (size_t) y * (size_t) dst.nx + (size_t) x;
for (int c = 0; c < 3; ++c) {
size_t src_idx = 3ull * base + (size_t) c; // interleaved in src
float raw = static_cast<float>(src.buf[src_idx]) / 255.0f;
float v = (raw - mean[c]) / std[c];
size_t dst_idx = (size_t) c * plane_sz + base; // planar in dst
dst.buf[dst_idx] = v;

}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happen here?

@github-actions github-actions bot added examples python python script changes labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants