-
Notifications
You must be signed in to change notification settings - Fork 13.7k
feat(mtmd): add Eagle2-VL multimodal support (mmproj + SigLIP pipeline) #17224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(mtmd): add Eagle2-VL multimodal support (mmproj + SigLIP pipeline) #17224
Conversation
…branch) Co-authored-by: YaelLogic <y0548591250@gmail.com>
Co-authored-by: YaelGitAccount <ya0504124870@gmail.com>
Co-authored-by: YaelGitAccount <ya0504124870@gmail.com>
…rker logs for Eagle2-VL
…ing, and shape validation
…ing, and shape validation\n\n- Scope strictly to Eagle2-VL with config/arch guards\n- Remove temporary debug logs; preserve upstream logging semantics\n- Keep tokenizer/encode path clean; no behavior change for non-vision models
…L logic isolated in clip.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel quite comfortable of patching quite a lot of existing code just to make this model to work. Probably this PR will break many existing models.
Instead, you should add its own conversion class, PROJECTOR_TYPE and cgraph build_... function
| if fullatt_block_indexes[i] - fullatt_block_indexes[i - 1] != n_wa_pattern: | ||
| raise ValueError(f"Invalid fullatt_block_indexes: {fullatt_block_indexes}") | ||
| self.gguf_writer.add_vision_n_wa_pattern(n_wa_pattern) | ||
| elif model_type in ['eagle_2_5_vl', 'eagle2_vl', 'eagle2_5_vl']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a dedicated conversion class and dedicated projector type
| @@ -12,6 +9,7 @@ | |||
|
|
|||
| #include <cassert> | |||
| #include <cmath> | |||
| #include <cstdio> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused header
| @@ -185,6 +183,11 @@ struct clip_hparams { | |||
|
|
|||
| patch_merge_type mm_patch_merge_type = PATCH_MERGE_FLAT; | |||
|
|
|||
| int32_t patch_merge_factor = 1; | |||
| std::string patch_merge_mode = "flat"; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use enum patch_merge_type instead
| @@ -185,6 +183,11 @@ struct clip_hparams { | |||
|
|
|||
| patch_merge_type mm_patch_merge_type = PATCH_MERGE_FLAT; | |||
|
|
|||
| int32_t patch_merge_factor = 1; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use n_merge instead
| const size_t plane_sz = (size_t) dst.nx * (size_t) dst.ny; | ||
| dst.buf.resize(3 * plane_sz); // planar RGB | ||
|
|
||
|
|
||
|
|
||
| for (int y = 0; y < dst.ny; ++y) { | ||
| for (int x = 0; x < dst.nx; ++x) { | ||
| size_t base = (size_t) y * (size_t) dst.nx + (size_t) x; | ||
| for (int c = 0; c < 3; ++c) { | ||
| size_t src_idx = 3ull * base + (size_t) c; // interleaved in src | ||
| float raw = static_cast<float>(src.buf[src_idx]) / 255.0f; | ||
| float v = (raw - mean[c]) / std[c]; | ||
| size_t dst_idx = (size_t) c * plane_sz + base; // planar in dst | ||
| dst.buf[dst_idx] = v; | ||
|
|
||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happen here?
This PR adds initial support for NVIDIA's Eagle2-VL vision-language models in
llama.cpp, addressing #16704.The goal is to enable GGUF conversion of the Eagle2-VL mmproj and run basic multimodal inference via
llama-mtmd-cli, while keeping the changes fully isolated to Eagle2-VL and leaving all other models unaffected.What this PR does
1. GGUF conversion: Eagle2-VL mmproj
convert_hf_to_gguf.pyto recognize theEagle2_5_VLForConditionalGenerationarchitecture when--mmprojis used.gguf/tensor_mapping.pyfor the Eagle2-VL mmproj so that:2. Runtime: SigLIP → mmproj → text integration in
mtmdCLIP graphtools/mtmd/clip.cppwith an Eagle2-VL–specific branch that:[hidden, tokens]layout for the downstream text model.3. Scope and safety
llamamodel, kv-cache, sampling, or quantization logic.convert_hf_to_gguf.py(mmproj conversion)gguf/tensor_mapping.py(mmproj tensor mapping)tools/mtmd/clip.cpp(vision → projector → text graph)Tested models
Conversion and inference were tested end-to-end on:
nvidia/Eagle2-1Bnvidia/Eagle2-2BFor both models:
convert_hf_to_gguf.py --mmproj --outtype f16produces a GGUF mmproj that loads successfully.llama-mtmd-clican:At this stage, 9B support is intentionally left out of scope to keep the diff small and focused.
Notes
mtmdstyle.Co-authors
This work was done together with @YaelLogic as part of a focused effort to add Eagle2-VL multimodal support to
llama.cpp.