Add dynamic high-resolution image preprocessing for InternVL model by bssrdf · Pull Request #20847 · ggml-org/llama.cpp

bssrdf · 2026-03-21T22:27:32Z

This PR adds support for dynamic high-resolution tiles used by InternVL model. This makes Qianfan-OCR work in llama.cpp

Use mmproj file from https://huggingface.co/bssrdf/Qianfan-OCR-gguf/blob/main/Qianfan-OCR-mmproj-bf16.gguf and LLM in https://huggingface.co/Reza2kn/Qianfan-OCR-GGUF/tree/main

Example: OCR the above image

master

 bin\Release\llama-mtmd-cli.exe -m ..\models\Qianfan-OCR-bf16.gguf --mmproj ..\models\Qianfan-OCR-mmproj-f16.gguf  --image document.png -p "Parse this document to Markdown." 
main: loading model: ..\models\Qianfan-OCR-bf16.gguf
WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli
encoding image slice...
image slice encoded in 68 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 93 ms

## Quantum Models Cookbook

Spectral decomposition of a quantum system is a fundamental problem in quantum physics. Given a Hamiltonian $H$, the goal is to find the eigenvalues and eigenvectors of $H$. This information is crucial for understanding the dynamics of the system.

### News

2023.03.21: Quantum-CR-A Unified End-to-End Model for Document Intelligence is released. Quantum-CR-A is a state-of-the-art model for document understanding. It is trained on a large corpus of documents and has achieved state-of-the-art performance on several benchmark datasets.

2023.03.18: Quantum-CR-A is now available on Hugging Face. The model can be easily fine-tuned and deployed on various platforms.

2023.03.15: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development.

2023.03.12: Quantum-CR-A is now available on Kaggle. The model can be used for various tasks such as text classification, named entity recognition, and relation extraction.

2023.03.09: Quantum-CR-A is now available on arXiv. The paper presents the model and its applications in detail.

2023.03.06: Quantum-CR-A is now available on Zenodo. The model and its training data are provided for further research and development.

2023.03.03: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development.

2023.02.28: Quantum-CR-A is now available on Kaggle. The model can be used for various tasks such as text classification, named entity recognition, and relation extraction.

2023.02.25: Quantum-CR-A is now available on arXiv. The paper presents the model and its applications in detail.

2023.02.22: Quantum-CR-A is now available on Zenodo. The model and its training data are provided for further research and development.

2023.02.19: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development.

2023.02.16: Quantum-CR-A is now available on Kaggle. The model can be used for various tasks such as text classification, named entity recognition, and relation extraction.

2023.02.13: Quantum-CR-A is now available on arXiv. The paper presents the model and its applications in detail.

2023.02.10: Quantum-CR-A is now available on Zenodo. The model and its training data are provided for further research and development.

2023.02.07: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development.

2023.02.04: Quantum-CR-A is now available on Kaggle. The model can be used for various tasks such as text classification, named entity recognition, and relation extraction.

2023.02.01: Quantum-CR-A is now available on arXiv. The paper presents the model and its applications in detail.

2023.01.28: Quantum-CR-A is now available on Zenodo. The model and its training data are provided for further research and development.

2023.01.25: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development.

This PR

 bin\Release\llama-mtmd-cli.exe -m ..\models\Qianfan-OCR-bf16.gguf --mmproj ..\models\Qianfan-OCR-mmproj-bf16.gguf --image document.png -p "Parse this document to Markdown." 

main: loading model: ..\models\Qianfan-OCR-bf16.gguf
WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli
encoding image slice...
image slice encoded in 323 ms
decoding image batch 1/2, n_tokens_batch = 2048
image decoded (batch 1/2) in 154 ms
decoding image batch 2/2, n_tokens_batch = 1280
image decoded (batch 2/2) in 136 ms

## Qianfan Models Cookbook

Example code and guides for basic tasks with LLMs Hosted by Qianfan Platform. You'll need a qianfan account and associated API key to accomplish all examples. We will provide example files, prompts within a cookbook. You can run all examples by setting the QIANFAN_TOKEN environment variable.

Note that this is a python code only repository.

## News

2026.03.12: Qianfan-OCR: A Unified End-to-End Model for Document Intelligence is released! Qianfan-OCR (4B+300M parameters) is now available on Baidu AI Cloud Open source weights coming soon!

Qianfan-OCR is a unified end-to-end document intelligence model, designed to help enterprises achieve digital transformation and move towards intelligent automation. Key highlights:

• Top Performance End-to-End OCR Model on OmniDocBench v1.5 and CmOCR Bench.

• General OCR: Top model performance on OCRBench and OCRBench v2.

• Document Understanding: Strong performance in document QA and information extraction.

• Layout-as-Thought: For documents with complex layouts and non-standard reading orders, Qianfan-OCR can perform layout-analysis-level reasoning via a novel Layout-as-Thought mechanism, achieving superior recognition results.

• Multilingual OCR: Supports up to 192 languages with top performance on CC-OCR.

2025.09.22: The Qianfan-VL Vision-Language model series from Baidu AI Cloud is now open source!

• Multimodal Large Language Models:

○ Qianfan-VL-3B, Qianfan-VL-8B, Qianfan-VL-70B

Designed for enterprise applications, these multimodal models combine excellent general capabilities with advanced performance in OCR and education.

2025.06.06: QianfanHuijin and QianfanHuijin-Reason series financial augmented models have been added to ModelBuilder:

• Financial Knowledge Augmented Models:

○ QianfanHuijin-70B-32K, QianfanHuijin-8B-32K

• Financial Reasoning Augmented Models:

○ QianfanHuijin-Reason-70B-32K, QianfanHuijin-Reason-8B-32K

…ded)

ngxson · 2026-03-21T22:50:11Z

gguf-py/gguf/constants.py

+        MIN_DYNAMIC_PATCH   = "clip.vision.min_dynamic_patch"
+        MAX_DYNAMIC_PATCH   = "clip.vision.max_dynamic_patch"


you should convert patches to image_min/max_pixels and reuse the existing code path instead

Thanks for the review.
These are used when converting/making mmproj file. The min/max_dynamic_patch are in the config.
In clip.cpp, if I reuse image_min/max_pixels = (image_size*image_size)* (MIN/MAX_dynamic_patch), will that be ok? I can remove KEY_MIN/MAX_DYNAMIC_PATCH and min/max_dynamic_patch from clip_hparams.

no that doesn't make sense. I think there is a mess up here: size of a patch is patch_size, not image_size

if you mean the UHD slices style, call them "slice" like what we are using in the code base

No, these are (image-size x image-size) tiles, which are different from UHD slices. AFAIK, there is no grid. I can reuse image_min/max_pixels without introducing other clip_hparams fields.

No, these are (image-size x image-size) tiles, which are different from UHD slices.

you sure? llava uhd also use square tile of (image_size x image_size). the only difference is the way tiles are being split. the main vision encoder never supports true dynamic resolution (i.e. unlike qwen-vl where tiling is not required)

I don't think it's worth reusing image_min/max_pixels, it is intended to be used by models with native dynamic resolution support (no tiling required). it is better to create a dedicated preproc_mix/max_slices for this kind of preprocessing

I have simplified the handling of the min/max number of patches. Please review.

ngxson

I cannot accept the current solution. image_min/max_pixels is used to determine the number of tokens that the cgraph need to process, and allocate backend buffers accordingly.

The vision encoder never process more than image_size x image_size worth of pixels, it does not make sense to reuse this metadata.

ngxson · 2026-03-22T00:18:41Z

tools/mtmd/clip.h

+bool clip_is_internvl(const struct clip_ctx * ctx);
 bool clip_is_llava(const struct clip_ctx * ctx);
 // note for contributor: this clip_is_(model) pattern is deprecated
 //                       do NOT add new functions like this


// note for contributor: this clip_is_(model) pattern is deprecated // do NOT add new functions like this

Sorry, I missed the comments. If no new such functions are allowed, what are the alternatives? I see you have TODO's here. Does that mean all other models have to wait before batched encoding is supported?

use clip_get_projector_type(ctx_v)

ngxson · 2026-03-22T00:19:08Z

tools/mtmd/clip.cpp

+#include <set>
 #include <stdexcept>
 #include <unordered_set>
 #include <vector>


use unordered_set

bssrdf · 2026-03-22T00:24:25Z

I cannot accept the current solution. image_min/max_pixels is used to determine the number of tokens that the cgraph need to process, and allocate backend buffers accordingly.

The vision encoder never process more than image_size x image_size worth of pixels, it does not make sense to reuse this metadata.

Can you give a pointer to the solution of how to find the min/max number of tiles? It has to come from the gguf meta data, right?

ngxson · 2026-03-22T00:27:12Z

@bssrdf have you read my comment carefully? #20847 (comment)

ngxson · 2026-03-22T00:33:27Z

also to make it clear, please prevent using the term "patch" in this context. internvl model does NOT support dynamic number of patches, it only supports dynamic number of "tiles" or "slices"

only models like qwen-vl or pixtral support real dynamic "patches" via 2D positional encoding

bssrdf · 2026-03-22T00:36:00Z

@bssrdf have you read my comment carefully? #20847 (comment)

I see your comments. I have code for preproc_mix/max_slices dedicated to such slices/tiles (however you call it). The only problem is how to find the min/max which have to be from the gguf file. I have done both ways (adding new fields or reuse the existing ones). I am wondering what is the solution you can accept.

ngxson · 2026-03-22T00:45:06Z

ok so to be clear, because you said that:

image_min/max_pixels = (image_size*image_size)* (MIN/MAX_dynamic_patch)

so I pointed out that there is a fundamental misconception in your initial version, image_min/max_pixels must always be calculated based on patch_size, not image_size. In other words, the following is correct:

image_min/max_pixels = (PATCH_size*PATCH_size)* (MIN/MAX_dynamic_patch)

I effectively explain here is that because you initially used the term patch, it makes things so confusing, and because this model technically use fixed image size for the encoder, I suggested using preproc_mix/max_slices to avoid this confusion

please update your code to remove any wrong usage of the term "patch"

ngxson · 2026-03-22T00:51:55Z

convert_hf_to_gguf.py

+        self.gguf_writer.add_vision_min_pixels(self.min_dynamic_patch*im2)
+        self.gguf_writer.add_vision_max_pixels(self.max_dynamic_patch*im2)


you need to add a new metadata specific for the number of min/max slices, as technically they are not the same as min/max pixels. something like preproc_mix/max_slices

Got it. Thanks. Is preproc_min/max_tiles good? Tiles is the term used in the source literature.

ngxson · 2026-03-22T00:53:16Z

tools/mtmd/clip.cpp

+
+    static Ratio find_closest_aspect_ratio(float aspect_ratio,
+                                const std::vector<Ratio>& target_ratios,
+                                int width, int height, int image_size) {


it's unclear what's the unit of width and height here, it's better be: n_slices_w/h

ngxson · 2026-03-22T00:58:50Z

tools/mtmd/clip.cpp

+                 dynamic_preprocess(const clip_image_u8 & image_rgb,
+                                   int min_num     = 1,
+                                   int max_num     = 12,
+                                   int image_size  = 448,
+                                   bool use_thumbnail = true) {


this whole function seems to be quite overlap with llava_uhd logic, basically:

select_best_resolution roughly equivalent to find_closest_aspect_ratio

Slice into blocks equivalent to slice_image

I can try to reuse the logic in llava_uhd.

ngxson · 2026-03-22T12:06:56Z

@CISC Can you have a quick look? Thanks.

ngxson · 2026-03-22T15:10:43Z

tools/mtmd/clip.cpp

+                hparams.image_res_candidates.push_back(clip_image_size{
+                            a*hparams.image_size,
+                            b*hparams.image_size,
+                        });


please fix indentation

ngxson · 2026-03-22T15:40:16Z

fix indentation for all lines:

CISC · 2026-03-22T15:30:20Z

tools/mtmd/clip-model.h

+    int32_t preproc_min_tiles = -1;
+    int32_t preproc_max_tiles = -1;


Suggested change

int32_t preproc_min_tiles = -1;

int32_t preproc_max_tiles = -1;

uint32_t preproc_min_tiles = 0;

uint32_t preproc_max_tiles = 0;

Defaulting these to -1 is definitely not ok though.

I am just following other similar fields. They will be assigned to >=0 after model init. Again it's @ngxson 's decision.

This is different, you're looping with <= and doing maths with these values.

no strong opinion, maybe better to change it to uint32 and add GGML_ASSERT accordingly as suggested in the comment below

Added protection.

tools/mtmd/clip.cpp

CISC · 2026-03-22T15:43:09Z

tools/mtmd/clip.cpp

+        int min_num = hparams.preproc_min_tiles;
+        int max_num = hparams.preproc_max_tiles;
+        for (int a = min_num; a <= max_num; ++a) {
+            int b_lo = (min_num + a - 1) / a;
+            int b_hi = max_num / a;
+            b_lo = std::max(b_lo, min_num);
+            b_hi = std::min(b_hi, max_num);
+            for (int b = b_lo; b <= b_hi; ++b) {
+                hparams.image_res_candidates.push_back(clip_image_size {
+                            a*hparams.image_size,
+                            b*hparams.image_size,


Suggested change

int min_num = hparams.preproc_min_tiles;

int max_num = hparams.preproc_max_tiles;

for (int a = min_num; a <= max_num; ++a) {

int b_lo = (min_num + a - 1) / a;

int b_hi = max_num / a;

b_lo = std::max(b_lo, min_num);

b_hi = std::min(b_hi, max_num);

for (int b = b_lo; b <= b_hi; ++b) {

hparams.image_res_candidates.push_back(clip_image_size {

a*hparams.image_size,

b*hparams.image_size,

uint32_t min_num = hparams.preproc_min_tiles;

uint32_t max_num = hparams.preproc_max_tiles;

for (uint32_t a = min_num; a <= max_num; ++a) {

uint32_t b_lo = (min_num + a - 1) / a;

uint32_t b_hi = max_num / a;

b_lo = std::max(b_lo, min_num);

b_hi = std::min(b_hi, max_num);

for (uint32_t b = b_lo; b <= b_hi; ++b) {

hparams.image_res_candidates.push_back(clip_image_size {

static_cast<int>(a * hparams.image_size),

static_cast<int>(b * hparams.image_size),

Not sure why the image sizes are int though?

Thanks, @CISC, for the review. @ngxson, are you ok with these changes?

Probably not, seems to be a pattern with the signed ints. :)

maybe leave them as is for this PR?

Something needs to be changed, but I'll leave it for @ngxson to decide, the rest LGTM.

clip.cpp already uses int for image size before I took over it, so I kept it that way, not really a big problem though

bssrdf added 2 commits March 21, 2026 15:39

added support for internvl's dynamic high-resolution (Qianfan-OCR nee…

bd4d9ee

…ded)

add min/max dynamic patch to gguf meta

d55ce51

bssrdf requested review from a team and CISC as code owners March 21, 2026 22:27

github-actions bot added examples python python script changes labels Mar 21, 2026

clean up

0c47a82

bssrdf mentioned this pull request Mar 21, 2026

Qianfan-OCR has excellent performance. Will llama.cpp support this model? #20734

Open

ngxson requested changes Mar 21, 2026

View reviewed changes

simplified handling min/max dynamic patch

7e08c7f

ngxson requested changes Mar 22, 2026

View reviewed changes

ngxson reviewed Mar 22, 2026

View reviewed changes

reuse llava_uhd logic for slice images

ae1099a

loci-dev mentioned this pull request Mar 22, 2026

UPSTREAM PR #20847: Add dynamic high-resolution image preprocessing for InternVL model auroralabs-loci/llama.cpp#1285

Open

ngxson added 2 commits March 22, 2026 12:31

provide default values for older models

2cf9a26

flake8

75c8071

ngxson approved these changes Mar 22, 2026

View reviewed changes

prevent writing 0 value to gguf

8e66032

remove duplicated resolution candidates with a better algorithm

b0b435d

ngxson reviewed Mar 22, 2026

View reviewed changes

fix indentation

109ad43

format

c42b5e7

CISC approved these changes Mar 22, 2026

View reviewed changes

bssrdf added 2 commits March 22, 2026 13:08

add protection from divide by zero

b260e69

change to 0 to be safe

5f67547

ngxson approved these changes Mar 22, 2026

View reviewed changes

ngxson merged commit ec2b787 into ggml-org:master Mar 23, 2026
52 checks passed

bssrdf deleted the dynamic-resolution-internvl branch March 23, 2026 00:51

		MIN_DYNAMIC_PATCH = "clip.vision.min_dynamic_patch"
		MAX_DYNAMIC_PATCH = "clip.vision.max_dynamic_patch"

		self.gguf_writer.add_vision_min_pixels(self.min_dynamic_patch*im2)
		self.gguf_writer.add_vision_max_pixels(self.max_dynamic_patch*im2)

		int32_t preproc_min_tiles = -1;
		int32_t preproc_max_tiles = -1;

Conversation

bssrdf commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example: OCR the above image

master

This PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Mar 22, 2026

Uh oh!

ngxson commented Mar 22, 2026

Uh oh!

bssrdf commented Mar 22, 2026

Uh oh!

ngxson commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Mar 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Mar 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf commented Mar 21, 2026 •

edited

Loading

bssrdf Mar 21, 2026 •

edited

Loading

bssrdf Mar 21, 2026 •

edited

Loading

ngxson Mar 21, 2026 •

edited

Loading

bssrdf Mar 22, 2026 •

edited

Loading

bssrdf commented Mar 22, 2026 •

edited

Loading

ngxson commented Mar 22, 2026 •

edited

Loading

bssrdf Mar 22, 2026 •

edited

Loading

bssrdf Mar 22, 2026 •

edited

Loading

bssrdf Mar 22, 2026 •

edited

Loading