Skip to content

Add dynamic high-resolution image preprocessing for InternVL model#20847

Merged
ngxson merged 13 commits intoggml-org:masterfrom
bssrdf:dynamic-resolution-internvl
Mar 23, 2026
Merged

Add dynamic high-resolution image preprocessing for InternVL model#20847
ngxson merged 13 commits intoggml-org:masterfrom
bssrdf:dynamic-resolution-internvl

Conversation

@bssrdf
Copy link
Contributor

@bssrdf bssrdf commented Mar 21, 2026

This PR adds support for dynamic high-resolution tiles used by InternVL model. This makes Qianfan-OCR work in llama.cpp

Use mmproj file from https://huggingface.co/bssrdf/Qianfan-OCR-gguf/blob/main/Qianfan-OCR-mmproj-bf16.gguf and LLM in https://huggingface.co/Reza2kn/Qianfan-OCR-GGUF/tree/main

document

Example: OCR the above image

master

 bin\Release\llama-mtmd-cli.exe -m ..\models\Qianfan-OCR-bf16.gguf --mmproj ..\models\Qianfan-OCR-mmproj-f16.gguf  --image document.png -p "Parse this document to Markdown." 
main: loading model: ..\models\Qianfan-OCR-bf16.gguf
WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli
encoding image slice...
image slice encoded in 68 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 93 ms

## Quantum Models Cookbook

Spectral decomposition of a quantum system is a fundamental problem in quantum physics. Given a Hamiltonian $H$, the goal is to find the eigenvalues and eigenvectors of $H$. This information is crucial for understanding the dynamics of the system.

### News

2023.03.21: Quantum-CR-A Unified End-to-End Model for Document Intelligence is released. Quantum-CR-A is a state-of-the-art model for document understanding. It is trained on a large corpus of documents and has achieved state-of-the-art performance on several benchmark datasets.

2023.03.18: Quantum-CR-A is now available on Hugging Face. The model can be easily fine-tuned and deployed on various platforms.

2023.03.15: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development.

2023.03.12: Quantum-CR-A is now available on Kaggle. The model can be used for various tasks such as text classification, named entity recognition, and relation extraction.

2023.03.09: Quantum-CR-A is now available on arXiv. The paper presents the model and its applications in detail.

2023.03.06: Quantum-CR-A is now available on Zenodo. The model and its training data are provided for further research and development.

2023.03.03: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development.

2023.02.28: Quantum-CR-A is now available on Kaggle. The model can be used for various tasks such as text classification, named entity recognition, and relation extraction.

2023.02.25: Quantum-CR-A is now available on arXiv. The paper presents the model and its applications in detail.

2023.02.22: Quantum-CR-A is now available on Zenodo. The model and its training data are provided for further research and development.

2023.02.19: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development.

2023.02.16: Quantum-CR-A is now available on Kaggle. The model can be used for various tasks such as text classification, named entity recognition, and relation extraction.

2023.02.13: Quantum-CR-A is now available on arXiv. The paper presents the model and its applications in detail.

2023.02.10: Quantum-CR-A is now available on Zenodo. The model and its training data are provided for further research and development.

2023.02.07: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development.

2023.02.04: Quantum-CR-A is now available on Kaggle. The model can be used for various tasks such as text classification, named entity recognition, and relation extraction.

2023.02.01: Quantum-CR-A is now available on arXiv. The paper presents the model and its applications in detail.

2023.01.28: Quantum-CR-A is now available on Zenodo. The model and its training data are provided for further research and development.

2023.01.25: Quantum-CR-A is now available on GitHub. The source code and training scripts are provided for further research and development. 

This PR

 bin\Release\llama-mtmd-cli.exe -m ..\models\Qianfan-OCR-bf16.gguf --mmproj ..\models\Qianfan-OCR-mmproj-bf16.gguf --image document.png -p "Parse this document to Markdown." 

main: loading model: ..\models\Qianfan-OCR-bf16.gguf
WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli
encoding image slice...
image slice encoded in 323 ms
decoding image batch 1/2, n_tokens_batch = 2048
image decoded (batch 1/2) in 154 ms
decoding image batch 2/2, n_tokens_batch = 1280
image decoded (batch 2/2) in 136 ms

## Qianfan Models Cookbook

Example code and guides for basic tasks with LLMs Hosted by Qianfan Platform. You'll need a qianfan account and associated API key to accomplish all examples. We will provide example files, prompts within a cookbook. You can run all examples by setting the QIANFAN_TOKEN environment variable.

Note that this is a python code only repository.

## News

2026.03.12: Qianfan-OCR: A Unified End-to-End Model for Document Intelligence is released! Qianfan-OCR (4B+300M parameters) is now available on Baidu AI Cloud Open source weights coming soon!

Qianfan-OCR is a unified end-to-end document intelligence model, designed to help enterprises achieve digital transformation and move towards intelligent automation. Key highlights:

• Top Performance End-to-End OCR Model on OmniDocBench v1.5 and CmOCR Bench.

• General OCR: Top model performance on OCRBench and OCRBench v2.

• Document Understanding: Strong performance in document QA and information extraction.

• Layout-as-Thought: For documents with complex layouts and non-standard reading orders, Qianfan-OCR can perform layout-analysis-level reasoning via a novel Layout-as-Thought mechanism, achieving superior recognition results.

• Multilingual OCR: Supports up to 192 languages with top performance on CC-OCR.

2025.09.22: The Qianfan-VL Vision-Language model series from Baidu AI Cloud is now open source!

• Multimodal Large Language Models:

○ Qianfan-VL-3B, Qianfan-VL-8B, Qianfan-VL-70B

Designed for enterprise applications, these multimodal models combine excellent general capabilities with advanced performance in OCR and education.

2025.06.06: QianfanHuijin and QianfanHuijin-Reason series financial augmented models have been added to ModelBuilder:

• Financial Knowledge Augmented Models:

○ QianfanHuijin-70B-32K, QianfanHuijin-8B-32K

• Financial Reasoning Augmented Models:

○ QianfanHuijin-Reason-70B-32K, QianfanHuijin-Reason-8B-32K 

@bssrdf bssrdf requested review from a team and CISC as code owners March 21, 2026 22:27
@github-actions github-actions bot added examples python python script changes labels Mar 21, 2026
Comment on lines +304 to +305
MIN_DYNAMIC_PATCH = "clip.vision.min_dynamic_patch"
MAX_DYNAMIC_PATCH = "clip.vision.max_dynamic_patch"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should convert patches to image_min/max_pixels and reuse the existing code path instead

Copy link
Contributor Author

@bssrdf bssrdf Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.
These are used when converting/making mmproj file. The min/max_dynamic_patch are in the config.
In clip.cpp, if I reuse image_min/max_pixels = (image_size*image_size)* (MIN/MAX_dynamic_patch), will that be ok? I can remove KEY_MIN/MAX_DYNAMIC_PATCH and min/max_dynamic_patch from clip_hparams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that doesn't make sense. I think there is a mess up here: size of a patch is patch_size, not image_size

if you mean the UHD slices style, call them "slice" like what we are using in the code base

Copy link
Contributor Author

@bssrdf bssrdf Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, these are (image-size x image-size) tiles, which are different from UHD slices. AFAIK, there is no grid. I can reuse image_min/max_pixels without introducing other clip_hparams fields.

Copy link
Contributor

@ngxson ngxson Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, these are (image-size x image-size) tiles, which are different from UHD slices.

you sure? llava uhd also use square tile of (image_size x image_size). the only difference is the way tiles are being split. the main vision encoder never supports true dynamic resolution (i.e. unlike qwen-vl where tiling is not required)

I don't think it's worth reusing image_min/max_pixels, it is intended to be used by models with native dynamic resolution support (no tiling required). it is better to create a dedicated preproc_mix/max_slices for this kind of preprocessing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified the handling of the min/max number of patches. Please review.

Copy link
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot accept the current solution. image_min/max_pixels is used to determine the number of tokens that the cgraph need to process, and allocate backend buffers accordingly.

The vision encoder never process more than image_size x image_size worth of pixels, it does not make sense to reuse this metadata.

Comment on lines 110 to 113
bool clip_is_internvl(const struct clip_ctx * ctx);
bool clip_is_llava(const struct clip_ctx * ctx);
// note for contributor: this clip_is_(model) pattern is deprecated
// do NOT add new functions like this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// note for contributor: this clip_is_(model) pattern is deprecated
//                       do NOT add new functions like this

Copy link
Contributor Author

@bssrdf bssrdf Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the comments. If no new such functions are allowed, what are the alternatives? I see you have TODO's here. Does that mean all other models have to wait before batched encoding is supported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use clip_get_projector_type(ctx_v)

Comment on lines 20 to 23
#include <set>
#include <stdexcept>
#include <unordered_set>
#include <vector>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use unordered_set

@bssrdf
Copy link
Contributor Author

bssrdf commented Mar 22, 2026

I cannot accept the current solution. image_min/max_pixels is used to determine the number of tokens that the cgraph need to process, and allocate backend buffers accordingly.

The vision encoder never process more than image_size x image_size worth of pixels, it does not make sense to reuse this metadata.

Can you give a pointer to the solution of how to find the min/max number of tiles? It has to come from the gguf meta data, right?

@ngxson
Copy link
Contributor

ngxson commented Mar 22, 2026

@bssrdf have you read my comment carefully? #20847 (comment)

@ngxson
Copy link
Contributor

ngxson commented Mar 22, 2026

also to make it clear, please prevent using the term "patch" in this context. internvl model does NOT support dynamic number of patches, it only supports dynamic number of "tiles" or "slices"

only models like qwen-vl or pixtral support real dynamic "patches" via 2D positional encoding

@bssrdf
Copy link
Contributor Author

bssrdf commented Mar 22, 2026

@bssrdf have you read my comment carefully? #20847 (comment)

I see your comments. I have code for preproc_mix/max_slices dedicated to such slices/tiles (however you call it). The only problem is how to find the min/max which have to be from the gguf file. I have done both ways (adding new fields or reuse the existing ones). I am wondering what is the solution you can accept.

@ngxson
Copy link
Contributor

ngxson commented Mar 22, 2026

ok so to be clear, because you said that:

image_min/max_pixels = (image_size*image_size)* (MIN/MAX_dynamic_patch)

so I pointed out that there is a fundamental misconception in your initial version, image_min/max_pixels must always be calculated based on patch_size, not image_size. In other words, the following is correct:

image_min/max_pixels = (PATCH_size*PATCH_size)* (MIN/MAX_dynamic_patch)

I effectively explain here is that because you initially used the term patch, it makes things so confusing, and because this model technically use fixed image size for the encoder, I suggested using preproc_mix/max_slices to avoid this confusion

please update your code to remove any wrong usage of the term "patch"

Comment on lines +4309 to +4310
self.gguf_writer.add_vision_min_pixels(self.min_dynamic_patch*im2)
self.gguf_writer.add_vision_max_pixels(self.max_dynamic_patch*im2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add a new metadata specific for the number of min/max slices, as technically they are not the same as min/max pixels. something like preproc_mix/max_slices

Copy link
Contributor Author

@bssrdf bssrdf Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks. Is preproc_min/max_tiles good? Tiles is the term used in the source literature.


static Ratio find_closest_aspect_ratio(float aspect_ratio,
const std::vector<Ratio>& target_ratios,
int width, int height, int image_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear what's the unit of width and height here, it's better be: n_slices_w/h

Comment on lines +2623 to +2627
dynamic_preprocess(const clip_image_u8 & image_rgb,
int min_num = 1,
int max_num = 12,
int image_size = 448,
bool use_thumbnail = true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole function seems to be quite overlap with llava_uhd logic, basically:

  • select_best_resolution roughly equivalent to find_closest_aspect_ratio
  • Slice into blocks equivalent to slice_image

Copy link
Contributor Author

@bssrdf bssrdf Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to reuse the logic in llava_uhd.

@ngxson
Copy link
Contributor

ngxson commented Mar 22, 2026

@CISC Can you have a quick look? Thanks.

Comment on lines +2203 to +2206
hparams.image_res_candidates.push_back(clip_image_size{
a*hparams.image_size,
b*hparams.image_size,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix indentation

@ngxson
Copy link
Contributor

ngxson commented Mar 22, 2026

fix indentation for all lines:

image

Comment on lines +45 to +46
int32_t preproc_min_tiles = -1;
int32_t preproc_max_tiles = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int32_t preproc_min_tiles = -1;
int32_t preproc_max_tiles = -1;
uint32_t preproc_min_tiles = 0;
uint32_t preproc_max_tiles = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting these to -1 is definitely not ok though.

Copy link
Contributor Author

@bssrdf bssrdf Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just following other similar fields. They will be assigned to >=0 after model init. Again it's @ngxson 's decision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different, you're looping with <= and doing maths with these values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong opinion, maybe better to change it to uint32 and add GGML_ASSERT accordingly as suggested in the comment below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added protection.

Comment on lines +2195 to +2205
int min_num = hparams.preproc_min_tiles;
int max_num = hparams.preproc_max_tiles;
for (int a = min_num; a <= max_num; ++a) {
int b_lo = (min_num + a - 1) / a;
int b_hi = max_num / a;
b_lo = std::max(b_lo, min_num);
b_hi = std::min(b_hi, max_num);
for (int b = b_lo; b <= b_hi; ++b) {
hparams.image_res_candidates.push_back(clip_image_size {
a*hparams.image_size,
b*hparams.image_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int min_num = hparams.preproc_min_tiles;
int max_num = hparams.preproc_max_tiles;
for (int a = min_num; a <= max_num; ++a) {
int b_lo = (min_num + a - 1) / a;
int b_hi = max_num / a;
b_lo = std::max(b_lo, min_num);
b_hi = std::min(b_hi, max_num);
for (int b = b_lo; b <= b_hi; ++b) {
hparams.image_res_candidates.push_back(clip_image_size {
a*hparams.image_size,
b*hparams.image_size,
uint32_t min_num = hparams.preproc_min_tiles;
uint32_t max_num = hparams.preproc_max_tiles;
for (uint32_t a = min_num; a <= max_num; ++a) {
uint32_t b_lo = (min_num + a - 1) / a;
uint32_t b_hi = max_num / a;
b_lo = std::max(b_lo, min_num);
b_hi = std::min(b_hi, max_num);
for (uint32_t b = b_lo; b <= b_hi; ++b) {
hparams.image_res_candidates.push_back(clip_image_size {
static_cast<int>(a * hparams.image_size),
static_cast<int>(b * hparams.image_size),

Not sure why the image sizes are int though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @CISC, for the review. @ngxson, are you ok with these changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, seems to be a pattern with the signed ints. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe leave them as is for this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something needs to be changed, but I'll leave it for @ngxson to decide, the rest LGTM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clip.cpp already uses int for image size before I took over it, so I kept it that way, not really a big problem though

@ngxson ngxson merged commit ec2b787 into ggml-org:master Mar 23, 2026
52 checks passed
@bssrdf bssrdf deleted the dynamic-resolution-internvl branch March 23, 2026 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants