Add dynamic high-resolution image preprocessing for InternVL model#20847
Add dynamic high-resolution image preprocessing for InternVL model#20847ngxson merged 13 commits intoggml-org:masterfrom
Conversation
gguf-py/gguf/constants.py
Outdated
| MIN_DYNAMIC_PATCH = "clip.vision.min_dynamic_patch" | ||
| MAX_DYNAMIC_PATCH = "clip.vision.max_dynamic_patch" |
There was a problem hiding this comment.
you should convert patches to image_min/max_pixels and reuse the existing code path instead
There was a problem hiding this comment.
Thanks for the review.
These are used when converting/making mmproj file. The min/max_dynamic_patch are in the config.
In clip.cpp, if I reuse image_min/max_pixels = (image_size*image_size)* (MIN/MAX_dynamic_patch), will that be ok? I can remove KEY_MIN/MAX_DYNAMIC_PATCH and min/max_dynamic_patch from clip_hparams.
There was a problem hiding this comment.
no that doesn't make sense. I think there is a mess up here: size of a patch is patch_size, not image_size
if you mean the UHD slices style, call them "slice" like what we are using in the code base
There was a problem hiding this comment.
No, these are (image-size x image-size) tiles, which are different from UHD slices. AFAIK, there is no grid. I can reuse image_min/max_pixels without introducing other clip_hparams fields.
There was a problem hiding this comment.
No, these are (image-size x image-size) tiles, which are different from UHD slices.
you sure? llava uhd also use square tile of (image_size x image_size). the only difference is the way tiles are being split. the main vision encoder never supports true dynamic resolution (i.e. unlike qwen-vl where tiling is not required)
I don't think it's worth reusing image_min/max_pixels, it is intended to be used by models with native dynamic resolution support (no tiling required). it is better to create a dedicated preproc_mix/max_slices for this kind of preprocessing
There was a problem hiding this comment.
I have simplified the handling of the min/max number of patches. Please review.
ngxson
left a comment
There was a problem hiding this comment.
I cannot accept the current solution. image_min/max_pixels is used to determine the number of tokens that the cgraph need to process, and allocate backend buffers accordingly.
The vision encoder never process more than image_size x image_size worth of pixels, it does not make sense to reuse this metadata.
tools/mtmd/clip.h
Outdated
| bool clip_is_internvl(const struct clip_ctx * ctx); | ||
| bool clip_is_llava(const struct clip_ctx * ctx); | ||
| // note for contributor: this clip_is_(model) pattern is deprecated | ||
| // do NOT add new functions like this |
There was a problem hiding this comment.
// note for contributor: this clip_is_(model) pattern is deprecated
// do NOT add new functions like this
There was a problem hiding this comment.
Sorry, I missed the comments. If no new such functions are allowed, what are the alternatives? I see you have TODO's here. Does that mean all other models have to wait before batched encoding is supported?
There was a problem hiding this comment.
use clip_get_projector_type(ctx_v)
tools/mtmd/clip.cpp
Outdated
| #include <set> | ||
| #include <stdexcept> | ||
| #include <unordered_set> | ||
| #include <vector> |
Can you give a pointer to the solution of how to find the min/max number of tiles? It has to come from the gguf meta data, right? |
|
@bssrdf have you read my comment carefully? #20847 (comment) |
|
also to make it clear, please prevent using the term "patch" in this context. internvl model does NOT support dynamic number of patches, it only supports dynamic number of "tiles" or "slices" only models like qwen-vl or pixtral support real dynamic "patches" via 2D positional encoding |
I see your comments. I have code for |
|
ok so to be clear, because you said that:
so I pointed out that there is a fundamental misconception in your initial version,
I effectively explain here is that because you initially used the term please update your code to remove any wrong usage of the term "patch" |
convert_hf_to_gguf.py
Outdated
| self.gguf_writer.add_vision_min_pixels(self.min_dynamic_patch*im2) | ||
| self.gguf_writer.add_vision_max_pixels(self.max_dynamic_patch*im2) |
There was a problem hiding this comment.
you need to add a new metadata specific for the number of min/max slices, as technically they are not the same as min/max pixels. something like preproc_mix/max_slices
There was a problem hiding this comment.
Got it. Thanks. Is preproc_min/max_tiles good? Tiles is the term used in the source literature.
tools/mtmd/clip.cpp
Outdated
|
|
||
| static Ratio find_closest_aspect_ratio(float aspect_ratio, | ||
| const std::vector<Ratio>& target_ratios, | ||
| int width, int height, int image_size) { |
There was a problem hiding this comment.
it's unclear what's the unit of width and height here, it's better be: n_slices_w/h
tools/mtmd/clip.cpp
Outdated
| dynamic_preprocess(const clip_image_u8 & image_rgb, | ||
| int min_num = 1, | ||
| int max_num = 12, | ||
| int image_size = 448, | ||
| bool use_thumbnail = true) { |
There was a problem hiding this comment.
this whole function seems to be quite overlap with llava_uhd logic, basically:
select_best_resolutionroughly equivalent tofind_closest_aspect_ratioSlice into blocksequivalent toslice_image
There was a problem hiding this comment.
I can try to reuse the logic in llava_uhd.
|
@CISC Can you have a quick look? Thanks. |
tools/mtmd/clip.cpp
Outdated
| hparams.image_res_candidates.push_back(clip_image_size{ | ||
| a*hparams.image_size, | ||
| b*hparams.image_size, | ||
| }); |
tools/mtmd/clip-model.h
Outdated
| int32_t preproc_min_tiles = -1; | ||
| int32_t preproc_max_tiles = -1; |
There was a problem hiding this comment.
| int32_t preproc_min_tiles = -1; | |
| int32_t preproc_max_tiles = -1; | |
| uint32_t preproc_min_tiles = 0; | |
| uint32_t preproc_max_tiles = 0; |
There was a problem hiding this comment.
Defaulting these to -1 is definitely not ok though.
There was a problem hiding this comment.
I am just following other similar fields. They will be assigned to >=0 after model init. Again it's @ngxson 's decision.
There was a problem hiding this comment.
This is different, you're looping with <= and doing maths with these values.
There was a problem hiding this comment.
no strong opinion, maybe better to change it to uint32 and add GGML_ASSERT accordingly as suggested in the comment below
tools/mtmd/clip.cpp
Outdated
| int min_num = hparams.preproc_min_tiles; | ||
| int max_num = hparams.preproc_max_tiles; | ||
| for (int a = min_num; a <= max_num; ++a) { | ||
| int b_lo = (min_num + a - 1) / a; | ||
| int b_hi = max_num / a; | ||
| b_lo = std::max(b_lo, min_num); | ||
| b_hi = std::min(b_hi, max_num); | ||
| for (int b = b_lo; b <= b_hi; ++b) { | ||
| hparams.image_res_candidates.push_back(clip_image_size { | ||
| a*hparams.image_size, | ||
| b*hparams.image_size, |
There was a problem hiding this comment.
| int min_num = hparams.preproc_min_tiles; | |
| int max_num = hparams.preproc_max_tiles; | |
| for (int a = min_num; a <= max_num; ++a) { | |
| int b_lo = (min_num + a - 1) / a; | |
| int b_hi = max_num / a; | |
| b_lo = std::max(b_lo, min_num); | |
| b_hi = std::min(b_hi, max_num); | |
| for (int b = b_lo; b <= b_hi; ++b) { | |
| hparams.image_res_candidates.push_back(clip_image_size { | |
| a*hparams.image_size, | |
| b*hparams.image_size, | |
| uint32_t min_num = hparams.preproc_min_tiles; | |
| uint32_t max_num = hparams.preproc_max_tiles; | |
| for (uint32_t a = min_num; a <= max_num; ++a) { | |
| uint32_t b_lo = (min_num + a - 1) / a; | |
| uint32_t b_hi = max_num / a; | |
| b_lo = std::max(b_lo, min_num); | |
| b_hi = std::min(b_hi, max_num); | |
| for (uint32_t b = b_lo; b <= b_hi; ++b) { | |
| hparams.image_res_candidates.push_back(clip_image_size { | |
| static_cast<int>(a * hparams.image_size), | |
| static_cast<int>(b * hparams.image_size), |
Not sure why the image sizes are int though?
There was a problem hiding this comment.
Probably not, seems to be a pattern with the signed ints. :)
There was a problem hiding this comment.
maybe leave them as is for this PR?
There was a problem hiding this comment.
Something needs to be changed, but I'll leave it for @ngxson to decide, the rest LGTM.
There was a problem hiding this comment.
clip.cpp already uses int for image size before I took over it, so I kept it that way, not really a big problem though

This PR adds support for dynamic high-resolution tiles used by InternVL model. This makes Qianfan-OCR work in llama.cpp
Use mmproj file from https://huggingface.co/bssrdf/Qianfan-OCR-gguf/blob/main/Qianfan-OCR-mmproj-bf16.gguf and LLM in https://huggingface.co/Reza2kn/Qianfan-OCR-GGUF/tree/main
Example: OCR the above image
master
This PR