-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support MiniCPM-V-2.5 #7599
base: master
Are you sure you want to change the base?
support MiniCPM-V-2.5 #7599
Conversation
sync master
sync master
The python demo is impressive, even when considering that the current PR is not the cleanest way of integrating it; If the results from python can be replicated it is definitely worth adding it in my opinion. Major:
The less redundant files the better so maintaining them does not become a big issue in the future.
For testing I've added a target into the cmakefile, removed the path from the examples cmakefile:
It compiles fine that way. Generation quality issue
|
Basically what @cmp-nct said. The generation quality is the biggest issue, especially when working with text. Have you tested if tokenizer works as you'd expect? |
I think the problem is deeper than that, it also saw two images in my example instead of one. Most text was totally misread, as if tokens were not in the right sequence and/or clip tensors were not correct. The two major problems (generation quality, conversion) need to be solved. |
May I confirm that you are using a new model in the process? Because the code in pr should not be able to directly use our gguf model on hf, these gguf are matched with our own fork code. |
Hi, I was not able to convert a fresh one due to the error described so I used the GGUF from your HF repository. Can you run the license image on your local copy and test what results you get ? Your web demo is able to provide a flawless OCR of all IDs and numbers. If we can get the generation quality fixed and the conversion working I'd want to get this PR merged as soon as possible. |
Hi! Sorry for the inconvenience. After looking into it, we also found some differences between the behavior of the proposed PR version and the actual Python model demo (also different from our fork). At the end, We'll provide the complete steps to reproduce it.
|
This looks very promising! I've been looking into the conversion issue and made a bit of progress on that end:
Maybe I am missing something but I think that's a flaw of the current conversion script, the checksum detection is nice for 90% of all cases but any changed models break that method. @Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ? Generation quality with the new model is significantly better, but still not as good as your example.
|
@cmp-nct In general you shouldn't add checksum by hand, instead you should use For now maybe it's best to use |
When I tried the legacy converter I got a gguf binary with wrong magic (PK), using the manual checksum "hack" worked. The update process didn't work out for me, though it's always a pain when I am gone for 2-3 weeks and so much is changing that it feels like years have passed - I probably did something wrong :) |
Hi👋, any updates now? It looks like the results from the openbmb fork are fine, but the merge into this master branch is faulty? |
I'm quite sure there are discrepancies on the fork side too. My guess is that the finetuning is slightly broken, even a single wrong token can cause projector based LLMs to become stupid. I hope @tc-mb can finish his PR here, minicpm in the python reference is quite stunning and would be a great benefit to llama.cpp (and higher level projects like ollama) |
Is there any progress? |
@tc-mb
The response was "There is a calculator in the lower left corner." Below is the entire log:
I also tested the license demo, I used a slightly different prompt because it redacted everything before.
|
@cmp-nct Our team has limited manpower, and I was stuck on another urgent project until the end of last week. I will do my best to finish all the changes to the PR this week. I convert the model using the advice mentioned earlier and re-reviewed the code, finding two issues in the previous PR. Our model supports images of any size and uses a different approach from llava. Even though I was as careful as possible when merging the code into llava, a key parameter was not passed to the lowest level, resulting in the model only using the default parameters. This caused the model to produce poor results, but with some correct outputs, which is why I didn't catch the bug immediately. The version I just submitted has fixed this issue. Additionally, I'm not certain that this version is completely consistent with Python in terms of accuracy. This week, I will continue the work I hadn't completed before and quantitatively confirm the model's performance by running evaluation set metrics, rather than just checking a few cases. |
I ran your update and there are significant improvements in output quality! I ran it on a still image of store goods and it mentioned a mirror and text, both not there. Something is still strangely off, maybe in terms of image preprocessing or patching ? I'm excited for your evaluation results. |
@cmp-nct @tc-mb My apologies if this has already been discovered, but from my quick research and experimentation yesterday, I have been able to successfully use openbmb/MiniCPM-Llama3-V-2_5-gguf VLM directly on LM Studio, upload an image and have it describe it. As well as nearly any other NON-Vision based llama 3 variant using xtuner's llama 3 mmproj file found here: https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/tree/main. I believe this could have clues to help speeding up this merge and even potentially help llama.cpp expand their VLM capabilities, as VLMs are growing very quickly in relevance and usefulness! here is an example of it working, but most likely downgraded quality from llava influence: This example is using the openbmb/MiniCPM-Llama3-V-2_5-gguf model with xtuner's llama 3 mmproj file, I have also tried it with Lewdiculous/L3-8B-Stheno-v3.2-GGUF-IQ-Imatrix/L3-8B-Stheno-v3.2-IQ3_M-imat.gguf and a few other llama 3 variants with overall good working results. I can only hope this helps somehow, as llama.cpp is without a doubt a leading LLM based project, but we really need to figure out universal VLM compatibility soon, because in the upcoming months we will have MANY New VLM models that have "Full Consistent Video Vision" not only for on device use for example consistent screen share capabilities, but also consistent robotics vision and I think llama.cpp would be a great foundation for all of that. |
@@ -36,7 +37,7 @@ struct clip_image_f32_batch { | |||
size_t size; | |||
}; | |||
|
|||
CLIP_API struct clip_ctx * clip_model_load (const char * fname, int verbosity); | |||
CLIP_API struct clip_ctx * clip_model_load (const char * fname, int verbosity, std::pair<int, int> load_image_size = {448, 448}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is supposed to be a C
header, as such c++ headers and types should not be here.
Simply decompose dimensions to width and height :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with llava.h
and your minicpmv_wrapper.h
, since it contains
#ifdef __cplusplus
extern "C" {
#endif
suggesting that it should be a c header
We felt that we should have found the problem and had updated the c++ code. Maybe you can try this code. By comparing it with c++, we actually discovered a bug hidden in the python code. But unfortunately, the model has been trained, so I can only imitate c++ and make the same mistake in python to use minicpmv2.5 well. We will improve this issue in future model training and provide the community with better performing open source models. I verified the accuracy of the model on mme and found that gguf f16 would cause a loss of tens of points, and the quantitative version would continue to lose tens of points. Considering that mme's perfect score of 2800, this does not seem unacceptable. And when I traced the numerical differences from beginning to end, I found that the differences would exist from the very beginning, the interpolation function (bicubic_resize) in clip would also cause obvious numerical differences. I will try to make changes next week. |
That sounds great, 10 points should not be a big hit in quality. 1) Model issues 2) Merging issues 3) Final step
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've summarized the issues for merging.
If possible, please also take a look at the two example images.
There might be a problem with some resolutions, especially the calculation image doesn'T work at all.
CLIP_API struct ggml_tensor * clip_get_newline_tensor(const struct clip_ctx * ctx); | ||
|
||
CLIP_API bool clip_image_encode (struct clip_ctx * ctx, int n_threads, struct clip_image_f32 * img, float * vec); | ||
CLIP_API bool clip_image_batch_encode(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, float * vec); | ||
CLIP_API bool clip_image_encode (struct clip_ctx * ctx, int n_threads, struct clip_image_f32 * img, float * vec, std::pair<int, int> load_image_size = {448, 448}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a C struct as replacement for the <int, int> pair
the clip.h should stay C compatible, otherwise we'll probably not get it merged
@@ -118,6 +117,7 @@ poetry.toml | |||
/tests/test-tokenizer-0 | |||
/tests/test-tokenizer-1-bpe | |||
/tests/test-tokenizer-1-spm | |||
/openbmb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be required
add_library(minicpmv_wrapper OBJECT | ||
minicpmv_wrapper.cpp | ||
) | ||
target_link_libraries(minicpmv_wrapper PRIVATE llava ${CMAKE_THREAD_LIBS_INIT}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new target is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove that image, the speed shown is very depending on the version of the software, quantization and the hardware used
@@ -3,6 +3,7 @@ | |||
|
|||
#include <stddef.h> | |||
#include <stdint.h> | |||
#include <utility> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to stay a C header
struct uhd_image_embed { | ||
std::vector<std::vector<struct llava_image_embed *>> image_embeds; | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should stay C compatible, can maybe be declared similar to clip_ctx ?
Dear llama.cpp Official,
Hi, I'm writing to address our new PR submission for integrating our model MiniCPM-Llama3-V 2.5 into llama.cpp, which has been trending on Huggingface for over a week and has garnered significant user demand. During the previous PR attempt of MiniCPM-V, we identified several critical implementation bugs. The official minicpm-v team has since fixed all these issues, resulting in a performance that matches our PyTorch version. These changes also distinguish our implementation significantly from LLaVA example codebase.
Here are some key differences and improvements we've made:
While some aspects of our implementation may appear similar to LLaVA example codebase, these distinct features and optimizations set our model apart. We can reference LLaVA for the overlapping components to maintain code integrity, but this might compromise the standalone nature of different examples, akin to how Huggingface Transformers ensures each model has its unique implementation.
Given the extensive user interest and the robust performance of our implementation, merging this model would significantly benefit the community. We are open to collaborating on any adjustments you deem necessary and are committed to ensuring the highest code quality and usability.
Thank you for considering our request. We look forward to your feedback and hope for a positive resolution.
Best regards,
MiniCPM-V Official ^_^