Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support MiniCPM-V-2.5 #7599

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

tc-mb
Copy link

@tc-mb tc-mb commented May 28, 2024

Dear llama.cpp Official,

Hi, I'm writing to address our new PR submission for integrating our model MiniCPM-Llama3-V 2.5 into llama.cpp, which has been trending on Huggingface for over a week and has garnered significant user demand. During the previous PR attempt of MiniCPM-V, we identified several critical implementation bugs. The official minicpm-v team has since fixed all these issues, resulting in a performance that matches our PyTorch version. These changes also distinguish our implementation significantly from LLaVA example codebase.

Here are some key differences and improvements we've made:

  1. Flexible Image Handling: We support arbitrary image sizes by dynamically segmenting images into sub-images, allowing our ViT to accept various aspect ratios, unlike the fixed dimensions required by other models.
  2. 2D Resampler: Our model uses a 2D resampler to down sample image features into smaller sequences, significantly speeding up inference.
  3. Enhanced Embedding: Unlike the original positional encoding of VIT used in previous VLMs, we employ a new approach for image embedding with a PosEmbedding layer.
  4. Distinct Tokenizer: Our tokenizer is different from LLaVA's, leading to unique special token decoding.
  5. Upper Framework Support: We've optimized our model for better integration with frameworks like Ollama.
  6. CLI Optimization: We've made modifications to better adapt the CLI for Android use.
  7. NPU-Optimized ViT: We've rewritten the Vision Transformer (ViT) component to leverage NPU on mobile devices, optimizing I/O for Android inference. (this week)

While some aspects of our implementation may appear similar to LLaVA example codebase, these distinct features and optimizations set our model apart. We can reference LLaVA for the overlapping components to maintain code integrity, but this might compromise the standalone nature of different examples, akin to how Huggingface Transformers ensures each model has its unique implementation.

Given the extensive user interest and the robust performance of our implementation, merging this model would significantly benefit the community. We are open to collaborating on any adjustments you deem necessary and are committed to ensuring the highest code quality and usability.

Thank you for considering our request. We look forward to your feedback and hope for a positive resolution.

Best regards,
MiniCPM-V Official ^_^

@tc-mb tc-mb marked this pull request as ready for review May 28, 2024 20:41
@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 12, 2024

@tc-mb

The python demo is impressive, even when considering that the current PR is not the cleanest way of integrating it; If the results from python can be replicated it is definitely worth adding it in my opinion.
I reviewed the PR and there are some minor and major issues left:

Major:

  • The model using llama.cpp responds significantly worse than on the web demo, below llava 1.5 in my examples
  • After getting that bad quality I tried to convert it myself, first hoop is that the readme mentioned convert.py is outdated (not available anymore in main dir) I tried using the legacy convert.py but the resulting GGUF file errors out with invalid magic characters "PK".

The less redundant files the better so maintaining them does not become a big issue in the future.
Minor:

  • The examples subdir doesn't exist anymore but is referenced in cmakefile
  • At the moment the new client is not being built by cmake, it has no section in the cmakefile
  • The readme is referencing the old fork and out dated python tools.
  • It should also be possible to remove the minicpm-cli.cpp and just use llava-cli.cpp, built with a define flag (#if could trigger the wrapper headers to be included).
  • Longterm the wrapper is not a nice solution but for now imho it would be worth it if the generation quality issue is fixed.
  • I don't think we should keep the separate encoder and surgery python files:
    The current "llava-surgery-v2" was intended to handle all sorts of models (pytorch, safetensor and different types of projectors), the new one should just be added into it similar to the old ones instead of duplicating those tools.

For testing I've added a target into the cmakefile, removed the path from the examples cmakefile:

set(TARGET minicpmv-cli)
add_executable(minicpmv-cli minicpmv-cli.cpp)
install(TARGETS minicpmv-cli RUNTIME)
target_link_libraries(minicpmv-cli PRIVATE common minicpmv_wrapper llava ${CMAKE_THREAD_LIBS_INIT})

It compiles fine that way.

Generation quality issue
The generation quality however is the biggest issue, something seems to be quite wrong:
I used the provided Q5K and 16 bit projector from your HF site.
I tried to run it through a US drivers license which is greatly OCR'd using your online python demo, but using minicpm-cli the results are worse than llava-1.5 (almost everything wrong, tons of hallucinations)

https://www.ocregister.com/wp-content/uploads/2018/10/0418_nws_idb-l-allen-col-0418-1.jpg?w=1035

.\build\bin\minicpmv-cli.exe -m Q:\models\llava\minicpm-2.5\ggml-model-Q5_K_S.gguf --mmproj Q:\models\llava\minicpm-2.5\mmproj-model-f16.gguf --image C:\temp\LICENSE_DEMO.jpg  -ngl 99 -p "Describe all text visible of this demo image, do not censor anything and respond in json" --temp 0 -c 2048 -n 1000

<user>Describe all text visible of this demo image, do not censor anything and respond in json
<assistant>
The image is a composite of two different ID cards, likely for the same person, with various personal and biometric data. Here's the structured description of the text and elements visible in the image:

**Left ID Card:**
- **Header:** "California USA DRIVER LICENSE"
- **Photo:** A woman's photo
- **Name:** "Lina Rodriguez"
- **Date of Birth:** "08/31/1974"
- **Address:** "1234 Main St, San Diego, CA 92101"
- **License Number:** "DL 123456"
- **Expiration Date:** "08/31/2017"
- **Signature:** "Lina Rodriguez"
- **Biometric Data:** "SEX F" (Female), "HAIR BRN" (Brunette), "EYES BRN" (Brown), "FINGERNAIL POLISH" (No polish), "LIPSTICK" (No lipstick), "HAIR 09/30/2017", "EYES 09/30/2017", "FINGERNAIL POLISH 09/30/2017", "LIPSTICK 09/30/2017"
- **Additional Information:** "CLASS A", "08/31/2017", "RSTR NONE", "DONOR", "VOTER", "SEX F", "HAIR BRN", "EYES BRN", "FINGERNAIL POLISH", "LIPSTICK", "HAIR 09/30/2017", "EYES 09/30/2017", "FINGERNAIL POLISH 09/30/2017", "LIPSTICK 09/30/2017"

**Right ID Card:**
- **Header:** "USA DRIVER LICENSE"
- **Photo:** A woman's photo
- **Name:** "Lina Rodriguez"
...

@Galunid
Copy link
Collaborator

Galunid commented Jun 13, 2024

Basically what @cmp-nct said. The generation quality is the biggest issue, especially when working with text. Have you tested if tokenizer works as you'd expect?

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 13, 2024

Basically what @cmp-nct said. The generation quality is the biggest issue, especially when working with text. Have you tested if tokenizer works as you'd expect?

I think the problem is deeper than that, it also saw two images in my example instead of one. Most text was totally misread, as if tokens were not in the right sequence and/or clip tensors were not correct.
It looks like one or possibly multiple issues in CLIP and image sampling/ordering.

The two major problems (generation quality, conversion) need to be solved.
Then I'd recommend a merge to not diverge further from the master, getting rid of the minor redundancy can be done in later updates.

@tc-mb
Copy link
Author

tc-mb commented Jun 14, 2024

May I confirm that you are using a new model in the process? Because the code in pr should not be able to directly use our gguf model on hf, these gguf are matched with our own fork code.
But it ask me to use the new convert method, and I'm not sure if this will work well. Because I don't know the changes in llama.cpp 's conversion script for llama3.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 14, 2024

May I confirm that you are using a new model in the process? Because the code in pr should not be able to directly use our gguf model on hf, these gguf are matched with our own fork code. But it ask me to use the new convert method, and I'm not sure if this will work well. Because I don't know the changes in llama.cpp 's conversion script for llama3.

Hi, I was not able to convert a fresh one due to the error described so I used the GGUF from your HF repository.
I fear those are quite outdated and some issues might come from it (input tokenization might be off ?).
But it does not explain the severe problems, the CLIP model should not be different right ?

Can you run the license image on your local copy and test what results you get ? Your web demo is able to provide a flawless OCR of all IDs and numbers.

If we can get the generation quality fixed and the conversion working I'd want to get this PR merged as soon as possible.
Every further week that passes the master is changing so merging gets harder over time.

@tc-mb
Copy link
Author

tc-mb commented Jun 14, 2024

May I confirm that you are using a new model in the process? Because the code in pr should not be able to directly use our gguf model on hf, these gguf are matched with our own fork code. But it ask me to use the new convert method, and I'm not sure if this will work well. Because I don't know the changes in llama.cpp 's conversion script for llama3.

Hi, I was not able to convert a fresh one due to the error described so I used the GGUF from your HF repository. I fear those are quite outdated and some issues might come from it (input tokenization might be off ?). But it does not explain the severe problems, the CLIP model should not be different right ?

Can you run the license image on your local copy and test what results you get ? Your web demo is able to provide a flawless OCR of all IDs and numbers.

If we can get the generation quality fixed and the conversion working I'd want to get this PR merged as soon as possible. Every further week that passes the master is changing so merging gets harder over time.

Hi! Sorry for the inconvenience. After looking into it, we also found some differences between the behavior of the proposed PR version and the actual Python model demo (also different from our fork).
Once we noticed the problem, we re-ran the results from our fork, and they looked fine. We think the issue might be due to some changes in other parts of the master branch. As you can see from the results below, the previous version seems pretty accurate. We're eager to get the code merged into the official repository as soon as possible. We're doing our best to find out what's going wrong and would appreciate your help in identifying the issue.

At the end, We'll provide the complete steps to reproduce it.

# running script:

./minicpmv-cli -m ../openbmb/MiniCPM-V-2_5/gguf/ggml-model-Q4_K_M.gguf --mmproj ../openbmb/MiniCPM-V-2_5/new/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image /Users/a0/Pictures/20240614-205027.jpeg -p "Describe all text visible of this demo image, do not censor anything and respond in json"

# result from our fork:

<assistant>
The image displays a driver license from California, USA, with various personal and identification details.
At the top of the image:
"California USA DRIVER LICENSE"

In the center, within a card-like border:
"DL 11234568"
"EXP 08/31/2014"
"LN CARDHOLDER"
"FN IMA"

On the left side within a blue rectangle:
"CLASS C"
"END NONE"

On the right side within a red rectangle:
"RSTRNONE"
"SEX F"
"HG BRN"
"HAIR BRN"
"EYES BRN"
"ISS HGT 125 lb"
"OBT 08/31/2009"

At the bottom of the image within a black rectangle:
"Ma. Cardholder"
"00/0000000000000NNAN/AFD/YY"
"08/31/2009"

On the left side, near the woman's photo:
"I'm"
"Cardholder"

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 14, 2024

This looks very promising!

I've been looking into the conversion issue and made a bit of progress on that end:

  1. convert.py is not supported for llama3
  2. your surgery process creates a "model" directory with the new model which needs to be converted
    The new method is convert-hf-to-gguf.py
    This fails because it appears to need trust_remote_code=True (line 375) AND the checksum detection is not working out.
    I used this hack to make it detect it as llama3-bpe in line 426:
        if chkhsh == "1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5":
            # minicpm
            res = "llama-bpe"

Maybe I am missing something but I think that's a flaw of the current conversion script, the checksum detection is nice for 90% of all cases but any changed models break that method.
3) With those two additions the conversion works like this:
python .\convert-hf-to-gguf.py Q:\models\llava\minicpm-2.5\model
.\build\bin\quantize Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf q5_k 16

@Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ?


Generation quality with the new model is significantly better, but still not as good as your example.

<user>Describe all text visible of this demo image, do not censor anything
<assistant>
The image is a composite of various identification documents and a photograph, likely used for illustrative purposes related to identity verification or security.
- On the top left corner: "California USA DRIVER LICENSE"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait
- Bottom left corner: "CLASS A"
- Bottom right corner: "CLASS B"

- On the top right corner: "CLASS A"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait

@Galunid
Copy link
Collaborator

Galunid commented Jun 15, 2024

@Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ?

@cmp-nct In general you shouldn't add checksum by hand, instead you should use convert-hf-to-gguf-update.py, which does that for you. You need to add the correct model there. There's some work done in #7379 to improve the process. You can check #6920 for details on why it was done this way. Unfortunately convert-hf-to-gguf-update.py has problem with loading remote code (as in it doesn't download that from the repo and it doesn't run it).

For now maybe it's best to use examples/convert-legacy-llama.py and then gguf-py/scripts/gguf-new-metadata.py --pre-tokenizer llama-bpe. I tried my hacked convert-hf-to-gguf.py a while ago and there wasn't a difference in generation quality.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 15, 2024

@Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ?

@cmp-nct In general you shouldn't add checksum by hand, instead you should use convert-hf-to-gguf-update.py, which does that for you. You need to add the correct model there. There's some work done in #7379 to improve the process. You can check #6920 for details on why it was done this way. Unfortunately convert-hf-to-gguf-update.py has problem with loading remote code (as in it doesn't download that from the repo and it doesn't run it).

For now maybe it's best to use examples/convert-legacy-llama.py and then gguf-py/scripts/gguf-new-metadata.py --pre-tokenizer llama-bpe. I tried my hacked convert-hf-to-gguf.py a while ago and there wasn't a difference in generation quality.

When I tried the legacy converter I got a gguf binary with wrong magic (PK), using the manual checksum "hack" worked. The update process didn't work out for me, though it's always a pain when I am gone for 2-3 weeks and so much is changing that it feels like years have passed - I probably did something wrong :)
I think a "force" option would be a good way for special cases, better than having to modify the python code (checksum) for a one-time conversion ?

@Cuiunbo
Copy link

Cuiunbo commented Jun 16, 2024

This looks very promising!

I've been looking into the conversion issue and made a bit of progress on that end:

  1. convert.py is not supported for llama3
  2. your surgery process creates a "model" directory with the new model which needs to be converted
    The new method is convert-hf-to-gguf.py
    This fails because it appears to need trust_remote_code=True (line 375) AND the checksum detection is not working out.
    I used this hack to make it detect it as llama3-bpe in line 426:
        if chkhsh == "1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5":
            # minicpm
            res = "llama-bpe"

Maybe I am missing something but I think that's a flaw of the current conversion script, the checksum detection is nice for 90% of all cases but any changed models break that method. 3) With those two additions the conversion works like this: python .\convert-hf-to-gguf.py Q:\models\llava\minicpm-2.5\model .\build\bin\quantize Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf q5_k 16

@Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ?

Generation quality with the new model is significantly better, but still not as good as your example.

<user>Describe all text visible of this demo image, do not censor anything
<assistant>
The image is a composite of various identification documents and a photograph, likely used for illustrative purposes related to identity verification or security.
- On the top left corner: "California USA DRIVER LICENSE"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait
- Bottom left corner: "CLASS A"
- Bottom right corner: "CLASS B"

- On the top right corner: "CLASS A"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait

Hi👋, any updates now? It looks like the results from the openbmb fork are fine, but the merge into this master branch is faulty?
@tc-mb @cmp-nct

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 18, 2024

@t

This looks very promising!
I've been looking into the conversion issue and made a bit of progress on that end:

  1. convert.py is not supported for llama3
  2. your surgery process creates a "model" directory with the new model which needs to be converted
    The new method is convert-hf-to-gguf.py
    This fails because it appears to need trust_remote_code=True (line 375) AND the checksum detection is not working out.
    I used this hack to make it detect it as llama3-bpe in line 426:
        if chkhsh == "1baddeb572cd9de2a6d36f2ad0c361490bf5447dafca20afbac625e9d37f18a5":
            # minicpm
            res = "llama-bpe"

Maybe I am missing something but I think that's a flaw of the current conversion script, the checksum detection is nice for 90% of all cases but any changed models break that method. 3) With those two additions the conversion works like this: python .\convert-hf-to-gguf.py Q:\models\llava\minicpm-2.5\model .\build\bin\quantize Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf q5_k 16
@Galunid please take a look at the conversion process, we should have a way to force model compatibility without manually adding a checksum. Or did I miss something ?
Generation quality with the new model is significantly better, but still not as good as your example.

<user>Describe all text visible of this demo image, do not censor anything
<assistant>
The image is a composite of various identification documents and a photograph, likely used for illustrative purposes related to identity verification or security.
- On the top left corner: "California USA DRIVER LICENSE"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait
- Bottom left corner: "CLASS A"
- Bottom right corner: "CLASS B"

- On the top right corner: "CLASS A"
- License number: "DL 1234568"
- Address: "EXP 08/31/2014 2581 24TH ST 09/08/2017 2581 24TH ST 09/08/2017"
- Name: "Lina Cardoso"
- Date of Birth: "09/08/1977"
- Gender: "F"
- Hair Color: "BRN"
- Eye Color: "EXP 08/31/2014 2581 24TH ST 09/08/2017"
- Height: "5'8"""
- Weight: "125 lb"
- Blood Type: "O"
- Signature: "Lina Cardoso"
- Photo: A woman's portrait

Hi👋, any updates now? It looks like the results from the openbmb fork are fine, but the merge into this master branch is faulty? @tc-mb @cmp-nct

I'm quite sure there are discrepancies on the fork side too. My guess is that the finetuning is slightly broken, even a single wrong token can cause projector based LLMs to become stupid.
I am also not sure if our CLIP model doesn't have a fundamental computation issue, in my previous work on llava-1.6 I noticed significant differences compared to the reference but had no time to dig into it.

I hope @tc-mb can finish his PR here, minicpm in the python reference is quite stunning and would be a great benefit to llama.cpp (and higher level projects like ollama)

@Forevery1
Copy link

Is there any progress?

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 24, 2024

@tc-mb
I tried your most recent commit on a reference image:
reference_2

.\build\bin\RelWithDebInfo\minicpmv-cli.exe -m Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf --mmproj Q:\models\llava\minicpm-2.5\mmproj-model-f16.gguf --image C:\temp\reference_2.png -ngl 99 -p "What is in the lower left corner?" --temp 0 -c 2048 -n 1000 --verbose-prompt

The response was "There is a calculator in the lower left corner."
This is the same error that we have with Microsoft's Phi-V which uses Siglip, as if the spatial patches are mixed up.
The number of image tokens was just 4x 96

Below is the entire log:

Log start
clip_model_load: description:  image encoder for MiniCPM-V
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    455
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 455 tensors from Q:\models\llava\minicpm-2.5\mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                clip.has_minicpmv_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for MiniCPM-V
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  170 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  1
clip_model_load: model size:     1044.86 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  1044.86 MB (455 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 104.80 MB
uhd_slice_image: multiple 4
uhd_slice_image: image_size: 1183 664; source_image size: 602 336
uhd_slice_image: image_size: 1183 664; best_grid: 3 1
uhd_slice_image: refine_image_size: 1050 588; refine_size: 1050 588
llava_image_embed_make_with_bytes_uhd: 602 336
llava_image_embed_make_with_bytes_uhd: 350 588
llava_image_embed_make_with_bytes_uhd: 350 588
llava_image_embed_make_with_bytes_uhd: 350 588
encode_image_with_clip: image embedding created: 96 tokens

encode_image_with_clip: image encoded in   108.98 ms by CLIP (    1.14 ms per image patch)
encode_image_with_clip: image embedding created: 96 tokens

encode_image_with_clip: image encoded in    63.51 ms by CLIP (    0.66 ms per image patch)
encode_image_with_clip: image embedding created: 96 tokens

encode_image_with_clip: image encoded in    62.05 ms by CLIP (    0.65 ms per image patch)
encode_image_with_clip: image embedding created: 96 tokens

encode_image_with_clip: image encoded in    60.96 ms by CLIP (    0.63 ms per image patch)
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from Q:\models\llava\minicpm-2.5\model\ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 128002
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7997 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW)
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: UNK token        = 128002 '<unk>'
llm_load_print_meta: PAD token        = 0 '!'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1002.00 MiB
llm_load_tensors:      CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

minicpmv_init: llava init in    13.17 ms.
process_image: image token past: 0
process_image: image token past: 400

minicpmv_init: llama process image in   332.57 ms.
<user>What is in the lower left corner?
<assistant>
There is a calculator in the lower left corner.

llama_print_timings:        load time =    5213.50 ms
llama_print_timings:      sample time =       2.70 ms /    11 runs   (    0.25 ms per token,  4075.58 tokens per second)
llama_print_timings: prompt eval time =     399.80 ms /   413 tokens (    0.97 ms per token,  1033.03 tokens per second)
llama_print_timings:        eval time =     194.29 ms /    10 runs   (   19.43 ms per token,    51.47 tokens per second)
llama_print_timings:       total time =    5414.02 ms /   423 tokens

I also tested the license demo, I used a slightly different prompt because it redacted everything before.
The results on the license demo are better than before, though also nowhere near the superb quality of the python reference yet.

minicpmv_init: llama process image in   758.64 ms.
<user>Describe all text visible of this demo image, do not redact, answer in JSON format
<assistant>
'''json
{
  "type": "driver-license",
  "image": "https://i.imgur.com/7JY8T9L.jpg",
  "data": {
    "name": "California",
    "issuing_state": "California",
    "license_number": "1234568",
    "expiration_date": "08/31/2014",
    "address": "2500 N. 24TH ST, 2570 24TH ST, SAN FRANCISCO, CA 94103",
    "date_of_birth": "09/08/1977",
    "gender": "F",
    "hair_color": "BRN",
    "eye_color": "EXP",
    "height": "5'8\"",
    "weight": "125 lb",
    "race": "WHITE",
    "signature": "Igna Cardoso",
    "issuing_date": "08/31/2009",
    "issuing_authority": "California Department of Motor Vehicles"
  }
}
'''

llama_print_timings:        load time =    5633.64 ms
llama_print_timings:      sample time =      51.57 ms /   214 runs   (    0.24 ms per token,  4149.86 tokens per second)
llama_print_timings: prompt eval time =     828.18 ms /   914 tokens (    0.91 ms per token,  1103.63 tokens per second)
llama_print_timings:        eval time =    4187.23 ms /   213 runs   (   19.66 ms per token,    50.87 tokens per second)
llama_print_timings:       total time =    9919.41 ms /  1127 tokens

@tc-mb
Copy link
Author

tc-mb commented Jun 25, 2024

@cmp-nct
Hi, I'm sorry for the delay in the last two weeks.

Our team has limited manpower, and I was stuck on another urgent project until the end of last week. I will do my best to finish all the changes to the PR this week.

I convert the model using the advice mentioned earlier and re-reviewed the code, finding two issues in the previous PR. Our model supports images of any size and uses a different approach from llava. Even though I was as careful as possible when merging the code into llava, a key parameter was not passed to the lowest level, resulting in the model only using the default parameters. This caused the model to produce poor results, but with some correct outputs, which is why I didn't catch the bug immediately. The version I just submitted has fixed this issue.

Additionally, I'm not certain that this version is completely consistent with Python in terms of accuracy. This week, I will continue the work I hadn't completed before and quantitatively confirm the model's performance by running evaluation set metrics, rather than just checking a few cases.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jun 26, 2024

@cmp-nct Hi, I'm sorry for the delay in the last two weeks.

Our team has limited manpower, and I was stuck on another urgent project until the end of last week. I will do my best to finish all the changes to the PR this week.

I convert the model using the advice mentioned earlier and re-reviewed the code, finding two issues in the previous PR. Our model supports images of any size and uses a different approach from llava. Even though I was as careful as possible when merging the code into llava, a key parameter was not passed to the lowest level, resulting in the model only using the default parameters. This caused the model to produce poor results, but with some correct outputs, which is why I didn't catch the bug immediately. The version I just submitted has fixed this issue.

Additionally, I'm not certain that this version is completely consistent with Python in terms of accuracy. This week, I will continue the work I hadn't completed before and quantitatively confirm the model's performance by running evaluation set metrics, rather than just checking a few cases.

I ran your update and there are significant improvements in output quality!
It's still not 100% where it should be based on the python reference.

I ran it on a still image of store goods and it mentioned a mirror and text, both not there.
The driving license is better solved than any llava-1.5 models now, still not as flawless as the reference though.
On spatial questions it is now closer to the correct answer (yellow sticky note), somehow it does not see everything.
I tried to dig into that:
The green sticky note is in the lower left corner, and below it is a yellow sticky note.
Above the calculator, there is a pair of eyeglasses

Something is still strangely off, maybe in terms of image preprocessing or patching ?
It answers as if the image-patches were tokenized in a wrong order

I'm excited for your evaluation results.

@AINXTGENStudio
Copy link

AINXTGENStudio commented Jul 4, 2024

@cmp-nct @tc-mb My apologies if this has already been discovered, but from my quick research and experimentation yesterday, I have been able to successfully use openbmb/MiniCPM-Llama3-V-2_5-gguf VLM directly on LM Studio, upload an image and have it describe it. As well as nearly any other NON-Vision based llama 3 variant using xtuner's llama 3 mmproj file found here: https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/tree/main. I believe this could have clues to help speeding up this merge and even potentially help llama.cpp expand their VLM capabilities, as VLMs are growing very quickly in relevance and usefulness!

here is an example of it working, but most likely downgraded quality from llava influence:
LM Studio VLM Test

This example is using the openbmb/MiniCPM-Llama3-V-2_5-gguf model with xtuner's llama 3 mmproj file, I have also tried it with Lewdiculous/L3-8B-Stheno-v3.2-GGUF-IQ-Imatrix/L3-8B-Stheno-v3.2-IQ3_M-imat.gguf and a few other llama 3 variants with overall good working results.

I can only hope this helps somehow, as llama.cpp is without a doubt a leading LLM based project, but we really need to figure out universal VLM compatibility soon, because in the upcoming months we will have MANY New VLM models that have "Full Consistent Video Vision" not only for on device use for example consistent screen share capabilities, but also consistent robotics vision and I think llama.cpp would be a great foundation for all of that.

@@ -36,7 +37,7 @@ struct clip_image_f32_batch {
size_t size;
};

CLIP_API struct clip_ctx * clip_model_load (const char * fname, int verbosity);
CLIP_API struct clip_ctx * clip_model_load (const char * fname, int verbosity, std::pair<int, int> load_image_size = {448, 448});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is supposed to be a C header, as such c++ headers and types should not be here.
Simply decompose dimensions to width and height :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with llava.h and your minicpmv_wrapper.h, since it contains

#ifdef __cplusplus
extern "C" {
#endif

suggesting that it should be a c header

@tc-mb
Copy link
Author

tc-mb commented Jul 7, 2024

@cmp-nct Hi, I'm sorry for the delay in the last two weeks.
Our team has limited manpower, and I was stuck on another urgent project until the end of last week. I will do my best to finish all the changes to the PR this week.
I convert the model using the advice mentioned earlier and re-reviewed the code, finding two issues in the previous PR. Our model supports images of any size and uses a different approach from llava. Even though I was as careful as possible when merging the code into llava, a key parameter was not passed to the lowest level, resulting in the model only using the default parameters. This caused the model to produce poor results, but with some correct outputs, which is why I didn't catch the bug immediately. The version I just submitted has fixed this issue.
Additionally, I'm not certain that this version is completely consistent with Python in terms of accuracy. This week, I will continue the work I hadn't completed before and quantitatively confirm the model's performance by running evaluation set metrics, rather than just checking a few cases.

I ran your update and there are significant improvements in output quality! It's still not 100% where it should be based on the python reference.

I ran it on a still image of store goods and it mentioned a mirror and text, both not there. The driving license is better solved than any llava-1.5 models now, still not as flawless as the reference though. On spatial questions it is now closer to the correct answer (yellow sticky note), somehow it does not see everything. I tried to dig into that: The green sticky note is in the lower left corner, and below it is a yellow sticky note. Above the calculator, there is a pair of eyeglasses

Something is still strangely off, maybe in terms of image preprocessing or patching ? It answers as if the image-patches were tokenized in a wrong order

I'm excited for your evaluation results.

We felt that we should have found the problem and had updated the c++ code. Maybe you can try this code.

By comparing it with c++, we actually discovered a bug hidden in the python code. But unfortunately, the model has been trained, so I can only imitate c++ and make the same mistake in python to use minicpmv2.5 well. We will improve this issue in future model training and provide the community with better performing open source models.

I verified the accuracy of the model on mme and found that gguf f16 would cause a loss of tens of points, and the quantitative version would continue to lose tens of points. Considering that mme's perfect score of 2800, this does not seem unacceptable. And when I traced the numerical differences from beginning to end, I found that the differences would exist from the very beginning, the interpolation function (bicubic_resize) in clip would also cause obvious numerical differences. I will try to make changes next week.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jul 10, 2024

@cmp-nct Hi, I'm sorry for the delay in the last two weeks.
Our team has limited manpower, and I was stuck on another urgent project until the end of last week. I will do my best to finish all the changes to the PR this week.
I convert the model using the advice mentioned earlier and re-reviewed the code, finding two issues in the previous PR. Our model supports images of any size and uses a different approach from llava. Even though I was as careful as possible when merging the code into llava, a key parameter was not passed to the lowest level, resulting in the model only using the default parameters. This caused the model to produce poor results, but with some correct outputs, which is why I didn't catch the bug immediately. The version I just submitted has fixed this issue.
Additionally, I'm not certain that this version is completely consistent with Python in terms of accuracy. This week, I will continue the work I hadn't completed before and quantitatively confirm the model's performance by running evaluation set metrics, rather than just checking a few cases.

I ran your update and there are significant improvements in output quality! It's still not 100% where it should be based on the python reference.
I ran it on a still image of store goods and it mentioned a mirror and text, both not there. The driving license is better solved than any llava-1.5 models now, still not as flawless as the reference though. On spatial questions it is now closer to the correct answer (yellow sticky note), somehow it does not see everything. I tried to dig into that: The green sticky note is in the lower left corner, and below it is a yellow sticky note. Above the calculator, there is a pair of eyeglasses
Something is still strangely off, maybe in terms of image preprocessing or patching ? It answers as if the image-patches were tokenized in a wrong order
I'm excited for your evaluation results.

We felt that we should have found the problem and had updated the c++ code. Maybe you can try this code.

By comparing it with c++, we actually discovered a bug hidden in the python code. But unfortunately, the model has been trained, so I can only imitate c++ and make the same mistake in python to use minicpmv2.5 well. We will improve this issue in future model training and provide the community with better performing open source models.

I verified the accuracy of the model on mme and found that gguf f16 would cause a loss of tens of points, and the quantitative version would continue to lose tens of points. Considering that mme's perfect score of 2800, this does not seem unacceptable. And when I traced the numerical differences from beginning to end, I found that the differences would exist from the very beginning, the interpolation function (bicubic_resize) in clip would also cause obvious numerical differences. I will try to make changes next week.

That sounds great, 10 points should not be a big hit in quality.
Will I need to create a fresh gguf ? I'm using the last one.

1) Model issues
I ran your changes and noticed some significant improvements, however the spatial test of reference_2.png is still strangely wrong.
Can you give reference_2.png
reference_2
a test run ? Ask "what is below the green sticky note" or "what is in the lower left corner"
Your python reference works great, but this PR still exchanges the green and the yellow note in my test.
As if the spatial patches were not sorted or some image sizes are causing an issue ?
Maybe there is another issue here, something different on my PC than on yours. Please test that image with those two questions.

2) Merging issues
Please also look at the previous code comments, currently this PR can not be compiled due to errors in the two cmakelist files
The C headers (like clip.h) need to be C compatible, otherwise other projects will not compile anymore. So the std::pair would need to be changed to a struct or into two separate inputs.

3) Final step
I think best would be if you can clean pull this PR into your PC and configure/make it once. (like pull/7599/head:pull-request-test)
Your local llama.cpp branch seems to have differences (additional directories, different cmakefiles) to this PR, that's why you can compile it fine but here it fails.

  1. Another test that totally fails on llama.cpp is this wide ratio image:
    calculation
<user>describe the image?
<assistant>
The image is a screenshot of a math problem related to fractions. The problem is presented in a step-by-step format, starting with a fraction 3/10 and then showing the process of dividing it by 15 to simplify it to 1/5. The answer to the problem is given as 3/5. Below the problem, there is a question asking which answer is correct, suggesting that the image is likely used for educational purposes to test understanding of fractions. The background is white, and the text is in black, making it easy to read. The overall layout is simple and straightforward, focusing on the mathematical content.

Copy link
Contributor

@cmp-nct cmp-nct left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've summarized the issues for merging.

If possible, please also take a look at the two example images.
There might be a problem with some resolutions, especially the calculation image doesn'T work at all.

CLIP_API struct ggml_tensor * clip_get_newline_tensor(const struct clip_ctx * ctx);

CLIP_API bool clip_image_encode (struct clip_ctx * ctx, int n_threads, struct clip_image_f32 * img, float * vec);
CLIP_API bool clip_image_batch_encode(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, float * vec);
CLIP_API bool clip_image_encode (struct clip_ctx * ctx, int n_threads, struct clip_image_f32 * img, float * vec, std::pair<int, int> load_image_size = {448, 448});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a C struct as replacement for the <int, int> pair
the clip.h should stay C compatible, otherwise we'll probably not get it merged

@@ -118,6 +117,7 @@ poetry.toml
/tests/test-tokenizer-0
/tests/test-tokenizer-1-bpe
/tests/test-tokenizer-1-spm
/openbmb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be required

add_library(minicpmv_wrapper OBJECT
minicpmv_wrapper.cpp
)
target_link_libraries(minicpmv_wrapper PRIVATE llava ${CMAKE_THREAD_LIBS_INIT})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new target is missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove that image, the speed shown is very depending on the version of the software, quantization and the hardware used

@@ -3,6 +3,7 @@

#include <stddef.h>
#include <stdint.h>
#include <utility>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to stay a C header

struct uhd_image_embed {
std::vector<std::vector<struct llava_image_embed *>> image_embeds;
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should stay C compatible, can maybe be declared similar to clip_ctx ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet