Implement multimodal models (LLaVA) #3436

monatis · 2023-10-02T11:23:02Z

This is still WIP and highly experimental.

The work started in lmm.cpp,
but it turned out to be also ok to implement it in this repo, which I believe will be much simpler.

The plan is make a surgery on LLaVA models and export:

a regular llama.gguf file,
a custom CLIP model with multimodal projector on top of it.

GGUF support for CLIP and LLaVA model surgery is already done.
E2E inference of LLaVA V1.5.
Use the GGML allocator API and cleanup the code.
Better CLI args handling in llava executable.
Upload pre-converted models and write a readme.

usage:

Build with cmake.
From this link download `mmproj-model-f16.gguf and one of ggml-model-[f16|q5_k|q4_k].gguf.
Run:

./bin/llava -m ggml-model-q5_k.gguf --mmproj mmproj-model-f16.gguf --image path/to/an/image.jpg

This will output the detailed description of the image.

Note: You can override the default textual prompt "Describe the image in detail." by adding -p "custom promp comes here". Run ./bin/llava for other options.

Note: A lower temperature value like 0.1 is recommended. Add --temp 0.1 to your command to do so.

staviq · 2023-10-02T15:04:23Z

Sometime ago I was playing with the idea of allowing images to be uploaded via server web UI, I had a working poc, but dropped the idea since nobody was working on multimodal functionality back then

Would it be helpful for testing if I make a pr with this change ?

The idea was to import images client side, in the browser, draw them on hidden canvas and export as ppm, this would allow such image to be processed server side without relying on any external libraries/dependencies

I could add image upload to the server UI and a simple image wrapper class/functions on the cpp side.

Let me know if you are interested.

monatis · 2023-10-02T15:23:34Z

Thanks @staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI.

staviq · 2023-10-02T15:36:43Z

Thanks @staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI.

I completely missed stb is licensed under MIT, that's cool. No format shenanigans necessary then.

Ok, take your time then, I'll wait until you feel comfortable for UI integration.

monatis · 2023-10-07T08:07:26Z

Sorry for the delay here. There was an issue with evaluating embedding input that I needed to debug, and it was too painful to do so with my physical machine slow at generation. Obtained a faster VM in the cloud and hope to move faster this weekend.

monatis · 2023-10-07T22:35:32Z

This is now working with recently published LLaVA V1.5. The CLIP part consumes a huge amount of memory --I'll optimize it with ggml_allocr and cleanup the implementation tomorrow.

monatis · 2023-10-08T01:32:12Z

@josephilome this shouldn't that hard --I can implement it once the current implementation is optimized.

monatis · 2023-10-09T06:55:01Z

There are still some tasks to do but I think this is ready for testing / feedback / reviews.

A pre-converted model can be found here.

You need to download one of the ggml-model[f16|q5_k|q4_k].gguf models and the mmproj-model-f16.gguf (the image encoder). These two-file format is faster to move right now, but we can think of a single file format in the future. Also see the readme.

I'll add more documentation, do code cleanup and address reviews this afternoon. Any feedback is welcome.

ggerganov · 2023-10-09T13:11:14Z

@monatis Awesome stuff!

I haven't had a detailed look or ran tests yet, but looking at the progress, it's quite amazing to have something that can understand images. Looking forward to giving this a try!

Just curious, how much of the total compute is done by CLIP? I.e. is it a bottleneck?

TikaToka · 2023-11-27T08:07:13Z

@TikaToka Sounds interesting --thinking of a way to enable it.
@Lurrobert Nice demo! Thanks for sharing.
Also I've just come across herrera-luis/vision-core-ai, that adds whisper to it as well. Love the community inspiring one another for new use cases!

@TikaToka: Are we talking about altering the "prompt: describe the image in detail" or something else? If we are, implementing a command-line prompt with an argument and parameter for --prompt "please describe any furniture in the picture" or whatever should be possible by imitating the argv/argc architecture of, say, examples/parallel/parallel.cpp.

sorry for late reply, as mentioned in like this link: https://replicate.com/blog/how-to-prompt-llama
default prompt of input is separated as system prompt + instruction prompt and I was asking about system prompt

ExtReMLapin · 2024-01-31T08:32:06Z

Any plan to update the GGUF for LLaVA 1.6 ?

Green-Sky · 2024-01-31T10:26:46Z

oh they released them https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2

a few days ago i only saw the 1.6 preview in their hf space, but no mention of it anywhere else on the internet :)

edit: blog post https://llava-vl.github.io/blog/2024-01-30-llava-1-6/

ExtReMLapin · 2024-02-01T15:18:43Z

Even if you convert the safetensor file into torch .bin file you will get this error when trying to convert to GGUF


  File "/opt/LLaVA/llama.cpp/convert.py", line 1474, in <module>
    main()
  File "/opt/LLaVA/llama.cpp/convert.py", line 1460, in main
    model   = convert_model_names(model, params)
  File "/opt/LLaVA/llama.cpp/convert.py", line 1198, in convert_model_names
    raise Exception(f"Unexpected tensor name: {name}")
Exception: Unexpected tensor name: model.image_newline

gamester2665 · 2024-02-01T17:29:40Z

yup.. can confirm following #2948 doesn't yield valid llava-v1.6-mistral-7b-GGUF... any suggestions?


$ python llama.cpp/convert.py llava-hf \
>   --outfile llava-v1.6-mistral-7b-GGUF.gguf \
>   --outtype f32
Loading model file llava-hf\model-00001-of-00004.safetensors
Loading model file llava-hf\model-00001-of-00004.safetensors
Loading model file llava-hf\model-00002-of-00004.safetensors
Loading model file llava-hf\model-00003-of-00004.safetensors
Loading model file llava-hf\model-00004-of-00004.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=1000000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.AllF32: 0>, path_model=WindowsPath('llava-hf'))Found vocab files: {'tokenizer.model': WindowsPath('llava-hf/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': WindowsPath('llava-hf/tokenizer.json')}
Loading vocab file 'llava-hf\tokenizer.model', type 'spm'
Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0, 'pad': 0}, add special tokens {'bos': True, 'eos': False}>
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
Permuting layer 4
Permuting layer 5
Permuting layer 6
Permuting layer 7
Permuting layer 8
Permuting layer 9
Permuting layer 10
Permuting layer 11
Permuting layer 12
Permuting layer 13
Permuting layer 14
Permuting layer 15
Permuting layer 16
Permuting layer 17
Permuting layer 18
Permuting layer 19
Permuting layer 20
Permuting layer 21
Permuting layer 22
Permuting layer 23
Permuting layer 24
Permuting layer 25
Permuting layer 26
Permuting layer 27
Permuting layer 28
Permuting layer 29
Permuting layer 30
Permuting layer 31
model.embed_tokens.weight                        -> token_embd.weight                        | BF16   | [32000, 4096]
Traceback (most recent call last):
  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1474, in <module>
    main()
  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1460, in main
    model   = convert_model_names(model, params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1198, in convert_model_names
    raise Exception(f"Unexpected tensor name: {name}")
Exception: Unexpected tensor name: model.image_newline
(llama-new)

ExtReMLapin · 2024-02-01T17:30:56Z

And that's the first one that fails (pretty much the first or second layer lmao)

chigkim · 2024-02-01T17:32:10Z

Looping in @haotian-liu and @cmp-nct in case they could help with Llava V1.6.

cjpais · 2024-02-01T20:36:06Z

I've got a hacked up script that works for 1.6, will share shortly on a fork

raw script (breaks llava 1.5 support): llava1.6-surgery-hack.py

loads safetensors
removes "model.image_newline" for convert.py, I don't know the impact of this
splits mm_projector into new file
saves updates safetensors which have been modified

note: the location of the mmproj is different between 34b and 7b, probably best to do a search for all of the mmproj tensors, split them all out, save them, and resave each checkpoint without them

cmp-nct · 2024-02-01T21:01:11Z

I'm also half way but occupied with real world stuff.
The main task of 1.6 is to implement the new 'unpad' mechanism

I've created a pull draft to use as a base for 1.6 #5267
It uses a clean surgery script which should work with all variants of llava, it also supports searching for stuff (though it currently does not search for the projector, only for the ViT)
The projector gguf file is also prepared for the new features (spatial_unpad), the new tensor is moved in there

Right now I am struggling with the new ViT
size mismatch for vision_model.encoder.layers.1.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([13824]).
That's ffn_down and ffn_up

When not using the correct ViT I could already test llava-1.6 and despite not including the proper image manipulation and resolution it is anyway very good already.

cjpais · 2024-02-02T01:50:58Z

not sure if okay to share here...
for those who are looking here are initial gguf quants for llava 1.6

please note they are very early, built from the hacked surgery script. improvements coming in #5267 from @cmp-nct, will try to contribute where I can but I am nothing close to an expert

7b mistral
34b

gamester2665 · 2024-02-02T01:52:47Z

awesome! thanks @cjpais .. throwing into LMStudio for testing now

BBC-Esq · 2024-02-02T02:23:09Z

Did it work in LM Studio?

gamester2665 · 2024-02-02T03:07:57Z

@BBC-Esq Yes! cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf working successfully in LMStudio.

BBC-Esq · 2024-02-02T03:10:28Z

You guys move fast. I'm considering moving my stuff from ctranslate2 to llama.cpp, any good issues/discussions to see if you move that fast with whisper.cpp?

ExtReMLapin · 2024-02-02T06:18:16Z

removes "model.image_newline" for convert.py, I don't know the impact of this

bruh moment

aymenabid-lab · 2024-03-25T21:39:41Z

I'm use the llava

how to modify bach size to avoid this error

from python within terminal:
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path /home/dl_g15/llava-v1.5-13b
=>
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 8.06 MiB is free. Including non-PyTorch memory, this process has 7.73 GiB memory in use. Of the allocated memory 7.60 GiB is allocated by PyTorch, and 7.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
from anaconda:
model_path = "/home/dl_g15/llava-v1.5-13b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
=>
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

cebtenzzre · 2024-03-25T21:46:11Z

I'm use the llava

You're almost certainly looking for https://github.com/haotian-liu/LLaVA. This is the llama.cpp repo.

WIP: start implementing LLaVA

59aa1ac

monatis added 3 commits October 2, 2023 21:38

rm scratch buf for now, will revert after cleanup

0f0e7c6

LLaVA image encoder is working. will combine with llama

7e9120f

Add llava inference code, but it's buggy. debugging

d37ed47

ggerganov added the model Model specific label Oct 3, 2023

LLaVA is working e2e, needs to optimize memory allocation + cleanup

8690f42

monatis added 5 commits October 8, 2023 14:58

Use ggml_allocr + rm unnecessary code

94eeac3

fix: crlf -> lf

0c2bd79

fix: new line at EoF

204d08b

fix: trailing whitespace

95da79e

Merge branch 'master' into llava

2a04d0b

monatis mentioned this pull request Oct 8, 2023

Thank you for you work! monatis/lmm.cpp#1

Open

Add readme

444dbce

monatis requested a review from ggerganov October 9, 2023 06:55

monatis marked this pull request as ready for review October 9, 2023 06:55

monatis added 6 commits October 9, 2023 11:10

Update readme

8af7e21

Some cleanup

54495c9

Are you happy editorconfig?

9b0ec4d

rm unused batch image preprocessing

8278a73

rm unused import

d78e816

fix: rm designated initializers

4759bfd

ggerganov added the high priority Very important issue label Oct 9, 2023

yeroc mentioned this pull request Dec 19, 2023

Support multimodal inputs kherud/java-llama.cpp#34

Open

chigkim mentioned this pull request Jan 12, 2024

Convert to Gguf format to work with Llama.cpp? OpenGVLab/InternVL#32

Open

feng-intel mentioned this pull request Jan 17, 2024

[Question] Can LLava inference on CPU? haotian-liu/LLaVA#865

Open

irthomasthomas mentioned this pull request Mar 6, 2024

LMOps/README.md at main · microsoft/LMOps irthomasthomas/undecidability#706

Open

1 task

phymbert added the llava LLaVa and multimodal label Mar 26, 2024

thbar mentioned this pull request Mar 27, 2024

Integration with local LLaVA, do we want it? What is the way? thmsmlr/instructor_ex#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement multimodal models (LLaVA) #3436

Implement multimodal models (LLaVA) #3436

monatis commented Oct 2, 2023 •

edited

staviq commented Oct 2, 2023

monatis commented Oct 2, 2023

staviq commented Oct 2, 2023

monatis commented Oct 7, 2023

monatis commented Oct 7, 2023

monatis commented Oct 8, 2023

monatis commented Oct 9, 2023 •

edited

ggerganov commented Oct 9, 2023 •

edited

TikaToka commented Nov 27, 2023

ExtReMLapin commented Jan 31, 2024

Green-Sky commented Jan 31, 2024 •

edited

ExtReMLapin commented Feb 1, 2024 •

edited

gamester2665 commented Feb 1, 2024 •

edited

ExtReMLapin commented Feb 1, 2024

chigkim commented Feb 1, 2024 •

edited

cjpais commented Feb 1, 2024 •

edited

cmp-nct commented Feb 1, 2024 •

edited

cjpais commented Feb 2, 2024 •

edited

gamester2665 commented Feb 2, 2024

BBC-Esq commented Feb 2, 2024

gamester2665 commented Feb 2, 2024

BBC-Esq commented Feb 2, 2024

ExtReMLapin commented Feb 2, 2024

aymenabid-lab commented Mar 25, 2024

cebtenzzre commented Mar 25, 2024

Implement multimodal models (LLaVA) #3436

Implement multimodal models (LLaVA) #3436

Conversation

monatis commented Oct 2, 2023 • edited

usage:

staviq commented Oct 2, 2023

monatis commented Oct 2, 2023

staviq commented Oct 2, 2023

monatis commented Oct 7, 2023

monatis commented Oct 7, 2023

monatis commented Oct 8, 2023

monatis commented Oct 9, 2023 • edited

ggerganov commented Oct 9, 2023 • edited

TikaToka commented Nov 27, 2023

ExtReMLapin commented Jan 31, 2024

Green-Sky commented Jan 31, 2024 • edited

ExtReMLapin commented Feb 1, 2024 • edited

gamester2665 commented Feb 1, 2024 • edited

ExtReMLapin commented Feb 1, 2024

chigkim commented Feb 1, 2024 • edited

cjpais commented Feb 1, 2024 • edited

cmp-nct commented Feb 1, 2024 • edited

cjpais commented Feb 2, 2024 • edited

gamester2665 commented Feb 2, 2024

BBC-Esq commented Feb 2, 2024

gamester2665 commented Feb 2, 2024

BBC-Esq commented Feb 2, 2024

ExtReMLapin commented Feb 2, 2024

aymenabid-lab commented Mar 25, 2024

cebtenzzre commented Mar 25, 2024

monatis commented Oct 2, 2023 •

edited

monatis commented Oct 9, 2023 •

edited

ggerganov commented Oct 9, 2023 •

edited

Green-Sky commented Jan 31, 2024 •

edited

ExtReMLapin commented Feb 1, 2024 •

edited

gamester2665 commented Feb 1, 2024 •

edited

chigkim commented Feb 1, 2024 •

edited

cjpais commented Feb 1, 2024 •

edited

cmp-nct commented Feb 1, 2024 •

edited

cjpais commented Feb 2, 2024 •

edited