Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MobileVLM native implementation #4954

Merged
merged 4 commits into from
Jan 22, 2024
Merged

Conversation

XiaotaoChen
Copy link
Contributor

MobileVLM

Currently this implementation supports MobileVLM-v1.7 variants.

for more information, please go to Meituan-AutoML/MobileVLM

The implementation is based on llava, and is compatible with llava and mobileVLM. The usage is basically same as llava.

Usage

Build with cmake or run make llava-cli to build it.

After building, run: ./llava-cli to see the usage. For example:

./llava-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \
    --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \
    --image path/to/an/image.jpg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? Answer the question using a single word or phrase. ASSISTANT:"

Model conversion

  • Clone mobileVLM-1.7B and clip-vit-large-patch14-336 locally:
git clone https://huggingface.co/mtgv/MobileVLM-1.7B

git clone https://huggingface.co/openai/clip-vit-large-patch14-336
  1. Use llava-surgery.py to split the LLaVA model to LLaMA and multimodel projector constituents:
python ./examples/llava/llava-surgery.py -m path/to/MobileVLM-1.7B
  1. Use convert-image-encoder-to-gguf.py with --projector-type ldp to convert the LLaVA image encoder to GGUF:
python ./examples/llava/convert-image-encoder-to-gguf \
    -m path/to/clip-vit-large-patch14-336 \
    --llava-projector path/to/MobileVLM-1.7B/llava.projector \
    --output-dir path/to/MobileVLM-1.7B \
    --projector-type ldp
  1. Use convert.py to convert the LLaMA part of LLaVA to GGUF:
python ./convert.py path/to/MobileVLM-1.7B
  1. Use quantize to convert LLaMA part's DataType from fp16 to q4_k
./quantize path/to/MobileVLM-1.7B/ggml-model-f16.gguf path/to/MobileVLM-1.7B/ggml-model-q4_k.gguf q4_k_s

Now both the LLaMA part and the image encoder is in the MobileVLM-1.7B directory.

Android compile and run

compile

refer to android/build_64.sh

mkdir android/build_64
cd android/build_64
../build_64.sh

run on Android

refer to android/adb_run.sh, modify resources' name and path

some result on Android with Snapdragon 888 chip

case 1

input

/data/local/tmp/llava-cli \
    -m /data/local/tmp/ggml-model-q4_k.gguf \
    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
    -t 4 \
    --image /data/local/tmp/demo.jpg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:"

output

encode_image_with_clip: image encoded in 21148.71 ms by CLIP (  146.87 ms per image patch)
 Susan Wise Bauer
llama_print_timings:        load time =   23574.72 ms
llama_print_timings:      sample time =       1.24 ms /     6 runs   (    0.21 ms per token,  4850.44 tokens per second)
llama_print_timings: prompt eval time =   12460.15 ms /   246 tokens (   50.65 ms per token,    19.74 tokens per second)
llama_print_timings:        eval time =     424.86 ms /     6 runs   (   70.81 ms per token,    14.12 tokens per second)
llama_print_timings:       total time =   34731.93 ms

case 2

input

/data/local/tmp/llava-cli \
    -m /data/local/tmp/ggml-model-q4_k.gguf \
    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
    -t 4 \
    --image /data/local/tmp/cat.jpeg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:"

output

encode_image_with_clip: image encoded in 21149.51 ms by CLIP (  146.87 ms per image patch)
 The image depicts a cat sitting in the grass near some tall green plants.
llama_print_timings:        load time =   23257.32 ms
llama_print_timings:      sample time =       5.25 ms /    18 runs   (    0.29 ms per token,  3430.53 tokens per second)
llama_print_timings: prompt eval time =   11900.73 ms /   232 tokens (   51.30 ms per token,    19.49 tokens per second)
llama_print_timings:        eval time =    1279.03 ms /    18 runs   (   71.06 ms per token,    14.07 tokens per second)
llama_print_timings:       total time =   34570.79 ms

TODO

  • Support non-CPU backend for the new operators, such as depthwise, hardswish, hardsigmoid

  • Optimize LDP projector performance

    - Optimize the structure definition to avoid unnecessary memory rearrangements, to reduce the use of `ggml_permute_cpy`;
    - Optimize operator implementation (ARM CPU/NVIDIA GPU): such as depthwise conv, hardswish, hardsigmoid, etc.
    
  • run MobileVLM on Jetson Orin

  • Support more model variants, such as MobileVLM-3B.

contributor

zhangjidong05, yangyang260, huyiming03, chenxiaotao03

ggml.h Outdated Show resolved Hide resolved
@@ -5350,6 +5389,52 @@ GGML_API struct ggml_tensor * ggml_conv_transpose_1d(
return result;
}

// ggml_conv_depthwise
struct ggml_tensor * ggml_conv_depthwise_2d(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the ggml_conv_depthwise_2d operator be represented as a combination of existing operators. For example, ggml_conv_2d is expressed via ggml_im2col, ggml_mul_mat and ggml_reshape:

llama.cpp/ggml.c

Lines 5404 to 5429 in ddb008d

// a: [OC,IC, KH, KW]
// b: [N, IC, IH, IW]
// result: [N, OC, OH, OW]
struct ggml_tensor * ggml_conv_2d(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int s0,
int s1,
int p0,
int p1,
int d0,
int d1) {
struct ggml_tensor * im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true); // [N, OH, OW, IC * KH * KW]
struct ggml_tensor * result =
ggml_mul_mat(ctx,
ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] * im2col->ne[2] * im2col->ne[1]), // [N, OH, OW, IC * KH * KW] => [N*OH*OW, IC * KH * KW]
ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2]), a->ne[3])); // [OC,IC, KH, KW] => [OC, IC * KH * KW]
result = ggml_reshape_4d(ctx, result, im2col->ne[1], im2col->ne[2], a->ne[3], im2col->ne[3]); // [N, OC, OH, OW]
return result;
}

If yes, it would be a better approach since it will allow directly GPU offload support

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing the code implementation in PyTorch, it seems that this operation (ggml_conv_depthwise_2d) is a conv2d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Thanks for your advices, We remove the ggml_conv_depthwise_2d native implementation, reconstruct it by ggml_im2col, ggml_mul_matand ggml_reshape

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Does it run on the GPU now or is it still missing some operators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov it can't run on GPU currently. there are two problems: 1. ggml-cuda doesn't support pool2d operator; 2. ggml_cuda_can_mul_mat requires ne0>=32, ne1>=32, ne2>=3, but depthwise conv with kernel 3x3's ne=[0]=9. What is your opinions about the two problems ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should extend support for these eventually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we'll try to extend support it basically.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 16, 2024

Really looking forward to the full offloaded support, great to see.
Looks like the projection is a lot more complex than llava-1.5; the embeddings are tiny in comparison. Will be very interesting to compare them in terms of quality

…wo by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake
android/build_64.sh Outdated Show resolved Hide resolved
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the editor config checks

@XiaotaoChen
Copy link
Contributor Author

Fix the editor config checks

thanks, it's done.

@ggerganov ggerganov merged commit 3ce7e8f into ggerganov:master Jan 22, 2024
41 of 43 checks passed
@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 23, 2024

@XiaotaoChen I don't think your example would actually work, you use -p "" as template but llava-cli uses -p "" only as User question and hard-codes Vicuna for llava-1.5 as template.
I have not tested it yet but my PR should solve it: #5093

P.S. from the discussion it sounds like minimal work is missing for a full GPU offload, is that still planned ?

@FSSRepo
Copy link
Collaborator

FSSRepo commented Jan 23, 2024

To add GPU support to this new model, it is necessary to create 3 new kernels for ggml_pool_2d, ggml_hardswish and ggml_hardsigmoid taking a quick look, they seem to be operations easy to implement in CUDA. I'll see if I take some time to give it a try.

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* MobileVLM native implementation

* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

* move android script to example/llava directory

* Fix the editor config checks

---------

Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* MobileVLM native implementation

* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

* move android script to example/llava directory

* Fix the editor config checks

---------

Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants