MobileVLM native implementation #4954

XiaotaoChen · 2024-01-15T08:13:34Z

MobileVLM

Currently this implementation supports MobileVLM-v1.7 variants.

for more information, please go to Meituan-AutoML/MobileVLM

The implementation is based on llava, and is compatible with llava and mobileVLM. The usage is basically same as llava.

Usage

Build with cmake or run make llava-cli to build it.

After building, run: ./llava-cli to see the usage. For example:

./llava-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \
    --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \
    --image path/to/an/image.jpg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? Answer the question using a single word or phrase. ASSISTANT:"

Model conversion

Clone mobileVLM-1.7B and clip-vit-large-patch14-336 locally:

git clone https://huggingface.co/mtgv/MobileVLM-1.7B

git clone https://huggingface.co/openai/clip-vit-large-patch14-336

Use llava-surgery.py to split the LLaVA model to LLaMA and multimodel projector constituents:

python ./examples/llava/llava-surgery.py -m path/to/MobileVLM-1.7B

Use convert-image-encoder-to-gguf.py with --projector-type ldp to convert the LLaVA image encoder to GGUF:

python ./examples/llava/convert-image-encoder-to-gguf \
    -m path/to/clip-vit-large-patch14-336 \
    --llava-projector path/to/MobileVLM-1.7B/llava.projector \
    --output-dir path/to/MobileVLM-1.7B \
    --projector-type ldp

Use convert.py to convert the LLaMA part of LLaVA to GGUF:

python ./convert.py path/to/MobileVLM-1.7B

Use quantize to convert LLaMA part's DataType from fp16 to q4_k

./quantize path/to/MobileVLM-1.7B/ggml-model-f16.gguf path/to/MobileVLM-1.7B/ggml-model-q4_k.gguf q4_k_s

Now both the LLaMA part and the image encoder is in the MobileVLM-1.7B directory.

Android compile and run

compile

refer to android/build_64.sh

mkdir android/build_64
cd android/build_64
../build_64.sh

run on Android

refer to android/adb_run.sh, modify resources' name and path

some result on Android with `Snapdragon 888` chip

case 1

input

/data/local/tmp/llava-cli \
    -m /data/local/tmp/ggml-model-q4_k.gguf \
    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
    -t 4 \
    --image /data/local/tmp/demo.jpg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:"

output

encode_image_with_clip: image encoded in 21148.71 ms by CLIP (  146.87 ms per image patch)
 Susan Wise Bauer
llama_print_timings:        load time =   23574.72 ms
llama_print_timings:      sample time =       1.24 ms /     6 runs   (    0.21 ms per token,  4850.44 tokens per second)
llama_print_timings: prompt eval time =   12460.15 ms /   246 tokens (   50.65 ms per token,    19.74 tokens per second)
llama_print_timings:        eval time =     424.86 ms /     6 runs   (   70.81 ms per token,    14.12 tokens per second)
llama_print_timings:       total time =   34731.93 ms

case 2

input

/data/local/tmp/llava-cli \
    -m /data/local/tmp/ggml-model-q4_k.gguf \
    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
    -t 4 \
    --image /data/local/tmp/cat.jpeg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:"

output

encode_image_with_clip: image encoded in 21149.51 ms by CLIP (  146.87 ms per image patch)
 The image depicts a cat sitting in the grass near some tall green plants.
llama_print_timings:        load time =   23257.32 ms
llama_print_timings:      sample time =       5.25 ms /    18 runs   (    0.29 ms per token,  3430.53 tokens per second)
llama_print_timings: prompt eval time =   11900.73 ms /   232 tokens (   51.30 ms per token,    19.49 tokens per second)
llama_print_timings:        eval time =    1279.03 ms /    18 runs   (   71.06 ms per token,    14.07 tokens per second)
llama_print_timings:       total time =   34570.79 ms

TODO

Support non-CPU backend for the new operators, such as depthwise, hardswish, hardsigmoid

Optimize LDP projector performance

- Optimize the structure definition to avoid unnecessary memory rearrangements, to reduce the use of `ggml_permute_cpy`;
- Optimize operator implementation (ARM CPU/NVIDIA GPU): such as depthwise conv, hardswish, hardsigmoid, etc.

run MobileVLM on Jetson Orin
Support more model variants, such as MobileVLM-3B.

contributor

zhangjidong05, yangyang260, huyiming03, chenxiaotao03

ggml.h

ggerganov · 2024-01-15T12:09:10Z

ggml.c

@@ -5350,6 +5389,52 @@ GGML_API struct ggml_tensor * ggml_conv_transpose_1d(
    return result;
 }

+// ggml_conv_depthwise
+struct ggml_tensor * ggml_conv_depthwise_2d(


Can the ggml_conv_depthwise_2d operator be represented as a combination of existing operators. For example, ggml_conv_2d is expressed via ggml_im2col, ggml_mul_mat and ggml_reshape:

llama.cpp/ggml.c

Lines 5404 to 5429 in ddb008d

// a: [OC，IC, KH, KW]

// b: [N, IC, IH, IW]

// result: [N, OC, OH, OW]

struct ggml_tensor * ggml_conv_2d(

struct ggml_context * ctx,

struct ggml_tensor * a,

struct ggml_tensor * b,

int s0,

int s1,

int p0,

int p1,

int d0,

int d1) {

struct ggml_tensor * im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true); // [N, OH, OW, IC * KH * KW]

struct ggml_tensor * result =

ggml_mul_mat(ctx,

ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] * im2col->ne[2] * im2col->ne[1]), // [N, OH, OW, IC * KH * KW] => [N*OH*OW, IC * KH * KW]

ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2]), a->ne[3])); // [OC，IC, KH, KW] => [OC, IC * KH * KW]

result = ggml_reshape_4d(ctx, result, im2col->ne[1], im2col->ne[2], a->ne[3], im2col->ne[3]); // [N, OC, OH, OW]

return result;

}

If yes, it would be a better approach since it will allow directly GPU offload support

Reviewing the code implementation in PyTorch, it seems that this operation (ggml_conv_depthwise_2d) is a conv2d.

@ggerganov Thanks for your advices, We remove the ggml_conv_depthwise_2d native implementation, reconstruct it by ggml_im2col, ggml_mul_matand ggml_reshape

Nice! Does it run on the GPU now or is it still missing some operators?

@ggerganov it can't run on GPU currently. there are two problems: 1. ggml-cuda doesn't support pool2d operator; 2. ggml_cuda_can_mul_mat requires ne0>=32, ne1>=32, ne2>=3, but depthwise conv with kernel 3x3's ne=[0]=9. What is your opinions about the two problems ?

We should extend support for these eventually

ok, we'll try to extend support it basically.

cmp-nct · 2024-01-16T14:33:32Z

Really looking forward to the full offloaded support, great to see.
Looks like the projection is a lot more complex than llava-1.5; the embeddings are tiny in comparison. Will be very interesting to compare them in terms of quality

…wo by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

android/build_64.sh

ggerganov

Fix the editor config checks

XiaotaoChen · 2024-01-22T11:52:33Z

Fix the editor config checks

thanks, it's done.

cmp-nct · 2024-01-23T04:49:47Z

@XiaotaoChen I don't think your example would actually work, you use -p "" as template but llava-cli uses -p "" only as User question and hard-codes Vicuna for llava-1.5 as template.
I have not tested it yet but my PR should solve it: #5093

P.S. from the discussion it sounds like minimal work is missing for a full GPU offload, is that still planned ?

FSSRepo · 2024-01-23T16:57:41Z

To add GPU support to this new model, it is necessary to create 3 new kernels for ggml_pool_2d, ggml_hardswish and ggml_hardsigmoid taking a quick look, they seem to be operations easy to implement in CUDA. I'll see if I take some time to give it a try.

* MobileVLM native implementation * delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake * move android script to example/llava directory * Fix the editor config checks --------- Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>

MobileVLM native implementation

ea6cdcc

XiaotaoChen force-pushed the MobileVLM branch from ee3c4b1 to ea6cdcc Compare January 15, 2024 08:59

ggerganov reviewed Jan 15, 2024

View reviewed changes

ggml.h Outdated Show resolved Hide resolved

ggerganov reviewed Jan 15, 2024

View reviewed changes

YangYang-DLUT mentioned this pull request Jan 16, 2024

inference on iPhone Meituan-AutoML/MobileVLM#2

Closed

delete depthwise_conv_2d and permute_cpy relative code, replace the t…

9303bbf

…wo by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake

ggerganov reviewed Jan 19, 2024

View reviewed changes

android/build_64.sh Outdated Show resolved Hide resolved

XiaotaoChen force-pushed the MobileVLM branch from 11c3f1e to 0936740 Compare January 21, 2024 07:49

move android script to example/llava directory

c37859b

XiaotaoChen force-pushed the MobileVLM branch from 0936740 to c37859b Compare January 21, 2024 10:23

ggerganov approved these changes Jan 22, 2024

View reviewed changes

Fix the editor config checks

0e57eb8

ggerganov merged commit 3ce7e8f into ggerganov:master Jan 22, 2024
41 of 43 checks passed

XiaotaoChen mentioned this pull request Jan 24, 2024

add MobileVLM 1.7B/3B to the supported models list #5107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MobileVLM native implementation #4954

MobileVLM native implementation #4954

XiaotaoChen commented Jan 15, 2024

ggerganov Jan 15, 2024

FSSRepo Jan 16, 2024

XiaotaoChen Jan 19, 2024

ggerganov Jan 19, 2024

XiaotaoChen Jan 21, 2024

ggerganov Jan 22, 2024

XiaotaoChen Jan 22, 2024

cmp-nct commented Jan 16, 2024

ggerganov left a comment

XiaotaoChen commented Jan 22, 2024

cmp-nct commented Jan 23, 2024 •

edited

Loading

FSSRepo commented Jan 23, 2024


	// a: [OC，IC, KH, KW]
	// b: [N, IC, IH, IW]
	// result: [N, OC, OH, OW]
	struct ggml_tensor * ggml_conv_2d(
	struct ggml_context * ctx,
	struct ggml_tensor * a,
	struct ggml_tensor * b,
	int s0,
	int s1,
	int p0,
	int p1,
	int d0,
	int d1) {
	struct ggml_tensor * im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true); // [N, OH, OW, IC * KH * KW]

	struct ggml_tensor * result =
	ggml_mul_mat(ctx,
	ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] * im2col->ne[2] * im2col->ne[1]), // [N, OH, OW, IC * KH * KW] => [NOHOW, IC * KH * KW]
	ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2]), a->ne[3])); // [OC，IC, KH, KW] => [OC, IC * KH * KW]

	result = ggml_reshape_4d(ctx, result, im2col->ne[1], im2col->ne[2], a->ne[3], im2col->ne[3]); // [N, OC, OH, OW]

	return result;
	}

MobileVLM native implementation #4954

MobileVLM native implementation #4954

Conversation

XiaotaoChen commented Jan 15, 2024

MobileVLM

Usage

Model conversion

Android compile and run

compile

run on Android

some result on Android with Snapdragon 888 chip

case 1

case 2

TODO

contributor

ggerganov Jan 15, 2024

Choose a reason for hiding this comment

FSSRepo Jan 16, 2024

Choose a reason for hiding this comment

XiaotaoChen Jan 19, 2024

Choose a reason for hiding this comment

ggerganov Jan 19, 2024

Choose a reason for hiding this comment

XiaotaoChen Jan 21, 2024

Choose a reason for hiding this comment

ggerganov Jan 22, 2024

Choose a reason for hiding this comment

XiaotaoChen Jan 22, 2024

Choose a reason for hiding this comment

cmp-nct commented Jan 16, 2024

ggerganov left a comment

Choose a reason for hiding this comment

XiaotaoChen commented Jan 22, 2024

cmp-nct commented Jan 23, 2024 • edited Loading

FSSRepo commented Jan 23, 2024

some result on Android with `Snapdragon 888` chip

cmp-nct commented Jan 23, 2024 •

edited

Loading