-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MobileVLM native implementation #4954
Conversation
ee3c4b1
to
ea6cdcc
Compare
@@ -5350,6 +5389,52 @@ GGML_API struct ggml_tensor * ggml_conv_transpose_1d( | |||
return result; | |||
} | |||
|
|||
// ggml_conv_depthwise | |||
struct ggml_tensor * ggml_conv_depthwise_2d( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the ggml_conv_depthwise_2d
operator be represented as a combination of existing operators. For example, ggml_conv_2d
is expressed via ggml_im2col
, ggml_mul_mat
and ggml_reshape
:
Lines 5404 to 5429 in ddb008d
// a: [OC,IC, KH, KW] | |
// b: [N, IC, IH, IW] | |
// result: [N, OC, OH, OW] | |
struct ggml_tensor * ggml_conv_2d( | |
struct ggml_context * ctx, | |
struct ggml_tensor * a, | |
struct ggml_tensor * b, | |
int s0, | |
int s1, | |
int p0, | |
int p1, | |
int d0, | |
int d1) { | |
struct ggml_tensor * im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true); // [N, OH, OW, IC * KH * KW] | |
struct ggml_tensor * result = | |
ggml_mul_mat(ctx, | |
ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] * im2col->ne[2] * im2col->ne[1]), // [N, OH, OW, IC * KH * KW] => [N*OH*OW, IC * KH * KW] | |
ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2]), a->ne[3])); // [OC,IC, KH, KW] => [OC, IC * KH * KW] | |
result = ggml_reshape_4d(ctx, result, im2col->ne[1], im2col->ne[2], a->ne[3], im2col->ne[3]); // [N, OC, OH, OW] | |
return result; | |
} | |
If yes, it would be a better approach since it will allow directly GPU offload support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing the code implementation in PyTorch, it seems that this operation (ggml_conv_depthwise_2d) is a conv2d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Thanks for your advices, We remove the ggml_conv_depthwise_2d
native implementation, reconstruct it by ggml_im2col
, ggml_mul_mat
and ggml_reshape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Does it run on the GPU now or is it still missing some operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov it can't run on GPU currently. there are two problems: 1. ggml-cuda
doesn't support pool2d
operator; 2. ggml_cuda_can_mul_mat
requires ne0>=32, ne1>=32, ne2>=3
, but depthwise conv
with kernel 3x3
's ne=[0]=9
. What is your opinions about the two problems ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should extend support for these eventually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we'll try to extend support it basically.
Really looking forward to the full offloaded support, great to see. |
…wo by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake
11c3f1e
to
0936740
Compare
0936740
to
c37859b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the editor config checks
thanks, it's done. |
@XiaotaoChen I don't think your example would actually work, you use -p "" as template but llava-cli uses -p "" only as User question and hard-codes Vicuna for llava-1.5 as template. P.S. from the discussion it sounds like minimal work is missing for a full GPU offload, is that still planned ? |
To add GPU support to this new model, it is necessary to create 3 new kernels for |
* MobileVLM native implementation * delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake * move android script to example/llava directory * Fix the editor config checks --------- Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
* MobileVLM native implementation * delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake * move android script to example/llava directory * Fix the editor config checks --------- Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
MobileVLM
Currently this implementation supports MobileVLM-v1.7 variants.
for more information, please go to Meituan-AutoML/MobileVLM
The implementation is based on llava, and is compatible with llava and mobileVLM. The usage is basically same as llava.
Usage
Build with cmake or run
make llava-cli
to build it.After building, run:
./llava-cli
to see the usage. For example:./llava-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \ --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \ --image path/to/an/image.jpg \ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? Answer the question using a single word or phrase. ASSISTANT:"
Model conversion
mobileVLM-1.7B
andclip-vit-large-patch14-336
locally:llava-surgery.py
to split the LLaVA model to LLaMA and multimodel projector constituents:convert-image-encoder-to-gguf.py
with--projector-type ldp
to convert the LLaVA image encoder to GGUF:convert.py
to convert the LLaMA part of LLaVA to GGUF:quantize
to convert LLaMA part's DataType fromfp16
toq4_k
Now both the LLaMA part and the image encoder is in the
MobileVLM-1.7B
directory.Android compile and run
compile
refer to
android/build_64.sh
mkdir android/build_64 cd android/build_64 ../build_64.sh
run on Android
refer to
android/adb_run.sh
, modify resources'name
andpath
some result on Android with
Snapdragon 888
chipcase 1
input
/data/local/tmp/llava-cli \ -m /data/local/tmp/ggml-model-q4_k.gguf \ --mmproj /data/local/tmp/mmproj-model-f16.gguf \ -t 4 \ --image /data/local/tmp/demo.jpg \ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:"
output
case 2
input
/data/local/tmp/llava-cli \ -m /data/local/tmp/ggml-model-q4_k.gguf \ --mmproj /data/local/tmp/mmproj-model-f16.gguf \ -t 4 \ --image /data/local/tmp/cat.jpeg \ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:"
output
TODO
Support non-CPU backend for the new operators, such as
depthwise
,hardswish
,hardsigmoid
Optimize LDP projector performance
run MobileVLM on
Jetson Orin
Support more model variants, such as
MobileVLM-3B
.contributor