Benchmark results #1

llhe · 2018-06-28T08:09:18Z

A benchmark results of a previous version is available here:

model_name                     device_name   soc        abi           runtime   init       warmup     run_avg    tuned
mobilenet_v2                   polaris       sdm845     armeabi-v7a   GPU       42.868     11.087     9.908      True
mobilenet_v2                   MI MAX        msm8952    armeabi-v7a   GPU       122.791    43.038     39.875     True
mobilenet_v2                   BKL-AL00      kirin970   armeabi-v7a   GPU       767.932    1226.373   47.597     True
mobilenet_v2                   polaris       sdm845     arm64-v8a     GPU       42.3       10.737     10.004     True
mobilenet_v2                   MI MAX        msm8952    arm64-v8a     GPU       129.123    42.584     39.552     True
mobilenet_v2                   BKL-AL00      kirin970   arm64-v8a     GPU       753.43     1170.291   48.016     True
mobilenet_v2                   polaris       sdm845     armeabi-v7a   CPU       16.035     69.761     41.627     False
mobilenet_v2                   MI MAX        msm8952    armeabi-v7a   CPU       31.319     86.206     67.586     False
mobilenet_v2                   BKL-AL00      kirin970   armeabi-v7a   CPU       22.521     137.963    132.012    False
mobilenet_v2                   polaris       sdm845     arm64-v8a     CPU       10.641     80.509     31.985     False
mobilenet_v2                   MI MAX        msm8952    arm64-v8a     CPU       32.225     86.345     54.7       False
mobilenet_v2                   BKL-AL00      kirin970   arm64-v8a     CPU       20.208     97.295     93.987     False
deeplab_v3_plus_mobilenet_v2   polaris       sdm845     armeabi-v7a   GPU       56.512     129.422    128.976    True
deeplab_v3_plus_mobilenet_v2   MI MAX        msm8952    armeabi-v7a   GPU       145.582    899.824    896.452    True
deeplab_v3_plus_mobilenet_v2   BKL-AL00      kirin970   armeabi-v7a   GPU       771.122    2096.33    651.999    True
deeplab_v3_plus_mobilenet_v2   polaris       sdm845     armeabi-v7a   CPU       34.084     951.812    932.764    False
deeplab_v3_plus_mobilenet_v2   MI MAX        msm8952    armeabi-v7a   CPU       91.383     1543.423   1628.255   False
deeplab_v3_plus_mobilenet_v2   BKL-AL00      kirin970   armeabi-v7a   CPU       67.022     2885.098   2872.558   False
deeplab_v3_plus_mobilenet_v2   polaris       sdm845     arm64-v8a     CPU       29.376     656.16     614.679    False
deeplab_v3_plus_mobilenet_v2   MI MAX        msm8952    arm64-v8a     CPU       99.986     1170.636   1469.199   False
deeplab_v3_plus_mobilenet_v2   BKL-AL00      kirin970   arm64-v8a     CPU       55.476     1796.491   1793.253   False
mobilenet_v1                   polaris       sdm845     armeabi-v7a   GPU       45.551     13.858     13.544     True
mobilenet_v1                   MI MAX        msm8952    armeabi-v7a   GPU       114.037    65.088     61.603     True
mobilenet_v1                   BKL-AL00      kirin970   armeabi-v7a   GPU       734.51     1211.078   49.318     True
mobilenet_v1                   polaris       sdm845     arm64-v8a     GPU       45.378     13.689     12.826     True
mobilenet_v1                   MI MAX        msm8952    arm64-v8a     GPU       110.526    64.566     61.696     True
mobilenet_v1                   BKL-AL00      kirin970   arm64-v8a     GPU       730.271    1135.675   48.124     True
mobilenet_v1                   polaris       sdm845     armeabi-v7a   CPU       6.874      79.032     49.676     False
mobilenet_v1                   MI MAX        msm8952    armeabi-v7a   CPU       18.332     121.923    88.207     False
mobilenet_v1                   BKL-AL00      kirin970   armeabi-v7a   CPU       13.0       172.239    164.469    False
mobilenet_v1                   polaris       sdm845     arm64-v8a     CPU       11.347     90.748     32.888     False
mobilenet_v1                   MI MAX        msm8952    arm64-v8a     CPU       18.358     113.023    71.16      False
mobilenet_v1                   BKL-AL00      kirin970   arm64-v8a     CPU       11.666     111.706    107.818    False
resnet_v2_50                   polaris       sdm845     armeabi-v7a   GPU       124.229    95.537     93.047     True
resnet_v2_50                   MI MAX        msm8952    armeabi-v7a   GPU       280.575    637.789    636.295    True
resnet_v2_50                   BKL-AL00      kirin970   armeabi-v7a   GPU       747.875    1596.039   450.651    True
resnet_v2_50                   polaris       sdm845     armeabi-v7a   CPU       18.57      556.961    394.792    False
resnet_v2_50                   MI MAX        msm8952    armeabi-v7a   CPU       44.175     1240.632   734.156    False
resnet_v2_50                   BKL-AL00      kirin970   armeabi-v7a   CPU       26.034     2505.979   1284.285   False
resnet_v2_50                   polaris       sdm845     arm64-v8a     CPU       17.241     438.925    261.949    False
resnet_v2_50                   MI MAX        msm8952    arm64-v8a     CPU       48.691     1143.032   566.313    False
resnet_v2_50                   BKL-AL00      kirin970   arm64-v8a     CPU       23.979     2169.373   499.587    False
vgg16                          polaris       sdm845     armeabi-v7a   CPU       15.537     924.855    438.6      False
vgg16                          MI MAX        msm8952    armeabi-v7a   CPU       40.055     2926.202   800.783    False
vgg16                          BKL-AL00      kirin970   armeabi-v7a   CPU       21.732     2514.862   1242.532   False
vgg16                          polaris       sdm845     arm64-v8a     CPU       12.837     786.419    332.642    False
vgg16                          MI MAX        msm8952    arm64-v8a     CPU       40.693     2794.225   666.285    False
vgg16                          BKL-AL00      kirin970   arm64-v8a     CPU       20.855     2581.558   1043.35    False
vgg16                          polaris       sdm845     armeabi-v7a   GPU       679.21     128.214    125.523    True
vgg16                          MI MAX        msm8952    armeabi-v7a   GPU       1527.823   806.779    761.073    True
vgg16                          BKL-AL00      kirin970   armeabi-v7a   GPU       1893.529   2551.389   1042.256   True
inception_v3_dsp               polaris       sdm845     armeabi-v7a   HEXAGON   585.899    77.921     38.875     False
inception_v3                   polaris       sdm845     armeabi-v7a   CPU       19.726     631.444    481.732    False
inception_v3                   MI MAX        msm8952    armeabi-v7a   CPU       47.674     958.758    839.108    False
inception_v3                   BKL-AL00      kirin970   armeabi-v7a   CPU       29.131     760.945    1194.063   False
inception_v3                   polaris       sdm845     arm64-v8a     CPU       22.251     578.611    425.145    False
inception_v3                   MI MAX        msm8952    arm64-v8a     CPU       50.948     888.531    761.826    False
inception_v3                   BKL-AL00      kirin970   arm64-v8a     CPU       27.106     668.552    789.08     False
inception_v3                   polaris       sdm845     armeabi-v7a   GPU       101.199    92.578     91.602     True
inception_v3                   MI MAX        msm8952    armeabi-v7a   GPU       257.311    588.829    586.779    True
inception_v3                   BKL-AL00      kirin970   armeabi-v7a   GPU       770.779    1621.834   436.877    True
squeezenet_v1_1                polaris       sdm845     armeabi-v7a   GPU       33.615     10.905     10.971     True
squeezenet_v1_1                MI MAX        msm8952    armeabi-v7a   GPU       83.183     47.273     44.548     True
squeezenet_v1_1                BKL-AL00      kirin970   armeabi-v7a   GPU       268.714    437.084    39.404     True
squeezenet_v1_0                polaris       sdm845     armeabi-v7a   GPU       45.145     16.719     15.0       True
squeezenet_v1_0                MI MAX        msm8952    armeabi-v7a   GPU       98.571     76.282     72.081     True
squeezenet_v1_0                BKL-AL00      kirin970   armeabi-v7a   GPU       403.515    1165.101   63.392     True
squeezenet_v1_0                polaris       sdm845     armeabi-v7a   CPU       7.393      94.284     60.057     False
squeezenet_v1_0                MI MAX        msm8952    armeabi-v7a   CPU       27.664     171.195    110.325    False
squeezenet_v1_0                BKL-AL00      kirin970   armeabi-v7a   CPU       14.84      169.715    93.174     False
squeezenet_v1_0                polaris       sdm845     arm64-v8a     CPU       11.9       117.696    49.342     False
squeezenet_v1_0                MI MAX        msm8952    arm64-v8a     CPU       27.554     170.987    95.552     False
squeezenet_v1_0                BKL-AL00      kirin970   arm64-v8a     CPU       13.76      121.544    79.353     False
squeezenet_v1_1                polaris       sdm845     arm64-v8a     CPU       9.583      61.783     25.376     False
squeezenet_v1_1                MI MAX        msm8952    arm64-v8a     CPU       21.424     98.661     53.031     False
squeezenet_v1_1                BKL-AL00      kirin970   arm64-v8a     CPU       11.005     67.381     41.086     False

More recent results will be available in the gitlab mirror project CI page soon.

A dedicated mobile device deep learning framework benchmark project MobileAIBench is available here: https://github.com/XiaoMi/mobile-ai-bench

The text was updated successfully, but these errors were encountered:

songruoningbupt · 2018-06-28T09:03:05Z

👍

llhe · 2018-06-29T01:21:11Z

The daily benchmark results is available here:

DiamonJoy · 2018-06-29T02:26:54Z

I really appreciate the results, but I am curious as to why the outcome of dsp is only available on inception_v3 ?

llhe · 2018-06-29T02:59:46Z

@DiamonJoy The benchmark is actually CI result of MACE Model Zoo project.
Util now, our efforts is mainly focused on float data type and CPU/GPU runtime, and have not enough time to adding more quantized models into MACE Model Zoo. Quantization (CPU or DSP) support and adding more models into MACE Model Zoo is in our roadmap .

robertwgh · 2018-06-29T05:07:55Z

@llhe Amazing results!
Can you explain a little more about the "tuned" column?

llhe · 2018-06-29T06:05:53Z

Tuned means the OpenCL kernel is tuned for the specific type of device instead of using the general rule.

robertwgh · 2018-06-29T06:08:32Z

Is this tuning process done manually offline or it is done at run time automatically?
If I understand correctly, is it mainly the work group size tuning?

llhe · 2018-06-29T06:11:18Z

@robertwgh
In our original use case, we deploy each model against a specific device (usually a new product), so we wish it's be ultimately optimized by brute forcely search against a list of workgroup options. However, for general application developer, they usually want to generate a library which applies to all devices.

It's offline now. We may consider improve the general rule or enable online increasing tuning in the future.

llhe · 2018-06-29T06:13:01Z

Or incorporate more advanced rule like ML based models is also a potential choice.

robertwgh · 2018-06-29T06:17:18Z

Yeah, that will be interesting. It would be extremely challenging given a large variety of the Android devices and SoC chipsets.
Look forward to seeing the results. 👍

izp001 · 2018-06-29T12:13:41Z

From the code I found the CPU benchmark use OpenMP default thread num, it should be 2 threads.
Can you confirm the CPU benchmark thread numbers?

nolanliou · 2018-07-02T01:50:17Z

@izp001
The CPU benchmark thread number is equal to the number of big core of CPU.

ligonzheng · 2018-07-16T09:37:38Z

It seems like CPU mode is much faster then GPU mode.
What's the reason make us use the GPU on android since it can not accelerate ?

llhe · 2018-07-17T01:21:20Z

@ligonzheng Only for some low end SoCs, CPU is faster than GPU. Usually GPU is faster or even much faster than CPU mode. And there are other benefits including power efficiency, multi tasking (when using GPU, CPU can be used for other computations like image processing algorithms).

ligonzheng · 2018-07-17T02:36:52Z

Some other question about using the mace :
1, What's the opencl_binary_file ? I can not find the opencl libraries in the builds directory. Could I pass a null when using GPU ?
2, What's KVStorageFactory ? KV means kernel verbose ?
3, does mace support reading proto file from memory ? As we know, it's not convenient to use the model file by pass a path on android and sometime we also don't hope to include the mode inside code.

Thank you for your reply !

nolanliou · 2018-07-17T05:38:09Z

@ligonzheng

please read document.
KVStorage is used for store built OpenCL binaries for speeding up the initalization and first-run.
We support convert the model to C++ code.

psyhtest · 2018-07-17T08:30:15Z

Happy to find out about this project, and thanks for sharing benchmark results! Wondering where would your results lie on the ReQuEST scoreboard?

Specifically for MobileNets v1/v2, are you using the baseline models (224-1.0)? It would be cool to add MACE to the ReQuEST MobileNets workflow and visualize results like below:

psyhtest · 2018-07-31T13:09:46Z

We have already added a CK package for MACE.

liyancas · 2018-11-16T03:26:39Z

@llhe
I benchmarked the models on the OnePlus 3T platform, The performance of quantized models are worse than the float modes. It is normal?

model_name	device_name	soc	abi	runtime	MACE	SNPE	NCNN	TFLITE
InceptionV3	ONEPLUS A3010	msm8996	arm64-v8a	CPU	884.654	488.97	1616.671	730.468
InceptionV3	ONEPLUS A3010	msm8996	arm64-v8a	DSP		5.682
InceptionV3	ONEPLUS A3010	msm8996	arm64-v8a	GPU	153.473	144.353
InceptionV3Quant	ONEPLUS A3010	msm8996	arm64-v8a	CPU				1014.662
MobileNetV1	ONEPLUS A3010	msm8996	arm64-v8a	CPU	52.004	702.713	43.301	101.273
MobileNetV1	ONEPLUS A3010	msm8996	arm64-v8a	GPU	23.833	24.228
MobileNetV1Quant	ONEPLUS A3010	msm8996	arm64-v8a	CPU	36.565			143.806
MobileNetV2	ONEPLUS A3010	msm8996	arm64-v8a	CPU	40.742	415.117	29.985	56.101
MobileNetV2	ONEPLUS A3010	msm8996	arm64-v8a	GPU	16.403	14.566
MobileNetV2Quant	ONEPLUS A3010	msm8996	arm64-v8a	CPU	28.688			294.525
SqueezeNetV11	ONEPLUS A3010	msm8996	arm64-v8a	CPU	37.404	61.414	22.325
SqueezeNetV11	ONEPLUS A3010	msm8996	arm64-v8a	GPU	20.021	14.528
VGG16	ONEPLUS A3010	msm8996	arm64-v8a	CPU	455.553	1416.414	477.352
VGG16	ONEPLUS A3010	msm8996	arm64-v8a	DSP		137.22
VGG16	ONEPLUS A3010	msm8996	arm64-v8a	GPU	208.335

llhe · 2018-11-16T04:24:54Z

@liyancas From your results, the speed ordered like this:

GPU > CPU quant > CPU float

This is expected result on middle or high end mobiles.

liyancas · 2018-11-16T04:52:54Z

@llhe But for tflite, CPU float > CPU quantized. I don't know what's the reason.

llhe · 2018-11-16T04:56:29Z

@liyancas Did you get the result from mobile-ai-bench? Could you please also report an issue in that project? If so, we'll have a look.

liyancas · 2018-11-16T05:01:25Z

Yes. I will double check the results. If the issue still exists and I will open an issue. Thanks for your help.

liyancas · 2018-11-16T06:46:19Z

@llhe I posted at XiaoMi/mobile-ai-bench#20

encorechow · 2018-12-20T12:43:18Z

I am curious what's data type and the number of iterations used in these benchmarks?

nolanliou · 2018-12-21T03:41:13Z

@liyancas the run time of quantized model MobileNetV1Quant is 36.565 while the float model MobileNetV1 run on CPU is 52.004. CPU float > CPU quantized, it's reasonable.

achigeor · 2019-02-11T12:41:07Z

@llhe
Are there any insights as to why kirin 970 GPU is that much slower than Snapdragon 845? The compute capabilities should be around the same for these two SoCs right?

For example mobilenetsv2 benchmark runs at ~10ms for sdm845 and 47.597ms for kirin970 (optimized). I also see big differences in my custom model. On a sdm845 I get around 75ms, while on kirin980 I get around 250ms.

Is it because snapdragon is actually that much faster, or because of how Mace ops / kernels are implemented? If it's the latter, what would be a good place to start working on possible optimizations?
Would you expect improvements if ARM Compute is used for ARM SoCs?

llhe · 2019-02-12T01:52:06Z

@achigeor
MACE use Image as the underlying OpenCL buffer. From our internal benchmarks, Image performs better than Buffer for Adreno and Mali Bifrost (which Kirin 970 based on). However, for Mali Utgard and Midgard, Image is not that superior due to the architecture difference.

Generally speaking, Adreno 630 is indeed faster than Mali G72-MP12 (depends on the exact config), and of course you can check whether ARM Compute Library is better optimized for Mali GPU which is not done by https://github.com/XiaoMi/mobile-ai-bench.

chiraggirdhar95 · 2019-02-13T10:23:19Z

@llhe how's the list of workgroup options determined which is used for finding the optimal tuning configuration(brute force method) for different kernels ?

llhe · 2019-02-14T02:10:18Z

@chiraggirdhar95 The candidate parameters to search (https://github.com/XiaoMi/mace/blob/master/mace/ops/opencl/helper.cc#L131, https://github.com/XiaoMi/mace/blob/master/mace/ops/opencl/image/conv_2d_1x1.cc#L143) is kind of random with some heuristics about data locality (the access pattern could be different for different kernels but has limited patterns), including cache, cache line and vector register.

However, as marked as TODO, this brute force search is naive and can be improved.

achigeor · 2019-02-14T12:43:58Z

@llhe Thank you for the promt reply!

In our test app with our custom model, Ardeno 630 is x3 faster than Mali G76 too, on GPU with MACE.
On a OnePlus 6T (snapdargon 845) we get around 75-80ms, and on the Huaweii Mate20 Pro (Kirin 980) we get 250ms.

Do these results sound normal? I didn't expect that big of a difference.

chiraggirdhar95 · 2019-02-14T13:23:48Z

@achigeor thanks for sharing the timing numbers for your custom model. Can you also share for any open source model?

huanyingjun mentioned this issue Jun 29, 2018

question about example.cc #11

Closed

1icas mentioned this issue Jul 4, 2018

mace_android_demo.apk 应用出错 #38

Closed

liyiqiang2016 mentioned this issue Aug 7, 2018

Read OpenCL tuned parameters file failed. #128

Closed

SunAriesCN mentioned this issue Aug 20, 2018

mace_context.engine->Run(inputs, &outputs); error #169

Closed

qiaowei1214 mentioned this issue Aug 31, 2018

自己的模型在GPU上会crash，CPU上则正常运行 #183

Closed

leeburt mentioned this issue Aug 31, 2018

交叉编译出来的gpu库，运行模型时报错Program received signal SIGSEGV, Segmentation fault。 #184

Closed

kweisamx mentioned this issue Nov 7, 2018

example Android app when switch to GPU always crash #248

Closed

llhe mentioned this issue Feb 12, 2019

Kirin970 GPU much slower than Snapdragon 845 XiaoMi/mobile-ai-bench#23

Closed

chenloveheimei mentioned this issue Apr 28, 2019

关于mace_model里面ssd-mobilenet-v1几个问题 #432

Open

f23505106 mentioned this issue May 16, 2019

当手机不支持opencl时析构maceEngine导致空指针crash #442

Closed

dimitryn mentioned this issue Jul 7, 2019

Crash on concat layer execution. #483

Closed

ptklx mentioned this issue Aug 5, 2019

vendor/qcom/proprietary/adsprpc/src/apps_std_imp.c:695:Error 45: fopen failed for libhexagon_nn_skel.so. (No such file or directory) #503

Closed

eudora-jia mentioned this issue Sep 22, 2019

static mace lib can not be used #534

Closed

ashes1106 mentioned this issue Nov 18, 2019

模型转换成功后，运行时报错：operator.cc:325 Key not registered,Reduce0T_DT_INT32 #558

Closed

WangFengtu1996 mentioned this issue Feb 5, 2020

I am curious what's data type and the number of iterations used in these benchmarks? #592

Closed

bbd2015 mentioned this issue Jun 3, 2020

Aborted Error run on Hexagon DSP #654

Closed

davidenitti mentioned this issue Sep 17, 2020

GPU run on mobile fails #674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark results #1

Benchmark results #1

llhe commented Jun 28, 2018 •

edited

songruoningbupt commented Jun 28, 2018

llhe commented Jun 29, 2018

DiamonJoy commented Jun 29, 2018

llhe commented Jun 29, 2018

robertwgh commented Jun 29, 2018

llhe commented Jun 29, 2018

robertwgh commented Jun 29, 2018

llhe commented Jun 29, 2018

llhe commented Jun 29, 2018

robertwgh commented Jun 29, 2018

izp001 commented Jun 29, 2018

nolanliou commented Jul 2, 2018

ligonzheng commented Jul 16, 2018

llhe commented Jul 17, 2018

ligonzheng commented Jul 17, 2018

nolanliou commented Jul 17, 2018

psyhtest commented Jul 17, 2018

psyhtest commented Jul 31, 2018

liyancas commented Nov 16, 2018

llhe commented Nov 16, 2018

liyancas commented Nov 16, 2018

llhe commented Nov 16, 2018

liyancas commented Nov 16, 2018

liyancas commented Nov 16, 2018

encorechow commented Dec 20, 2018 •

edited

nolanliou commented Dec 21, 2018

achigeor commented Feb 11, 2019 •

edited

llhe commented Feb 12, 2019

chiraggirdhar95 commented Feb 13, 2019

llhe commented Feb 14, 2019

achigeor commented Feb 14, 2019

chiraggirdhar95 commented Feb 14, 2019

Benchmark results #1

Benchmark results #1

Comments

llhe commented Jun 28, 2018 • edited

songruoningbupt commented Jun 28, 2018

llhe commented Jun 29, 2018

DiamonJoy commented Jun 29, 2018

llhe commented Jun 29, 2018

robertwgh commented Jun 29, 2018

llhe commented Jun 29, 2018

robertwgh commented Jun 29, 2018

llhe commented Jun 29, 2018

llhe commented Jun 29, 2018

robertwgh commented Jun 29, 2018

izp001 commented Jun 29, 2018

nolanliou commented Jul 2, 2018

ligonzheng commented Jul 16, 2018

llhe commented Jul 17, 2018

ligonzheng commented Jul 17, 2018

nolanliou commented Jul 17, 2018

psyhtest commented Jul 17, 2018

psyhtest commented Jul 31, 2018

liyancas commented Nov 16, 2018

llhe commented Nov 16, 2018

liyancas commented Nov 16, 2018

llhe commented Nov 16, 2018

liyancas commented Nov 16, 2018

liyancas commented Nov 16, 2018

encorechow commented Dec 20, 2018 • edited

nolanliou commented Dec 21, 2018

achigeor commented Feb 11, 2019 • edited

llhe commented Feb 12, 2019

chiraggirdhar95 commented Feb 13, 2019

llhe commented Feb 14, 2019

achigeor commented Feb 14, 2019

chiraggirdhar95 commented Feb 14, 2019

llhe commented Jun 28, 2018 •

edited

encorechow commented Dec 20, 2018 •

edited

achigeor commented Feb 11, 2019 •

edited