Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark results #1

Open
llhe opened this issue Jun 28, 2018 · 43 comments
Open

Benchmark results #1

llhe opened this issue Jun 28, 2018 · 43 comments

Comments

@llhe
Copy link
Member

llhe commented Jun 28, 2018

A benchmark results of a previous version is available here:

model_name                     device_name   soc        abi           runtime   init       warmup     run_avg    tuned
mobilenet_v2                   polaris       sdm845     armeabi-v7a   GPU       42.868     11.087     9.908      True
mobilenet_v2                   MI MAX        msm8952    armeabi-v7a   GPU       122.791    43.038     39.875     True
mobilenet_v2                   BKL-AL00      kirin970   armeabi-v7a   GPU       767.932    1226.373   47.597     True
mobilenet_v2                   polaris       sdm845     arm64-v8a     GPU       42.3       10.737     10.004     True
mobilenet_v2                   MI MAX        msm8952    arm64-v8a     GPU       129.123    42.584     39.552     True
mobilenet_v2                   BKL-AL00      kirin970   arm64-v8a     GPU       753.43     1170.291   48.016     True
mobilenet_v2                   polaris       sdm845     armeabi-v7a   CPU       16.035     69.761     41.627     False
mobilenet_v2                   MI MAX        msm8952    armeabi-v7a   CPU       31.319     86.206     67.586     False
mobilenet_v2                   BKL-AL00      kirin970   armeabi-v7a   CPU       22.521     137.963    132.012    False
mobilenet_v2                   polaris       sdm845     arm64-v8a     CPU       10.641     80.509     31.985     False
mobilenet_v2                   MI MAX        msm8952    arm64-v8a     CPU       32.225     86.345     54.7       False
mobilenet_v2                   BKL-AL00      kirin970   arm64-v8a     CPU       20.208     97.295     93.987     False
deeplab_v3_plus_mobilenet_v2   polaris       sdm845     armeabi-v7a   GPU       56.512     129.422    128.976    True
deeplab_v3_plus_mobilenet_v2   MI MAX        msm8952    armeabi-v7a   GPU       145.582    899.824    896.452    True
deeplab_v3_plus_mobilenet_v2   BKL-AL00      kirin970   armeabi-v7a   GPU       771.122    2096.33    651.999    True
deeplab_v3_plus_mobilenet_v2   polaris       sdm845     armeabi-v7a   CPU       34.084     951.812    932.764    False
deeplab_v3_plus_mobilenet_v2   MI MAX        msm8952    armeabi-v7a   CPU       91.383     1543.423   1628.255   False
deeplab_v3_plus_mobilenet_v2   BKL-AL00      kirin970   armeabi-v7a   CPU       67.022     2885.098   2872.558   False
deeplab_v3_plus_mobilenet_v2   polaris       sdm845     arm64-v8a     CPU       29.376     656.16     614.679    False
deeplab_v3_plus_mobilenet_v2   MI MAX        msm8952    arm64-v8a     CPU       99.986     1170.636   1469.199   False
deeplab_v3_plus_mobilenet_v2   BKL-AL00      kirin970   arm64-v8a     CPU       55.476     1796.491   1793.253   False
mobilenet_v1                   polaris       sdm845     armeabi-v7a   GPU       45.551     13.858     13.544     True
mobilenet_v1                   MI MAX        msm8952    armeabi-v7a   GPU       114.037    65.088     61.603     True
mobilenet_v1                   BKL-AL00      kirin970   armeabi-v7a   GPU       734.51     1211.078   49.318     True
mobilenet_v1                   polaris       sdm845     arm64-v8a     GPU       45.378     13.689     12.826     True
mobilenet_v1                   MI MAX        msm8952    arm64-v8a     GPU       110.526    64.566     61.696     True
mobilenet_v1                   BKL-AL00      kirin970   arm64-v8a     GPU       730.271    1135.675   48.124     True
mobilenet_v1                   polaris       sdm845     armeabi-v7a   CPU       6.874      79.032     49.676     False
mobilenet_v1                   MI MAX        msm8952    armeabi-v7a   CPU       18.332     121.923    88.207     False
mobilenet_v1                   BKL-AL00      kirin970   armeabi-v7a   CPU       13.0       172.239    164.469    False
mobilenet_v1                   polaris       sdm845     arm64-v8a     CPU       11.347     90.748     32.888     False
mobilenet_v1                   MI MAX        msm8952    arm64-v8a     CPU       18.358     113.023    71.16      False
mobilenet_v1                   BKL-AL00      kirin970   arm64-v8a     CPU       11.666     111.706    107.818    False
resnet_v2_50                   polaris       sdm845     armeabi-v7a   GPU       124.229    95.537     93.047     True
resnet_v2_50                   MI MAX        msm8952    armeabi-v7a   GPU       280.575    637.789    636.295    True
resnet_v2_50                   BKL-AL00      kirin970   armeabi-v7a   GPU       747.875    1596.039   450.651    True
resnet_v2_50                   polaris       sdm845     armeabi-v7a   CPU       18.57      556.961    394.792    False
resnet_v2_50                   MI MAX        msm8952    armeabi-v7a   CPU       44.175     1240.632   734.156    False
resnet_v2_50                   BKL-AL00      kirin970   armeabi-v7a   CPU       26.034     2505.979   1284.285   False
resnet_v2_50                   polaris       sdm845     arm64-v8a     CPU       17.241     438.925    261.949    False
resnet_v2_50                   MI MAX        msm8952    arm64-v8a     CPU       48.691     1143.032   566.313    False
resnet_v2_50                   BKL-AL00      kirin970   arm64-v8a     CPU       23.979     2169.373   499.587    False
vgg16                          polaris       sdm845     armeabi-v7a   CPU       15.537     924.855    438.6      False
vgg16                          MI MAX        msm8952    armeabi-v7a   CPU       40.055     2926.202   800.783    False
vgg16                          BKL-AL00      kirin970   armeabi-v7a   CPU       21.732     2514.862   1242.532   False
vgg16                          polaris       sdm845     arm64-v8a     CPU       12.837     786.419    332.642    False
vgg16                          MI MAX        msm8952    arm64-v8a     CPU       40.693     2794.225   666.285    False
vgg16                          BKL-AL00      kirin970   arm64-v8a     CPU       20.855     2581.558   1043.35    False
vgg16                          polaris       sdm845     armeabi-v7a   GPU       679.21     128.214    125.523    True
vgg16                          MI MAX        msm8952    armeabi-v7a   GPU       1527.823   806.779    761.073    True
vgg16                          BKL-AL00      kirin970   armeabi-v7a   GPU       1893.529   2551.389   1042.256   True
inception_v3_dsp               polaris       sdm845     armeabi-v7a   HEXAGON   585.899    77.921     38.875     False
inception_v3                   polaris       sdm845     armeabi-v7a   CPU       19.726     631.444    481.732    False
inception_v3                   MI MAX        msm8952    armeabi-v7a   CPU       47.674     958.758    839.108    False
inception_v3                   BKL-AL00      kirin970   armeabi-v7a   CPU       29.131     760.945    1194.063   False
inception_v3                   polaris       sdm845     arm64-v8a     CPU       22.251     578.611    425.145    False
inception_v3                   MI MAX        msm8952    arm64-v8a     CPU       50.948     888.531    761.826    False
inception_v3                   BKL-AL00      kirin970   arm64-v8a     CPU       27.106     668.552    789.08     False
inception_v3                   polaris       sdm845     armeabi-v7a   GPU       101.199    92.578     91.602     True
inception_v3                   MI MAX        msm8952    armeabi-v7a   GPU       257.311    588.829    586.779    True
inception_v3                   BKL-AL00      kirin970   armeabi-v7a   GPU       770.779    1621.834   436.877    True
squeezenet_v1_1                polaris       sdm845     armeabi-v7a   GPU       33.615     10.905     10.971     True
squeezenet_v1_1                MI MAX        msm8952    armeabi-v7a   GPU       83.183     47.273     44.548     True
squeezenet_v1_1                BKL-AL00      kirin970   armeabi-v7a   GPU       268.714    437.084    39.404     True
squeezenet_v1_0                polaris       sdm845     armeabi-v7a   GPU       45.145     16.719     15.0       True
squeezenet_v1_0                MI MAX        msm8952    armeabi-v7a   GPU       98.571     76.282     72.081     True
squeezenet_v1_0                BKL-AL00      kirin970   armeabi-v7a   GPU       403.515    1165.101   63.392     True
squeezenet_v1_0                polaris       sdm845     armeabi-v7a   CPU       7.393      94.284     60.057     False
squeezenet_v1_0                MI MAX        msm8952    armeabi-v7a   CPU       27.664     171.195    110.325    False
squeezenet_v1_0                BKL-AL00      kirin970   armeabi-v7a   CPU       14.84      169.715    93.174     False
squeezenet_v1_0                polaris       sdm845     arm64-v8a     CPU       11.9       117.696    49.342     False
squeezenet_v1_0                MI MAX        msm8952    arm64-v8a     CPU       27.554     170.987    95.552     False
squeezenet_v1_0                BKL-AL00      kirin970   arm64-v8a     CPU       13.76      121.544    79.353     False
squeezenet_v1_1                polaris       sdm845     arm64-v8a     CPU       9.583      61.783     25.376     False
squeezenet_v1_1                MI MAX        msm8952    arm64-v8a     CPU       21.424     98.661     53.031     False
squeezenet_v1_1                BKL-AL00      kirin970   arm64-v8a     CPU       11.005     67.381     41.086     False

More recent results will be available in the gitlab mirror project CI page soon.

A dedicated mobile device deep learning framework benchmark project MobileAIBench is available here: https://github.com/XiaoMi/mobile-ai-bench

@songruoningbupt
Copy link

👍

@llhe
Copy link
Member Author

llhe commented Jun 29, 2018

The daily benchmark results is available here:

@DiamonJoy
Copy link

I really appreciate the results, but I am curious as to why the outcome of dsp is only available on inception_v3 ?

@llhe
Copy link
Member Author

llhe commented Jun 29, 2018

@DiamonJoy The benchmark is actually CI result of MACE Model Zoo project.
Util now, our efforts is mainly focused on float data type and CPU/GPU runtime, and have not enough time to adding more quantized models into MACE Model Zoo. Quantization (CPU or DSP) support and adding more models into MACE Model Zoo is in our roadmap .

@robertwgh
Copy link

@llhe Amazing results!
Can you explain a little more about the "tuned" column?

@llhe
Copy link
Member Author

llhe commented Jun 29, 2018

Tuned means the OpenCL kernel is tuned for the specific type of device instead of using the general rule.

@robertwgh
Copy link

Is this tuning process done manually offline or it is done at run time automatically?
If I understand correctly, is it mainly the work group size tuning?

@llhe
Copy link
Member Author

llhe commented Jun 29, 2018

@robertwgh
In our original use case, we deploy each model against a specific device (usually a new product), so we wish it's be ultimately optimized by brute forcely search against a list of workgroup options. However, for general application developer, they usually want to generate a library which applies to all devices.

It's offline now. We may consider improve the general rule or enable online increasing tuning in the future.

@llhe
Copy link
Member Author

llhe commented Jun 29, 2018

Or incorporate more advanced rule like ML based models is also a potential choice.

@robertwgh
Copy link

Yeah, that will be interesting. It would be extremely challenging given a large variety of the Android devices and SoC chipsets.
Look forward to seeing the results. 👍

@izp001
Copy link

izp001 commented Jun 29, 2018

From the code I found the CPU benchmark use OpenMP default thread num, it should be 2 threads.
Can you confirm the CPU benchmark thread numbers?

@nolanliou
Copy link
Member

@izp001
The CPU benchmark thread number is equal to the number of big core of CPU.

@ligonzheng
Copy link

It seems like CPU mode is much faster then GPU mode.
What's the reason make us use the GPU on android since it can not accelerate ?

@llhe
Copy link
Member Author

llhe commented Jul 17, 2018

@ligonzheng Only for some low end SoCs, CPU is faster than GPU. Usually GPU is faster or even much faster than CPU mode. And there are other benefits including power efficiency, multi tasking (when using GPU, CPU can be used for other computations like image processing algorithms).

@ligonzheng
Copy link

Some other question about using the mace :
1, What's the opencl_binary_file ? I can not find the opencl libraries in the builds directory. Could I pass a null when using GPU ?
2, What's KVStorageFactory ? KV means kernel verbose ?
3, does mace support reading proto file from memory ? As we know, it's not convenient to use the model file by pass a path on android and sometime we also don't hope to include the mode inside code.

Thank you for your reply !

@nolanliou
Copy link
Member

@ligonzheng

  1. please read document.
  2. KVStorage is used for store built OpenCL binaries for speeding up the initalization and first-run.
  3. We support convert the model to C++ code.

@psyhtest
Copy link

Happy to find out about this project, and thanks for sharing benchmark results! Wondering where would your results lie on the ReQuEST scoreboard?

Specifically for MobileNets v1/v2, are you using the baseline models (224-1.0)? It would be cool to add MACE to the ReQuEST MobileNets workflow and visualize results like below:

request

@psyhtest
Copy link

We have already added a CK package for MACE.

@liyancas
Copy link

@llhe
I benchmarked the models on the OnePlus 3T platform, The performance of quantized models are worse than the float modes. It is normal?

model_name device_name soc abi runtime MACE SNPE NCNN TFLITE
InceptionV3 ONEPLUS A3010 msm8996 arm64-v8a CPU 884.654 488.97 1616.671 730.468
InceptionV3 ONEPLUS A3010 msm8996 arm64-v8a DSP   5.682    
InceptionV3 ONEPLUS A3010 msm8996 arm64-v8a GPU 153.473 144.353    
InceptionV3Quant ONEPLUS A3010 msm8996 arm64-v8a CPU       1014.662
MobileNetV1 ONEPLUS A3010 msm8996 arm64-v8a CPU 52.004 702.713 43.301 101.273
MobileNetV1 ONEPLUS A3010 msm8996 arm64-v8a GPU 23.833 24.228    
MobileNetV1Quant ONEPLUS A3010 msm8996 arm64-v8a CPU 36.565     143.806
MobileNetV2 ONEPLUS A3010 msm8996 arm64-v8a CPU 40.742 415.117 29.985 56.101
MobileNetV2 ONEPLUS A3010 msm8996 arm64-v8a GPU 16.403 14.566    
MobileNetV2Quant ONEPLUS A3010 msm8996 arm64-v8a CPU 28.688     294.525
SqueezeNetV11 ONEPLUS A3010 msm8996 arm64-v8a CPU 37.404 61.414 22.325  
SqueezeNetV11 ONEPLUS A3010 msm8996 arm64-v8a GPU 20.021 14.528    
VGG16 ONEPLUS A3010 msm8996 arm64-v8a CPU 455.553 1416.414 477.352  
VGG16 ONEPLUS A3010 msm8996 arm64-v8a DSP   137.22    
VGG16 ONEPLUS A3010 msm8996 arm64-v8a GPU 208.335      

@llhe
Copy link
Member Author

llhe commented Nov 16, 2018

@liyancas From your results, the speed ordered like this:

GPU > CPU quant > CPU float

This is expected result on middle or high end mobiles.

@liyancas
Copy link

@llhe But for tflite, CPU float > CPU quantized. I don't know what's the reason.

@llhe
Copy link
Member Author

llhe commented Nov 16, 2018

@liyancas Did you get the result from mobile-ai-bench? Could you please also report an issue in that project? If so, we'll have a look.

@liyancas
Copy link

Yes. I will double check the results. If the issue still exists and I will open an issue. Thanks for your help.

@liyancas
Copy link

@llhe I posted at XiaoMi/mobile-ai-bench#20

@encorechow
Copy link

encorechow commented Dec 20, 2018

I am curious what's data type and the number of iterations used in these benchmarks?

@nolanliou
Copy link
Member

@liyancas the run time of quantized model MobileNetV1Quant is 36.565 while the float model MobileNetV1 run on CPU is 52.004. CPU float > CPU quantized, it's reasonable.

@achigeor
Copy link

achigeor commented Feb 11, 2019

@llhe
Are there any insights as to why kirin 970 GPU is that much slower than Snapdragon 845? The compute capabilities should be around the same for these two SoCs right?

For example mobilenetsv2 benchmark runs at ~10ms for sdm845 and 47.597ms for kirin970 (optimized). I also see big differences in my custom model. On a sdm845 I get around 75ms, while on kirin980 I get around 250ms.

Is it because snapdragon is actually that much faster, or because of how Mace ops / kernels are implemented? If it's the latter, what would be a good place to start working on possible optimizations?
Would you expect improvements if ARM Compute is used for ARM SoCs?

@llhe
Copy link
Member Author

llhe commented Feb 12, 2019

@achigeor
MACE use Image as the underlying OpenCL buffer. From our internal benchmarks, Image performs better than Buffer for Adreno and Mali Bifrost (which Kirin 970 based on). However, for Mali Utgard and Midgard, Image is not that superior due to the architecture difference.

Generally speaking, Adreno 630 is indeed faster than Mali G72-MP12 (depends on the exact config), and of course you can check whether ARM Compute Library is better optimized for Mali GPU which is not done by https://github.com/XiaoMi/mobile-ai-bench.

@chiraggirdhar95
Copy link

@llhe how's the list of workgroup options determined which is used for finding the optimal tuning configuration(brute force method) for different kernels ?

@llhe
Copy link
Member Author

llhe commented Feb 14, 2019

@chiraggirdhar95 The candidate parameters to search (https://github.com/XiaoMi/mace/blob/master/mace/ops/opencl/helper.cc#L131, https://github.com/XiaoMi/mace/blob/master/mace/ops/opencl/image/conv_2d_1x1.cc#L143) is kind of random with some heuristics about data locality (the access pattern could be different for different kernels but has limited patterns), including cache, cache line and vector register.

However, as marked as TODO, this brute force search is naive and can be improved.

@achigeor
Copy link

@llhe Thank you for the promt reply!

In our test app with our custom model, Ardeno 630 is x3 faster than Mali G76 too, on GPU with MACE.
On a OnePlus 6T (snapdargon 845) we get around 75-80ms, and on the Huaweii Mate20 Pro (Kirin 980) we get 250ms.

Do these results sound normal? I didn't expect that big of a difference.

@chiraggirdhar95
Copy link

@achigeor thanks for sharing the timing numbers for your custom model. Can you also share for any open source model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests