Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX,AVX2 support for ggml_vec_scale_f32 #285

Merged
merged 1 commit into from Dec 17, 2022

Conversation

katsu560
Copy link
Contributor

@katsu560 katsu560 commented Dec 17, 2022

I added AVX intrinsic to ggml_vec_scale_f32 function.
Based on my test app, I got some performance up by using intrinsic. please confirm below results.
And I found that it is slow at nloop = 1024,2048,4096,32768,262144 compiled with gcc 8.4.0 -O3 option.
It doesn't happen with -Ofast option.

i5-3320M ggml_vec_scale_f32:msec and ratio

nloop O3normal O3avx Ofastnormal Ofastavx Ofastnormal/avx
8 0.000825 0.000828 0.000795 0.000790 1.006658
16 0.000883 0.000875 0.000790 0.000794 0.995787
32 0.000985 0.000958 0.000791 0.000794 0.996918
64 0.000781 0.000780 0.000796 0.000802 0.991909
128 0.000786 0.000788 0.000802 0.000801 1.001428
256 0.000814 0.000816 0.000817 0.000823 0.992932
512 0.000841 0.000816 0.000853 0.000842 1.012188
1024 0.006802 0.006493 0.000912 0.000876 1.040143
2048 0.012983 0.012320 0.001030 0.000967 1.064961
4096 0.023904 0.022840 0.001274 0.001142 1.115731
8192 0.001892 0.001779 0.001911 0.001818 1.051560
16384 0.002995 0.002916 0.003038 0.002963 1.025504
32768 0.206894 0.173663 0.005211 0.005053 1.031208
65536 0.010136 0.009971 0.010099 0.009781 1.032603
131072 0.027599 0.025793 0.028916 0.025023 1.155556
262144 0.971159 0.943179 0.052940 0.049019 1.079975
524288 0.123003 0.119847 0.111943 0.108255 1.034067

i3-10110U ggml_vec_scale_f32:msec and ratio

nloop O3normal O3avx Ofastnormal Ofastavx Ofastnormal/avx
8 0.000062 0.000064 0.000024 0.000023 1.002240
16 0.000097 0.000101 0.000024 0.000024 0.997858
32 0.000172 0.000166 0.000026 0.000025 1.019047
64 0.000030 0.000029 0.000028 0.000027 1.033117
128 0.000035 0.000033 0.000032 0.000031 1.034460
256 0.000044 0.000043 0.000041 0.000039 1.062638
512 0.000062 0.000057 0.000065 0.000053 1.225189
1024 0.004726 0.004575 0.000122 0.000091 1.340776
2048 0.009288 0.009128 0.000220 0.000164 1.335846
4096 0.017843 0.017766 0.000423 0.000293 1.444074
8192 0.000635 0.000602 0.000804 0.000598 1.344614
16384 0.001448 0.001350 0.001589 0.001378 1.152535
32768 0.139295 0.139047 0.003171 0.002728 1.162412
65536 0.006614 0.006076 0.006498 0.006171 1.053028
131072 0.013934 0.013880 0.013894 0.013764 1.009410
262144 0.829618 0.801937 0.030017 0.029203 1.027896
524288 0.065351 0.060838 0.060361 0.057639 1.047231

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 8.4.0-1ubuntu1-18.04' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)

pkversion Ubuntu 8.4.0-1ubuntu1~18.04

@ggerganov
Copy link
Owner

I don't see the improvement in the Encoder benchmarks because ggml_vec_scale_f32() is takes negligible time overall in this case. But nevertheless, I think this change is nice to have - thank you.

The slowdown for 1024 2048 4096 32768 262144 is strange. I don't understand it.

Btw, if you have similar scripts or benchmark tools that you are willing to share, feel free to open PR and add them in the tests or bench folder of the project. These can be useful in the future for re-evaluating the performance of different parts of the codebase.

@ggerganov ggerganov merged commit 419b8a6 into ggerganov:master Dec 17, 2022
@djthorpe
Copy link
Contributor

I had to disable AVX2 on my Mac (Intel Xeon E5, MacPro6,1) by passing the following arguments through to cmake:
-D WHISPER_NO_AVX2=on

If I didn't do that, I got some "Illegal Instruction" kernel panics.

@katsu560
Copy link
Contributor Author

Thank you for the merging, Georgi.

I agreed that the slowdown is strange. I don't understand why it happens, too.
I'll share my test app later.

@katsu560 katsu560 deleted the devpr branch December 20, 2022 23:04
@katsu560
Copy link
Contributor Author

Please confirm pull request, Add ggml performance test #306.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants