Add AVX,AVX2 support for ggml_vec_scale_f32 #285

katsu560 · 2022-12-17T05:00:33Z

I added AVX intrinsic to ggml_vec_scale_f32 function.
Based on my test app, I got some performance up by using intrinsic. please confirm below results.
And I found that it is slow at nloop = 1024,2048,4096,32768,262144 compiled with gcc 8.4.0 -O3 option.
It doesn't happen with -Ofast option.

i5-3320M ggml_vec_scale_f32:msec and ratio

nloop	O3normal	O3avx	Ofastnormal	Ofastavx	Ofastnormal/avx
8	0.000825	0.000828	0.000795	0.000790	1.006658
16	0.000883	0.000875	0.000790	0.000794	0.995787
32	0.000985	0.000958	0.000791	0.000794	0.996918
64	0.000781	0.000780	0.000796	0.000802	0.991909
128	0.000786	0.000788	0.000802	0.000801	1.001428
256	0.000814	0.000816	0.000817	0.000823	0.992932
512	0.000841	0.000816	0.000853	0.000842	1.012188
1024	0.006802	0.006493	0.000912	0.000876	1.040143
2048	0.012983	0.012320	0.001030	0.000967	1.064961
4096	0.023904	0.022840	0.001274	0.001142	1.115731
8192	0.001892	0.001779	0.001911	0.001818	1.051560
16384	0.002995	0.002916	0.003038	0.002963	1.025504
32768	0.206894	0.173663	0.005211	0.005053	1.031208
65536	0.010136	0.009971	0.010099	0.009781	1.032603
131072	0.027599	0.025793	0.028916	0.025023	1.155556
262144	0.971159	0.943179	0.052940	0.049019	1.079975
524288	0.123003	0.119847	0.111943	0.108255	1.034067

i3-10110U ggml_vec_scale_f32:msec and ratio

nloop	O3normal	O3avx	Ofastnormal	Ofastavx	Ofastnormal/avx
8	0.000062	0.000064	0.000024	0.000023	1.002240
16	0.000097	0.000101	0.000024	0.000024	0.997858
32	0.000172	0.000166	0.000026	0.000025	1.019047
64	0.000030	0.000029	0.000028	0.000027	1.033117
128	0.000035	0.000033	0.000032	0.000031	1.034460
256	0.000044	0.000043	0.000041	0.000039	1.062638
512	0.000062	0.000057	0.000065	0.000053	1.225189
1024	0.004726	0.004575	0.000122	0.000091	1.340776
2048	0.009288	0.009128	0.000220	0.000164	1.335846
4096	0.017843	0.017766	0.000423	0.000293	1.444074
8192	0.000635	0.000602	0.000804	0.000598	1.344614
16384	0.001448	0.001350	0.001589	0.001378	1.152535
32768	0.139295	0.139047	0.003171	0.002728	1.162412
65536	0.006614	0.006076	0.006498	0.006171	1.053028
131072	0.013934	0.013880	0.013894	0.013764	1.009410
262144	0.829618	0.801937	0.030017	0.029203	1.027896
524288	0.065351	0.060838	0.060361	0.057639	1.047231

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 8.4.0-1ubuntu1-18.04' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)

pkversion Ubuntu 8.4.0-1ubuntu1~18.04

ggerganov · 2022-12-17T17:40:04Z

I don't see the improvement in the Encoder benchmarks because ggml_vec_scale_f32() is takes negligible time overall in this case. But nevertheless, I think this change is nice to have - thank you.

The slowdown for 1024 2048 4096 32768 262144 is strange. I don't understand it.

Btw, if you have similar scripts or benchmark tools that you are willing to share, feel free to open PR and add them in the tests or bench folder of the project. These can be useful in the future for re-evaluating the performance of different parts of the codebase.

djthorpe · 2022-12-18T11:03:25Z

I had to disable AVX2 on my Mac (Intel Xeon E5, MacPro6,1) by passing the following arguments through to cmake:
-D WHISPER_NO_AVX2=on

If I didn't do that, I got some "Illegal Instruction" kernel panics.

katsu560 · 2022-12-20T23:02:40Z

Thank you for the merging, Georgi.

I agreed that the slowdown is strange. I don't understand why it happens, too.
I'll share my test app later.

katsu560 · 2022-12-22T14:34:24Z

Please confirm pull request, Add ggml performance test #306.

Add AVX,AVX2 support for ggml_vec_scale_f32

ceaccca

ggerganov merged commit 419b8a6 into ggerganov:master Dec 17, 2022

katsu560 deleted the devpr branch December 20, 2022 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX,AVX2 support for ggml_vec_scale_f32 #285

Add AVX,AVX2 support for ggml_vec_scale_f32 #285

katsu560 commented Dec 17, 2022 •

edited by ggerganov

ggerganov commented Dec 17, 2022

djthorpe commented Dec 18, 2022

katsu560 commented Dec 20, 2022

katsu560 commented Dec 22, 2022

Add AVX,AVX2 support for ggml_vec_scale_f32 #285

Add AVX,AVX2 support for ggml_vec_scale_f32 #285

Conversation

katsu560 commented Dec 17, 2022 • edited by ggerganov

ggerganov commented Dec 17, 2022

djthorpe commented Dec 18, 2022

katsu560 commented Dec 20, 2022

katsu560 commented Dec 22, 2022

katsu560 commented Dec 17, 2022 •

edited by ggerganov