Try to add AVX 512-bit support #95

ggerganov · 2022-10-26T15:51:26Z

Update:

This PR adds AVX 512-bit support. The performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.

OUTDATED BELOW

WIP in progress

This is not tested because I don't have a AVX 512-bit CPU, so very likely that the code will fail.
Still, would appreciate if someone gives it a try and report issues.

@ArtyomZemlyak
Are you interested in giving it a try? I noticed you have a CPU with AVX 512-bit support

lerela · 2022-10-26T21:16:06Z

Hi @ggerganov, I gave it a try out of curiosity on my i7-1165G7 on Ubuntu 22.04, it does not work but unfortunately there isn't much to report. The script just runs forever. Let me know if there is a way to provide verbose logs.

avx512 branch:

➜ make
cc  -I.              -O3 -std=c11   -pthread -mavx512f -mavx512dq -mfma -mf16c   -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main 

❯ ./main -v -l fr -m ../whisper.cpp/models/ggml-medium.bin -f ../whisper.cpp/testfile-16b.wav
whisper_model_load: loading model from '../whisper.cpp/models/ggml-medium.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem_required  = 2610.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 1644.98 MB
whisper_model_load: memory size =   182.62 MB 
whisper_model_load: model size  =  1462.12 MB

main: processing '../whisper.cpp/testfile-16b.wav' (1243847 samples, 77.7 sec), 4 threads, lang = fr, task = transcribe, timestamps = 1 ...

master branch for reference:

➜ make
cc  -I.              -O3 -std=c11   -pthread -mavx -mavx2 -mfma -mf16c   -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main 

➜ ./main -l fr -m models/ggml-medium.bin -f testfile-16b.wav 
whisper_model_load: loading model from 'models/ggml-medium.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem_required  = 2610.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 1644.98 MB
whisper_model_load: memory size =   182.62 MB 
whisper_model_load: model size  =  1462.12 MB

main: processing 'testfile-16b.wav' (1243847 samples, 77.7 sec), 4 threads, lang = fr, task = transcribe, timestamps = 1 ...

whisper_print_timings:     load time =  1196.27 ms
whisper_print_timings:      mel time =   712.04 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 133803.97 ms / 5575.17 ms per layer
whisper_print_timings:   decode time = 28689.79 ms / 1195.41 ms per layer
whisper_print_timings:    total time = 164578.70 ms

system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

And my cpuinfo:

➜ cat /proc/cpuinfo | grep avx512
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear ibt flush_l1d arch_capabilities

ArtyomZemlyak · 2022-10-27T07:37:25Z

Yes its interesting!
Compiled.
Tryied bench (tiny -t 8):

whisper_model_load: loading model from '../models/ggml-model-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  84.99 MB
whisper_model_load: memory size =    11.41 MB 
whisper_model_load: model size  =    73.54 MB

whisper_print_timings:     load time =   118.11 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3359.54 ms / 839.89 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3477.65 ms

system_info: n_threads = 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

ArtyomZemlyak · 2022-10-27T07:42:06Z

Seems its much slower for all models

ArtyomZemlyak · 2022-10-27T07:47:54Z

3 runs of tiny -t 8 on AVX512 and master (AVX2) branches

ArtyomZemlyak · 2022-10-27T07:58:21Z

Cpu info (all):

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect flush_l1d arch_capabilities

ggerganov · 2022-10-27T14:48:09Z

@lerela @ArtyomZemlyak
Just pushed another version - I'm doing this blindly, so not sure if it works

lerela · 2022-10-27T14:54:18Z

Still not working. I tried with the tiny model, it does terminate (I wasn't patient enough yesterday) and it's faster than before but still behind master, and there is no output (it just prints the stats but no text).

There is a lot of variance between runs but here are some timings:

avx512:

whisper_print_timings:     load time =   253.09 ms
whisper_print_timings:      mel time =   710.95 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 16384.33 ms / 4096.08 ms per layer
whisper_print_timings:   decode time = 35269.12 ms / 8817.28 ms per layer
whisper_print_timings:    total time = 54139.26 ms

master:

whisper_print_timings:     load time =   240.68 ms
whisper_print_timings:      mel time =  1357.53 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 11729.50 ms / 2932.38 ms per layer
whisper_print_timings:   decode time =  8131.67 ms / 2032.92 ms per layer
whisper_print_timings:    total time = 21719.72 ms

Called with ./main -l fr -pc -m models/ggml-tiny.bin -f testfile.wav.

jaybinks · 2022-11-05T11:21:16Z

tested on my : Xeon(R) Silver 4210R CPU @ 2.40GHz ( VM with 8 cores only )

avx512 branch :

jay.binks@tools2:~/src/whisper.cpp$ ./main -v -l fr -m ../whisper.cpp/models/ggml-small.en.bin -f ../whisper.cpp/samples/jfk.wav 
whisper_model_load: loading model from '../whisper.cpp/models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1048.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 533.05 MB
whisper_model_load: memory size =    68.48 MB 
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: WARNING: model is not multilingual, ignoring language and translation options
main: processing '../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...



whisper_print_timings:     load time =   762.59 ms
whisper_print_timings:      mel time =   154.38 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 10756.24 ms / 896.35 ms per layer
whisper_print_timings:   decode time = 10072.97 ms / 839.41 ms per layer
whisper_print_timings:    total time = 21918.74 ms

master:

jay.binks@tools2:~/src/whisper.cpp$ ./main -v -l fr -m ../whisper.cpp/models/ggml-small.en.bin -f ../whisper.cpp/samples/jfk.wav 
whisper_model_load: loading model from '../whisper.cpp/models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1044.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: WARNING: model is not multilingual, ignoring language and translation options
main: processing '../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:08.000 --> 00:00:11.000]   Ask what you can do for your country.


whisper_print_timings:     load time =   750.77 ms
whisper_print_timings:      mel time =   140.73 ms
whisper_print_timings:   sample time =    23.58 ms
whisper_print_timings:   encode time = 11561.66 ms / 963.47 ms per layer
whisper_print_timings:   decode time =  1224.77 ms / 102.06 ms per layer
whisper_print_timings:    total time = 13703.20 ms

ggerganov · 2022-11-05T20:55:46Z

ggml.c

+    const __m512 sum23 = _mm512_add_ps(sum2, sum3);
+    const __m512 sum0123 = _mm512_add_ps(sum01, sum23);
+
+    sumf = sum0123[0] + sum0123[1] + sum0123[2] + sum0123[3] + sum0123[4] + sum0123[5] + sum0123[6] + sum0123[7];


I think this is the bug - should sum up to sum0123[15]

ggml.c

ggerganov · 2022-11-05T20:57:47Z

@jaybinks
Just pushed another fix

jaybinks · 2022-11-05T23:51:07Z

I'll test it soon... What timezone are you in ? I'm in GMT +10 Australia,. Happy to jump on a call and test with you if you like

…

On Sun, 6 Nov 2022, 6:57 am Georgi Gerganov, ***@***.***> wrote: @jaybinks <https://github.com/jaybinks> Just pushed another fix — Reply to this email directly, view it on GitHub <#95 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALQR67XJF2VD5IVXMTNXTTWG3C5LANCNFSM6AAAAAARPESVCE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jaybinks · 2022-11-06T05:59:15Z

have re-tested, and it seems to be no better (possibly worse)
./bench -m ./models/ggml-small.en.bin -t 4

before pulling your recent change :

whisper_print_timings:     load time =  1096.42 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 11793.09 ms / 982.76 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 12890.37 ms

after commit c435035

whisper_print_timings:     load time =   785.93 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 13540.00 ms / 1128.33 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 14325.94 ms

I'm not convinced it's worse, some runs were "total time 12sec".
its just not heaps better.

ggerganov · 2022-11-06T07:10:22Z

I found a machine with AVX-512 CPU and fixed the code. It now produces correct results, but the performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.

jaybinks · 2022-11-06T07:41:17Z

Hey, do you have any plans to add gpu support to whisper.cpp ? Just curious what your plans are

…

On Sun, 6 Nov 2022, 5:10 pm Georgi Gerganov, ***@***.***> wrote: I found a machine with AVX-512 CPU and fixed the code. It now produces correct results, but the performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it. — Reply to this email directly, view it on GitHub <#95 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALQR6ZDVITCRDJVWMV6KMLWG5KWRANCNFSM6AAAAAARPESVCE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ggerganov · 2022-11-06T15:53:52Z

Adding GPU support is not out of the question, but it's low priority atm. Here are some additional thoughts on this: #126

RndyP · 2022-12-09T22:23:47Z

May I suggest #define the AVX instructions per the type of SIMD instead of #define big code blocks. It's much easier that way to support different flavors of SIMD. You need to #define the number of floats that fit in the AVX register of course, and alter all the code to use SIMD_Register_Float_Size instead of the hardcoded sizes. It turns out not to be much work. Example code below senses AVX and uses 256 bit AVX, else regresses to 128 bit. For Whisper you probably want a 512 and 256 case.

#if defined(AVX)
#define SIMD_Register_Float_Size 8 // 8 floats fit in 256 bit AVX register
typedef __m256 SIMD_vFloat;
typedef __m256i SIMD_vInt;

#define SIMD_Int m256i_i32
#define SIMD_UInt m256i_u32
#define SIMD_Float m256_f32
#define SIMD_Add _mm256_add_ps
#define SIMD_Subtract _mm256_sub_ps
#define SIMD_Multiply _mm256_mul_ps
#define SIMD_Divide _mm256_div_ps
#define SIMD_Set _mm256_set1_ps
#define SIMD_Max _mm256_max_ps
#define SIMD_Sqrt _mm256_sqrt_ps
#define SIMD_Zero _mm256_setzero_ps
#define SIMD_FloatToInt _mm256_cvtps_epi32
#define SIMD_CastInt _mm256_castsi256_ps
#define SIMD_AddInt _mm256_add_epi32
#define SIMD_MultiplyInt _mm256_mullo_epi32
#define SIMD_SubtractInt _mm256_sub_epi32
#define SIMD_SetInt _mm256_set1_epi32
#define SIMD_Hypot _mm256_hypot_ps
#define SIMD_Exp _mm256_exp_ps
#else
#define SIMD_Register_Float_Size 4 // 4 floats fit in 128 bit SSE register
typedef __m128 SIMD_vFloat;
typedef __m128i SIMD_vInt;

#define SIMD_Int m128i_i32
#define SIMD_UInt m128i_u32
#define SIMD_Float m128_f32
#define SIMD_Add _mm_add_ps
#define SIMD_Subtract _mm_sub_ps
#define SIMD_Multiply _mm_mul_ps
#define SIMD_Divide _mm_div_ps
#define SIMD_Set _mm_set1_ps
#define SIMD_Max _mm_max_ps
#define SIMD_Sqrt _mm_sqrt_ps
#define SIMD_Zero _mm_setzero_ps
#define SIMD_FloatToInt _mm_cvtps_epi32
#define SIMD_CastInt _mm_castsi128_ps
#define SIMD_AddInt _mm_add_epi32
#define SIMD_MultiplyInt _mm_mullo_epi32
#define SIMD_SubtractInt _mm_sub_epi32
#define SIMD_SetInt _mm_set1_epi32
#define SIMD_Hypot _mm_hypot_ps
#define SIMD_Exp _mm_exp_ps
#endif

xvallspl · 2023-03-17T11:41:33Z

FWIW, I remember from my own tests several years ago (around the first Coffee Lake processors), that using AVX-512 intrinsics caused frequency throttling that ended up in performance losses. This was confirmed by intel engineers.

I don't know if that's still the case (and I can't test it), or what's happening in arm platforms though.

ggerganov · 2023-04-15T13:31:03Z

Try to add AVX 512-bit support

95f4fc7

ggerganov mentioned this pull request Oct 26, 2022

Benchmark results #89

Open

ggerganov added 3 commits October 27, 2022 17:31

Merge branch 'master' into avx512

01e037c

Another shot at AVX-512 support

7fc52fa

Merge remote-tracking branch 'origin/master' into avx512

3613951

ggerganov commented Nov 5, 2022

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

Update ggml.c

c435035

ggerganov added 2 commits November 6, 2022 08:50

ggml : fix AVX 512-bit kernels

b71d45b

Merge branch 'master' into avx512

5d895d6

ggerganov mentioned this pull request Dec 23, 2022

Simplify the SIMD code #324

Merged

10 tasks

ggerganov closed this Apr 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to add AVX 512-bit support #95

Try to add AVX 512-bit support #95

ggerganov commented Oct 26, 2022 •

edited

Loading

lerela commented Oct 26, 2022

ArtyomZemlyak commented Oct 27, 2022

ArtyomZemlyak commented Oct 27, 2022

ArtyomZemlyak commented Oct 27, 2022

ArtyomZemlyak commented Oct 27, 2022

ggerganov commented Oct 27, 2022

lerela commented Oct 27, 2022

jaybinks commented Nov 5, 2022 •

edited

Loading

ggerganov Nov 5, 2022

ggerganov commented Nov 5, 2022

jaybinks commented Nov 5, 2022 via email

jaybinks commented Nov 6, 2022 •

edited

Loading

ggerganov commented Nov 6, 2022

jaybinks commented Nov 6, 2022 via email

ggerganov commented Nov 6, 2022

RndyP commented Dec 9, 2022

xvallspl commented Mar 17, 2023

ggerganov commented Apr 15, 2023

Try to add AVX 512-bit support #95

Try to add AVX 512-bit support #95

Conversation

ggerganov commented Oct 26, 2022 • edited Loading

OUTDATED BELOW

lerela commented Oct 26, 2022

ArtyomZemlyak commented Oct 27, 2022

ArtyomZemlyak commented Oct 27, 2022

ArtyomZemlyak commented Oct 27, 2022

ArtyomZemlyak commented Oct 27, 2022

ggerganov commented Oct 27, 2022

lerela commented Oct 27, 2022

jaybinks commented Nov 5, 2022 • edited Loading

ggerganov Nov 5, 2022

Choose a reason for hiding this comment

ggerganov commented Nov 5, 2022

jaybinks commented Nov 5, 2022 via email

jaybinks commented Nov 6, 2022 • edited Loading

ggerganov commented Nov 6, 2022

jaybinks commented Nov 6, 2022 via email

ggerganov commented Nov 6, 2022

RndyP commented Dec 9, 2022

xvallspl commented Mar 17, 2023

ggerganov commented Apr 15, 2023

ggerganov commented Oct 26, 2022 •

edited

Loading

jaybinks commented Nov 5, 2022 •

edited

Loading

jaybinks commented Nov 6, 2022 •

edited

Loading