Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to add AVX 512-bit support #95

Closed
wants to merge 7 commits into from
Closed

Try to add AVX 512-bit support #95

wants to merge 7 commits into from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 26, 2022

Update:

This PR adds AVX 512-bit support. The performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.


OUTDATED BELOW

WIP in progress

This is not tested because I don't have a AVX 512-bit CPU, so very likely that the code will fail.
Still, would appreciate if someone gives it a try and report issues.

@ArtyomZemlyak
Are you interested in giving it a try? I noticed you have a CPU with AVX 512-bit support

@ggerganov ggerganov mentioned this pull request Oct 26, 2022
@lerela
Copy link

lerela commented Oct 26, 2022

Hi @ggerganov, I gave it a try out of curiosity on my i7-1165G7 on Ubuntu 22.04, it does not work but unfortunately there isn't much to report. The script just runs forever. Let me know if there is a way to provide verbose logs.

avx512 branch:

➜ make
cc  -I.              -O3 -std=c11   -pthread -mavx512f -mavx512dq -mfma -mf16c   -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main 

❯ ./main -v -l fr -m ../whisper.cpp/models/ggml-medium.bin -f ../whisper.cpp/testfile-16b.wav
whisper_model_load: loading model from '../whisper.cpp/models/ggml-medium.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem_required  = 2610.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 1644.98 MB
whisper_model_load: memory size =   182.62 MB 
whisper_model_load: model size  =  1462.12 MB

main: processing '../whisper.cpp/testfile-16b.wav' (1243847 samples, 77.7 sec), 4 threads, lang = fr, task = transcribe, timestamps = 1 ...

master branch for reference:

➜ make
cc  -I.              -O3 -std=c11   -pthread -mavx -mavx2 -mfma -mf16c   -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main 

➜ ./main -l fr -m models/ggml-medium.bin -f testfile-16b.wav 
whisper_model_load: loading model from 'models/ggml-medium.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem_required  = 2610.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 1644.98 MB
whisper_model_load: memory size =   182.62 MB 
whisper_model_load: model size  =  1462.12 MB

main: processing 'testfile-16b.wav' (1243847 samples, 77.7 sec), 4 threads, lang = fr, task = transcribe, timestamps = 1 ...

whisper_print_timings:     load time =  1196.27 ms
whisper_print_timings:      mel time =   712.04 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 133803.97 ms / 5575.17 ms per layer
whisper_print_timings:   decode time = 28689.79 ms / 1195.41 ms per layer
whisper_print_timings:    total time = 164578.70 ms

system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

And my cpuinfo:

➜ cat /proc/cpuinfo | grep avx512
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear ibt flush_l1d arch_capabilities

@ArtyomZemlyak
Copy link

Yes its interesting!
Compiled.
Tryied bench (tiny -t 8):

whisper_model_load: loading model from '../models/ggml-model-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  84.99 MB
whisper_model_load: memory size =    11.41 MB 
whisper_model_load: model size  =    73.54 MB

whisper_print_timings:     load time =   118.11 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3359.54 ms / 839.89 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3477.65 ms

system_info: n_threads = 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

@ArtyomZemlyak
Copy link

Seems its much slower for all models

@ArtyomZemlyak
Copy link

3 runs of tiny -t 8 on AVX512 and master (AVX2) branches
image

@ArtyomZemlyak
Copy link

Cpu info (all):

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect flush_l1d arch_capabilities

@ggerganov
Copy link
Owner Author

@lerela @ArtyomZemlyak
Just pushed another version - I'm doing this blindly, so not sure if it works

@lerela
Copy link

lerela commented Oct 27, 2022

Still not working. I tried with the tiny model, it does terminate (I wasn't patient enough yesterday) and it's faster than before but still behind master, and there is no output (it just prints the stats but no text).

There is a lot of variance between runs but here are some timings:

avx512:

whisper_print_timings:     load time =   253.09 ms
whisper_print_timings:      mel time =   710.95 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 16384.33 ms / 4096.08 ms per layer
whisper_print_timings:   decode time = 35269.12 ms / 8817.28 ms per layer
whisper_print_timings:    total time = 54139.26 ms

master:

whisper_print_timings:     load time =   240.68 ms
whisper_print_timings:      mel time =  1357.53 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 11729.50 ms / 2932.38 ms per layer
whisper_print_timings:   decode time =  8131.67 ms / 2032.92 ms per layer
whisper_print_timings:    total time = 21719.72 ms

Called with ./main -l fr -pc -m models/ggml-tiny.bin -f testfile.wav.

@jaybinks
Copy link
Contributor

jaybinks commented Nov 5, 2022

tested on my : Xeon(R) Silver 4210R CPU @ 2.40GHz ( VM with 8 cores only )

avx512 branch :

jay.binks@tools2:~/src/whisper.cpp$ ./main -v -l fr -m ../whisper.cpp/models/ggml-small.en.bin -f ../whisper.cpp/samples/jfk.wav 
whisper_model_load: loading model from '../whisper.cpp/models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1048.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 533.05 MB
whisper_model_load: memory size =    68.48 MB 
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: WARNING: model is not multilingual, ignoring language and translation options
main: processing '../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...



whisper_print_timings:     load time =   762.59 ms
whisper_print_timings:      mel time =   154.38 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 10756.24 ms / 896.35 ms per layer
whisper_print_timings:   decode time = 10072.97 ms / 839.41 ms per layer
whisper_print_timings:    total time = 21918.74 ms

master:

jay.binks@tools2:~/src/whisper.cpp$ ./main -v -l fr -m ../whisper.cpp/models/ggml-small.en.bin -f ../whisper.cpp/samples/jfk.wav 
whisper_model_load: loading model from '../whisper.cpp/models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1044.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: WARNING: model is not multilingual, ignoring language and translation options
main: processing '../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:08.000 --> 00:00:11.000]   Ask what you can do for your country.


whisper_print_timings:     load time =   750.77 ms
whisper_print_timings:      mel time =   140.73 ms
whisper_print_timings:   sample time =    23.58 ms
whisper_print_timings:   encode time = 11561.66 ms / 963.47 ms per layer
whisper_print_timings:   decode time =  1224.77 ms / 102.06 ms per layer
whisper_print_timings:    total time = 13703.20 ms

ggml.c Outdated
const __m512 sum23 = _mm512_add_ps(sum2, sum3);
const __m512 sum0123 = _mm512_add_ps(sum01, sum23);

sumf = sum0123[0] + sum0123[1] + sum0123[2] + sum0123[3] + sum0123[4] + sum0123[5] + sum0123[6] + sum0123[7];
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the bug - should sum up to sum0123[15]

ggml.c Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner Author

@jaybinks
Just pushed another fix

@jaybinks
Copy link
Contributor

jaybinks commented Nov 5, 2022 via email

@jaybinks
Copy link
Contributor

jaybinks commented Nov 6, 2022

have re-tested, and it seems to be no better (possibly worse)
./bench -m ./models/ggml-small.en.bin -t 4

before pulling your recent change :

whisper_print_timings:     load time =  1096.42 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 11793.09 ms / 982.76 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 12890.37 ms

after commit c435035

whisper_print_timings:     load time =   785.93 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 13540.00 ms / 1128.33 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 14325.94 ms

I'm not convinced it's worse, some runs were "total time 12sec".
its just not heaps better.

@ggerganov
Copy link
Owner Author

I found a machine with AVX-512 CPU and fixed the code. It now produces correct results, but the performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.

@jaybinks
Copy link
Contributor

jaybinks commented Nov 6, 2022 via email

@ggerganov
Copy link
Owner Author

Adding GPU support is not out of the question, but it's low priority atm. Here are some additional thoughts on this: #126

@RndyP
Copy link

RndyP commented Dec 9, 2022

May I suggest #define the AVX instructions per the type of SIMD instead of #define big code blocks. It's much easier that way to support different flavors of SIMD. You need to #define the number of floats that fit in the AVX register of course, and alter all the code to use SIMD_Register_Float_Size instead of the hardcoded sizes. It turns out not to be much work. Example code below senses AVX and uses 256 bit AVX, else regresses to 128 bit. For Whisper you probably want a 512 and 256 case.

#if defined(AVX)
#define SIMD_Register_Float_Size 8 // 8 floats fit in 256 bit AVX register
typedef __m256 SIMD_vFloat;
typedef __m256i SIMD_vInt;

#define SIMD_Int m256i_i32
#define SIMD_UInt m256i_u32
#define SIMD_Float m256_f32
#define SIMD_Add _mm256_add_ps
#define SIMD_Subtract _mm256_sub_ps
#define SIMD_Multiply _mm256_mul_ps
#define SIMD_Divide _mm256_div_ps
#define SIMD_Set _mm256_set1_ps
#define SIMD_Max _mm256_max_ps
#define SIMD_Sqrt _mm256_sqrt_ps
#define SIMD_Zero _mm256_setzero_ps
#define SIMD_FloatToInt _mm256_cvtps_epi32
#define SIMD_CastInt _mm256_castsi256_ps
#define SIMD_AddInt _mm256_add_epi32
#define SIMD_MultiplyInt _mm256_mullo_epi32
#define SIMD_SubtractInt _mm256_sub_epi32
#define SIMD_SetInt _mm256_set1_epi32
#define SIMD_Hypot _mm256_hypot_ps
#define SIMD_Exp _mm256_exp_ps
#else
#define SIMD_Register_Float_Size 4 // 4 floats fit in 128 bit SSE register
typedef __m128 SIMD_vFloat;
typedef __m128i SIMD_vInt;

#define SIMD_Int m128i_i32
#define SIMD_UInt m128i_u32
#define SIMD_Float m128_f32
#define SIMD_Add _mm_add_ps
#define SIMD_Subtract _mm_sub_ps
#define SIMD_Multiply _mm_mul_ps
#define SIMD_Divide _mm_div_ps
#define SIMD_Set _mm_set1_ps
#define SIMD_Max _mm_max_ps
#define SIMD_Sqrt _mm_sqrt_ps
#define SIMD_Zero _mm_setzero_ps
#define SIMD_FloatToInt _mm_cvtps_epi32
#define SIMD_CastInt _mm_castsi128_ps
#define SIMD_AddInt _mm_add_epi32
#define SIMD_MultiplyInt _mm_mullo_epi32
#define SIMD_SubtractInt _mm_sub_epi32
#define SIMD_SetInt _mm_set1_epi32
#define SIMD_Hypot _mm_hypot_ps
#define SIMD_Exp _mm_exp_ps
#endif

@ggerganov ggerganov mentioned this pull request Dec 23, 2022
10 tasks
@xvallspl
Copy link

FWIW, I remember from my own tests several years ago (around the first Coffee Lake processors), that using AVX-512 intrinsics caused frequency throttling that ended up in performance losses. This was confirmed by intel engineers.

I don't know if that's still the case (and I can't test it), or what's happening in arm platforms though.

@ggerganov
Copy link
Owner Author

@ggerganov ggerganov closed this Apr 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants