-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC #375
Conversation
It seems I'm too tired to find the button for converting to a draft, but anyway. |
I haven't checked out the compiled output at the disassembly level yet, so especially in the case of F16C there is the consideration as to which extent the compiler had already optimized the generic ggml_compute_fp16_to_fp32 and fp32_to_fp16 to use the cvt/f16c instructions. The answer to that question also answers the question whether this change can bring possibly a significant performance increase or do pretty much nothing at all. |
Avx2/avx512 also implies all the simd instructions being enabled like sse3 |
Yeah, but the e: i guess its a good addition anyway if possibly used in the future. won't hurt. |
does that macro even exist? https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160 |
It doesn't, that is the entire point. |
Do you observe improved performance with this change? |
This got FMA enabled while building from VS, windows, on i7 8th gen. However, time per token seems to be the same (under 1% diff) |
I'll have to take a in-depth look later analysing the binary code and timing the performance, until then no idea. In the case of FMA the difference between That didn't really answer your question. 😄 Thanks @lofcz for providing some initial testing. If anyone else wants to chip in with their results including the processor and model parameters of the test, that'd be greatly appreciated. I'd expect this to increase performance in range of +0% to +X%, but especially important would be to make sure that this will not decrease performance in any case. |
These are my runtimes on my Ryzen 4500U (Zen 2) Without FMA is built upon e4412b4 while with FMA just adds on the commits in this pull request Without FMA runs faster? Values are from tokens / ms here: EDIT: |
I've tried to do the same as @nicknitewolf I have an Intel Xeon W-2295 So I guess on my system there is little to no influence on the performance for the eval time, however the sample time seems to be a bit better. However, the sample time has little effect on the total time. The original systems information and loading timings for the 7B and 65B are: |
Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳 I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance. @KASR could you run the tests with |
__FMA__ macro does not exist in MSVC
__F16C__ macro does not exist in MSVC, but is implied with AVX2/AVX512
even though it's not currently used for anything when AVX is defined
d1e4a18
to
b8a80f9
Compare
Rebased the branch to master for easier testing. |
Ah I should have done that, setting a specified thread count, maybe thats why my results vary so much |
Even based on your tests alone I would think that merging this is a good idea. All the other platforms use the extensions already and MSVC is also supposed to use them as they are implied with /arch:AVX2, the only problem here is really the #define not being set so MSVC takes a slower codepath than all the other compilers. The reason for variance is probably just Windows, unfortunately, and there is not much to be done about it except running more tests to decrease their significance. I'm actually still maining Windows 7 for exactly this reason, since a fresh Win7 installation with the crap cut down runs under 100 threads total when idle. ~5 Watt and ~0,5% CPU usage. Windows 10 on the other hand has 800+ at any given time and can go up into the thousands when some updates or whatever Store/Xbox/Cortana nonsense is going in the background. Open your task manager and see ur threadcount when you are supposedly doing nothing and see all the bloat eating up your cpu cycles. It's really hard to properly perftest anything in the modern windowses and most of the crap is baked in so heavily into the system that its nigh-impossible to remove it all without crippling the OS. That being said, I do not recommend maining Win7 anymore since hardware and software support is on it's very last legs. Unfortunate, since the OS itself is pretty much perfect and 100% stable, haven't had a system crash or even a malfunction in years. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick test in godbolt shows that MSVC compiles GGML_COMPUTE_FP16_TO_FP32 and GGML_COMPUTE_FP32_TO_FP16 to vcvtph2ps and vcvtps2ph, same as GCC. I see no reason to not merge this.
@slaren And looking at the current alternative paints a pretty clear picture 😄 After PR GGML_COMPUTE_FP16_TO_FP32 PROC ; COMDAT
movzx eax, cx
vmovd xmm0, eax
vcvtph2ps xmm0, xmm0
ret 0
GGML_COMPUTE_FP16_TO_FP32 ENDP
GGML_COMPUTE_FP32_TO_FP16 PROC ; COMDAT
vmovaps xmm1, xmm0
vxorps xmm0, xmm0, xmm0
vmovss3 xmm1, xmm0, xmm1
vcvtps2ph xmm2, xmm1, 0
vpextrw eax, xmm2, 0
ret 0
GGML_COMPUTE_FP32_TO_FP16 ENDP Before PR GGML_COMPUTE_FP16_TO_FP32 PROC ; COMDAT
movzx eax, cx
shl eax, 16
mov edx, eax
and edx, -2147483648 ; 80000000H
lea ecx, DWORD PTR [rax+rax]
mov eax, ecx
shr eax, 4
add eax, 1879048192 ; 70000000H
mov DWORD PTR fp32$3[rsp], eax
mov eax, ecx
shr eax, 17
or eax, 1056964608 ; 3f000000H
mov DWORD PTR fp32$2[rsp], eax
cmp ecx, 134217728 ; 08000000H
jae SHORT $LN5@GGML_COMPU
vmovss xmm0, DWORD PTR fp32$2[rsp]
vsubss xmm1, xmm0, DWORD PTR __real@3f000000
vmovss DWORD PTR tv87[rsp], xmm1
mov eax, DWORD PTR tv87[rsp]
or eax, edx
mov DWORD PTR fp32$1[rsp], eax
vmovss xmm0, DWORD PTR fp32$1[rsp]
ret 0
$LN5@GGML_COMPU:
vmovss xmm0, DWORD PTR fp32$3[rsp]
vmulss xmm1, xmm0, DWORD PTR __real@07800000
vmovss DWORD PTR tv87[rsp], xmm1
mov eax, DWORD PTR tv87[rsp]
or eax, edx
mov DWORD PTR fp32$1[rsp], eax
vmovss xmm0, DWORD PTR fp32$1[rsp]
ret 0
GGML_COMPUTE_FP16_TO_FP32 ENDP
GGML_COMPUTE_FP32_TO_FP16 PROC ; COMDAT
$LN17:
sub rsp, 56 ; 00000038H
vmovaps XMMWORD PTR [rsp+32], xmm6
vmovaps xmm6, xmm0
vcvtss2sd xmm0, xmm6, xmm0
vmovq rcx, xmm0
call fabsf
vmovd r8d, xmm6
vmovaps xmm6, XMMWORD PTR [rsp+32]
mov ecx, 1895825408 ; 71000000H
vxorps xmm1, xmm1, xmm1
vcvtsi2ss xmm1, xmm1, eax
vmulss xmm2, xmm1, DWORD PTR __real@77800000
vmulss xmm3, xmm2, DWORD PTR __real@08800000
lea edx, DWORD PTR [r8+r8]
and r8d, -2147483648 ; 80000000H
mov eax, edx
and eax, -16777216 ; ff000000H
cmp eax, ecx
cmovb eax, ecx
shr eax, 1
add eax, 125829120 ; 07800000H
mov DWORD PTR fp32$1[rsp], eax
vaddss xmm1, xmm3, DWORD PTR fp32$1[rsp]
vmovd ecx, xmm1
mov eax, ecx
and ecx, 4095 ; 00000fffH
shr eax, 13
and eax, 31744 ; 00007c00H
add eax, ecx
mov ecx, 32256 ; 00007e00H
cmp edx, -16777216 ; ff000000H
cmova ax, cx
shr r8d, 16
or ax, r8w
add rsp, 56 ; 00000038H
ret 0
GGML_COMPUTE_FP32_TO_FP16 ENDP |
Hold merging this until #546 is merged. |
Yes sure, I've updated to the newest commit (at the time of writing) and enabled AVX512. I only performed with the 7B model, let me know if it's interesting to also see the results for the 65B. I've used the command: Original settings: After adjustments: I've also added the results using t 20, ( i list the xx ms/token as value ): so using t4 --> 6.55% speedup (which is very close to the value @nicknitewolf had) |
Much appreciated ! 👍 Thanks to everyone taking part to this, with special thanks to @nicknitewolf and @KASR to take the time to do benchmarks. While a 5% speed increase might not be very noticeable on its' own, all the performance increases add up and are important as parts of the big picture. After such well made research we can be definitely confident that this PR is a go. I'll merge this right after the CI is fixed #546 |
__FMA__
and__F16C__
are defined in GCC and Clang__FMA__
and__F16C__
are not defined in MSVC, however they are implied with AVX2/AVX512https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-160
Thus, enable FMA and F16C in MSVC if either AVX2/AVX512 is enabled