Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC #375

anzz1 · 2023-03-22T01:47:45Z

__FMA__ and __F16C__ are defined in GCC and Clang

__FMA__ and __F16C__ are not defined in MSVC, however they are implied with AVX2/AVX512
https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-160

Thus, enable FMA and F16C in MSVC if either AVX2/AVX512 is enabled

anzz1 · 2023-03-22T02:15:03Z

It seems I'm too tired to find the button for converting to a draft, but anyway.
The _cvtss_sh and _cvtss_ss intrinsics are still missing and not implemented yet, so don't merge yet.

anzz1 · 2023-03-22T03:20:07Z

I haven't checked out the compiled output at the disassembly level yet, so especially in the case of F16C there is the consideration as to which extent the compiler had already optimized the generic ggml_compute_fp16_to_fp32 and fp32_to_fp16 to use the cvt/f16c instructions. The answer to that question also answers the question whether this change can bring possibly a significant performance increase or do pretty much nothing at all.

niclimcy · 2023-03-22T03:29:13Z

Avx2/avx512 also implies all the simd instructions being enabled like sse3

anzz1 · 2023-03-22T03:44:07Z

Avx2/avx512 also implies all the simd instructions being enabled like sse3

Yeah, but the __SSE3__ wasn't currently used as __AVX__ takes precedence over it, so I didn't add it (#elif defined(__SSE3__) is after #elif defined(__AVX__)

e: i guess its a good addition anyway if possibly used in the future. won't hurt.

niclimcy · 2023-03-22T04:25:33Z

does that macro even exist? https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160

anzz1 · 2023-03-22T05:44:20Z

does that macro even exist? https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160

It doesn't, that is the entire point.

ggerganov · 2023-03-22T17:14:44Z

Do you observe improved performance with this change?

lofcz · 2023-03-22T22:03:08Z

This got FMA enabled while building from VS, windows, on i7 8th gen. However, time per token seems to be the same (under 1% diff)

anzz1 · 2023-03-23T01:59:25Z

Do you observe improved performance with this change?

I'll have to take a in-depth look later analysing the binary code and timing the performance, until then no idea. In the case of FMA the difference between _mm256_fmadd_ps(b, c, a) and _mm256_add_ps(_mm256_mul_ps(b, c), a) is probably marginal. The impact from F16C intrinsics could be greater. Obviously the results can also vary between different processor lines, but generally I'd expect the functions baked into the processor to do exactly that computation perform better than using general computation. However we've also seen before that it's not always the case (f.ex. for AVX-512, at least a few years ago in the earliest Intel/AMD consumer SKUs to include this functionality, the implementation was less than stellar and in many cases made the performance worse when using it, but afaik that is still more an exception than the norm. however that case was also an example of how these things can be hard to measure, since iirc at least in the intel models the problem wasn't that the avx512 calculations themselves weren't faster, but that using the avx512 slowed down other calculations which made the total impact negative).

That didn't really answer your question. 😄

Thanks @lofcz for providing some initial testing. If anyone else wants to chip in with their results including the processor and model parameters of the test, that'd be greatly appreciated.

I'd expect this to increase performance in range of +0% to +X%, but especially important would be to make sure that this will not decrease performance in any case.

niclimcy · 2023-03-23T08:18:39Z

These are my runtimes on my Ryzen 4500U (Zen 2)

Without FMA is built upon e4412b4 while with FMA just adds on the commits in this pull request

Without FMA runs faster?

Values are from tokens / ms here:

EDIT:
Maybe I'm wrong at that value is runtime? In that case FMA patch improves performance by 6.24%?

KASR · 2023-03-23T14:15:19Z

I've tried to do the same as @nicknitewolf

I have an Intel Xeon W-2295

So I guess on my system there is little to no influence on the performance for the eval time, however the sample time seems to be a bit better. However, the sample time has little effect on the total time.

The original systems information and loading timings for the 7B and 65B are:

The 7B timings:

The 65B timings:

After the modifications:

The 7B timings:

The 65B timings:

anzz1 · 2023-03-26T19:08:32Z

Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳

I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4 would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.

It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.

@KASR could you run the tests with -t 4 and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.

__FMA__ macro does not exist in MSVC

__F16C__ macro does not exist in MSVC, but is implied with AVX2/AVX512

even though it's not currently used for anything when AVX is defined

anzz1 · 2023-03-26T19:28:46Z

Rebased the branch to master for easier testing.

niclimcy · 2023-03-27T06:35:48Z

Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳

I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4 would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.

It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.

@KASR could you run the tests with -t 4 and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.

Ah I should have done that, setting a specified thread count, maybe thats why my results vary so much

anzz1 · 2023-03-27T17:34:27Z

Even based on your tests alone I would think that merging this is a good idea. All the other platforms use the extensions already and MSVC is also supposed to use them as they are implied with /arch:AVX2, the only problem here is really the #define not being set so MSVC takes a slower codepath than all the other compilers.

The reason for variance is probably just Windows, unfortunately, and there is not much to be done about it except running more tests to decrease their significance. I'm actually still maining Windows 7 for exactly this reason, since a fresh Win7 installation with the crap cut down runs under 100 threads total when idle. ~5 Watt and ~0,5% CPU usage. Windows 10 on the other hand has 800+ at any given time and can go up into the thousands when some updates or whatever Store/Xbox/Cortana nonsense is going in the background. Open your task manager and see ur threadcount when you are supposedly doing nothing and see all the bloat eating up your cpu cycles. It's really hard to properly perftest anything in the modern windowses and most of the crap is baked in so heavily into the system that its nigh-impossible to remove it all without crippling the OS.

That being said, I do not recommend maining Win7 anymore since hardware and software support is on it's very last legs. Unfortunate, since the OS itself is pretty much perfect and 100% stable, haven't had a system crash or even a malfunction in years.

slaren

A quick test in godbolt shows that MSVC compiles GGML_COMPUTE_FP16_TO_FP32 and GGML_COMPUTE_FP32_TO_FP16 to vcvtph2ps and vcvtps2ph, same as GCC. I see no reason to not merge this.

anzz1 · 2023-03-27T18:36:06Z

@slaren And looking at the current alternative paints a pretty clear picture 😄
And as seen in the before part, vex instructions like vcvtsi2ss are still used since AVX2 implies their use, its just that the most optimized version wasnt used because of the missing flag.

After PR

GGML_COMPUTE_FP16_TO_FP32 PROC                      ; COMDAT
        movzx   eax, cx
        vmovd   xmm0, eax
        vcvtph2ps xmm0, xmm0
        ret     0
GGML_COMPUTE_FP16_TO_FP32 ENDP

GGML_COMPUTE_FP32_TO_FP16 PROC                      ; COMDAT
        vmovaps xmm1, xmm0
        vxorps  xmm0, xmm0, xmm0
        vmovss3 xmm1, xmm0, xmm1
        vcvtps2ph xmm2, xmm1, 0
        vpextrw eax, xmm2, 0
        ret     0
GGML_COMPUTE_FP32_TO_FP16 ENDP

Before PR

GGML_COMPUTE_FP16_TO_FP32 PROC                      ; COMDAT
        movzx   eax, cx
        shl     eax, 16
        mov     edx, eax
        and     edx, -2147483648              ; 80000000H
        lea     ecx, DWORD PTR [rax+rax]
        mov     eax, ecx
        shr     eax, 4
        add     eax, 1879048192                     ; 70000000H
        mov     DWORD PTR fp32$3[rsp], eax
        mov     eax, ecx
        shr     eax, 17
        or      eax, 1056964608                   ; 3f000000H
        mov     DWORD PTR fp32$2[rsp], eax
        cmp     ecx, 134217728                            ; 08000000H
        jae     SHORT $LN5@GGML_COMPU
        vmovss  xmm0, DWORD PTR fp32$2[rsp]
        vsubss  xmm1, xmm0, DWORD PTR __real@3f000000
        vmovss  DWORD PTR tv87[rsp], xmm1
        mov     eax, DWORD PTR tv87[rsp]
        or      eax, edx
        mov     DWORD PTR fp32$1[rsp], eax
        vmovss  xmm0, DWORD PTR fp32$1[rsp]
        ret     0
$LN5@GGML_COMPU:
        vmovss  xmm0, DWORD PTR fp32$3[rsp]
        vmulss  xmm1, xmm0, DWORD PTR __real@07800000
        vmovss  DWORD PTR tv87[rsp], xmm1
        mov     eax, DWORD PTR tv87[rsp]
        or      eax, edx
        mov     DWORD PTR fp32$1[rsp], eax
        vmovss  xmm0, DWORD PTR fp32$1[rsp]
        ret     0
GGML_COMPUTE_FP16_TO_FP32 ENDP

GGML_COMPUTE_FP32_TO_FP16 PROC                      ; COMDAT
$LN17:
        sub     rsp, 56                             ; 00000038H
        vmovaps XMMWORD PTR [rsp+32], xmm6
        vmovaps xmm6, xmm0
        vcvtss2sd xmm0, xmm6, xmm0
        vmovq   rcx, xmm0
        call    fabsf
        vmovd   r8d, xmm6
        vmovaps xmm6, XMMWORD PTR [rsp+32]
        mov     ecx, 1895825408                     ; 71000000H
        vxorps  xmm1, xmm1, xmm1
        vcvtsi2ss xmm1, xmm1, eax
        vmulss  xmm2, xmm1, DWORD PTR __real@77800000
        vmulss  xmm3, xmm2, DWORD PTR __real@08800000
        lea     edx, DWORD PTR [r8+r8]
        and     r8d, -2147483648              ; 80000000H
        mov     eax, edx
        and     eax, -16777216                            ; ff000000H
        cmp     eax, ecx
        cmovb   eax, ecx
        shr     eax, 1
        add     eax, 125829120                            ; 07800000H
        mov     DWORD PTR fp32$1[rsp], eax
        vaddss  xmm1, xmm3, DWORD PTR fp32$1[rsp]
        vmovd   ecx, xmm1
        mov     eax, ecx
        and     ecx, 4095               ; 00000fffH
        shr     eax, 13
        and     eax, 31744                                ; 00007c00H
        add     eax, ecx
        mov     ecx, 32256                                ; 00007e00H
        cmp     edx, -16777216                            ; ff000000H
        cmova   ax, cx
        shr     r8d, 16
        or      ax, r8w
        add     rsp, 56                             ; 00000038H
        ret     0
GGML_COMPUTE_FP32_TO_FP16 ENDP

anzz1 · 2023-03-27T18:41:55Z

Hold merging this until #546 is merged.

KASR · 2023-03-28T15:03:34Z

Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳

I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4 would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.

It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.

@KASR could you run the tests with -t 4 and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.

Yes sure, I've updated to the newest commit (at the time of writing) and enabled AVX512. I only performed with the 7B model, let me know if it's interesting to also see the results for the 65B.

I've used the command: ./main -m ./models/7B/ggml-model-q4_0.bin -s 1679164839 -n 128 -t 4

Original settings:

After adjustments:

I've also added the results using t 20, ( i list the xx ms/token as value ):

so using t4 --> 6.55% speedup (which is very close to the value @nicknitewolf had)
using t20 --> 5.15% speedup

anzz1 · 2023-03-28T15:32:41Z

Much appreciated ! 👍

Thanks to everyone taking part to this, with special thanks to @nicknitewolf and @KASR to take the time to do benchmarks. While a 5% speed increase might not be very noticeable on its' own, all the performance increases add up and are important as parts of the big picture.

After such well made research we can be definitely confident that this PR is a go.

I'll merge this right after the CI is fixed #546

anzz1 changed the title ~~Enable Fused-Multiply-Add (FMA) instructions on MSVC~~ Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC Mar 22, 2023

anzz1 added bug Something isn't working performance Speed related topics labels Mar 22, 2023

slaren mentioned this pull request Mar 25, 2023

Comparison of Windows Build VS Unix Build (through WSL2) #507

Closed

x02Sylvie mentioned this pull request Mar 26, 2023

Build your windows binaries with Clang and not MSVC. #534

Closed

anzz1 added 4 commits March 26, 2023 22:19

Enable Fused-Multiply-Add (FMA) instructions on MSVC

50ea002

__FMA__ macro does not exist in MSVC

Enable F16C/CVT16 vector extensions on MSVC

c43d456

__F16C__ macro does not exist in MSVC, but is implied with AVX2/AVX512

MSVC cvt intrinsics

67ad7c4

Add __SSE3__ macro for MSVC too because why not

b8a80f9

even though it's not currently used for anything when AVX is defined

anzz1 force-pushed the llama-patch-enable-fma-msvc branch from d1e4a18 to b8a80f9 Compare March 26, 2023 19:27

slaren approved these changes Mar 27, 2023

View reviewed changes

anzz1 merged commit 5a5f8b1 into master Mar 28, 2023

anzz1 deleted the llama-patch-enable-fma-msvc branch March 28, 2023 19:44

anzz1 mentioned this pull request Mar 30, 2023

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC #375

Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC #375

anzz1 commented Mar 22, 2023 •

edited

Loading

anzz1 commented Mar 22, 2023

anzz1 commented Mar 22, 2023

niclimcy commented Mar 22, 2023

anzz1 commented Mar 22, 2023 •

edited

Loading

niclimcy commented Mar 22, 2023

anzz1 commented Mar 22, 2023 •

edited

Loading

ggerganov commented Mar 22, 2023

lofcz commented Mar 22, 2023

anzz1 commented Mar 23, 2023 •

edited

Loading

niclimcy commented Mar 23, 2023 •

edited

Loading

KASR commented Mar 23, 2023

anzz1 commented Mar 26, 2023

anzz1 commented Mar 26, 2023

niclimcy commented Mar 27, 2023

anzz1 commented Mar 27, 2023

slaren left a comment

anzz1 commented Mar 27, 2023

anzz1 commented Mar 27, 2023

KASR commented Mar 28, 2023

anzz1 commented Mar 28, 2023

Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC #375

Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC #375

Conversation

anzz1 commented Mar 22, 2023 • edited Loading

anzz1 commented Mar 22, 2023

anzz1 commented Mar 22, 2023

niclimcy commented Mar 22, 2023

anzz1 commented Mar 22, 2023 • edited Loading

niclimcy commented Mar 22, 2023

anzz1 commented Mar 22, 2023 • edited Loading

ggerganov commented Mar 22, 2023

lofcz commented Mar 22, 2023

anzz1 commented Mar 23, 2023 • edited Loading

niclimcy commented Mar 23, 2023 • edited Loading

KASR commented Mar 23, 2023

anzz1 commented Mar 26, 2023

anzz1 commented Mar 26, 2023

niclimcy commented Mar 27, 2023

anzz1 commented Mar 27, 2023

slaren left a comment

Choose a reason for hiding this comment

anzz1 commented Mar 27, 2023

anzz1 commented Mar 27, 2023

KASR commented Mar 28, 2023

anzz1 commented Mar 28, 2023

anzz1 commented Mar 22, 2023 •

edited

Loading

anzz1 commented Mar 22, 2023 •

edited

Loading

anzz1 commented Mar 22, 2023 •

edited

Loading

anzz1 commented Mar 23, 2023 •

edited

Loading

niclimcy commented Mar 23, 2023 •

edited

Loading