Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC #375

Merged
merged 4 commits into from
Mar 28, 2023

Conversation

anzz1
Copy link
Contributor

@anzz1 anzz1 commented Mar 22, 2023

__FMA__ and __F16C__ are defined in GCC and Clang

__FMA__ and __F16C__ are not defined in MSVC, however they are implied with AVX2/AVX512
https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-160

Thus, enable FMA and F16C in MSVC if either AVX2/AVX512 is enabled

@anzz1 anzz1 changed the title Enable Fused-Multiply-Add (FMA) instructions on MSVC Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC Mar 22, 2023
@anzz1
Copy link
Contributor Author

anzz1 commented Mar 22, 2023

It seems I'm too tired to find the button for converting to a draft, but anyway.
The _cvtss_sh and _cvtss_ss intrinsics are still missing and not implemented yet, so don't merge yet.

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 22, 2023

I haven't checked out the compiled output at the disassembly level yet, so especially in the case of F16C there is the consideration as to which extent the compiler had already optimized the generic ggml_compute_fp16_to_fp32 and fp32_to_fp16 to use the cvt/f16c instructions. The answer to that question also answers the question whether this change can bring possibly a significant performance increase or do pretty much nothing at all.

@anzz1 anzz1 added bug Something isn't working performance Speed related topics labels Mar 22, 2023
@niclimcy
Copy link

Avx2/avx512 also implies all the simd instructions being enabled like sse3

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 22, 2023

Avx2/avx512 also implies all the simd instructions being enabled like sse3

Yeah, but the __SSE3__ wasn't currently used as __AVX__ takes precedence over it, so I didn't add it (#elif defined(__SSE3__) is after #elif defined(__AVX__)

e: i guess its a good addition anyway if possibly used in the future. won't hurt.

@niclimcy
Copy link

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 22, 2023

does that macro even exist? https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160

It doesn't, that is the entire point.

@ggerganov
Copy link
Owner

Do you observe improved performance with this change?

@lofcz
Copy link

lofcz commented Mar 22, 2023

This got FMA enabled while building from VS, windows, on i7 8th gen. However, time per token seems to be the same (under 1% diff)

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 23, 2023

Do you observe improved performance with this change?

I'll have to take a in-depth look later analysing the binary code and timing the performance, until then no idea. In the case of FMA the difference between _mm256_fmadd_ps(b, c, a) and _mm256_add_ps(_mm256_mul_ps(b, c), a) is probably marginal. The impact from F16C intrinsics could be greater. Obviously the results can also vary between different processor lines, but generally I'd expect the functions baked into the processor to do exactly that computation perform better than using general computation. However we've also seen before that it's not always the case (f.ex. for AVX-512, at least a few years ago in the earliest Intel/AMD consumer SKUs to include this functionality, the implementation was less than stellar and in many cases made the performance worse when using it, but afaik that is still more an exception than the norm. however that case was also an example of how these things can be hard to measure, since iirc at least in the intel models the problem wasn't that the avx512 calculations themselves weren't faster, but that using the avx512 slowed down other calculations which made the total impact negative).

That didn't really answer your question. 😄

Thanks @lofcz for providing some initial testing. If anyone else wants to chip in with their results including the processor and model parameters of the test, that'd be greatly appreciated.

I'd expect this to increase performance in range of +0% to +X%, but especially important would be to make sure that this will not decrease performance in any case.

@niclimcy
Copy link

niclimcy commented Mar 23, 2023

These are my runtimes on my Ryzen 4500U (Zen 2)

Without FMA is built upon e4412b4 while with FMA just adds on the commits in this pull request

image

Without FMA runs faster?

Values are from tokens / ms here:
image

EDIT:
Maybe I'm wrong at that value is runtime? In that case FMA patch improves performance by 6.24%?

@KASR
Copy link
Contributor

KASR commented Mar 23, 2023

I've tried to do the same as @nicknitewolf

I have an Intel Xeon W-2295

So I guess on my system there is little to no influence on the performance for the eval time, however the sample time seems to be a bit better. However, the sample time has little effect on the total time.

image

The original systems information and loading timings for the 7B and 65B are:
image
The 7B timings:
image
The 65B timings:
image

After the modifications:
image
The 7B timings:
image
The 65B timings:
image

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 26, 2023

Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳

I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4 would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.

It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.

@KASR could you run the tests with -t 4 and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.

__FMA__ macro does not exist in MSVC
__F16C__ macro does not exist in MSVC, but is implied with AVX2/AVX512
even though it's not currently used for anything when AVX is defined
@anzz1 anzz1 force-pushed the llama-patch-enable-fma-msvc branch from d1e4a18 to b8a80f9 Compare March 26, 2023 19:27
@anzz1
Copy link
Contributor Author

anzz1 commented Mar 26, 2023

Rebased the branch to master for easier testing.

@niclimcy
Copy link

Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳

I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4 would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.

It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.

@KASR could you run the tests with -t 4 and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.

Ah I should have done that, setting a specified thread count, maybe thats why my results vary so much

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 27, 2023

Even based on your tests alone I would think that merging this is a good idea. All the other platforms use the extensions already and MSVC is also supposed to use them as they are implied with /arch:AVX2, the only problem here is really the #define not being set so MSVC takes a slower codepath than all the other compilers.

The reason for variance is probably just Windows, unfortunately, and there is not much to be done about it except running more tests to decrease their significance. I'm actually still maining Windows 7 for exactly this reason, since a fresh Win7 installation with the crap cut down runs under 100 threads total when idle. ~5 Watt and ~0,5% CPU usage. Windows 10 on the other hand has 800+ at any given time and can go up into the thousands when some updates or whatever Store/Xbox/Cortana nonsense is going in the background. Open your task manager and see ur threadcount when you are supposedly doing nothing and see all the bloat eating up your cpu cycles. It's really hard to properly perftest anything in the modern windowses and most of the crap is baked in so heavily into the system that its nigh-impossible to remove it all without crippling the OS.

That being said, I do not recommend maining Win7 anymore since hardware and software support is on it's very last legs. Unfortunate, since the OS itself is pretty much perfect and 100% stable, haven't had a system crash or even a malfunction in years.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick test in godbolt shows that MSVC compiles GGML_COMPUTE_FP16_TO_FP32 and GGML_COMPUTE_FP32_TO_FP16 to vcvtph2ps and vcvtps2ph, same as GCC. I see no reason to not merge this.

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 27, 2023

@slaren And looking at the current alternative paints a pretty clear picture 😄
And as seen in the before part, vex instructions like vcvtsi2ss are still used since AVX2 implies their use, its just that the most optimized version wasnt used because of the missing flag.

After PR

GGML_COMPUTE_FP16_TO_FP32 PROC                      ; COMDAT
        movzx   eax, cx
        vmovd   xmm0, eax
        vcvtph2ps xmm0, xmm0
        ret     0
GGML_COMPUTE_FP16_TO_FP32 ENDP

GGML_COMPUTE_FP32_TO_FP16 PROC                      ; COMDAT
        vmovaps xmm1, xmm0
        vxorps  xmm0, xmm0, xmm0
        vmovss3 xmm1, xmm0, xmm1
        vcvtps2ph xmm2, xmm1, 0
        vpextrw eax, xmm2, 0
        ret     0
GGML_COMPUTE_FP32_TO_FP16 ENDP

Before PR

GGML_COMPUTE_FP16_TO_FP32 PROC                      ; COMDAT
        movzx   eax, cx
        shl     eax, 16
        mov     edx, eax
        and     edx, -2147483648              ; 80000000H
        lea     ecx, DWORD PTR [rax+rax]
        mov     eax, ecx
        shr     eax, 4
        add     eax, 1879048192                     ; 70000000H
        mov     DWORD PTR fp32$3[rsp], eax
        mov     eax, ecx
        shr     eax, 17
        or      eax, 1056964608                   ; 3f000000H
        mov     DWORD PTR fp32$2[rsp], eax
        cmp     ecx, 134217728                            ; 08000000H
        jae     SHORT $LN5@GGML_COMPU
        vmovss  xmm0, DWORD PTR fp32$2[rsp]
        vsubss  xmm1, xmm0, DWORD PTR __real@3f000000
        vmovss  DWORD PTR tv87[rsp], xmm1
        mov     eax, DWORD PTR tv87[rsp]
        or      eax, edx
        mov     DWORD PTR fp32$1[rsp], eax
        vmovss  xmm0, DWORD PTR fp32$1[rsp]
        ret     0
$LN5@GGML_COMPU:
        vmovss  xmm0, DWORD PTR fp32$3[rsp]
        vmulss  xmm1, xmm0, DWORD PTR __real@07800000
        vmovss  DWORD PTR tv87[rsp], xmm1
        mov     eax, DWORD PTR tv87[rsp]
        or      eax, edx
        mov     DWORD PTR fp32$1[rsp], eax
        vmovss  xmm0, DWORD PTR fp32$1[rsp]
        ret     0
GGML_COMPUTE_FP16_TO_FP32 ENDP

GGML_COMPUTE_FP32_TO_FP16 PROC                      ; COMDAT
$LN17:
        sub     rsp, 56                             ; 00000038H
        vmovaps XMMWORD PTR [rsp+32], xmm6
        vmovaps xmm6, xmm0
        vcvtss2sd xmm0, xmm6, xmm0
        vmovq   rcx, xmm0
        call    fabsf
        vmovd   r8d, xmm6
        vmovaps xmm6, XMMWORD PTR [rsp+32]
        mov     ecx, 1895825408                     ; 71000000H
        vxorps  xmm1, xmm1, xmm1
        vcvtsi2ss xmm1, xmm1, eax
        vmulss  xmm2, xmm1, DWORD PTR __real@77800000
        vmulss  xmm3, xmm2, DWORD PTR __real@08800000
        lea     edx, DWORD PTR [r8+r8]
        and     r8d, -2147483648              ; 80000000H
        mov     eax, edx
        and     eax, -16777216                            ; ff000000H
        cmp     eax, ecx
        cmovb   eax, ecx
        shr     eax, 1
        add     eax, 125829120                            ; 07800000H
        mov     DWORD PTR fp32$1[rsp], eax
        vaddss  xmm1, xmm3, DWORD PTR fp32$1[rsp]
        vmovd   ecx, xmm1
        mov     eax, ecx
        and     ecx, 4095               ; 00000fffH
        shr     eax, 13
        and     eax, 31744                                ; 00007c00H
        add     eax, ecx
        mov     ecx, 32256                                ; 00007e00H
        cmp     edx, -16777216                            ; ff000000H
        cmova   ax, cx
        shr     r8d, 16
        or      ax, r8w
        add     rsp, 56                             ; 00000038H
        ret     0
GGML_COMPUTE_FP32_TO_FP16 ENDP

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 27, 2023

Hold merging this until #546 is merged.

@KASR
Copy link
Contributor

KASR commented Mar 28, 2023

Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳

I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4 would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.

It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.

@KASR could you run the tests with -t 4 and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.

Yes sure, I've updated to the newest commit (at the time of writing) and enabled AVX512. I only performed with the 7B model, let me know if it's interesting to also see the results for the 65B.

I've used the command: ./main -m ./models/7B/ggml-model-q4_0.bin -s 1679164839 -n 128 -t 4

Original settings:

image

image

After adjustments:

image

image

I've also added the results using t 20, ( i list the xx ms/token as value ):

image

so using t4 --> 6.55% speedup (which is very close to the value @nicknitewolf had)
using t20 --> 5.15% speedup

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 28, 2023

Much appreciated ! 👍

Thanks to everyone taking part to this, with special thanks to @nicknitewolf and @KASR to take the time to do benchmarks. While a 5% speed increase might not be very noticeable on its' own, all the performance increases add up and are important as parts of the big picture.

After such well made research we can be definitely confident that this PR is a go.

I'll merge this right after the CI is fixed #546

@anzz1 anzz1 merged commit 5a5f8b1 into master Mar 28, 2023
@anzz1 anzz1 deleted the llama-patch-enable-fma-msvc branch March 28, 2023 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants