Requesting Support for `fmaddsub` #88

dannys4 · 2021-07-22T12:45:31Z

I was talking in the discourse about SIMD complex numbers (I hadn't seen #60 until a minute or two ago) and I'm trying to get a working implementation of Complex using Vec types, at least as a preliminary type. I can shave off a few clock cycles from the multiplication itself if I'm able to access the fmaddsub/fmsubadd intrinsics, but unfortunately, I can't those working on my computer. I'm not sure what the issue is, but I'd put up a PR if I knew more LLVM. I'm putting my meager attempt at it below

function fmaddsub(x::Vec{2,Float64}, y::Vec{2,Float64}, z::Vec{2,Float64})
    llvmcall("""
    %res = @llvm.x86.fma.vfmaddsub.pd <2 x double> %0, %1, %2
    ret <2 x double> %res
    """, Float64x2, Tuple{Vec{2,Float64},Vec{2,Float64},Vec{2,Float64}}, x, y, z)
end

The text was updated successfully, but these errors were encountered:

eschnett · 2021-07-22T13:13:07Z

This looks reasonable.

Are you sure this intrinsic exists on your system? The .fma. part means that this requires the fma instruction set extension, and you might need to explicitly enable it when calling LLVM. I don't know whether Julia does this. You could try the intrinsic llvm.x86.sse3.addsub.pd first, which should exist on any halfway modern system. (This intrinsic only adds and subtracts, it doesn't multiply, but it's good for testing.)

To go further, I recommend the following:

Fork the SIMD.jl repository
Make changes to your fork (even if they don't compile), commit, and push them to your fork
Open a pull request and mark it WIP (work in progress)

This allows people to see easily exactly what changes you made.

Then report exactly what system you are using (and run versioninfo()), what commands you're typing, and what error messages you get.

dannys4 · 2021-07-22T13:40:30Z

Sure thing. I'm certain my system is capable of using fma (I regularly use instructions from it with C++ intrinsics), but I just don't know if it exists in LLVM or how to access it which is largely why I put up the issue. I've made a draft PR and I'll get some more info later today. Here's my version info a priori Julia Version 1.6.2 Commit 1b93d53fc4 (2021-07-14 15:36 UTC) Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: 11th Gen Intel(R) Core(TM) i5-11600K @ 3.90GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)

eschnett · 2021-07-23T00:06:29Z

See #89

eschnett · 2021-07-23T00:07:58Z

LLVM supports the intrinsic. The question is whether the machine architecture that Julia tells LLVM to use includes the fma extensions.

What error message do you see?

dannys4 · 2021-07-24T00:02:07Z

I toyed around with it inside SIMD.jl on the PR (to integrate it better with the library). I tried adding :vfmaddsub to MULADD_INTRINSICS and it compiled, but produced this behavior--

julia> using SIMD

julia> x = Vec(1.5, 2.5)
<2 x Float64>[1.5, 2.5]

julia> xd = x.data
(VecElement{Float64}(1.5), VecElement{Float64}(2.5))

julia> SIMD.Intrinsics.fma_fmaddsub(xd, xd, xd)
(VecElement{Float64}(3.75), VecElement{Float64}(8.75))

For some reason it's just calling fmadd and I'm not sure why since it's theoretically generating with the right intrinsic.

Then, I tried implementing it a different way using

function fmaddsub(a::LVec{N, T}, b::LVec{N, T}, c::LVec{N, T}) where {N, T<:FloatingTypes}
    Base.llvmcall("llvm.x86.fma.vfmaddsub_pd", LVec{N, T}, (LVec{N, T}, LVec{N, T}, LVec{N, T}), a, b, c)
end

which (using the same previous example) gave me

julia> SIMD.Intrinsics.fmaddsub(xd, xd, xd)
Internal error: encountered unexpected error during compilation of fmaddsub:
TypeError(func=:llvmcall, context="", expected=Type, got=(Tuple{VecElement{Float64}, VecElement{Float64}}, Tuple{VecElement{Float64}, VecElement{Float64}}, 
Tuple{VecElement{Float64}, VecElement{Float64}}))
jl_type_error_rt at /cygdrive/c/buildbot/worker/package_win64/build/src\rtutils.c:119
jl_type_error at /cygdrive/c/buildbot/worker/package_win64/build/src\rtutils.c:127
emit_llvmcall at /cygdrive/c/buildbot/worker/package_win64/build/src\ccall.cpp:750
emit_intrinsic at /cygdrive/c/buildbot/worker/package_win64/build/src\intrinsics.cpp:893
emit_call at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:3524
emit_expr at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:4390
emit_ssaval_assign at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:4041
emit_stmtpos at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:4252
emit_function at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:6853
jl_emit_code at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:7215
jl_emit_codeinst at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:7260
_jl_compile_codeinst at /cygdrive/c/buildbot/worker/package_win64/build/src\jitlayers.cpp:124
jl_generate_fptr at /cygdrive/c/buildbot/worker/package_win64/build/src\jitlayers.cpp:352
jl_compile_method_internal at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:1970
jl_compile_method_internal at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:1924 [inlined]
_jl_invoke at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2229 [inlined]
jl_apply_generic at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2419
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]
do_call at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:115
eval_value at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:204
eval_stmt_value at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:155 [inlined]
eval_body at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:576
jl_interpret_toplevel_thunk at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:670
top-level scope at REPL[5]:1
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:877
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:825
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:825
jl_toplevel_eval at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:886 [inlined]
jl_toplevel_eval_in at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:929
eval at .\boot.jl:360 [inlined]
eval at .\Base.jl:39 [inlined]
repleval at c:\Users\danny\.vscode\extensions\julialang.language-julia-1.2.8\scripts\packages\VSCodeServer\src\repl.jl:149
#69 at c:\Users\danny\.vscode\extensions\julialang.language-julia-1.2.8\scripts\packages\VSCodeServer\src\repl.jl:115
unknown function (ip: 00000000611271a3)
with_logstate at .\logging.jl:491
with_logger at .\logging.jl:603 [inlined]
#68 at c:\Users\danny\.vscode\extensions\julialang.language-julia-1.2.8\scripts\packages\VSCodeServer\src\repl.jl:116
unknown function (ip: 0000000061127363)
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]
jl_f__call_latest at /cygdrive/c/buildbot/worker/package_win64/build/src\builtins.c:714
#invokelatest#2 at .\essentials.jl:708 [inlined]
invokelatest at .\essentials.jl:706
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]
do_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\builtins.c:670
macro expansion at c:\Users\danny\.vscode\extensions\julialang.language-julia-1.2.8\scripts\packages\VSCodeServer\src\eval.jl:34 [inlined]
#53 at .\task.jl:411
unknown function (ip: 000000006110b543)
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]
start_task at /cygdrive/c/buildbot/worker/package_win64/build/src\task.c:839
ERROR: error statically evaluating llvmcall return type
Stacktrace:
 [1] fmaddsub(a::Tuple{Vararg{VecElement{T}, N}}, b::Tuple{Vararg{VecElement{T}, N}}, c::Tuple{Vararg{VecElement{T}, N}}) where {N, T<:Union{Float32, Float64}}
   @ SIMD.Intrinsics c:\Users\danny\.julia\dev\SIMD\src\LLVM_intrinsics.jl:434
 [2] top-level scope
   @ REPL[5]:1

I know that I can theoretically access it through Julia, because I was talking with some people over at VectorizationBase and someone got it to work there and pushed it up. I didn't want to just leave this hanging, but working with LLVM is definitely not in my wheelhouse.

eschnett · 2021-07-24T16:03:22Z

This is my implementation:

function fmaddsub(a::LVec{2, Float64}, b::LVec{2, Float64}, c::LVec{2, Float64})
    ccall("llvm.x86.fma.vfmaddsub.pd", llvmcall, LVec{2, Float64}, (LVec{2, Float64}, LVec{2, Float64}, LVec{2, Float64}), a, b, c)
end

and this is the result I get:

julia> SIMD.Intrinsics.fmaddsub(xd, xd, xd)
(VecElement{Float64}(0.75), VecElement{Float64}(8.75))

which subtracts the first and adds the second element.

Note that ccall calls a single function, while llvmcall expects proper LLVM code, probably multiple lines, including a ret statement. The nearby intrinsics in the SIMD source code are good examples.

Note also the spelling of the intrinsic: there is a letter v in front of fmaddsub, and the suffix pd is separated by a dot, not by an underscore. The test cases in the LLVM source code are a good way of finding out how intrinsics are called.

dannys4 · 2021-07-24T23:21:24Z

Ah, I totally missed the underscore I left in, sloppy mistake. Check out the PR again and see how you feel about what I did (I tried to imitate the library's previous syntax). I added vfmsubadd since it's basically buy-one get-one free on these kinds of intrinsics, but I observed that, curiously, vfmsubadd and vfmaddsub give the same result. Do you have any thoughts on this?

Also, it seems like at least my version of LLVM doesn't support AVX512 based intrinsics for these (I know that my CPU is capable of the AVX512 instructions, and I can perform those instructions normally through C), so I excluded those using comments and left open the possibility for adding them back in easily later.

eschnett · 2021-07-25T00:32:21Z

I added comments. These comments assume that you plan to submit this to SIMD.jl, and include things such as test cases etc.

I don't know which LLVM version supports what intrinsics. I find LLVM's documentation in this respect sparse, i.e. I usually have to look at its source code to find out.

I also noticed that vfmaddsub seems to calculate vfmsubadd. I guessed that's because complex number arithmetic needs this. Now I see that both intrinsics exist, this is strange indeed. Is this a bug in LLVM?

dannys4 · 2021-07-25T15:18:44Z

Sure, I've come too far to turn back now. I agree with the problem of finding what version supports which intrinsics, I've had the same issue, but it seems like these were incorporated long enough ago that it shouldn't be a problem for Julia (though I'll check a baseline LLVM version).

I can't tell you why vfmaddsub calculates vfmsubadd, because I know the recent push to VectorizationBase gets correct results and, to my knowledge, they're doing virtually the same thing as the current PR. I'll poke around later when I get to making the changes, but let me know if you have any ideas as to why.

dannys4 · 2021-07-25T17:28:53Z

I'm looking at this and I'm starting to rethink whether this does belong in SIMD.jl. In the original thread I posted I thought it did because this library provides a lot of functionality for the Vec types. Now that I'm looking at it though, the whole idea of detecting whether the CPU supports the vectorization seems like it further steps on VectorizationBase and is a little outside the scope of this library. I really don't mind working on this (hopefully it doesn't sound like I'm trying to weasel my way out), but I also don't want to make people even more confused with duplicated features between the two libraries.

KristofferC · 2021-07-26T08:03:12Z

Made a comment about this on the PR: #89 (comment).

photor · 2022-11-18T12:47:40Z

So, is there any progress about this?

dannys4 · 2022-11-29T13:52:45Z

Not really, at least for now (I will make no promises). I looked into it and started, but it turned out that VectorizationBase.Jl had this (or rather, it was implemented around the time I opened this). I apologize for the dead issue 😅 .

heltonmc · 2023-04-03T17:56:35Z

It appears to be possible to emit these kind of instructions with generic LLVM without calling specific intrinsic functions. I generally agree with @KristofferC to allow LLVM to generate (though fragile) specialized instructions and to keep this package lean. Like mentioned, VectorizationBase.jl seems to handle many different instructions for various architectures so if users want them they can use that package.

Nonethless, here is a general version of these instructions with some more details at heltonmc/SIMDMath.jl#4.

For example,

julia> @code_native SIMDMath.faddsub(a, b)

julia_faddsub_1812:                     # @julia_faddsub_1812
; ┌ @ none within `faddsub`
	.cfi_startproc
# %bb.0:                                # %top
; │┌ @ none within `macro expansion`
	vaddsubpd	%ymm1, %ymm0, %ymm0
	retq

Generates, the target specific instruction on my machine while delivering a fairly optimal instruction on my ARM computer shown in other thread. Though, it is not quite combining the multiply with the vaddsubpd in the version discussed here. I'll play around a little bit with the LLVM but that could be an issue on LLVMs side. The type system and intrinsics are based on this package so the implementations are easily transferred.

KristofferC · 2023-04-03T18:00:10Z

Maybe you need to emit "fast" fp instructions for LLVM to be allowed to change it to a fused operation?

heltonmc · 2023-04-03T18:04:18Z

Yes! I also tried a couple variations similar to

# a*b - c for i = 1, 3, ...
# a*b + c for i = 0, 2, ...
@inline @generated function fmsubadd(x::LVec{N, T}, y::LVec{N, T}, z::LVec{N, T}) where {N, T <: FloatTypes}
    @assert iseven(N) "Vector length must be even"
    shfl = join((string("i32 ", Int32(i-1), ", i32 ", Int32(N+i)) for i in 1:2:N), ", ")
    s = """
        %4 = fmul <$N x $(LLVMType[T])> %0, %1
        %5 = fadd fast <$N x $(LLVMType[T])> %4, %2
        %6 = fsub fast <$N x $(LLVMType[T])> %4, %2
        %7 = shufflevector <$N x $(LLVMType[T])> %5, <$N x $(LLVMType[T])> %6, <$N x i32> <$shfl>
        ret <$N x $(LLVMType[T])> %7
        """
    return :(
        llvmcall($s, LVec{N, T}, Tuple{LVec{N, T}, LVec{N, T}, LVec{N, T}}, x, y, z)
        )
end

But they all gave similar instructions (though I think this approach will work for this issue just need to fit into a form LLVM likes...)

julia> @code_native SIMDMath.fmaddsub(a, b, c)
julia_fmaddsub_1848:                    # @julia_fmaddsub_1848
; ┌ @ none within `fmaddsub`
	.cfi_startproc
# %bb.0:                                # %top
; │┌ @ none within `macro expansion`
	vmulpd	%ymm1, %ymm0, %ymm0
	vaddsubpd	%ymm2, %ymm0, %ymm0
	retq

dannys4 mentioned this issue Jul 26, 2021

[wip] Creating fmaddsub for these intrinsics #89

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting Support for `fmaddsub` #88

Requesting Support for `fmaddsub` #88

dannys4 commented Jul 22, 2021

eschnett commented Jul 22, 2021

dannys4 commented Jul 22, 2021 via email •

edited

Loading

eschnett commented Jul 23, 2021

eschnett commented Jul 23, 2021

dannys4 commented Jul 24, 2021

eschnett commented Jul 24, 2021

dannys4 commented Jul 24, 2021 •

edited

Loading

eschnett commented Jul 25, 2021

dannys4 commented Jul 25, 2021

dannys4 commented Jul 25, 2021

KristofferC commented Jul 26, 2021

photor commented Nov 18, 2022

dannys4 commented Nov 29, 2022

heltonmc commented Apr 3, 2023 •

edited

Loading

KristofferC commented Apr 3, 2023

heltonmc commented Apr 3, 2023

Requesting Support for fmaddsub #88

Requesting Support for fmaddsub #88

Comments

dannys4 commented Jul 22, 2021

eschnett commented Jul 22, 2021

dannys4 commented Jul 22, 2021 via email • edited Loading

eschnett commented Jul 23, 2021

eschnett commented Jul 23, 2021

dannys4 commented Jul 24, 2021

eschnett commented Jul 24, 2021

dannys4 commented Jul 24, 2021 • edited Loading

eschnett commented Jul 25, 2021

dannys4 commented Jul 25, 2021

dannys4 commented Jul 25, 2021

KristofferC commented Jul 26, 2021

photor commented Nov 18, 2022

dannys4 commented Nov 29, 2022

heltonmc commented Apr 3, 2023 • edited Loading

KristofferC commented Apr 3, 2023

heltonmc commented Apr 3, 2023

Requesting Support for `fmaddsub` #88

Requesting Support for `fmaddsub` #88

dannys4 commented Jul 22, 2021 via email •

edited

Loading

dannys4 commented Jul 24, 2021 •

edited

Loading

heltonmc commented Apr 3, 2023 •

edited

Loading