Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting Support for fmaddsub #88

Open
dannys4 opened this issue Jul 22, 2021 · 16 comments
Open

Requesting Support for fmaddsub #88

dannys4 opened this issue Jul 22, 2021 · 16 comments

Comments

@dannys4
Copy link

dannys4 commented Jul 22, 2021

I was talking in the discourse about SIMD complex numbers (I hadn't seen #60 until a minute or two ago) and I'm trying to get a working implementation of Complex using Vec types, at least as a preliminary type. I can shave off a few clock cycles from the multiplication itself if I'm able to access the fmaddsub/fmsubadd intrinsics, but unfortunately, I can't those working on my computer. I'm not sure what the issue is, but I'd put up a PR if I knew more LLVM. I'm putting my meager attempt at it below

function fmaddsub(x::Vec{2,Float64}, y::Vec{2,Float64}, z::Vec{2,Float64})
    llvmcall("""
    %res = @llvm.x86.fma.vfmaddsub.pd <2 x double> %0, %1, %2
    ret <2 x double> %res
    """, Float64x2, Tuple{Vec{2,Float64},Vec{2,Float64},Vec{2,Float64}}, x, y, z)
end
@eschnett
Copy link
Owner

This looks reasonable.

Are you sure this intrinsic exists on your system? The .fma. part means that this requires the fma instruction set extension, and you might need to explicitly enable it when calling LLVM. I don't know whether Julia does this. You could try the intrinsic llvm.x86.sse3.addsub.pd first, which should exist on any halfway modern system. (This intrinsic only adds and subtracts, it doesn't multiply, but it's good for testing.)

To go further, I recommend the following:

  • Fork the SIMD.jl repository
  • Make changes to your fork (even if they don't compile), commit, and push them to your fork
  • Open a pull request and mark it WIP (work in progress)

This allows people to see easily exactly what changes you made.

Then report exactly what system you are using (and run versioninfo()), what commands you're typing, and what error messages you get.

@dannys4
Copy link
Author

dannys4 commented Jul 22, 2021 via email

@eschnett
Copy link
Owner

See #89

@eschnett
Copy link
Owner

LLVM supports the intrinsic. The question is whether the machine architecture that Julia tells LLVM to use includes the fma extensions.

What error message do you see?

@dannys4
Copy link
Author

dannys4 commented Jul 24, 2021

I toyed around with it inside SIMD.jl on the PR (to integrate it better with the library). I tried adding :vfmaddsub to MULADD_INTRINSICS and it compiled, but produced this behavior--

julia> using SIMD

julia> x = Vec(1.5, 2.5)
<2 x Float64>[1.5, 2.5]

julia> xd = x.data
(VecElement{Float64}(1.5), VecElement{Float64}(2.5))

julia> SIMD.Intrinsics.fma_fmaddsub(xd, xd, xd)
(VecElement{Float64}(3.75), VecElement{Float64}(8.75))

For some reason it's just calling fmadd and I'm not sure why since it's theoretically generating with the right intrinsic.

Then, I tried implementing it a different way using

function fmaddsub(a::LVec{N, T}, b::LVec{N, T}, c::LVec{N, T}) where {N, T<:FloatingTypes}
    Base.llvmcall("llvm.x86.fma.vfmaddsub_pd", LVec{N, T}, (LVec{N, T}, LVec{N, T}, LVec{N, T}), a, b, c)
end

which (using the same previous example) gave me

julia> SIMD.Intrinsics.fmaddsub(xd, xd, xd)
Internal error: encountered unexpected error during compilation of fmaddsub:
TypeError(func=:llvmcall, context="", expected=Type, got=(Tuple{VecElement{Float64}, VecElement{Float64}}, Tuple{VecElement{Float64}, VecElement{Float64}}, 
Tuple{VecElement{Float64}, VecElement{Float64}}))
jl_type_error_rt at /cygdrive/c/buildbot/worker/package_win64/build/src\rtutils.c:119
jl_type_error at /cygdrive/c/buildbot/worker/package_win64/build/src\rtutils.c:127
emit_llvmcall at /cygdrive/c/buildbot/worker/package_win64/build/src\ccall.cpp:750
emit_intrinsic at /cygdrive/c/buildbot/worker/package_win64/build/src\intrinsics.cpp:893
emit_call at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:3524
emit_expr at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:4390
emit_ssaval_assign at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:4041
emit_stmtpos at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:4252
emit_function at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:6853
jl_emit_code at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:7215
jl_emit_codeinst at /cygdrive/c/buildbot/worker/package_win64/build/src\codegen.cpp:7260
_jl_compile_codeinst at /cygdrive/c/buildbot/worker/package_win64/build/src\jitlayers.cpp:124
jl_generate_fptr at /cygdrive/c/buildbot/worker/package_win64/build/src\jitlayers.cpp:352
jl_compile_method_internal at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:1970
jl_compile_method_internal at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:1924 [inlined]
_jl_invoke at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2229 [inlined]
jl_apply_generic at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2419
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]
do_call at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:115
eval_value at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:204
eval_stmt_value at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:155 [inlined]
eval_body at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:576
jl_interpret_toplevel_thunk at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:670
top-level scope at REPL[5]:1
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:877
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:825
jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:825
jl_toplevel_eval at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:886 [inlined]
jl_toplevel_eval_in at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:929
eval at .\boot.jl:360 [inlined]
eval at .\Base.jl:39 [inlined]
repleval at c:\Users\danny\.vscode\extensions\julialang.language-julia-1.2.8\scripts\packages\VSCodeServer\src\repl.jl:149
#69 at c:\Users\danny\.vscode\extensions\julialang.language-julia-1.2.8\scripts\packages\VSCodeServer\src\repl.jl:115
unknown function (ip: 00000000611271a3)
with_logstate at .\logging.jl:491
with_logger at .\logging.jl:603 [inlined]
#68 at c:\Users\danny\.vscode\extensions\julialang.language-julia-1.2.8\scripts\packages\VSCodeServer\src\repl.jl:116
unknown function (ip: 0000000061127363)
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]
jl_f__call_latest at /cygdrive/c/buildbot/worker/package_win64/build/src\builtins.c:714
#invokelatest#2 at .\essentials.jl:708 [inlined]
invokelatest at .\essentials.jl:706
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]
do_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\builtins.c:670
macro expansion at c:\Users\danny\.vscode\extensions\julialang.language-julia-1.2.8\scripts\packages\VSCodeServer\src\eval.jl:34 [inlined]
#53 at .\task.jl:411
unknown function (ip: 000000006110b543)
jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1703 [inlined]
start_task at /cygdrive/c/buildbot/worker/package_win64/build/src\task.c:839
ERROR: error statically evaluating llvmcall return type
Stacktrace:
 [1] fmaddsub(a::Tuple{Vararg{VecElement{T}, N}}, b::Tuple{Vararg{VecElement{T}, N}}, c::Tuple{Vararg{VecElement{T}, N}}) where {N, T<:Union{Float32, Float64}}
   @ SIMD.Intrinsics c:\Users\danny\.julia\dev\SIMD\src\LLVM_intrinsics.jl:434
 [2] top-level scope
   @ REPL[5]:1

I know that I can theoretically access it through Julia, because I was talking with some people over at VectorizationBase and someone got it to work there and pushed it up. I didn't want to just leave this hanging, but working with LLVM is definitely not in my wheelhouse.

@eschnett
Copy link
Owner

This is my implementation:

function fmaddsub(a::LVec{2, Float64}, b::LVec{2, Float64}, c::LVec{2, Float64})
    ccall("llvm.x86.fma.vfmaddsub.pd", llvmcall, LVec{2, Float64}, (LVec{2, Float64}, LVec{2, Float64}, LVec{2, Float64}), a, b, c)
end

and this is the result I get:

julia> SIMD.Intrinsics.fmaddsub(xd, xd, xd)
(VecElement{Float64}(0.75), VecElement{Float64}(8.75))

which subtracts the first and adds the second element.

Note that ccall calls a single function, while llvmcall expects proper LLVM code, probably multiple lines, including a ret statement. The nearby intrinsics in the SIMD source code are good examples.

Note also the spelling of the intrinsic: there is a letter v in front of fmaddsub, and the suffix pd is separated by a dot, not by an underscore. The test cases in the LLVM source code are a good way of finding out how intrinsics are called.

@dannys4
Copy link
Author

dannys4 commented Jul 24, 2021

Ah, I totally missed the underscore I left in, sloppy mistake. Check out the PR again and see how you feel about what I did (I tried to imitate the library's previous syntax). I added vfmsubadd since it's basically buy-one get-one free on these kinds of intrinsics, but I observed that, curiously, vfmsubadd and vfmaddsub give the same result. Do you have any thoughts on this?

Also, it seems like at least my version of LLVM doesn't support AVX512 based intrinsics for these (I know that my CPU is capable of the AVX512 instructions, and I can perform those instructions normally through C), so I excluded those using comments and left open the possibility for adding them back in easily later.

@eschnett
Copy link
Owner

I added comments. These comments assume that you plan to submit this to SIMD.jl, and include things such as test cases etc.

I don't know which LLVM version supports what intrinsics. I find LLVM's documentation in this respect sparse, i.e. I usually have to look at its source code to find out.

I also noticed that vfmaddsub seems to calculate vfmsubadd. I guessed that's because complex number arithmetic needs this. Now I see that both intrinsics exist, this is strange indeed. Is this a bug in LLVM?

@dannys4
Copy link
Author

dannys4 commented Jul 25, 2021

Sure, I've come too far to turn back now. I agree with the problem of finding what version supports which intrinsics, I've had the same issue, but it seems like these were incorporated long enough ago that it shouldn't be a problem for Julia (though I'll check a baseline LLVM version).

I can't tell you why vfmaddsub calculates vfmsubadd, because I know the recent push to VectorizationBase gets correct results and, to my knowledge, they're doing virtually the same thing as the current PR. I'll poke around later when I get to making the changes, but let me know if you have any ideas as to why.

@dannys4
Copy link
Author

dannys4 commented Jul 25, 2021

I'm looking at this and I'm starting to rethink whether this does belong in SIMD.jl. In the original thread I posted I thought it did because this library provides a lot of functionality for the Vec types. Now that I'm looking at it though, the whole idea of detecting whether the CPU supports the vectorization seems like it further steps on VectorizationBase and is a little outside the scope of this library. I really don't mind working on this (hopefully it doesn't sound like I'm trying to weasel my way out), but I also don't want to make people even more confused with duplicated features between the two libraries.

@KristofferC
Copy link
Collaborator

Made a comment about this on the PR: #89 (comment).

@photor
Copy link

photor commented Nov 18, 2022

So, is there any progress about this?

@dannys4
Copy link
Author

dannys4 commented Nov 29, 2022

Not really, at least for now (I will make no promises). I looked into it and started, but it turned out that VectorizationBase.Jl had this (or rather, it was implemented around the time I opened this). I apologize for the dead issue 😅 .

@heltonmc
Copy link

heltonmc commented Apr 3, 2023

It appears to be possible to emit these kind of instructions with generic LLVM without calling specific intrinsic functions. I generally agree with @KristofferC to allow LLVM to generate (though fragile) specialized instructions and to keep this package lean. Like mentioned, VectorizationBase.jl seems to handle many different instructions for various architectures so if users want them they can use that package.

Nonethless, here is a general version of these instructions with some more details at heltonmc/SIMDMath.jl#4.

For example,

julia> @code_native SIMDMath.faddsub(a, b)

julia_faddsub_1812:                     # @julia_faddsub_1812
; ┌ @ none within `faddsub`
	.cfi_startproc
# %bb.0:                                # %top
; │┌ @ none within `macro expansion`
	vaddsubpd	%ymm1, %ymm0, %ymm0
	retq

Generates, the target specific instruction on my machine while delivering a fairly optimal instruction on my ARM computer shown in other thread. Though, it is not quite combining the multiply with the vaddsubpd in the version discussed here. I'll play around a little bit with the LLVM but that could be an issue on LLVMs side. The type system and intrinsics are based on this package so the implementations are easily transferred.

@KristofferC
Copy link
Collaborator

Maybe you need to emit "fast" fp instructions for LLVM to be allowed to change it to a fused operation?

@heltonmc
Copy link

heltonmc commented Apr 3, 2023

Yes! I also tried a couple variations similar to

# a*b - c for i = 1, 3, ...
# a*b + c for i = 0, 2, ...
@inline @generated function fmsubadd(x::LVec{N, T}, y::LVec{N, T}, z::LVec{N, T}) where {N, T <: FloatTypes}
    @assert iseven(N) "Vector length must be even"
    shfl = join((string("i32 ", Int32(i-1), ", i32 ", Int32(N+i)) for i in 1:2:N), ", ")
    s = """
        %4 = fmul <$N x $(LLVMType[T])> %0, %1
        %5 = fadd fast <$N x $(LLVMType[T])> %4, %2
        %6 = fsub fast <$N x $(LLVMType[T])> %4, %2
        %7 = shufflevector <$N x $(LLVMType[T])> %5, <$N x $(LLVMType[T])> %6, <$N x i32> <$shfl>
        ret <$N x $(LLVMType[T])> %7
        """
    return :(
        llvmcall($s, LVec{N, T}, Tuple{LVec{N, T}, LVec{N, T}, LVec{N, T}}, x, y, z)
        )
end

But they all gave similar instructions (though I think this approach will work for this issue just need to fit into a form LLVM likes...)

julia> @code_native SIMDMath.fmaddsub(a, b, c)
julia_fmaddsub_1848:                    # @julia_fmaddsub_1848
; ┌ @ none within `fmaddsub`
	.cfi_startproc
# %bb.0:                                # %top
; │┌ @ none within `macro expansion`
	vmulpd	%ymm1, %ymm0, %ymm0
	vaddsubpd	%ymm2, %ymm0, %ymm0
	retq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants