Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider FMA #5

Open
jfbastien opened this issue Apr 25, 2017 · 6 comments
Open

Consider FMA #5

jfbastien opened this issue Apr 25, 2017 · 6 comments
Labels
in-overview Instruction has been added to Overview.md instruction-proposal

Comments

@jfbastien
Copy link
Member

Fused multiply-add is being discussed in WebAssembly/simd#8. Let's move the discussion here so we don't drop it.

I'd consider FMA after the first shot at SIMD because, similar to #4, it doesn't have universal platform support. Specific benchmarks may benefit from its presence. As with other ops, maybe we should consider adding support for scalar and vector FMA at the same time.

FMA is part of IEEE 754-2008, but isn't universally supported and can't be polyfilled efficiently. Maybe implementations should simply not advertise FMA if the hardware doesn't support it, forcing users to feature-detect, as opposed to offering non-fused fallback which yield different numerical results.

FMA offers multiply+add with one less rounding, sometimes better runtime performance (sometimes because the hardware itself is magical, it avoids register issues, has its own execution unit, etc), and may have better instruction-cache impact.

It's supported in languages like C or C++ through opt-in with pragma fp_contract, explicit math.h fma function and friends, or fast math flags. I would disallow such optimizations by WebAssembly compilers, only the developer-side compiler should be allowed to emit that opcode.

It's supported in recent x86, and all ARMv8.

@Maratyszcza
Copy link
Collaborator

It would help performance to just have a multiply-add instruction, which could be optionally implemented with FMA

@lemaitre
Copy link

lemaitre commented Apr 18, 2018

I think 3 operations are actually needed:

  • addmul(a, b, c) -> implement exactly a + (b * c) with intermediate rounding
  • fusedaddmul(a, b, c) -> implement exactly a + b * c without any intermediate rounding
  • fastaddmul(a, b, c) -> picks the fastest between addmul and fusedaddmul

The first operation is actually not needed at all: this can be easily detect by the VM as a sequence of a mul and add. To be noted that some hardware have special instructions for this very case.

I think here it would be really important to detect the presence of hadware instruction for FMA.

@gnzlbg
Copy link

gnzlbg commented Mar 2, 2019

Performing the operation with intermediate rounding can already be expressed using a + b * c. In my experience, it is rare for applications to actually require an FMA without any intermediate rounding, particularly when that might be much slower in practice. If there is a strong need for this, we can always add this later.

What most apps want is "perform a + b * c as fast as possible", so maybe we should start by adding that.

@munrocket
Copy link

munrocket commented Dec 9, 2019

Here my example: without FMA error-free Dekker’s multiplication algorithm requires 17 floating-point operations. But if it will be available in WASM, it will be required only 2 FLOP. [1] This algorithm probably have a biggest performance boost because it requires precision. Theoretically FMA can gives 17/2 * 100% - 100% = +750% to this algo, but here you can find real benchmark. You can find another applications and papers where it is usefull.

References:

  1. Mioara Joldes, Jean-Michel Muller, Valentina Popescu. Tight and rigourous error bounds for basic building blocks of double-word arithmetic, 2017
  2. Sylvie Boldo, Guillaume Melquiond. Emulation of a FMA and correctly-rounded sums:
    proved algorithms using rounding to odd
    , 2010
  3. Sylvie Boldo, Guillaume Melquiond. When double rounding is odd. 2005

@ngzhian ngzhian transferred this issue from WebAssembly/simd Mar 17, 2021
@ngzhian
Copy link
Member

ngzhian commented Mar 18, 2021

WebAssembly/simd#79 was an initial proposal for qfma, linking to here here since we can't transfer PRs.

@ngzhian ngzhian added the in-overview Instruction has been added to Overview.md label Feb 18, 2022
@jrus
Copy link

jrus commented May 16, 2022

it is rare for applications to actually require an FMA without any intermediate rounding [...] What most apps want is "perform a + b * c as fast as possible"

There are a whole bunch of fundamental computational building blocks where people care about the precise bits of a floating point number, and careful error analysis has been done to make guarantees about maximum errors, where using FMA greatly simplifies and speeds up the algorithm but naively performing "a + b * c" with intermediate rounding totally wrecks the result and a more complicated algorithm must be used instead if the mul and add operations are separate.

These are especially common in computational geometry where slight rounding errors can break invariants that cause higher-level algorithms to give hilariously incorrect results, go into an infinite loop, or crash. For context see https://www.cs.cmu.edu/~quake/robust.html or for a list of relevant papers, try https://scholar.google.com/scholar?q=geometric+predicate+FMA – as munrocket points out correct double-double arithmetic is another (closely related) example.

Likewise, substituting FMA for a + b * c will often wreck computations which expect (rely on) intermediate rounding. Opting people into an FMA they didn’t explicitly ask for will break their code.

FMA is widely supported by hardware. It would be great to have some syntax/function for an explicit FMA in wasm (and every other programming language). In contexts where it is unavailable the code author can write a separate fall-back algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-overview Instruction has been added to Overview.md instruction-proposal
Projects
None yet
Development

No branches or pull requests

7 participants