Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Floating-Point to Nearest Integer Conversions #247

Closed
wants to merge 1 commit into from

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Jun 6, 2020

Introduction

This PR adds four forms of floating-point-to-integer conversion with rounding to nearest (ties to even), in addition to existing instructions with rounding towards zero mode. This operation is natively supported in SSE2 and ARMv8 NEON, and can be efficiently simulated in native instructions on ARMv7 NEON.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86 processors with SSE2 instruction set

  • i32x4.nearest_sat_f32x4_s
    • y = i32x4.nearest_sat_f32x4_s(x) (y is NOT x) is lowered to
      • MOVAPS xmm_y, xmm_x
      • MOVAPS xmm_tmp, wasm_splat_f32(0x1.0p+31f)
      • CMPUNORDSS xmm_y, xmm_y
      • CMPLEPS xmm_tmp, xmm_x
      • ANDNPS xmm_y, xmm_x
      • CVTDQ2PS xmm_y, xmm_y
      • PXOR xmm_y, xmm_tmp
  • i32x4.nearest_sat_f32x4_u
    • y = i32x4.nearest_sat_f32x4_u(x) (y is NOT x) is lowered to
      • MOVAPS xmm_tmp0, wasm_splat_f32(0x1.0p+31f)
      • MOVAPS xmm_tmp1, xmm_x
      • CMPNLTPS xmm_tmp1, xmm_tmp0
      • MOVAPS xmm_y, xmm_x
      • MOVAPS xmm_tmp2, xmm_tmp0
      • ANDPS xmm_tmp0, xmm_tmp1
      • SUBPS xmm_y, xmm_tmp0
      • PSLLD xmm_tmp1, 31
      • CMPLEPS xmm_tmp2, xmm_y
      • CVTPS2DQ xmm_y, xmm_y
      • PXOR xmm_tmp1, xmm_y
      • PXOR xmm_y, xmm_y
      • PCMPGTD xmm_y, xmm_x
      • POR xmm_tmp1, xmm_tmp2
      • PANDN xmm_y, xmm_tmp1
  • i32x4.trunc_sat_f64x2_s_zero
    • y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
      • XORPS xmm_tmp, xmm_tmp
      • CMPEQPD xmm_tmp, xmm_x
      • MOVAPS xmm_y, xmm_x
      • ANDPS xmm_tmp, [wasm_f64x2_splat(2147483647.0)]
      • MINPD xmm_y, xmm_tmp
      • CVTPD2DQ xmm_y, xmm_y
  • i32x4.nearest_sat_f64x2_u_zero
    • y = i32x4.nearest_sat_f64x2_u_zero(x) is lowered to:
      • MOVAPD xmm_y, xmm_x
      • XORPD xmm_tmp, xmm_tmp
      • MAXPD xmm_y, xmm_tmp
      • MINPD xmm_y, [wasm_f64x2_splat(4294967295.0)]
      • ADDPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
      • SHUFPS xmm_y, xmm_xmp, 0x88

ARM64 processors

  • i32x4.nearest_sat_f32x4_s
    • y = i32x4.nearest_sat_f32x4_s(x) is lowered to FCVTNS Vy.4S, Vx.4S
  • i32x4.nearest_sat_f32x4_u
    • y = i32x4.nearest_sat_f32x4_u(x) is lowered to FCVTNU Vy.4S, Vx.4S
  • i32x4.nearest_sat_f64x2_s_zero
    • y = i32x4.nearest_sat_f64x2_s_zero(x) is lowered to:
      • FCVTNS Vy.2D, Vx.2D
      • SQXTN Vy.2S, Vy.2D
  • i32x4.nearest_sat_f64x2_u_zero
    • y = i32x4.nearest_sat_f64x2_u_zero(x) is lowered to:
      • FCVTNU Vy.2D, Vx.2D
      • UQXTN Vy.2S, Vy.2D

ARM processors with ARMv8 (32-bit) instruction set

  • i32x4.nearest_sat_f32x4_s
    • y = i32x4.nearest_sat_f32x4_s(x) is lowered to VCVTN.S32.F32 Qy, Qx
  • i32x4.nearest_sat_f32x4_u
    • y = i32x4.nearest_sat_f32x4_u(x) is lowered to VCVTN.U32.F32 Qy, Qx
  • i32x4.trunc_sat_f64x2_s_zero
    • y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
      • FCVTZS Vy.2D, Vx.2D
      • SQXTN Vy.2S, Vy.2D
  • i32x4.trunc_sat_f64x2_u_zero
    • y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
      • FCVTZU Vy.2D, Vx.2D
      • UQXTN Vy.2S, Vy.2D

ARM processors with ARMv7 (32-bit) instruction set

  • i32x4.nearest_sat_f32x4_s
    • y = i32x4.nearest_sat_f32x4_s(x) (y is NOT x) is lowered to
      • VMOV.I32 Qtmp, 0x80000000
      • VMOV.F32 Qy, 0x4B000000
      • VBSL Qtmp, Qx, Qy
      • VADD.F32 Qy, Qx, Qtmp
      • VSUB.F32 Qy, Qy, Qtmp
      • VACLT.F32 Qtmp, Qx, Qtmp
      • VBSL Qtmp, Qy, Qx
      • VCVT.S32.F32 Qy, Qtmp
  • i32x4.nearest_sat_f32x4_u
    • y = i32x4.nearest_sat_f32x4_u(x) (y is NOT x) is lowered to
      • VMOV.I32 Qtmp, 0x4B000000
      • VADD.F32 Qy, Qx, Qtmp
      • VSUB.F32 Qy, Qy, Qtmp
      • VCLT.U32 Qtmp, Qx, Qtmp
      • VBSL Qtmp, Qy, Qx
      • VCVT.U32.F32 Qy, Qtmp

@Maratyszcza Maratyszcza changed the title [WIP] i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u Jun 7, 2020
@dtig
Copy link
Member

dtig commented Jun 9, 2020

These instructions are widely useful, and I agree that these are hard to emulate without operations being explicitly exposed. The suboptimal mapping on Intel hardware has been contentious in the past for the conversion and other operations, but we have included them as there's no good way to emulate them - explicitly asking @arunetm and other Intel folks for opinions here.

@zeux
Copy link
Contributor

zeux commented Jun 10, 2020

Just noting that in all my high performance kernels that needed f32->i32 conversion, I had to stay away from the "native" Wasm instructions due to the large overhead.

I have three kernels that need this instruction; they run at 2 GB/s, 3.6 GB/s and 2.6 GB/s when using a fast emulation. When using the "native" instruction on latest v8, I get 1.25 GB/s, 2.4 GB/s and 2.1 GB/s - very significant and noticeable penalty, and that's given that the "native" code doesn't perform the rounding that's required and I get for free in the emulated version, so the real perf delta is larger. As usual, these aren't microbenchmarks, and the conversion is merely part of the computational chain.

The instructions proposed here would perhaps help a bit in that at least my emulation can be tested vs a native rounding instruction and I'd expect these to perform similarly to existing variants, but I'm expecting these to be similarly not-useful for performance sensitive code unless it's impossible to implement the algorithm without these.

That's not really an objection to adding these, as these instructions aren't worse than what we already have, merely an observation. In the examples linked I believe the expectation is that the lowering is much more optimal than the one proposed (because of the differences in handling saturation/NaNs).

@zeux
Copy link
Contributor

zeux commented Jun 10, 2020

(on a less pessimistic note, if we decide to go ahead with these I'd be happy to contribute the kernels above as benchmarks for perf evaluation, we could compare "manual" rounding (adding 0.5 with the proper sign and using truncate), "assisted" rounding (using new fp32 rounding + truncate), proposed direct rounding and the fast emulation)

@arunetm
Copy link
Collaborator

arunetm commented Jun 10, 2020

I think the current state of spec makes it too risky to include these. Mapping on x86 looks concerning for these instructions with the largest gap w.r.t instr count (16 & 7). We already have a significant asymmetry in spec considering costs of op-implementations on x86. I am afraid inclusion of these significantly increases the risk of hiding higher perf penalties/regressions on one popular platform vs. others limiting their usability for developers and moving away from spec goals.
The instructions looks useful from a convenience standpoint, but the use-case perf benefits and tradeoffs seems unclear. Even included, the perf cost on x86 may force users to rely on emulations like @zeux pointed out negating any benefits. Thanks @zeux for sharing the info.
I suggest not including these in current spec and re-consider for post MVP given their usefulness.

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Jun 10, 2020

@arunetm please note that x86 lowering differs only by one instruction (CVTTPS2DQ -> CVTPS2DQ) from the lowering of i32x4.trunc_sat_f32x4_s and i32x4.nearest_sat_f32x4_u instructions already in the spec. The other instructions are needed to handle the difference between out-of-bounds behavior of x86 conversation instructions (return INT32_MIN) and WAsm conversion instructions (saturate between INT32_MIN and INT32_MAX and convert NaN to 0), and to simulate the floating-point -> unsigned integer conversion missing on pre-AVX512 x86.

Without i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u in the spec, developers would implement them as i32x4.trunc_sat_f32x4_s(f32x4.nearest(x)) and i32x4.trunc_sat_f32x4_u(f32x4.nearest(x)), which is strictly worse on all platforms, but particularly so on pre-SSE4 x86.

@arunetm
Copy link
Collaborator

arunetm commented Jun 10, 2020

lowering of i32x4.trunc_sat_f32x4_s and i32x4.nearest_sat_f32x4_u instructions already in the spec.

Did you mean i32x4.trunc_sat_f32x4_s & i32x4.trunc_sat_f32x4_u here? Unfortunately these ops are highly expensive to implement on x86 and we have open issues discussing their tradeoffs #173 . We need to clearly understand their real world implications regarding perf cliffs before including trunc instructions. We may be compounding the problem by adding the new ops that rely on these. Given that we cannot assume AVX-512 instructions broad availability, the extra cost of handling out of bounds will make it even worse.

Agree that not including these will force developers to choose workarounds that may not be ideal on certain platforms. IMO, its a a better tradeoff than letting them be vulnerable to hidden perf cliffs on certain common platforms when they rely SIMD anticipating consistent performance gains. Also, runtimes always has the option of adding platform specific optimizations in these cases where developer expectations will not be broken by the spec and only enhanced by implementers.

@tlively
Copy link
Member

tlively commented Oct 7, 2020

@Maratyszcza can you assign these proposed instructions new opcodes? The current opcodes conflict with pmin/pmax.

@dtig
Copy link
Member

dtig commented Feb 4, 2021

Adding a preliminary vote for the inclusion of i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u to the SIMD proposal below. Please vote with -

👍 For including i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u operations
👎 Against including i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u operations

@Maratyszcza Maratyszcza changed the title i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u Floating-Point to Nearest Integer Conversions Feb 5, 2021
@Maratyszcza
Copy link
Contributor Author

Added double-precision variants similar to #383. These instructions introduce floating-point-to-integer conversions that round to nearest-even rather than truncate. The lowering of these instructions on x86 and ARM64 differs only by one instruction from the truncation variants, and is more efficient than simulation. Lowering on x86 is somewhat inefficient due to special handling of out-of-bounds inputs (albeit not more inefficient that existing trunc_sat conversion instructions), but this would be fixed with the subsequent Fast SIMD proposal.

@zeux
Copy link
Contributor

zeux commented Feb 5, 2021

@Maratyszcza Please update the PR text so that it's clear that this is now proposing 4 instructions

@dtig
Copy link
Member

dtig commented Mar 5, 2021

Closing as per #436.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs discussion Proposal with an unclear resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants