Implements f64x2.convert_low_i32x4_u for x64 #2982

jlb6740 · 2021-06-14T15:38:18Z

No description provided.

akirilov-arm · 2021-06-14T16:01:36Z

cranelift/codegen/meta/src/shared/instructions.rs

+        Inst::new(
+            "fcvt_low_from_uint",
+            r#"
+        Converts packed unsigned doubleword integers to packed double precision floating point.


Please, use 32-bit integers instead of doubleword integers.

You should also update the fcvt_low_from_sint description.

@akirilov-arm good catch, that can be ambiguous. Thanks!

Yes, exactly, the term "doubleword" means "64-bit" in the Arm architecture.

bjorn3 · 2021-06-14T18:53:14Z

cranelift/codegen/meta/src/shared/instructions.rs

+        Inst::new(
+            "fcvt_low_from_uint",
+            r#"
+        Converts packed unsigned 32-bit integers to packed double precision floating point.


Can you copy the docs from fcvt_low_from_sint?

akirilov-arm · 2021-06-15T13:08:42Z

cranelift/codegen/meta/src/shared/instructions.rs

@@ -4411,6 +4411,19 @@ pub(crate) fn define(
        .operands_out(vec![a]),
    );

+    ig.push(


Is there any specific reason to add a new IR operation? Correct me if I am wrong, but this is equivalent to uwiden_low + fcvt_from_uint, which should be straightforward to pattern-match in the backend (if it was a sequence of, say, 10 instructions, then a new IR operation would be perfectly understandable).

I realize that this has already been done for fcvt_low_from_sint, so I guess the same question applies to it.

Hi @akirilov-arm .. good question. I did the lowering for fcvt_low_from_sint but don't remember the reasoning. Likely I simply did not realize swiden_low could be used. Brings up a question though of if the goal is to minimize the number of instructions here by reusing and mapping many-to-one as much as possible which will have the consequence of a more generic definition of the instruction. Or do we instead want to be more specific in our instruction name and definition and closely tie to the mapped wasm instruction. You mention how involved the pattern matching would get if instructions are shared may be a decider. I'll push another patch removing these new instructions and instead attempt to lower using uwiden_low and swiden_low.

@akirilov-arm I remember now, we were mimicking the implementation of F32x4ConvertI32x4S but couldn't use or didn't want to use that same instruction. I think uwiden_low was never a thought. Rereading your question, I am not sure what you mean by uwiden_low+fcvt_from_uint as I was initially thinking you just asking why I didn't reuse uwiden_low? Can you explain what you mean by uwiden_low+fcvt_from_uint w.r.t the instruction definition in instruction.rs and code_translator.rs? The patch I just pushed should address all other comments except this one.

What I mean is that no changes would be necessary to instructions.rs. As for code_translator.rs, it would need to do something like:

Operator::F64x2ConvertLowI32x4U => { let a = pop1_with_bitcast(state, I32X4, builder); let widened_a = builder.ins().uwiden_low(a); state.push1(builder.ins().fcvt_from_uint(F64X2, widened_a)); }

Then if a backend needs to do anything special for that combination, it will match the pattern.

Hi @akirilov-arm ok .. I am following the other instructions as an example and don't see the others doing this. What does it mean to call two instruction builders here? I tried the above and it's going to want something implemented for uwiden_low and fcvt_from_uint. Is this what you are suggesting?

@akirilov-arm I see a mergable_load call that I assume doesn't apply here? Is it really best to try and merge instructions instead of just mapping it explicitly 1:1 to the wasm instruction like it is now? I don't know that I follow exactly, but I think following your suggestion new logic would be needed in input_to_reg_mem that would get executed for every instruction calling this function. I guess the logic would check if the instruction does a widen and convert? Seems to be less straight forward? What's the advantage of avoiding the 1:1 mapping?

I haven't worked on the x64 backend, so I can't make specific implementation suggestions - my guess is that it would be similar to the AArch64 code, i.e. in the implementation of fcvt_from_uint you would check if the input is uwiden_low and act accordingly (input_to_reg_mem() doesn't seem like the right place). My point is that matching is possible and in this case easy because we need to match only one input, not a whole expression tree.

I am working on enabling the i16x8.q15mulr_sat_s and i32x4.trunc_sat_f64x2_s_zero operations, which also happen to be expressible in terms of existing IR operations. However, they result in much more complicated patterns, which are no longer easy to match, so in those cases I would not make the same suggestion.

The main advantage of avoiding the 1:1 mapping is that it reduces the work necessary for all backends to get basic support for the operation - once they implement 2 other operations, which they are going to do anyway because it is required by the Wasm SIMD specification, they get f64x2.convert_low_i32x4_u for free (and the same would apply to f64x2.convert_low_i32x4_s if it is changed similarly). Also, the pattern might occur organically in the Wasm code and would be handled properly. There is probably an argument to be made that enums should be kept as lean as possible, and that other applications that try to generate or manipulate Cranelift IR (say, another compiler frontend or a peephole optimizer) would have an easier job because they will have fewer things to consider, but these are lesser concerns.

With all that said, I am definitely not categorically opposed to adding an IR operation, so perhaps it is best to hear a third opinion.

cc @abrown @bnjbvr @bjorn3 @cfallin @uweigand

@jlb6740 we had some discussion on this topic last year -- see #2278 and #2376.

I think in general, we want to try to limit new instructions we add at the CLIF level to those that are truly necessary; while combo-instructions make it easier to plumb through the new thing and get exactly the semantics you want, they impose ongoing maintenance cost and cost on new backend implementations, as the IR-level instruction set becomes more complex. Also, the operator-combos for which there are efficient lowerings may be different on different architectures; we then end up with the union of all combo-instructions, and while some backends will have efficient lowerings for the ones that were purpose-built for that backend, the other other backends will have to add new logic that could have come from the combination of simpler ops automatically, as @akirilov-arm says. In general, pattern-matching is a good way to implement better lowerings for some combos without taking on this cost, I think.

In this particular case, the general helpers (e.g. input_to_reg_mem()) don't need to change at all; instead there would be some logic in the lowering case for fcvt_from_uint that looks for a uwiden_low. The matches_input() helper should be useful here.

@cfallin It's probably reasonable to consider a "complexity bound" on the necessary matching, though, as I said - for instance, i32x4.trunc_sat_f64x2_s_zero is equivalent to:

v1 = fcvt_to_sint_sat.i64x2 v0 v2 = vconst.i64x2 0 v3 = snarrow v1, v2

or:

v1 = fcvt_to_sint_sat.i64x2 v0 v2 = iconst.i64 0 v3 = splat.i64x2 v2 v4 = snarrow v1, v3

(and maybe I am forgetting another straightforward way to generate a vector of zeros)
I would lean towards introducing a new IR operation in that case.

Yes, definitely, the matching has some dynamic cost (in compile time and IR memory usage), so there's a tradeoff. When it's a simple A+B combo op as here, it seems reasonable to me to pattern-match the composition, but we should take it on a case-by-case basis. I'd be curious in your examples whether the lowering can give a more optimized instruction sequence for those 3/4 ops together than what would fall out of the simple lowering, but we can save such discussion for a future PR :-)

akirilov-arm · 2021-06-15T13:13:23Z

cranelift/codegen/meta/src/shared/instructions.rs

+        Inst::new(
+            "fcvt_low_from_uint",
+            r#"
+        Converts packed unsigned doubleword integers to packed double precision floating point.


Yes, exactly, the term "doubleword" means "64-bit" in the Arm architecture.

akirilov-arm · 2021-06-16T16:18:23Z

cranelift/codegen/meta/src/shared/instructions.rs

+
+        Considering only the low half of the register, each lane in `x` is interpreted as a
+        unsigned 32-bit integer that is then converted to a double precision float. This
+        which are converted to occupy twice the number of bits. No rounding should be needed


I think you missed a line in the description here.

@akirilov-arm @cfallin Hi .. with some offline help from @cfallin I was able to understand what this should look like in lowering. I used the existing algorithm but will investigate doing it another way that uses fewer instructions. I did not try to refactor any other lowerings such as f64x2.convert_low_i32x4_s since this PR is about implementing f64x2.convert_low_i32x4_u and I figure we want to at least get the others finished before optimizing and refactoring previous instructions. That said, I do plan to refactor f64x2.convert_low_i32x4_s and maybe others with a different PR if not just for symmetry. Let me know if there is anything else needed for this PR.

akirilov-arm

I can't really judge the x64-specific implementation, but the rest looks good to me.

jlb6740 force-pushed the implement_fcvt_low_from_unit branch from 7a4fb14 to 6fd70b0 Compare June 14, 2021 15:54

akirilov-arm suggested changes Jun 14, 2021

View reviewed changes

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm labels Jun 14, 2021

jlb6740 force-pushed the implement_fcvt_low_from_unit branch 2 times, most recently from 9af1890 to 90d4afc Compare June 14, 2021 18:16

bjorn3 reviewed Jun 14, 2021

View reviewed changes

jlb6740 force-pushed the implement_fcvt_low_from_unit branch 2 times, most recently from 30210f6 to 8e88727 Compare June 14, 2021 19:56

jlb6740 mentioned this pull request Jun 14, 2021

VCode X64 SIMD Support Status #2272

Closed

jlb6740 requested a review from akirilov-arm June 14, 2021 22:23

akirilov-arm reviewed Jun 15, 2021

View reviewed changes

jlb6740 force-pushed the implement_fcvt_low_from_unit branch from 8e88727 to 52f8f89 Compare June 16, 2021 01:38

akirilov-arm reviewed Jun 16, 2021

View reviewed changes

This was referenced Jun 28, 2021

Enable the simd_i16x8_q15mulr_sat_s test on AArch64 #3035

Merged

Add extend-add-pairwise instructions x64 #3031

Merged

abrown mentioned this pull request Jun 30, 2021

clif: pattern-match to remove specialized SIMD instructions #3045

Closed

jlb6740 added 2 commits July 8, 2021 11:50

Implements f64x2.convert_low_i32x4_u for x64

2e8ac8e

Update comment on fcvt_low_from_sint instruction

18ea7ea

jlb6740 force-pushed the implement_fcvt_low_from_unit branch 2 times, most recently from 3dabf08 to 6b5871c Compare July 8, 2021 19:37

Fold fcvt_low_from_uinit into previously existing clif instructions

839b042

jlb6740 force-pushed the implement_fcvt_low_from_unit branch 2 times, most recently from 81b4245 to 839b042 Compare July 8, 2021 20:26

jlb6740 requested a review from akirilov-arm July 9, 2021 02:13

akirilov-arm approved these changes Jul 9, 2021

View reviewed changes

jlb6740 merged commit d8e8132 into bytecodealliance:main Jul 9, 2021

cfallin mentioned this pull request Jul 29, 2021

Re-implement AArch64 atomic load and stores #3128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements f64x2.convert_low_i32x4_u for x64 #2982

Implements f64x2.convert_low_i32x4_u for x64 #2982

jlb6740 commented Jun 14, 2021

akirilov-arm Jun 14, 2021

jlb6740 Jun 14, 2021 •

edited

Loading

akirilov-arm Jun 15, 2021

bjorn3 Jun 14, 2021

akirilov-arm Jun 15, 2021

jlb6740 Jun 16, 2021

jlb6740 Jun 16, 2021

akirilov-arm Jun 16, 2021

jlb6740 Jun 17, 2021

jlb6740 Jun 23, 2021

akirilov-arm Jun 24, 2021

cfallin Jun 24, 2021

akirilov-arm Jun 24, 2021

cfallin Jun 24, 2021

akirilov-arm Jun 15, 2021

akirilov-arm Jun 16, 2021

jlb6740 Jul 8, 2021 •

edited

Loading

akirilov-arm left a comment

Implements f64x2.convert_low_i32x4_u for x64 #2982

Implements f64x2.convert_low_i32x4_u for x64 #2982

Conversation

jlb6740 commented Jun 14, 2021

Choose a reason for hiding this comment

jlb6740 Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlb6740 Jul 8, 2021 • edited Loading

Choose a reason for hiding this comment

akirilov-arm left a comment

Choose a reason for hiding this comment

jlb6740 Jun 14, 2021 •

edited

Loading

jlb6740 Jul 8, 2021 •

edited

Loading