Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements f64x2.convert_low_i32x4_u for x64 #2982

Merged
merged 3 commits into from
Jul 9, 2021

Conversation

jlb6740
Copy link
Contributor

@jlb6740 jlb6740 commented Jun 14, 2021

No description provided.

@jlb6740 jlb6740 force-pushed the implement_fcvt_low_from_unit branch from 7a4fb14 to 6fd70b0 Compare June 14, 2021 15:54
Inst::new(
"fcvt_low_from_uint",
r#"
Converts packed unsigned doubleword integers to packed double precision floating point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, use 32-bit integers instead of doubleword integers.

You should also update the fcvt_low_from_sint description.

Copy link
Contributor Author

@jlb6740 jlb6740 Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akirilov-arm good catch, that can be ambiguous. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, the term "doubleword" means "64-bit" in the Arm architecture.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm labels Jun 14, 2021
@jlb6740 jlb6740 force-pushed the implement_fcvt_low_from_unit branch 2 times, most recently from 9af1890 to 90d4afc Compare June 14, 2021 18:16
Inst::new(
"fcvt_low_from_uint",
r#"
Converts packed unsigned 32-bit integers to packed double precision floating point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you copy the docs from fcvt_low_from_sint?

@jlb6740 jlb6740 force-pushed the implement_fcvt_low_from_unit branch 2 times, most recently from 30210f6 to 8e88727 Compare June 14, 2021 19:56
@@ -4411,6 +4411,19 @@ pub(crate) fn define(
.operands_out(vec![a]),
);

ig.push(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason to add a new IR operation? Correct me if I am wrong, but this is equivalent to uwiden_low + fcvt_from_uint, which should be straightforward to pattern-match in the backend (if it was a sequence of, say, 10 instructions, then a new IR operation would be perfectly understandable).

I realize that this has already been done for fcvt_low_from_sint, so I guess the same question applies to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akirilov-arm .. good question. I did the lowering for fcvt_low_from_sint but don't remember the reasoning. Likely I simply did not realize swiden_low could be used. Brings up a question though of if the goal is to minimize the number of instructions here by reusing and mapping many-to-one as much as possible which will have the consequence of a more generic definition of the instruction. Or do we instead want to be more specific in our instruction name and definition and closely tie to the mapped wasm instruction. You mention how involved the pattern matching would get if instructions are shared may be a decider. I'll push another patch removing these new instructions and instead attempt to lower using uwiden_low and swiden_low.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akirilov-arm I remember now, we were mimicking the implementation of F32x4ConvertI32x4S but couldn't use or didn't want to use that same instruction. I think uwiden_low was never a thought. Rereading your question, I am not sure what you mean by uwiden_low+fcvt_from_uint as I was initially thinking you just asking why I didn't reuse uwiden_low? Can you explain what you mean by uwiden_low+fcvt_from_uint w.r.t the instruction definition in instruction.rs and code_translator.rs? The patch I just pushed should address all other comments except this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that no changes would be necessary to instructions.rs. As for code_translator.rs, it would need to do something like:

Operator::F64x2ConvertLowI32x4U => {
  let a = pop1_with_bitcast(state, I32X4, builder);
  let widened_a = builder.ins().uwiden_low(a);

  state.push1(builder.ins().fcvt_from_uint(F64X2, widened_a));
}

Then if a backend needs to do anything special for that combination, it will match the pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akirilov-arm ok .. I am following the other instructions as an example and don't see the others doing this. What does it mean to call two instruction builders here? I tried the above and it's going to want something implemented for uwiden_low and fcvt_from_uint. Is this what you are suggesting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akirilov-arm I see a mergable_load call that I assume doesn't apply here? Is it really best to try and merge instructions instead of just mapping it explicitly 1:1 to the wasm instruction like it is now? I don't know that I follow exactly, but I think following your suggestion new logic would be needed in input_to_reg_mem that would get executed for every instruction calling this function. I guess the logic would check if the instruction does a widen and convert? Seems to be less straight forward? What's the advantage of avoiding the 1:1 mapping?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't worked on the x64 backend, so I can't make specific implementation suggestions - my guess is that it would be similar to the AArch64 code, i.e. in the implementation of fcvt_from_uint you would check if the input is uwiden_low and act accordingly (input_to_reg_mem() doesn't seem like the right place). My point is that matching is possible and in this case easy because we need to match only one input, not a whole expression tree.

I am working on enabling the i16x8.q15mulr_sat_s and i32x4.trunc_sat_f64x2_s_zero operations, which also happen to be expressible in terms of existing IR operations. However, they result in much more complicated patterns, which are no longer easy to match, so in those cases I would not make the same suggestion.

The main advantage of avoiding the 1:1 mapping is that it reduces the work necessary for all backends to get basic support for the operation - once they implement 2 other operations, which they are going to do anyway because it is required by the Wasm SIMD specification, they get f64x2.convert_low_i32x4_u for free (and the same would apply to f64x2.convert_low_i32x4_s if it is changed similarly). Also, the pattern might occur organically in the Wasm code and would be handled properly. There is probably an argument to be made that enums should be kept as lean as possible, and that other applications that try to generate or manipulate Cranelift IR (say, another compiler frontend or a peephole optimizer) would have an easier job because they will have fewer things to consider, but these are lesser concerns.

With all that said, I am definitely not categorically opposed to adding an IR operation, so perhaps it is best to hear a third opinion.

cc @abrown @bnjbvr @bjorn3 @cfallin @uweigand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlb6740 we had some discussion on this topic last year -- see #2278 and #2376.

I think in general, we want to try to limit new instructions we add at the CLIF level to those that are truly necessary; while combo-instructions make it easier to plumb through the new thing and get exactly the semantics you want, they impose ongoing maintenance cost and cost on new backend implementations, as the IR-level instruction set becomes more complex. Also, the operator-combos for which there are efficient lowerings may be different on different architectures; we then end up with the union of all combo-instructions, and while some backends will have efficient lowerings for the ones that were purpose-built for that backend, the other other backends will have to add new logic that could have come from the combination of simpler ops automatically, as @akirilov-arm says. In general, pattern-matching is a good way to implement better lowerings for some combos without taking on this cost, I think.

In this particular case, the general helpers (e.g. input_to_reg_mem()) don't need to change at all; instead there would be some logic in the lowering case for fcvt_from_uint that looks for a uwiden_low. The matches_input() helper should be useful here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfallin It's probably reasonable to consider a "complexity bound" on the necessary matching, though, as I said - for instance, i32x4.trunc_sat_f64x2_s_zero is equivalent to:

v1 = fcvt_to_sint_sat.i64x2 v0
v2 = vconst.i64x2 0
v3 = snarrow v1, v2

or:

v1 = fcvt_to_sint_sat.i64x2 v0
v2 = iconst.i64 0
v3 = splat.i64x2 v2
v4 = snarrow v1, v3

(and maybe I am forgetting another straightforward way to generate a vector of zeros)
I would lean towards introducing a new IR operation in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely, the matching has some dynamic cost (in compile time and IR memory usage), so there's a tradeoff. When it's a simple A+B combo op as here, it seems reasonable to me to pattern-match the composition, but we should take it on a case-by-case basis. I'd be curious in your examples whether the lowering can give a more optimized instruction sequence for those 3/4 ops together than what would fall out of the simple lowering, but we can save such discussion for a future PR :-)

Inst::new(
"fcvt_low_from_uint",
r#"
Converts packed unsigned doubleword integers to packed double precision floating point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, the term "doubleword" means "64-bit" in the Arm architecture.

@jlb6740 jlb6740 force-pushed the implement_fcvt_low_from_unit branch from 8e88727 to 52f8f89 Compare June 16, 2021 01:38

Considering only the low half of the register, each lane in `x` is interpreted as a
unsigned 32-bit integer that is then converted to a double precision float. This
which are converted to occupy twice the number of bits. No rounding should be needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed a line in the description here.

Copy link
Contributor Author

@jlb6740 jlb6740 Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akirilov-arm @cfallin Hi .. with some offline help from @cfallin I was able to understand what this should look like in lowering. I used the existing algorithm but will investigate doing it another way that uses fewer instructions. I did not try to refactor any other lowerings such as f64x2.convert_low_i32x4_s since this PR is about implementing f64x2.convert_low_i32x4_u and I figure we want to at least get the others finished before optimizing and refactoring previous instructions. That said, I do plan to refactor f64x2.convert_low_i32x4_s and maybe others with a different PR if not just for symmetry. Let me know if there is anything else needed for this PR.

@jlb6740 jlb6740 force-pushed the implement_fcvt_low_from_unit branch 2 times, most recently from 3dabf08 to 6b5871c Compare July 8, 2021 19:37
@jlb6740 jlb6740 force-pushed the implement_fcvt_low_from_unit branch 2 times, most recently from 81b4245 to 839b042 Compare July 8, 2021 20:26
@jlb6740 jlb6740 requested a review from akirilov-arm July 9, 2021 02:13
Copy link
Contributor

@akirilov-arm akirilov-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really judge the x64-specific implementation, but the rest looks good to me.

@jlb6740 jlb6740 merged commit d8e8132 into bytecodealliance:main Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants