-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements f64x2.convert_low_i32x4_u for x64 #2982
Implements f64x2.convert_low_i32x4_u for x64 #2982
Conversation
7a4fb14
to
6fd70b0
Compare
Inst::new( | ||
"fcvt_low_from_uint", | ||
r#" | ||
Converts packed unsigned doubleword integers to packed double precision floating point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, use 32-bit integers
instead of doubleword integers
.
You should also update the fcvt_low_from_sint
description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akirilov-arm good catch, that can be ambiguous. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly, the term "doubleword" means "64-bit" in the Arm architecture.
9af1890
to
90d4afc
Compare
Inst::new( | ||
"fcvt_low_from_uint", | ||
r#" | ||
Converts packed unsigned 32-bit integers to packed double precision floating point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you copy the docs from fcvt_low_from_sint
?
30210f6
to
8e88727
Compare
@@ -4411,6 +4411,19 @@ pub(crate) fn define( | |||
.operands_out(vec![a]), | |||
); | |||
|
|||
ig.push( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any specific reason to add a new IR operation? Correct me if I am wrong, but this is equivalent to uwiden_low
+ fcvt_from_uint
, which should be straightforward to pattern-match in the backend (if it was a sequence of, say, 10 instructions, then a new IR operation would be perfectly understandable).
I realize that this has already been done for fcvt_low_from_sint
, so I guess the same question applies to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @akirilov-arm .. good question. I did the lowering for fcvt_low_from_sint but don't remember the reasoning. Likely I simply did not realize swiden_low could be used. Brings up a question though of if the goal is to minimize the number of instructions here by reusing and mapping many-to-one as much as possible which will have the consequence of a more generic definition of the instruction. Or do we instead want to be more specific in our instruction name and definition and closely tie to the mapped wasm instruction. You mention how involved the pattern matching would get if instructions are shared may be a decider. I'll push another patch removing these new instructions and instead attempt to lower using uwiden_low and swiden_low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akirilov-arm I remember now, we were mimicking the implementation of F32x4ConvertI32x4S but couldn't use or didn't want to use that same instruction. I think uwiden_low was never a thought. Rereading your question, I am not sure what you mean by uwiden_low+fcvt_from_uint as I was initially thinking you just asking why I didn't reuse uwiden_low? Can you explain what you mean by uwiden_low+fcvt_from_uint w.r.t the instruction definition in instruction.rs and code_translator.rs? The patch I just pushed should address all other comments except this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that no changes would be necessary to instructions.rs
. As for code_translator.rs
, it would need to do something like:
Operator::F64x2ConvertLowI32x4U => {
let a = pop1_with_bitcast(state, I32X4, builder);
let widened_a = builder.ins().uwiden_low(a);
state.push1(builder.ins().fcvt_from_uint(F64X2, widened_a));
}
Then if a backend needs to do anything special for that combination, it will match the pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @akirilov-arm ok .. I am following the other instructions as an example and don't see the others doing this. What does it mean to call two instruction builders here? I tried the above and it's going to want something implemented for uwiden_low and fcvt_from_uint. Is this what you are suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akirilov-arm I see a mergable_load call that I assume doesn't apply here? Is it really best to try and merge instructions instead of just mapping it explicitly 1:1 to the wasm instruction like it is now? I don't know that I follow exactly, but I think following your suggestion new logic would be needed in input_to_reg_mem that would get executed for every instruction calling this function. I guess the logic would check if the instruction does a widen and convert? Seems to be less straight forward? What's the advantage of avoiding the 1:1 mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't worked on the x64 backend, so I can't make specific implementation suggestions - my guess is that it would be similar to the AArch64 code, i.e. in the implementation of fcvt_from_uint
you would check if the input is uwiden_low
and act accordingly (input_to_reg_mem()
doesn't seem like the right place). My point is that matching is possible and in this case easy because we need to match only one input, not a whole expression tree.
I am working on enabling the i16x8.q15mulr_sat_s
and i32x4.trunc_sat_f64x2_s_zero
operations, which also happen to be expressible in terms of existing IR operations. However, they result in much more complicated patterns, which are no longer easy to match, so in those cases I would not make the same suggestion.
The main advantage of avoiding the 1:1 mapping is that it reduces the work necessary for all backends to get basic support for the operation - once they implement 2 other operations, which they are going to do anyway because it is required by the Wasm SIMD specification, they get f64x2.convert_low_i32x4_u
for free (and the same would apply to f64x2.convert_low_i32x4_s
if it is changed similarly). Also, the pattern might occur organically in the Wasm code and would be handled properly. There is probably an argument to be made that enums should be kept as lean as possible, and that other applications that try to generate or manipulate Cranelift IR (say, another compiler frontend or a peephole optimizer) would have an easier job because they will have fewer things to consider, but these are lesser concerns.
With all that said, I am definitely not categorically opposed to adding an IR operation, so perhaps it is best to hear a third opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlb6740 we had some discussion on this topic last year -- see #2278 and #2376.
I think in general, we want to try to limit new instructions we add at the CLIF level to those that are truly necessary; while combo-instructions make it easier to plumb through the new thing and get exactly the semantics you want, they impose ongoing maintenance cost and cost on new backend implementations, as the IR-level instruction set becomes more complex. Also, the operator-combos for which there are efficient lowerings may be different on different architectures; we then end up with the union of all combo-instructions, and while some backends will have efficient lowerings for the ones that were purpose-built for that backend, the other other backends will have to add new logic that could have come from the combination of simpler ops automatically, as @akirilov-arm says. In general, pattern-matching is a good way to implement better lowerings for some combos without taking on this cost, I think.
In this particular case, the general helpers (e.g. input_to_reg_mem()
) don't need to change at all; instead there would be some logic in the lowering case for fcvt_from_uint
that looks for a uwiden_low
. The matches_input()
helper should be useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfallin It's probably reasonable to consider a "complexity bound" on the necessary matching, though, as I said - for instance, i32x4.trunc_sat_f64x2_s_zero
is equivalent to:
v1 = fcvt_to_sint_sat.i64x2 v0
v2 = vconst.i64x2 0
v3 = snarrow v1, v2
or:
v1 = fcvt_to_sint_sat.i64x2 v0
v2 = iconst.i64 0
v3 = splat.i64x2 v2
v4 = snarrow v1, v3
(and maybe I am forgetting another straightforward way to generate a vector of zeros)
I would lean towards introducing a new IR operation in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely, the matching has some dynamic cost (in compile time and IR memory usage), so there's a tradeoff. When it's a simple A+B combo op as here, it seems reasonable to me to pattern-match the composition, but we should take it on a case-by-case basis. I'd be curious in your examples whether the lowering can give a more optimized instruction sequence for those 3/4 ops together than what would fall out of the simple lowering, but we can save such discussion for a future PR :-)
Inst::new( | ||
"fcvt_low_from_uint", | ||
r#" | ||
Converts packed unsigned doubleword integers to packed double precision floating point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly, the term "doubleword" means "64-bit" in the Arm architecture.
8e88727
to
52f8f89
Compare
|
||
Considering only the low half of the register, each lane in `x` is interpreted as a | ||
unsigned 32-bit integer that is then converted to a double precision float. This | ||
which are converted to occupy twice the number of bits. No rounding should be needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you missed a line in the description here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akirilov-arm @cfallin Hi .. with some offline help from @cfallin I was able to understand what this should look like in lowering. I used the existing algorithm but will investigate doing it another way that uses fewer instructions. I did not try to refactor any other lowerings such as f64x2.convert_low_i32x4_s since this PR is about implementing f64x2.convert_low_i32x4_u and I figure we want to at least get the others finished before optimizing and refactoring previous instructions. That said, I do plan to refactor f64x2.convert_low_i32x4_s and maybe others with a different PR if not just for symmetry. Let me know if there is anything else needed for this PR.
3dabf08
to
6b5871c
Compare
81b4245
to
839b042
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really judge the x64-specific implementation, but the rest looks good to me.
No description provided.