Support for I128 operations in x64 backend. #2539

cfallin · 2021-01-04T06:08:26Z

This PR is a followup to #2538 and is dependent on it.

x64 backend: implement 128-bit ops and misc fixes.

This implements all of the ops on I128 that are implemented by the
legacy x86 backend, and includes all that are required by at least one
major use-case (cg_clif rustc backend).

The sequences are open-coded where necessary; for e.g. the bit
operations, this can be somewhat complex, but these sequences have been
tested carefully. This PR also includes a drive-by fix of clz/ctz for 8-
and 16-bit cases where they were incorrect previously.

Also includes ridealong fixes developed while bringing up cg_clif
support, because they are difficult to completely separate due to
other refactors that occurred in this PR:

fix REX prefix logic for some 8-bit instructions.

When using an 8-bit register in 64-bit mode on x86-64, the REX prefix
semantics are somewhat subtle: without the REX prefix, register numbers
4--7 correspond to the second-to-lowest byte of the first four registers
(AH, CH, BH, DH), whereas with the REX prefix, these register numbers
correspond to the usual encoding (SPL, BPL, SIL, DIL). We could always
emit a REX byte for instructions with 8-bit cases (this is harmless even
if unneeded), but this would unnecessarily inflate code size; instead,
the usual approach is to emit it only for these registers.

This logic was present in some cases but missing for some other
instructions: divide, not, negate, shifts.

Fixes Cranelift: rex prefix sometimes incorrectly omitted for 8bit operations with x64 newBE #2508.
avoid unaligned SSE loads on some f64 ops.

The implementations of several FP ops, such as fabs/fneg, used SSE
instructions. This is not a problem per-se, except that load-op merging
did not take alignment into account. Specifically, if an op on an f64
loaded from memory happened to merge that load, and the instruction into
which it was merged was an SSE instruction, then the SSE instruction
imposes stricter (128-bit) alignment requirements than the load.f64 did.

This PR simply forces any instruction lowerings that could use SSE
instructions to implement non-SIMD operations to take inputs in
registers only, and avoid load-op merging.

Fixes Support unaligned float loads and stores in x64 newBE #2507.
two bugfixes exposed by cg_clif: urem/srem.i8, select.b1.
- urem/srem.i8: the 8-bit form of the DIV instruction on x86-64 places
  the remainder in AH, not RDX, different from all the other width-forms
  of this instruction.
- select.b1: we were not recognizing selects of boolean values as
  integer-typed operations, so we were generating XMM moves instead (!).

abrown

Having read through this PR, the obvious split to me seems between the 8-bit changes and the new 128-bit sequences (to get @bnjbvr to take a look that might be a possible commit re-factoring). I like the added tests and I would merge it based on that, but I wonder in a few of my comments if there might not be a clearer way to express some of the logic. Since I only have vague ideas for improvement, though, I'll approve to keep things moving. Hopefully @bnjbvr or @julian-seward1 can come up with more specific suggestions.

cranelift/filetests/filetests/isa/x64/bitops-i128-run.clif

cranelift/codegen/src/isa/x64/inst/args.rs

abrown · 2021-01-11T19:20:26Z

cranelift/codegen/src/isa/x64/inst/args.rs

+            _ => false,
+        }
+    }
+}


It seems preferable to me to just have AluRmiROpcode::And and avoid this special-casing if possible. Do you agree? If yes, I wonder if there is a way to standardize register sizes throughout lower.rs using OperandSize and trying to clarify/simplify all of the special logic related to variables like is64, is_64, is_8bit, etc.

In general I would agree, except that we don't use 8-bit ops elsewhere for CLIF-level 8-bit instructions so I went for the limited-in-scope approach here. I'm totally open to a future refactor that centralizes OperandSize for all instructions, though -- that would remove a lot of the ad-hoc conditionals, as you note!

abrown · 2021-01-11T19:26:01Z

cranelift/codegen/src/isa/x64/inst/emit.rs


                match src {
                    RegMemImm::Reg { reg: reg_e } => {
+                        if is_8bit {
+                            rex.always_emit_if_8bit_needed(int_reg_enc(*reg_e));
+                            rex.always_emit_if_8bit_needed(int_reg_enc(reg_g.to_reg()));


I think this is the type of logic that is tricky to understand later: where is is_8bit coming from? It's sort of hidden that the registers used, if encoded in the range 4..7, should always emit a REX prefix. I'm not suggesting a great alternative here (sorry!), just noting that this is the type of thing that will make the lowering code tricky to troubleshoot and extend.

Perhaps the RexFlags could take an OperandSize with its constructor, if we centralize use of OperandSize; that would at least enforce the need to think about it. OK if we defer to a later refactor?

This implements all of the ops on I128 that are implemented by the legacy x86 backend, and includes all that are required by at least one major use-case (cg_clif rustc backend). The sequences are open-coded where necessary; for e.g. the bit operations, this can be somewhat complex, but these sequences have been tested carefully. This PR also includes a drive-by fix of clz/ctz for 8- and 16-bit cases where they were incorrect previously. Also includes ridealong fixes developed while bringing up cg_clif support, because they are difficult to completely separate due to other refactors that occurred in this PR: - fix REX prefix logic for some 8-bit instructions. When using an 8-bit register in 64-bit mode on x86-64, the REX prefix semantics are somewhat subtle: without the REX prefix, register numbers 4--7 correspond to the second-to-lowest byte of the first four registers (AH, CH, BH, DH), whereas with the REX prefix, these register numbers correspond to the usual encoding (SPL, BPL, SIL, DIL). We could always emit a REX byte for instructions with 8-bit cases (this is harmless even if unneeded), but this would unnecessarily inflate code size; instead, the usual approach is to emit it only for these registers. This logic was present in some cases but missing for some other instructions: divide, not, negate, shifts. Fixes bytecodealliance#2508. - avoid unaligned SSE loads on some f64 ops. The implementations of several FP ops, such as fabs/fneg, used SSE instructions. This is not a problem per-se, except that load-op merging did not take *alignment* into account. Specifically, if an op on an f64 loaded from memory happened to merge that load, and the instruction into which it was merged was an SSE instruction, then the SSE instruction imposes stricter (128-bit) alignment requirements than the load.f64 did. This PR simply forces any instruction lowerings that could use SSE instructions to implement non-SIMD operations to take inputs in registers only, and avoid load-op merging. Fixes bytecodealliance#2507. - two bugfixes exposed by cg_clif: urem/srem.i8, select.b1. - urem/srem.i8: the 8-bit form of the DIV instruction on x86-64 places the remainder in AH, not RDX, different from all the other width-forms of this instruction. - select.b1: we were not recognizing selects of boolean values as integer-typed operations, so we were generating XMM moves instead (!).

cfallin · 2021-01-14T21:50:31Z

Thanks, updated! As discussed just now, I'm happy to do some of the operand size-related refactors later, but I think it might be best to get this in sooner to avoid merge conflicts which seem to accumulate :-) Waiting for your final 👍 before merging though!

abrown · 2021-01-14T21:55:39Z

Consider this a 👍 from me. I think getting this type of thing in is better earlier than later, even without the refactoring opportunities we talked about.

cfallin · 2021-01-14T22:48:23Z

Opened #2586 to track the above refactors.

cfallin requested review from abrown, bnjbvr and julian-seward1 January 4, 2021 06:08

cfallin force-pushed the x64-i128 branch from 72eef07 to 8b38e20 Compare January 4, 2021 06:21

This was referenced Jan 4, 2021

x64 and aarch64: allow StructArgument and StructReturn args. #2541

Merged

Draft: I128 support (partial) on x64. #2504

Closed

cfallin force-pushed the x64-i128 branch 4 times, most recently from 4770836 to 8f79433 Compare January 6, 2021 18:01

abrown approved these changes Jan 11, 2021

View reviewed changes

bnjbvr removed their request for review January 12, 2021 14:07

cfallin force-pushed the x64-i128 branch from 8f79433 to 71ead6e Compare January 14, 2021 21:45

cfallin merged commit 25088de into bytecodealliance:main Jan 14, 2021

cfallin deleted the x64-i128 branch January 14, 2021 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for I128 operations in x64 backend. #2539

Support for I128 operations in x64 backend. #2539

cfallin commented Jan 4, 2021 •

edited

Loading

abrown left a comment •

edited

Loading

abrown Jan 11, 2021

cfallin Jan 14, 2021

abrown Jan 11, 2021

cfallin Jan 14, 2021

cfallin commented Jan 14, 2021

abrown commented Jan 14, 2021

cfallin commented Jan 14, 2021

Support for I128 operations in x64 backend. #2539

Support for I128 operations in x64 backend. #2539

Conversation

cfallin commented Jan 4, 2021 • edited Loading

abrown left a comment • edited Loading

Choose a reason for hiding this comment

abrown Jan 11, 2021

Choose a reason for hiding this comment

cfallin Jan 14, 2021

Choose a reason for hiding this comment

abrown Jan 11, 2021

Choose a reason for hiding this comment

cfallin Jan 14, 2021

Choose a reason for hiding this comment

cfallin commented Jan 14, 2021

abrown commented Jan 14, 2021

cfallin commented Jan 14, 2021

cfallin commented Jan 4, 2021 •

edited

Loading

abrown left a comment •

edited

Loading