x64: Add more support for more AVX instructions #5931

alexcrichton · 2023-03-04T22:33:17Z

One aspect of AVX that I have just recently become aware of is that there's apparently a performance penalty associated with mixing AVX and SSE intructions. One reason for this is that AVX instructions which operate over 128-bit values always zero the higher-than-128-bits of each register operate on. SSE instructions, however, don't do this. This means that false dependencies can be created between instructions because SSE instructions look like they're intentionally preserving higher bits where AVX instructions intentionally zero them. According to this stackoverflow question the processor also tracks whether an instruction has been executed and there's a "scary red line" for mixing AVX/SSE.

On the local meshoptimizer benchmark this PR doesn't actually have any effect on the generate code's performance, or not one that I can measure. In that sense this is more of a hygiene thing than anything else.

Specifically the changes here were to refactor many ISLE helpers that were generating instructions with SseOpcode.XXX manually to instead using the instruction helpers which will use the AVX variant if enabled. Additionally more AVX instructions were added for moving data to/from memory and such.

I don't think this 100% handles all the SSE instructions cranelift can generate when AVX is enabled, but it at least raises the bar further and removes a bunch of cases of SSE-generated instructions when AVX is enabled.

github-actions · 2023-03-04T22:44:59Z

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

abrown

I think this change makes sense but I do feel the performance concern in the StackOverflow question is overstated, at least for this context. The OP for that question does modify the upper bits of a YMM register with a load and thus incurs a significant performance penalty; that is not really possible in Cranelift at the moment. If anyone stumbles on this PR later and is interested, I would highly recommend reading the entire section that question references in Intel's optimization manual — "Mixing AVX Code with SSE Code" in section 15.3. My reading is that in the Cranelift context (for Skylake and later micro-architectures) we only ever cycle on the "Clean UpperState" node of figure 15-2 — Cranelift only emits SSE instructions or 128-bit AVX instructions, avoiding the expensive transition penalty (the ~70 cycle slowdown mentioned in the StackOverflow question).

Now, there could be a partial register dependency but even that is not entirely clear: that may only apply when the upper bits of a YMM register are dirty, which again I do not think is possible here. Again, let me just recommend reading the section directly or running experiments as @alexcrichton has done (with no performance effect, right?).

All in all, though, I think this change looks good to me! The nits are around renaming some helpers but those are not crucial.

abrown · 2023-03-09T20:01:12Z

cranelift/codegen/src/isa/x64/inst.isle

+
+(decl x64_movdqu_store (SyntheticAmode Xmm) SideEffectNoResult)
+(rule (x64_movdqu_store addr data)
+      (x64_xmm_movrm (SseOpcode.Movdqu) addr data))


Maybe a rename?

Suggested change

(x64_xmm_movrm (SseOpcode.Movdqu) addr data))

(xmm_movrm_sse (SseOpcode.Movdqu) addr data))

Back when I was adding things like this, I tried to keep a convention of x64_<instruction> but it's been a while.

I opted to drop the _sse suffix since the other xmm_* helpers don't have that, although it might be good to go back and rename Xmm* to Sse and Xmm*Vex to Avx* perhaps.

cranelift/codegen/src/isa/x64/inst.isle

alexcrichton · 2023-03-09T22:38:43Z

Oh good points! Shame on me for not actually reading all the way through on these bits...

So hypothetically if the host uses ymm registers in its own code, that might cause stalls but given that the stall requires hopping between the guest and the host it probably isn't really going to affect much?

Otherwise though locally I can't measure a difference before/after this PR, so the main motivation at this point is to copy what v8 does.

…ecodealliance#5930) * x64: Add lowerings for `punpck{h,l}wd` Add some special cases for `shuffle` for more specialized x86 instructions. * x64: Add `shuffle` lowerings for `pshufd` This commit adds special-cased lowerings for the x64 `shuffle` instruction when the `pshufd` instruction alone is necessary. This is possible when the shuffle immediate permutes 32-bit values within one of the vector inputs of the `shuffle` instruction, but not both. * x64: Add shuffle lowerings for `punpck{h,l}{q,}dq` This adds specific permutations for some x86 instructions which specifically interleave high/low bytes for 32 and 64-bit values. This corresponds to the preexisting specific lowerings for interleaving 8 and 16-bit values. * x64: Add `shuffle` lowerings for `shufps` This commit adds targeted lowerings for the `shuffle` instruction that match the pattern that `shufps` supports. The `shufps` instruction selects two elements from the first vector and two elements from the second vector which means while it's not generally applicable it should still be more useful than the catch-all lowering of `shuffle`. * x64: Add shuffle support for `pshuf{l,h}w` This commit adds special lowering cases for these instructions which permute 16-bit values within a 128-bit value either within the upper or lower half of the 128-bit value. * x64: Specialize `shuffle` with an all-zeros immediate Instead of loading the all-zeros immediate from a rip-relative address at the end of the function instead generate a zero with a `pxor` instruction and then use `pshufb` to do the broadcast. * Review comments

abrown · 2023-03-09T23:01:53Z

So hypothetically if the host uses ymm registers in its own code, that might cause stalls but given that the stall requires hopping between the guest and the host it probably isn't really going to affect much?

Yeah, that's a good point. I guess we should remember that, beyond just the normal overhead of switching between guest and host, this YMM transition penalty could add to the switch overhead. Maybe it's worthwhile to think about running VZEROUPPER in the "host to guest" trampoline so that we feel more sure that guest code will be in the "Clean UpperState"? cc: @cfallin, @elliottt, @jameysharp; I guess this is a "better safe than sorry" kind of thought, but that goes along with the intent of this PR.

Otherwise though locally I can't measure a difference before/after this PR, so the main motivation at this point is to copy what v8 does.

Yeah, I wanted to say it earlier but don't want to sound cavalier: one might have to work rather hard to make the partial register dependency become a noticeable issue in a real benchmark. I'm not saying it can't be done and we shouldn't try to avoid it, just... the StackOverflow answer ("you are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions") felt more alarmist than I thought was warranted.

alexcrichton · 2023-03-09T23:08:55Z

Oh sorry I didn't mean to raise an alarms or convey any sense of urgency. I should probably more succinctly put it as "I was interested in filling out more AVX instructions, but had no technical motivation to document as the reason to do so, so I picked the first google result and pasted it here"

I'll need to read up more on VZEROUPPER as I'm not sure what it does and how it affects performance myself.

This will benefit from lack of need for alignment vs the `pshufd` instruction if working with a memory operand and additionally, as I've just learned, this reduces dependencies between instructions because the `v*` instructions zero the upper bits as opposed to preserving them which could accidentally create false dependencies in the CPU between instructions.

This commit adds VEX-encoded versions of instructions such as `mov{ss,sd,upd,ups,dqu}` for load and store operations. This also changes some signatures so the `load` helpers specifically take a `SyntheticAmode` argument which ended up doing a small refactoring of the `*_regmove` variant used for `insertlane 0` into f64x2 vectors.

This commit refactors the internal ISLE helpers for creating zero'd xmm registers to leverage the AVX support for all other instructions. This moves away from picking opcodes to instead picking instructions with a bit of reorganization.

All existing users can be replaced with usage of the `xmm_uninit_value` helper instruction so there's no longer any need for these otherwise constant operations. This additionally reduces manual usage of opcodes in favor of instruction helpers.

jameysharp · 2023-03-09T23:15:15Z

The optimization manual says that vzeroupper "has zero latency" so I guess the only cost is instruction decode. Given that, adding one no-operand instruction to Wasmtime's trampolines sounds reasonable to me. (I guess it should be added for transitions in both directions between host and guest, based on the optimization manual's recommendations.)

I think I remember at least one of those trampolines does a tail-call, so it doesn't have the opportunity to do this when the callee returns, which I suppose could lead to surprising results too.

Just to check, we don't need to worry about ABI here, right? I'm assuming no x86 ABI guarantees anything about bits beyond the first 128 of vector registers across a call, or all the vector registers are caller-saved, or something.

abrown · 2023-03-09T23:44:00Z

Just to check, we don't need to worry about ABI here, right? I'm assuming no x86 ABI guarantees anything about bits beyond the first 128 of vector registers across a call, or all the vector registers are caller-saved, or something.

Honestly, hadn't thought too much about this idea until today so I don't know, but if we did add VZEROUPPER in the "host to guest" direction, e.g., I think we would want to do so before we fill in any registers with passed v128 values.

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Mar 4, 2023

alexcrichton requested a review from abrown March 8, 2023 20:49

abrown approved these changes Mar 9, 2023

View reviewed changes

alexcrichton enabled auto-merge March 9, 2023 22:42

alexcrichton mentioned this pull request Mar 9, 2023

x64: Optimize store-of-extract-lane-0 #5924

Merged

alexcrichton added 6 commits March 9, 2023 15:12

x64: Enable using AVX instructions for zero regs

e008ecb

This commit refactors the internal ISLE helpers for creating zero'd xmm registers to leverage the AVX support for all other instructions. This moves away from picking opcodes to instead picking instructions with a bit of reorganization.

x64: Remove XmmConstOp as an instruction

312eb40

All existing users can be replaced with usage of the `xmm_uninit_value` helper instruction so there's no longer any need for these otherwise constant operations. This additionally reduces manual usage of opcodes in favor of instruction helpers.

Review comments

eea5333

Update test expectations

1fd6e36

alexcrichton force-pushed the x64-more-avx branch from 93c5408 to 1fd6e36 Compare March 9, 2023 23:24

alexcrichton disabled auto-merge March 9, 2023 23:25

alexcrichton enabled auto-merge March 9, 2023 23:40

github-actions bot added the cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. label Mar 9, 2023

alexcrichton added this pull request to the merge queue Mar 9, 2023

Merged via the queue into bytecodealliance:main with commit 83f21e7 Mar 10, 2023

alexcrichton deleted the x64-more-avx branch March 10, 2023 00:42

jameysharp mentioned this pull request Apr 5, 2023

Add release notes for 8.0.0 #6145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x64: Add more support for more AVX instructions #5931

x64: Add more support for more AVX instructions #5931

alexcrichton commented Mar 4, 2023

github-actions bot commented Mar 4, 2023

abrown left a comment

abrown Mar 9, 2023

alexcrichton Mar 9, 2023

alexcrichton commented Mar 9, 2023

abrown commented Mar 9, 2023

alexcrichton commented Mar 9, 2023

jameysharp commented Mar 9, 2023

abrown commented Mar 9, 2023

	(x64_xmm_movrm (SseOpcode.Movdqu) addr data))
	(xmm_movrm_sse (SseOpcode.Movdqu) addr data))

x64: Add more support for more AVX instructions #5931

x64: Add more support for more AVX instructions #5931

Conversation

alexcrichton commented Mar 4, 2023

github-actions bot commented Mar 4, 2023

Subscribe to Label Action

abrown left a comment

Choose a reason for hiding this comment

abrown Mar 9, 2023

Choose a reason for hiding this comment

alexcrichton Mar 9, 2023

Choose a reason for hiding this comment

alexcrichton commented Mar 9, 2023

abrown commented Mar 9, 2023

alexcrichton commented Mar 9, 2023

jameysharp commented Mar 9, 2023

abrown commented Mar 9, 2023