Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighten up some SSE2 ASCII + UTF-16 transcoding logic #32036

Merged
merged 2 commits into from Feb 11, 2020

Conversation

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Feb 10, 2020

Cleans up some of the logic slightly by removing SSE4.1-specific paths when SSE2 paths suffice with the same number of instructions. Also optimizes the SSE2-specific codegen a little bit by reusing existing registers whenever possible.

As an example of the type of optimization performed here:

/* old code - 3 instructions (pxor, pcmpgtw, pmovmskb) */
Sse2.MoveMask(Sse2.CompareGreaterThan(Sse2.Xor(combinedVector, asciiMaskForPXOR), asciiMaskForPCMPGTW).AsByte()

/* new code - 2 instructions (paddusw, pmovmskb) */
Sse2.MoveMask(Sse2.AddSaturate(combinedVector, asciiMaskForPADDUSW).AsByte())

Dropping from 3 instructions down to 2 instructions also allows us to get rid of one of the "const" vectors that we use for these operations, hence the more efficient use of SIMD registers.

This code was already fuzzed as part of #31904 (comment).

Copy link
Member

tannergooding left a comment

Nice optimization. Any perf numbers?

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

GrabYourPitchforks commented Feb 11, 2020

Raw throughput difference is in the range of noise, which isn't surprising. But I confirmed that there's more efficient use of registers:

;;; OLD ;;;
00007ffd`5e48bb90 b80080ffff      mov     eax,0FFFF8000h
00007ffd`5e48bb95 660f6ec0        movd    xmm0,eax
00007ffd`5e48bb99 0f28c8          movaps  xmm1,xmm0
00007ffd`5e48bb9c 660f61c8        punpcklwd xmm1,xmm0
00007ffd`5e48bba0 0f28c1          movaps  xmm0,xmm1
00007ffd`5e48bba3 660f70c000      pshufd  xmm0,xmm0,0
00007ffd`5e48bba8 0f28c8          movaps  xmm1,xmm0
00007ffd`5e48bbab b87f80ffff      mov     eax,0FFFF807Fh
00007ffd`5e48bbb0 660f6ed0        movd    xmm2,eax
00007ffd`5e48bbb4 0f28da          movaps  xmm3,xmm2
00007ffd`5e48bbb7 660f61da        punpcklwd xmm3,xmm2
00007ffd`5e48bbbb 0f28d3          movaps  xmm2,xmm3

;;; NEW ;;;
00007ffd`519ab240 b8807f0000      mov     eax,7F80h
00007ffd`519ab245 660f6ec0        movd    xmm0,eax
00007ffd`519ab249 0f28c8          movaps  xmm1,xmm0
00007ffd`519ab24c 660f61c8        punpcklwd xmm1,xmm0
00007ffd`519ab250 0f28c1          movaps  xmm0,xmm1
00007ffd`519ab253 660f70c000      pshufd  xmm0,xmm0,0
00007ffd`519ab258 0f28c8          movaps  xmm1,xmm0

And the main loop is also a little tighter, as expected:

;;; OLD ;;;
00007ffd`5e48bbc6 f30f6f21        movdqu  xmm4,xmmword ptr [rcx]
00007ffd`5e48bbca 0f28ec          movaps  xmm5,xmm4
00007ffd`5e48bbcd 660fefe8        pxor    xmm5,xmm0
00007ffd`5e48bbd1 0f28c5          movaps  xmm0,xmm5
00007ffd`5e48bbd4 660f65c2        pcmpgtw xmm0,xmm2
00007ffd`5e48bbd8 660fd7c0        pmovmskb eax,xmm0
00007ffd`5e48bbdc 85c0            test    eax,eax

;;; NEW ;;;
00007ffd`519ab28c f30f6f5110      movdqu  xmm2,xmmword ptr [rcx+10h]
00007ffd`519ab291 0f28c2          movaps  xmm0,xmm2
00007ffd`519ab294 660fddc1        paddusw xmm0,xmm1
00007ffd`519ab298 66440fd7c8      pmovmskb r9d,xmm0
00007ffd`519ab29d 4d63c9          movsxd  r9,r9d
00007ffd`519ab2a0 41f7c1aaaa0000  test    r9d,0AAAAh

I'm looking into why there's a stray sign extension in there. I wouldn't have expected that to be there.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

GrabYourPitchforks commented Feb 11, 2020

Found and fixed the unnecessary movsxd instruction. :)
Edit: Opened #32089 to track improving this JIT-wide.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

GrabYourPitchforks commented Feb 11, 2020

CI is on the floor right now due to failures in System.Threading.Tasks.Tests. Unrelated to this PR. Proceeding with merge.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

GrabYourPitchforks commented Feb 11, 2020

Thanks Tanner for the review! :)

@GrabYourPitchforks GrabYourPitchforks merged commit 720f2d6 into dotnet:master Feb 11, 2020
61 of 67 checks passed
61 of 67 checks passed
runtime Build #20200210.104 had test failures
Details
runtime (Libraries Test Run checked coreclr OSX x64 Debug) Libraries Test Run checked coreclr OSX x64 Debug failed
Details
runtime (Libraries Test Run release coreclr Linux x64 Debug) Libraries Test Run release coreclr Linux x64 Debug failed
Details
runtime (Libraries Test Run release coreclr Linux_musl x64 Debug) Libraries Test Run release coreclr Linux_musl x64 Debug failed
Details
runtime (Libraries Test Run release coreclr OSX x64 Debug) Libraries Test Run release coreclr OSX x64 Debug failed
Details
runtime (Libraries Test Run release coreclr Windows_NT x64 Debug) Libraries Test Run release coreclr Windows_NT x64 Debug was canceled
Details
WIP Ready for review
Details
license/cla All CLA requirements met.
Details
runtime (Checkout) Checkout succeeded
Details
runtime (CoreCLR Product Build Linux arm release) CoreCLR Product Build Linux arm release succeeded
Details
runtime (CoreCLR Product Build Linux arm64 release) CoreCLR Product Build Linux arm64 release succeeded
Details
runtime (CoreCLR Product Build Linux x64 release) CoreCLR Product Build Linux x64 release succeeded
Details
runtime (CoreCLR Product Build Linux_musl arm64 release) CoreCLR Product Build Linux_musl arm64 release succeeded
Details
runtime (CoreCLR Product Build Linux_musl x64 release) CoreCLR Product Build Linux_musl x64 release succeeded
Details
runtime (CoreCLR Product Build OSX x64 checked) CoreCLR Product Build OSX x64 checked succeeded
Details
runtime (CoreCLR Product Build OSX x64 release) CoreCLR Product Build OSX x64 release succeeded
Details
runtime (CoreCLR Product Build Windows_NT arm release) CoreCLR Product Build Windows_NT arm release succeeded
Details
runtime (CoreCLR Product Build Windows_NT arm64 release) CoreCLR Product Build Windows_NT arm64 release succeeded
Details
runtime (CoreCLR Product Build Windows_NT x64 release) CoreCLR Product Build Windows_NT x64 release succeeded
Details
runtime (CoreCLR Product Build Windows_NT x86 release) CoreCLR Product Build Windows_NT x86 release succeeded
Details
runtime (Installer Build and Test Linux_arm Debug) Installer Build and Test Linux_arm Debug succeeded
Details
runtime (Installer Build and Test Linux_arm64 Debug) Installer Build and Test Linux_arm64 Debug succeeded
Details
runtime (Installer Build and Test Linux_musl_arm64 Debug) Installer Build and Test Linux_musl_arm64 Debug succeeded
Details
runtime (Installer Build and Test Linux_musl_x64 Debug) Installer Build and Test Linux_musl_x64 Debug succeeded
Details
runtime (Installer Build and Test Linux_x64 Debug) Installer Build and Test Linux_x64 Debug succeeded
Details
runtime (Installer Build and Test OSX_x64 Debug) Installer Build and Test OSX_x64 Debug succeeded
Details
runtime (Installer Build and Test Windows_NT_arm Debug) Installer Build and Test Windows_NT_arm Debug succeeded
Details
runtime (Installer Build and Test Windows_NT_arm64 Debug) Installer Build and Test Windows_NT_arm64 Debug succeeded
Details
runtime (Installer Build and Test Windows_NT_x64 Debug) Installer Build and Test Windows_NT_x64 Debug succeeded
Details
runtime (Installer Build and Test Windows_NT_x86 Debug) Installer Build and Test Windows_NT_x86 Debug succeeded
Details
runtime (Libraries Build Linux arm Release) Libraries Build Linux arm Release succeeded
Details
runtime (Libraries Build Linux arm64 Debug) Libraries Build Linux arm64 Debug succeeded
Details
runtime (Libraries Build Linux x64 Debug) Libraries Build Linux x64 Debug succeeded
Details
runtime (Libraries Build Linux_musl arm64 Release) Libraries Build Linux_musl arm64 Release succeeded
Details
runtime (Libraries Build Linux_musl x64 Debug) Libraries Build Linux_musl x64 Debug succeeded
Details
runtime (Libraries Build OSX x64 Debug) Libraries Build OSX x64 Debug succeeded
Details
runtime (Libraries Build WebAssembly wasm Debug) Libraries Build WebAssembly wasm Debug succeeded
Details
runtime (Libraries Build Windows_NT allConfigurations x64 Debug) Libraries Build Windows_NT allConfigurations x64 Debug succeeded
Details
runtime (Libraries Build Windows_NT arm Release) Libraries Build Windows_NT arm Release succeeded
Details
runtime (Libraries Build Windows_NT arm64 Release) Libraries Build Windows_NT arm64 Release succeeded
Details
runtime (Libraries Build Windows_NT net472 x86 Release) Libraries Build Windows_NT net472 x86 Release succeeded
Details
runtime (Libraries Build Windows_NT x64 Debug) Libraries Build Windows_NT x64 Debug succeeded
Details
runtime (Libraries Build Windows_NT x86 Debug) Libraries Build Windows_NT x86 Debug succeeded
Details
runtime (Libraries Build Windows_NT x86 Release) Libraries Build Windows_NT x86 Release succeeded
Details
runtime (Libraries Test Build Linux x64 Debug) Libraries Test Build Linux x64 Debug succeeded
Details
runtime (Libraries Test Build OSX x64 Debug) Libraries Test Build OSX x64 Debug succeeded
Details
runtime (Libraries Test Build Windows_NT x64 Debug) Libraries Test Build Windows_NT x64 Debug succeeded
Details
runtime (Libraries Test Run release coreclr Windows_NT x86 Debug) Libraries Test Run release coreclr Windows_NT x86 Debug succeeded
Details
runtime (Libraries Test Run release coreclr Windows_NT x86 Release) Libraries Test Run release coreclr Windows_NT x86 Release succeeded
Details
runtime (Libraries Test Run release mono Linux x64 Debug) Libraries Test Run release mono Linux x64 Debug succeeded
Details
runtime (Libraries Test Run release mono OSX x64 Debug) Libraries Test Run release mono OSX x64 Debug succeeded
Details
runtime (Mono LLVM Product Build Linux x64 debug) Mono LLVM Product Build Linux x64 debug succeeded
Details
runtime (Mono LLVM Product Build Linux x64 release) Mono LLVM Product Build Linux x64 release succeeded
Details
runtime (Mono Product Build Linux x64 debug) Mono Product Build Linux x64 debug succeeded
Details
runtime (Mono Product Build Linux x64 release) Mono Product Build Linux x64 release succeeded
Details
runtime (Mono Product Build Linux_musl x64 debug) Mono Product Build Linux_musl x64 debug succeeded
Details
runtime (Mono Product Build Linux_musl x64 release) Mono Product Build Linux_musl x64 release succeeded
Details
runtime (Mono Product Build OSX x64 debug) Mono Product Build OSX x64 debug succeeded
Details
runtime (Mono Product Build OSX x64 release) Mono Product Build OSX x64 release succeeded
Details
runtime (Mono Product Build Windows_NT x64 debug) Mono Product Build Windows_NT x64 debug succeeded
Details
runtime (Mono Product Build Windows_NT x64 release) Mono Product Build Windows_NT x64 release succeeded
Details
runtime-live-build Build #20200210.98 succeeded
Details
runtime-live-build (Build Linux x64 debug RuntimeFlavor_Mono) Build Linux x64 debug RuntimeFlavor_Mono succeeded
Details
runtime-live-build (Build Linux x64 debug Runtime_Release) Build Linux x64 debug Runtime_Release succeeded
Details
runtime-live-build (Build OSX x64 release Runtime_Debug) Build OSX x64 release Runtime_Debug succeeded
Details
runtime-live-build (Build Windows_NT x86 release Runtime_Debug) Build Windows_NT x86 release Runtime_Debug succeeded
Details
runtime-live-build (Checkout) Checkout succeeded
Details
@GrabYourPitchforks GrabYourPitchforks deleted the GrabYourPitchforks:intr_opt branch Feb 11, 2020
alistairjevans added a commit to alistairjevans/runtime that referenced this pull request Feb 11, 2020
Makes more efficient use of register allocation and uses paddusw to speed up some operations
monojenkins pushed a commit to monojenkins/mono that referenced this pull request Feb 11, 2020
Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things internally we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time).

So this PR implements it for mono using named LLVM intrinsics. It turns out it's different for LLVM6 we currently based on and LLVM9 (we plan to migrate to soon).

it emits
```llvm
%result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right)
```
which is then emitted as `vpaddusw`
monojenkins pushed a commit to monojenkins/mono that referenced this pull request Feb 11, 2020
Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things internally we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time).

So this PR implements it for mono using named LLVM intrinsics. It turns out it's different for LLVM6 we currently based on and LLVM9 (we plan to migrate to soon).

it emits
```llvm
%result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right)
```
which is then emitted as `vpaddusw`
monojenkins pushed a commit to monojenkins/mono that referenced this pull request Feb 11, 2020
Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things there we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time).

So this PR implements it for mono (with all Sse2 overloads) using named LLVM intrinsics. It turns out it's different between LLVM6 (we currently based on) and LLVM9 (we plan to migrate to soon).

for `Vector128<byte>` overload it emits
```llvm
%result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right)
```
which is then emitted as `vpaddusw`
akoeplinger pushed a commit to mono/mono that referenced this pull request Feb 11, 2020
Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things there we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time).

So this PR implements it for mono (with all Sse2 overloads) using named LLVM intrinsics. It turns out it's different between LLVM6 (we currently based on) and LLVM9 (we plan to migrate to soon).

for `Vector128<byte>` overload it emits
```llvm
%result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right)
```
which is then emitted as `vpaddusw`

Co-authored-by: Egor Bogatov <egorbo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.