Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struct copy using movs rather than SSE on x64 #7469

Open
benaadams opened this issue Feb 19, 2017 · 3 comments
Open

Struct copy using movs rather than SSE on x64 #7469

benaadams opened this issue Feb 19, 2017 · 3 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@benaadams
Copy link
Member

benaadams commented Feb 19, 2017

Assigning a struct to another static variable struct (Memory<T> below)

public struct Memory<T> : IEquatable<Memory<T>>, IEquatable<ReadOnlyMemory<T>>
{
    readonly OwnedMemory<T> _owner;
    readonly int _id;
    readonly int _index;
    readonly int _length;
}

Can generate a movs copy

movs        qword ptr [rdi],qword ptr [rsi]  
movs        qword ptr [rdi],qword ptr [rsi]

Would this be better as a SSE copy?

movdqu      xmm0,xmmword ptr [rsi]  
movdqu      xmmword ptr [rdi],xmm0

As seen in dotnet/corefxlab#1227 (comment)

/cc @mikedn
category:cq
theme:block-opts
skill-level:intermediate
cost:small
impact:medium

@mikedn
Copy link
Contributor

mikedn commented Feb 19, 2017

To be clear the entire copy code looks like this:

; rsi loaded with the source address and rdi loaded with the destination address
; copy _owner field, GC write barrier, rsi += 8, rdi += 8
E89FEE0C5F           call     CORINFO_HELP_ASSIGN_BYREF
; copy 16 bytes: _id, _index, _length and 4 padding bytes
48A5                 movsq
48A5                 movsq

If there's no reference field then the generated code is:

; copy 12 bytes
488B08               mov      rcx, qword ptr [rax]
48890A               mov      qword ptr [rdx], rcx
8B4808               mov      ecx, dword ptr [rax+8]
894A08               mov      dword ptr [rdx+8], ecx

If there's no reference field and another int field is added so the struct has 16 bytes then we get this:

; copy 16 bytes
C4E17A6F00           vmovdqu  xmm0, qword ptr [rax]
C4E17A7F02           vmovdqu  qword ptr [rdx], xmm0

If instead of a reference field we use a long field then we get this:

; copy 24 bytes
C4E17A6F00           vmovdqu  xmm0, qword ptr [rax]
C4E17A7F02           vmovdqu  qword ptr [rdx], xmm0
488B4810             mov      rcx, qword ptr [rax+16]
48894A10             mov      qword ptr [rdx+16], rcx

So movs is used when the struct contains reference fields, it complements CORINFO_HELP_ASSIGN_BYREF which uses rsi and rdi just like movs does.

The use of movs may be worth investigation, it doesn't have a good reputation performance wise. For example, on Skylake movs requires 5 uops instead of 1 or 2 needed by movdqu. That said, its association with helper calls and memory accesses might mean that the overhead is too small to matter.

@mikedn
Copy link
Contributor

mikedn commented Feb 20, 2017

Some quick testing seems to indicate that when vmovdqu is used the code is ~1.5x faster. It would seem that movs is indeed very slow, especially considering that the test code includes 2 calls.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 24, 2023
@huoyaoyuan
Copy link
Member

I believe this can be closed now. Struct copying now uses SSE/AVX when appropriate:

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKBuIGYACXDKAVzA0YFkZ9oAnjQDeNRuMaxsAEwgA7ADYDGASzlcA+hADucmFADcYiVNmLlazSulHqEyTBnylq9Yw1rpMBLfunnFm4aCjByAOYYABa+EpaM2DHiccCJrlxgqXE2NAC+NHRMZIwAwoyidhK8/FDK+KnG4sQoPAAUVYKMAG7YCmwwAJSMALwAfIz4w109fbY5QA

C.M(Memory)
    L0000: vzeroupper
    L0003: vmovdqu ymm0, [rdx]
    L0007: vmovdqu [rcx+8], ymm0
    L000c: vzeroupper
    L000f: ret

Struct with GC reference may use rep movsx though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants