Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT: Unnecessary Copy of System V x86-64 Multi-Register Structure Returning #7897

Closed
fiigii opened this issue Apr 17, 2017 · 5 comments · Fixed by dotnet/coreclr#11133
Closed
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization os-linux Linux OS (any supported distro) os-mac-os-x macOS aka OSX tenet-performance Performance related issue
Milestone

Comments

@fiigii
Copy link
Contributor

fiigii commented Apr 17, 2017

According to VTune characterization of TechEmpower with CoreCLR 2.0-beta, the top hot function of System::IO::Pipelines is Pipe.get_buffer, which is called by the top hot function WriteFast in Kestrel.

internal Buffer<byte> Buffer => _writingHead?.Buffer.Slice(_writingHead.End, _writingHead.WritableBytes) ?? Buffer<byte>.Empty;

There are codegen differences between Windows and Ubuntu as shown below:

Windows Ubuntu
Block 1: Block 1:
push rdi push r14
push rsi push r13
push rbx push rbx
sub rsp, 0x20 sub rsp, 0x40
mov rsi, rdx vzeroupper
mov rdi, qword ptr [rcx+0x38] mov r13, rdi
mov rbx, rdi lea rdi, ptr [rsp]
test rbx, rbx mov ecx, 0x10
jnz 0x7ffa9a429019 <Block 4> xor eax, eax
Block 2: rep stosd dword ptr [rdi]
mov rcx, rsi mov rdi, r13
call 0x7ffa9a421410 mov rbx, qword ptr [rdi+0x38]
Block 3: mov r14, rbx
mov rax, rsi test r14, r14
add rsp, 0x20 jnz 0x7f52d191a08c <Block 4>
pop rbx Block 2:
pop rsi call 0x7f52d1910b90
pop rdi Block 3:
ret mov qword ptr [rsp+0x10], rax
mov qword ptr [rsp+0x18], rdx
vmovdqu xmm0, xmmword ptr [rsp+0x10]
vmovdqu xmmword ptr [rsp+0x20], xmm0
mov rax, qword ptr [rsp+0x20]
mov rdx, qword ptr [rsp+0x28]
add rsp, 0x40
pop rbx
pop r13
pop r14
ret
Block 4: Block 4:
lea rdx, ptr [rbx+0x28] lea rax, ptr [r14+0x28]
mov rcx, rdx mov rdi, rax
mov rax, qword ptr [rcx] mov rsi, qword ptr [rdi]
mov ecx, dword ptr [rcx+0x8] mov edi, dword ptr [rdi+0x8]
mov r8d, dword ptr [rdi+0x1c] mov edx, dword ptr [rbx+0x1c]
mov r9d, r8d mov ecx, edx
mov r10d, dword ptr [rdi] mov r8d, dword ptr [rbx]
mov edx, dword ptr [rdx+0xc] mov eax, dword ptr [rax+0xc]
mov edi, edx sub eax, edx
sub edi, r8d lea rdx, ptr [rsp]
lea ebx, ptr [rcx+r9*1] vxorpd xmm0, xmm0, xmm0
lea rcx, ptr [rsi] vmovdqu xmmword ptr [rdx], xmm0
mov rdx, rax add edi, ecx
call CORINFO_HELP_CHECKED_ASSIGN_REF mov qword ptr [rsp], rsi
Block 5: mov dword ptr [rsp+0x8], edi
mov dword ptr [rsi+0x8], ebx mov dword ptr [rsp+0xc], eax
mov dword ptr [rsi+0xc], edi vmovdqu xmm0, xmmword ptr [rsp]
mov rax, rsi vmovdqu xmmword ptr [rsp+0x30], xmm0
add rsp, 0x20 mov rax, qword ptr [rsp+0x30]
pop rbx mov rdx, qword ptr [rsp+0x38]
pop rsi add rsp, 0x40
pop rdi pop rbx
ret pop r13
pop r14
ret

There are two codegen issues that make Linux version slower:

  1. Redundantly copy the result of calling Buffer<byte>.Empty (highlighted code in Block 3).
  2. Redundantly copy the result of calling (inlined) Slice (highlighted code in Block 4).

I will look into fixing these codegen issues.

@RussKeldorph
Copy link
Contributor

@dotnet/jit-contrib

There are a few existing issues that may be related

@fiigii
Copy link
Contributor Author

fiigii commented Apr 18, 2017

@RussKeldorph Most of these existing issues are related to structure argument passing, and this issue is only about returned structures.

@RussKeldorph
Copy link
Contributor

@fiigii Understood. I didn't mean to suggest your issue was a duplicate; just that in theory there might be other issues affected by a fix.

@fiigii
Copy link
Contributor Author

fiigii commented Apr 18, 2017

@RussKeldorph I see, thanks for explaining.

@omariom
Copy link
Contributor

omariom commented Apr 19, 2017

It may be related to First class structs

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization os-linux Linux OS (any supported distro) os-mac-os-x macOS aka OSX tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants