Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core.simd inefficient codegen #17587

Open
dlangBugzillaToGithub opened this issue Jun 1, 2013 · 14 comments
Open

core.simd inefficient codegen #17587

dlangBugzillaToGithub opened this issue Jun 1, 2013 · 14 comments

Comments

@dlangBugzillaToGithub
Copy link

Benjamin Thaut (@bethau) reported this on 2013-06-01T03:12:12Z

Transferred from https://issues.dlang.org/show_bug.cgi?id=10226

CC List

Description

The following testcode:

import core.simd;
import std.stdio;

void main(string[] args)
{
	float[] value1 = [1.0f, 2.0f, 3.0f, 4.0f];
	float4 result = __simd(XMM.LODAPS, *cast(float4*)value1.ptr);
	result = __simd(XMM.ADDPS, result, result);
	__simd_sto(XMM.STOAPS, *cast(float4*)value1.ptr, result);
	writefln("%s", value1);
}

Will produce the following assembly

1 mov         rax,qword ptr [rbp-68h]  
2 movaps      xmm0,xmmword ptr [rax]  
3 movaps      xmmword ptr [rbp-60h],xmm0  
4 movdqa      xmm0,xmmword ptr [rbp-60h]  
5 addps       xmm0,xmmword ptr [rbp-60h]  
6 movaps      xmmword ptr [rbp-60h],xmm0  
7 movdqa      xmm0,xmmword ptr [rbp-60h]  
8 mov         rax,qword ptr [rbp-68h]  
9 movaps      xmmword ptr [rax],xmm0  

The instructions 3 and 4 are completely useless, as well as the instructions 6 and 7. Instruction 8 has no effect because RAX already contains that value. Ideally the assembly should look as follows:

1 mov         rax,qword ptr [rbp-68h]  
2 movaps      xmm0,xmmword ptr [rax]  
5 addps       xmm0,xmmword ptr [rbp-60h]  
9 movaps      xmmword ptr [rax],xmm0  

This is a huge problem because SIMD does only start to get effective if you stay within the xmm registers as long as possible.

tested with dmd 2.063
@dlangBugzillaToGithub
Copy link
Author

code (@MartinNowak) commented on 2013-06-01T03:13:40Z

Small correction, ideal assembly should look like this:

1 mov         rax,qword ptr [rbp-68h]  
2 movaps      xmm0,xmmword ptr [rax]  
5 addps       xmm0,xmm0
9 movaps      xmmword ptr [rax],xmm0 

Instruction 5 should use a xmm register as well and not add from memory.

@dlangBugzillaToGithub
Copy link
Author

turkeyman commented on 2013-06-03T03:43:34Z

(In reply to comment #1)
> Small correction, ideal assembly should look like this:
> 
> 1 mov         rax,qword ptr [rbp-68h]  
> 2 movaps      xmm0,xmmword ptr [rax]  
> 5 addps       xmm0,xmm0
> 9 movaps      xmmword ptr [rax],xmm0 
> 
> Instruction 5 should use a xmm register as well and not add from memory.

Interesting. I haven't scrutinised DMD's codegen as much as GDC yet.
I've been working on the std.simd API mostly against GDC, and once I'm happy with that, I'll be logging codegen bugs in DMD accordingly.

What do you get if you do:

    float4 result = [1,2,3,4];
    result = __simd(XMM.ADDPS, result, result);
    writefln("%s", result.array);

Why do you need to issue the loads and stores explicitly?

@dlangBugzillaToGithub
Copy link
Author

code (@MartinNowak) commented on 2013-06-03T05:37:15Z

(In reply to comment #2)
> (In reply to comment #1)
> > Small correction, ideal assembly should look like this:
> > 
> > 1 mov         rax,qword ptr [rbp-68h]  
> > 2 movaps      xmm0,xmmword ptr [rax]  
> > 5 addps       xmm0,xmm0
> > 9 movaps      xmmword ptr [rax],xmm0 
> > 
> > Instruction 5 should use a xmm register as well and not add from memory.
> 
> Interesting. I haven't scrutinised DMD's codegen as much as GDC yet.
> I've been working on the std.simd API mostly against GDC, and once I'm happy
> with that, I'll be logging codegen bugs in DMD accordingly.
> 
> What do you get if you do:
> 
>     float4 result = [1,2,3,4];
>     result = __simd(XMM.ADDPS, result, result);
>     writefln("%s", result.array);
> 
> Why do you need to issue the loads and stores explicitly?

On modern processors unaligned loads come at almost no penalty, and it is a lot easier to use unaligned loads within a highly optimized functions compared to making all the data in the whole project 16 byte aligned and use aligned loads.

@dlangBugzillaToGithub
Copy link
Author

turkeyman commented on 2013-06-03T08:42:51Z

(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > Small correction, ideal assembly should look like this:
> > > 
> > > 1 mov         rax,qword ptr [rbp-68h]  
> > > 2 movaps      xmm0,xmmword ptr [rax]  
> > > 5 addps       xmm0,xmm0
> > > 9 movaps      xmmword ptr [rax],xmm0 
> > > 
> > > Instruction 5 should use a xmm register as well and not add from memory.
> > 
> > Interesting. I haven't scrutinised DMD's codegen as much as GDC yet.
> > I've been working on the std.simd API mostly against GDC, and once I'm happy
> > with that, I'll be logging codegen bugs in DMD accordingly.
> > 
> > What do you get if you do:
> > 
> >     float4 result = [1,2,3,4];
> >     result = __simd(XMM.ADDPS, result, result);
> >     writefln("%s", result.array);
> > 
> > Why do you need to issue the loads and stores explicitly?
> 
> On modern processors unaligned loads come at almost no penalty, and it is a lot
> easier to use unaligned loads within a highly optimized functions compared to
> making all the data in the whole project 16 byte aligned and use aligned loads.

Umm, What?
This has nothing to do with alignment... where did that come from?
Your issue is the codegen right? You expect the codegen to be what you pasted below, and I agree. I'm suggesting you remove the explicit load/store and let the compiler generate them. What do you get using the code I suggest?

@dlangBugzillaToGithub
Copy link
Author

bearophile_hugs commented on 2013-06-03T09:46:24Z

(In reply to comment #3)

> On modern processors unaligned loads come at almost no penalty, and it is a lot
> easier to use unaligned loads within a highly optimized functions compared to
> making all the data in the whole project 16 byte aligned and use aligned loads.

If this is true then is it worth reducing the smallest allocation size of the D garbage collector from 16 bytes to 8 bytes and reduce the alignment to 8 bytes? This saves some memory when you allocate many small objects, like small tree nodes.

@dlangBugzillaToGithub
Copy link
Author

code (@MartinNowak) commented on 2013-06-03T10:07:12Z

Reducing alignment on all allocations to 8 byte is not a good idea. When I said recent processors I meant i7 ivy bridge. On older processors this still has a significant performance impact. Initially the manual load and store operation were unaligned, but as of this bug
http://d.puremagic.com/issues/show_bug.cgi?id=10225
I had to replace them with aligned loads and stores for the time beeing. But this doesn't help that when the unaligned store gets fixed, the generated code will be equally bad as when using aligned operations.

@dlangBugzillaToGithub
Copy link
Author

turkeyman commented on 2013-06-03T16:21:54Z

(In reply to comment #6)
> Reducing alignment on all allocations to 8 byte is not a good idea. When I said
> recent processors I meant i7 ivy bridge. On older processors this still has a
> significant performance impact. Initially the manual load and store operation
> were unaligned, but as of this bug
> http://d.puremagic.com/issues/show_bug.cgi?id=10225
> I had to replace them with aligned loads and stores for the time beeing. But
> this doesn't help that when the unaligned store gets fixed, the generated code
> will be equally bad as when using aligned operations.

Okay, I see. It occurred to me that maybe DMD promoted that movups in your other bug to a movaps because it was able to detect that the source was always aligned? (I have no hard reason to believe that, but just a thought. Walter?)

Yeah it looks kinda like the compiler is treating the mov*ps opcodes like a typical add or something, and the compiler is generating typical register allocation around the opcodes which performs explicit register allocation ;)
My guess is the compiler doesn't actually understand what the mov*ps (and friends) opcodes do, and therefore erroneously tries to helpfully configure a register state to execute them ;)

What flags are you building with? If it's building un-optimised or in some debug mode, the extra mov's could be to try and keep stack synchronisation.

@dlangBugzillaToGithub
Copy link
Author

code (@MartinNowak) commented on 2013-06-03T22:36:41Z

Building with debug symbols gives a ICE ;-)

I'm building with -release -O -inline

@dlangBugzillaToGithub
Copy link
Author

bugzilla (@WalterBright) commented on 2016-04-25T07:00:47Z

(In reply to Benjamin Thaut from comment #8)
> Building with debug symbols gives a ICE ;-)

It works when I compile with -g.

@dlangBugzillaToGithub
Copy link
Author

bugzilla (@WalterBright) commented on 2016-04-25T07:01:33Z

It isn't 'bad' codegen, it's just inefficient codegen. I reclassified this as an enhancement request.

@dlangBugzillaToGithub
Copy link
Author

code (@MartinNowak) commented on 2017-03-13T19:41:51Z

cat > bug.d << CODE
import core.simd;

alias vec = __vector(double[2]);

void storeUnaligned(ref vec a, in ref vec val)
{
    __simd_sto(XMM.STOUPD, a, val);
}

void test(ref vec a, in vec b)
{
    __simd_sto(XMM.STOUPD, a, b * b); // works
    immutable tmp = b * b;
    __simd_sto(XMM.STOUPD, a, tmp); // dips stack
    storeUnaligned(a, tmp); // dips stack
}
CODE
dmd -c -O -release -inline bug
----
        mulpd   xmm1, xmm3                              ; 000C _ 66: 0F 59. CB
        movapd  xmmword ptr [rsp], xmm1                 ; 0010 _ 66: 0F 29. 0C 24
        movdqa  xmm2, xmmword ptr [rsp]                 ; 0015 _ 66: 0F 6F. 14 24
        movupd  xmmword ptr [rdi], xmm2                 ; 001A _ 66: 0F 11. 17
----
Turns out that dmd's codegen is almost prohibitively inefficient, the intermediate value unnecessarily get pushed to the stack.

@dlangBugzillaToGithub
Copy link
Author

code (@MartinNowak) commented on 2017-03-14T01:03:42Z

The cause for the example in comment 11 seems to be the void16 parameters of __simd_sto which breaks the

  (tyfloating(em->Ety) != 0) == (tyfloating(e->Ety) != 0)

condition in localizer.

https://github.com/dlang/dmd/blob/0089ae06db7c7b4bebe4d11bfcf02eab69936d81/src/ddmd/backend/glocal.c#L344


The assignment of value

{
(__vector(float[4])* p = &a[0];) , ((__vector(float[4]) value = b - c;));
__simd_sto(cast(XMM)3857, cast(__vector(void[16]))*p, cast(__vector(void[16]))value);
}

is optimized to this el

el:0x9003b0 cnt=0 cs=0 *  TYsigned char[16] 0x900350
 el:0x900350 cnt=0 cs=0 var  TY*  __a_8
el:0x8ffc94 cnt=0 cs=0 -  TYsigned char[16] 0x8ffbd4 0x8ffc34
 el:0x8ffbd4 cnt=0 cs=0 var  TYfloat[4]  b
 el:0x8ffc34 cnt=0 cs=0 var  TYfloat[4]  c

which assigns a float[4] vector to a byte[16] vector.

@dlangBugzillaToGithub
Copy link
Author

github-bugzilla commented on 2017-03-15T04:11:37Z

Commits pushed to master at https://github.com/dlang/dmd

https://github.com/dlang/dmd/commit/194fa0a92aec4a2b4fa5a6e28feeedca6daceb76
partly resolve Issue 10226 - inefficient core.simd codegen

- fix localization of float vector expressions passed to
  void16 parameters of simd intrinsics

https://github.com/dlang/dmd/commit/0d4eb1f06b938599293249eaecc2130fa6c89c81
Merge pull request #6626 from MartinNowak/issue10226

partly resolve Issue 10226 - inefficient core.simd codegen

@dlangBugzillaToGithub
Copy link
Author

github-bugzilla commented on 2017-08-07T13:15:31Z

Commits pushed to newCTFE at https://github.com/dlang/dmd

https://github.com/dlang/dmd/commit/194fa0a92aec4a2b4fa5a6e28feeedca6daceb76
partly resolve Issue 10226 - inefficient core.simd codegen

https://github.com/dlang/dmd/commit/0d4eb1f06b938599293249eaecc2130fa6c89c81
Merge pull request #6626 from MartinNowak/issue10226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant