-
-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for put/doPut templates that use outputRanges that should be pass by value #2655
Conversation
8cd7d25 to
0a0656e
Compare
|
dynamic arrays actually must be passed by reference as they do change in size after each put. |
0a0656e to
6870d2a
Compare
|
Ah yes good catch, fixed and pushed. |
6870d2a to
cbcdfac
Compare
|
Hey just wanted to know what the status on this is. Is everyone too busy to look at it right now or is there something I can do to make this PR better? Thanks. |
|
I haven't gotten around to looking over the code yet. I'll see if I can do that sometime today. |
| { | ||
| import std.string; | ||
| static assert (false, | ||
| format("Cannot nativaly put a %s into a %s.", E.stringof, R.stringof)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this wasn't your doing, but can you fix nativaly? I'm assuming it should be natively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing
|
Can you give a hint as to what optimizations we can have by specializing on certain types? This seems like we are singling out certain types, but there are others that could be "pass by value". |
|
This optimization removes a level of indirection for types that already have indirection like pointers/classes/etc. If you pass a pointer by reference, you can modify what the pointer is pointing to and the pointer itself. The put/doPut functions will be modifying what the pointer is pointing to but will never be modifying the pointer itself. Therefore, passing a pointer by reference in this case is unnecessary. There may be more types that could be pass by value but these were all the ones I could think of (if you have any more let me know). However, the default is to use pass by reference, which at worst creates more indirection then needed. If any more types are discovered, they just need to by added to the It is unfortunate that there's no way to combine these functions into a single template. I posted an idea to do this by adding a new function attribute http://forum.dlang.org/thread/dhqrbvslzzoxmhzisnga@forum.dlang.org I didn't get any reponses though. Basically, a type that is "pointing" to data should be passed by value, whereas a type that is data ( like a struct ) should be passed by reference. The perfect example is a struct (pass by reference) or a pointer to a struct (pass by value). If you read the post I made and understand what
So if the callers variable is a pointer, the caller needs only to pass the pointer, if the callers variable is not a pointer, like a raw struct or a value type, then the caller should pass the variable by reference. |
cbcdfac to
1b4e617
Compare
OK, so it's simply an avoidance of indirection. The way you phrased the original post, it sounded like there were some other optimizations that can be made, but you were holding off to avoid conflating the two.
For instance a user-created smart pointer. Basically, any struct that is made up of references that have nothing to do with the range apparatus, because what the references point to is actually the range. |
Yes it's true that if the struct is simply wrapping the real output range then it could be passed by value and it would still work. However, I'm not sure it's possible to be able to determine in the general case if a struct is doing this. Furthermore, even if the struct is wrapping an output range, it may also contain a bunch of other fields which may make the user want to pass it by reference to avoid copying the entire struct to the stack just to call the put function. I think its reasonable to say that structs will always be passed by reference. If the struct is wrapping some type that is actually facilitating the output range, then the user can pass that type to the put function instead. |
|
I suppose it's just an optimization that we can't give users the ability to hook. Probably fine the way it is then. |
|
Please block this PR until #2661 became closed (merged/rejected). |
|
It looks like the test infrastructure keeps trying to re-validate this PR. Is there a way to prevent this while we are waiting for #2661? I don't want to take away build cycles from that PR. |
|
You can close the PR, and then open it later. |
1b4e617 to
e7bb046
Compare
|
Modified to accomodate rebasing from #2661. |
|
Thanks! |
1322cc8 to
db19a83
Compare
db19a83 to
3940364
Compare
|
Awesome, so all my comments are somewhere on Jonathon's branch... Anyhow some proof that this actually does add speed is needed. Maybe inliner just unrolls it all on its own and we just duplicate code (and significant amounts). |
|
@9il Moscow |
|
@DmitryOlshansky Moscow too. I will write my comment in Russian tomorrow. My English is very bad to explain all details I want. |
|
Oh! Thanks for clarifying. If the compiler is able to optimize a ref class/ref delegate/ref function/ref (pointer) by removing the level of indirection then this PR is useless. I'm not convinced that the compiler will do this in all cases, I'll have to do some "thinking". When you get a chance, what tool did you use to get the assembly code? Did you disassemble the binary or did you have DMD generate the assembly code? I would also love to get someone like @WalterBright to give his take on this. He might be able to immediately say whether or not this optimization is useful. @WalterBright If you have a minute, is there any use to remove the "ref" attribute on the |
|
@DmitryOlshansky @9il FYI: Moscow too;) |
|
@IgorStepanov #dconf in Moscow, 2018? :-) |
|
@9il Great idea! |
I was afraid it can since
I disassembled the binary, the tool is IDA Pro: https://www.hex-rays.com/products/ida/ On linux one can just use objdump, it;s good enough.
It doesn't produce ASM listing. To the best of my knowledge DMD outputs object/machine code directly, there is no intermediate asm format anywhere. |
|
@9il @IgorStepanov I recall folks hoped for the next DConf to be staged in Europe. Well, Moscow is Europe ... ;) @9il About x64, here are 2 dumps on Linux before/after this pull: As you can observe with you favorite diff tool, the only difference is addresses of calls. I tricked optimizer by using separate compilation, now it finally makes some sense: Variation is about +- 4ms. The latest code: import std.stdio;
import std.datetime;
import std.range;
import submod; // defines size_t global and is separately compiled
class MyOutputRange
{
size_t x, y;
final void put(int x)
{
// Do something to demonstrate that indirection causes slower performance
//asm{ nop; }
global = x;
y = global;
this.x = y;
}
}
void foo(){
asm { nop; }
}
void main(string[] args)
{
size_t runCount = 2;
size_t loopCount = 100_000_000;
// Use struct pointer to show that new put function operates quicker on pointers
MyOutputRange outputRange = new MyOutputRange();
StopWatch sw;
for(auto runIndex = 0; runIndex < runCount; runIndex++) {
writefln("OutputRange (struct pointer) run %s (loopcount %s)", runIndex + 1, loopCount);
sw.reset();
sw.start();
for(auto i = 0; i < loopCount; i++) {
foo();
}
sw.stop();
writefln(" nothing : %s microseconds", sw.peek.usecs);
sw.reset();
sw.start();
for(auto i = 0; i < loopCount; i++) {
outputRange.put(3);
}
sw.stop();
writefln(" range.put : %s microseconds", sw.peek.usecs);
sw.reset();
sw.start();
for(auto i = 0; i < loopCount; i++) {
put(outputRange, 3);
}
sw.stop();
writefln(" put(range) : %s microseconds", sw.peek.usecs);
version(PutRef) {
sw.reset();
sw.start();
for(auto i = 0; i < loopCount; i++) {
putRef(outputRange, 3);
}
sw.stop();
writefln(" putRef(range): %s microseconds", sw.peek.usecs);
}
}
} |
|
This may be getting a little "theoretical" here, but I would think that if a compiler could optimize a "ref" function parameter to a "pass-by-value" argument, then it could handle this: __gshared int global;
void func(ref int x) { global = x; }
void main() { int a; func(a); }Since the function parameter x is not getting modified, the compiler can optimize that argument to a pass-by-value. I haven't mentally decompiled the assembly but when I compile this code with and without the "ref", here's what I get. Non-ref version: SECTION .text$_D4test4funcFiZv align=4 execute ; section number 8, code
; Communal section not supported by YASM
_D4test4funcFiZv:; Function begin, communal
push eax ; 0000 _ 50
mov dword [_D4test6globali], eax ; 0001 _ A3, 00000000(segrel)
pop eax ; 0006 _ 58
ret ; 0007 _ C3
; _D4test4funcFiZv End of functionRef version: SECTION .text$_D4test4funcFKiZv align=4 execute ; section number 8, code
; Communal section not supported by YASM
_D4test4funcFKiZv:; Function begin, communal
push eax ; 0000 _ 50
mov ecx, dword [eax] ; 0001 _ 8B. 08
mov dword [_D4test6globali], ecx ; 0003 _ 89. 0D, 00000000(segrel)
pop eax ; 0009 _ 58
ret ; 000A _ C3
; _D4test4funcFKiZv End of functionIt appears that the compiler isn't able to optimize the indirection and still passes the integer argument by reference. I'm going to try this again with the pointer types but this is interesting. |
|
Looks like the same thing happens with classes, I'm going to disassemble the old put vs new put and see if I see the same thing as you. |
|
Ok, I'm definitely seeing differences in the assembly code. I'm not sure why you were not seeing differences. I've simplified the code since now were looking at assembly, no more need for all those write statements and extra imports: import std.range;
size_t global;
class MyOutputRange
{
size_t x;
void put(int x)
{
asm { nop; }
global = x;
}
}
void main(string[] args)
{
MyOutputRange outputRange;
put(outputRange, 3);
}Old Put SECTION .text$__Dmain align=4 execute ; section number 12, code
; Communal section not supported by YASM
__Dmain:; Function begin, communal
push eax ; 0000 _ 50
lea eax, [esp] ; 0001 _ 8D. 04 24
mov dword [esp], 0 ; 0004 _ C7. 04 24, 00000000
push eax ; 000B _ 50
mov eax, 3 ; 000C _ B8, 00000003
call _D3std5range35__T3putTC7testput13MyOutputRangeTiZ3putFKC7testput13MyOutputRangeiZv; 0011 _ E8, 00000000(rel)
xor eax, eax ; 0016 _ 31. C0
pop ecx ; 0018 _ 59
ret ; 0019 _ C3
; __Dmain End of function
SECTION .text$_D3std5range35__T3putTC7testput13MyOutputRangeTiZ3putFKC7testput13MyOutputRangeiZv align=4 execute ; section number 13, code
; Communal section not supported by YASM
_D3std5range35__T3putTC7testput13MyOutputRangeTiZ3putFKC7testput13MyOutputRangeiZv:; Function begin, communal
push eax ; 0000 _ 50
mov ecx, dword [esp+8H] ; 0001 _ 8B. 4C 24, 08
push eax ; 0005 _ 50
mov eax, dword [ecx] ; 0006 _ 8B. 01
mov edx, dword [eax] ; 0008 _ 8B. 10
call near [edx+14H] ; 000A _ FF. 52, 14
pop eax ; 000D _ 58
ret 4 ; 000E _ C2, 0004
; _D3std5range35__T3putTC7testput13MyOutputRangeTiZ3putFKC7testput13MyOutputRangeiZv End of function
SECTION .text$_D3std5range37__T5doPutTC7testput13MyOutputRangeTiZ5doPutFKC7testput13MyOutputRangeKiZv align=4 execute ; section number 14, code
; Communal section not supported by YASM
_D3std5range37__T5doPutTC7testput13MyOutputRangeTiZ5doPutFKC7testput13MyOutputRangeKiZv:; Function begin, communal
push eax ; 0000 _ 50
push dword [eax] ; 0001 _ FF. 30
mov ecx, dword [esp+0CH] ; 0003 _ 8B. 4C 24, 0C
mov eax, dword [ecx] ; 0007 _ 8B. 01
mov edx, dword [eax] ; 0009 _ 8B. 10
call near [edx+14H] ; 000B _ FF. 52, 14
pop eax ; 000E _ 58
ret 4 ; 000F _ C2, 0004
; _D3std5range37__T5doPutTC7testput13MyOutputRangeTiZ5doPutFKC7testput13MyOutputRangeKiZv End of functionNew Put SECTION .text$__Dmain align=4 execute ; section number 12, code
; Communal section not supported by YASM
__Dmain:; Function begin, communal
push eax ; 0000 _ 50
mov eax, 3 ; 0001 _ B8, 00000003
push 0 ; 0006 _ 6A, 00
call _D3std5range35__T3putTC7testput13MyOutputRangeTiZ3putFC7testput13MyOutputRangeiZv; 0008 _ E8, 00000000(rel)
xor eax, eax ; 000D _ 31. C0
pop ecx ; 000F _ 59
ret ; 0010 _ C3
; __Dmain End of function
SECTION .text$_D3std5range35__T3putTC7testput13MyOutputRangeTiZ3putFC7testput13MyOutputRangeiZv align=4 execute ; section number 13, code
; Communal section not supported by YASM
_D3std5range35__T3putTC7testput13MyOutputRangeTiZ3putFC7testput13MyOutputRangeiZv:; Function begin, communal
push eax ; 0000 _ 50
push eax ; 0001 _ 50
mov eax, dword [esp+0CH] ; 0002 _ 8B. 44 24, 0C
mov ecx, dword [eax] ; 0006 _ 8B. 08
call near [ecx+14H] ; 0008 _ FF. 51, 14
add esp, 4 ; 000B _ 83. C4, 04
ret 4 ; 000E _ C2, 0004
; _D3std5range35__T3putTC7testput13MyOutputRangeTiZ3putFC7testput13MyOutputRangeiZv End of function
SECTION .text$_D3std5range37__T5doPutTC7testput13MyOutputRangeTiZ5doPutFC7testput13MyOutputRangeKiZv align=4 execute ; section number 14, code
; Communal section not supported by YASM
_D3std5range37__T5doPutTC7testput13MyOutputRangeTiZ5doPutFC7testput13MyOutputRangeKiZv:; Function begin, communal
push eax ; 0000 _ 50
push dword [eax] ; 0001 _ FF. 30
mov eax, dword [esp+0CH] ; 0003 _ 8B. 44 24, 0C
mov ecx, dword [eax] ; 0007 _ 8B. 08
call near [ecx+14H] ; 0009 _ FF. 51, 14
add esp, 4 ; 000C _ 83. C4, 04
ret 4 ; 000F _ C2, 0004
; _D3std5range37__T5doPutTC7testput13MyOutputRangeTiZ5doPutFC7testput13MyOutputRangeKiZv End of function |
|
It doesn't even compile for me with full optimizations turned on: there is a null pointer. I'll try with MyOutputRange outputRange = new MyOutputRange(); though |
|
This is how it looks for me: vs Completely inlined. There are also present the bodies of put that look different, and in fact your pull makes version that looks better. The catch is it's NOT used anywhere after inlining. Anyway here they are: vs |
|
So you're saying that the doPut/put functions get generated but no one calls them because the call to put gets inlined? So let's say that's true, the times that the function can't get inlined, the new put is going to still be better correct? |
|
@marler8997 Yeah, that pretty much nails it. If compiler inlines it then ref doesn't matter, if not then passing classes and pointers by copy is going to be faster. |
|
Cool so where does that leave us. Is this something we want to integrate. I can understand the argument that the duplicate code may not be worth having such a small optimization. There also may a feature added to D in the future that would solve this problem (maybe adding a 'cref' attribute instead of 'ref' that means the parameter is referencing the "contents" of the caller's variable instead of the variable itself). What do you think? |
|
I think the problem is more general then put/doPut that is any generic function that takes a range and wants to modify it has to use Now if there was some cleaner idiom I'd be all for it, but solving it with duplication like this for a single case is too much. Conditional ref would be awesome... I have an idea - maybe trumpoline function that is so simple to always be inlined could help: auto someFuncTrumpl(ref T arg) if(ValueType!T){
return someFunc(arg);
}
auto someFuncTrumpl(ref T arg) if(!ValueType!T){
auto ptr= *&arg;
return someFunc(ptr); // get simple value
}The catch is without Need to test assembly output though for as many cases as possible. |
|
Even better - the above could be generalized to something like: // something like:
alias put(T) = forwardRef!(doPut, T);Where forwardRef would alias itself away to raw doPut if T is not a class or pointer , and be a trumpoline template otherwise. Best of both worlds it seems. |
|
Interesting...could you provide the code for |
|
I was wrong on the case that was aliased away and interestingly I've found it reduces template bloat: pointers and references use the same instance(!). Somewhat raw but a good start: import std.traits;
private enum bool isPassByValue(T) = isPointer!T || is( T == class ) ||
is (T == delegate) || is( T == function );
//same doPut but _stripped_ of `ref` storage class
void doPut(T)(T arg){
pragma(msg, "Instantiated " ~ T.stringof);
//...
static if(isPointer!T)
*arg = 1;
else
*arg = 1;
}
template forwardRef(alias Fn, T)
{
static if(isPassByValue!T)
{
pragma(msg, T.stringof ~" by value");
alias forwardRef = Fn!T;
}
else
{
pragma(msg, T.stringof ~" by ref");
auto forwardRef(ref T arg)
{
//convert to pointer explicitly
return Fn(&arg);
}
}
}
void main()
{
alias a = forwardRef!(doPut, int);
alias b = forwardRef!(doPut, int*);
int k;
a(k);
assert(k == 1);
k = 0;
b(&k);
assert(k == 1);
}UPDATE: fixed main to be more insteresting. |
|
I started out writing the new put function this way. The problem is that I couldn't figure out how to write the put template in a way that the template parameters could be deduced. If the template parameters cannot be deduced then everything that uses an output range will need to explicitly write the template parameter for every put call. That's why I proposed cref. But no one has looked at my proposal. The only way I know how to make a proposal for the language is to post on the forums but alot of times I get very little response. I think the put function demonstrates a deficiency that D has no solution to right now aside from writing the same function twice as I have done here. Is there a better way to make a proposal like cref and actually get people to see it? |
|
I'd like to get some closure on this PR. To summarize, this PR does make the generated put/doPut functions slightly more efficient by removing a level of indirection for "pointer" types, but at the cost of writing the functions source code twice. However, many times these functions get inlined which also removes the indirection anyway. @DmitryOlshansky feels like the duplicated source code is not a good solution but is open to exploring other solutions. I've proposed a new (conditional reference) attribute I made a post here: http://forum.dlang.org/thread/dhqrbvslzzoxmhzisnga@forum.dlang.org There are likely other solutions as well. If we could solve the template parameter type deduction problem we could use the solution that @DmitryOlshansky suggested. I would like to get some more opinions on this. Maybe we should close this PR and create an issue instead? Not sure how to move forward. |
|
Well a change to the language is not likely to be accepted for a corner case. Problem of optimal parameter passing is not new, and e.g. C++ guys "solved" in Boost library. It is however rather involved. If C++ can do it in library we must be able to. I'll take a shot at deduction problem in a day or two. |
|
Sounds good. If we find a solution there are more functions in phobos that could benefit as well. |
|
It's been a while but haven't had the time to try it out. Stay tuned. |
|
This is stuck and the general idea of optimal forwarding for classes/pointers should be just converted to bugzilla enhancement. The plus side is handling ref and pointer in one template instance. Last code: import std.traits;
private enum bool isPassByValue(T) = isPointer!T || is( T == class ) ||
is (T == delegate) || is( T == function );
//same doPut but _stripped_ of `ref` storage class
void doPut(T)(T arg){
pragma(msg, "Instantiated " ~ T.stringof);
//...
static if(isPointer!T)
*arg = 1; // pointer branch
else{
*arg = 1; // delelgate class branch
}
}
template forwardRef(alias Fn, T)
{
static if(isPassByValue!T)
{
pragma(msg, T.stringof ~" by value");
alias forwardRef = Fn!T;
}
else
{
pragma(msg, T.stringof ~" by ref");
auto forwardRef(ref T arg)
{
//convert to pointer explicitly
return Fn(&arg);
}
}
}
void main()
{
alias a = forwardRef!(doPut, int);
alias b = forwardRef!(doPut, int*);
int k;
a(k);
assert(k == 1);
k = 0;
b(&k);
assert(k == 1);
} |
|
I turned this into an ehancement request: With that I'm closing it - any effort on this front should be generic and address this problem w/o duplicating code. |
The original put/doPut functions pass in the OutputRange by ref, i.e.
However, some types do not need to be passed by reference such as pointers, dynamic arrays, classes, delegates...
To better optimize these output ranges I added overloads for the put/doPut functions that pass the OutputRange by value instead of by reference (See code).
Now that each overload handles a different set of types, some of the logic can be removed from each function, but I just duplicated the logic twice to minimize the potential amount of turmoil (I don't want to mess anything up). If anyone knows "for sure" what logic can be removed I'm willing to do that, but, most of the logic is compile time so the only gain will be slightly faster compile times (not necessarily smaller/leaner functions). Also feel free to make any modifications to my code:)