New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization to integer addition #897

Closed
waywardmonkeys opened this Issue May 8, 2015 · 6 comments

Comments

Projects
None yet
2 participants
@waywardmonkeys
Member

waywardmonkeys commented May 8, 2015

When we add 2 integers, we avoid untagging and retagging them by doing this:

define sealed inline method \+ (x :: <integer>, y :: <integer>)
 => (result :: <integer>)
  let mx = interpret-integer-as-machine-word(x);
  let my = strip-integer-tag(interpret-integer-as-machine-word(y));
  let result = machine-word-add-signal-overflow(mx, my);
  interpret-machine-word-as-integer(result)
end method \+;

The definition of strip-integer-tag is:

define inline-only function strip-integer-tag
    (x :: <machine-word>) => (result :: <machine-word>)
  machine-word-logxor(x, $integer-tag-value)
end function strip-integer-tag;

This results in this sort of thing in the generated assembly (x86_64):

    0x100003e6f <+47>: xorq   $0x1, %rax
    0x100003e73 <+51>: addq   %rax, %r14

@pkhuong has suggested that we could get better generated code if we did a subtraction instead of an xor operation here as that would allow LLVM to do better optimization:

[19:33:08] brucem: I think I kind of see what's going on.
[19:33:30] the xor is messing up llvm's detection of commutativity
[19:34:17] it does r14 <- rax + r14; r14 -> rbx; rbx -> rax
[19:34:49] ideally, you'd get rax <- rax + r14

This seems like an interesting experiment to try.

It also pointed out an issue where not all back-ends might be using tagged integers (like my JS back-end might not).

It seems like an easy way to address this would be to add a new primitive-strip-integer-tag and perhaps for some other similar operations and then each back-end can do the right thing.

@waywardmonkeys

This comment has been minimized.

Show comment
Hide comment
@waywardmonkeys

waywardmonkeys May 8, 2015

Member

I also took the .ll file and hand edited it to change from xor to sub and passed that to clang -O2 -c -S and then diffed the assembly outputs:

batavia:fib bruce$ diff -u fib.xor.S fib.sub.S
--- fib.xor.S   2015-05-08 20:43:43.000000000 +0700
+++ fib.sub.S   2015-05-08 20:44:09.000000000 +0700
@@ -34,7 +34,7 @@
 ## BB#3:
    movq    %rbx, %rdi
    callq   _KfibVfibI
-   xorq    $1, %rax
+   decq    %rax
    addq    %rax, %r14
    jo  LBB0_6
 ## BB#4:

But that would do better in other situations where constants could be folded, etc.

Member

waywardmonkeys commented May 8, 2015

I also took the .ll file and hand edited it to change from xor to sub and passed that to clang -O2 -c -S and then diffed the assembly outputs:

batavia:fib bruce$ diff -u fib.xor.S fib.sub.S
--- fib.xor.S   2015-05-08 20:43:43.000000000 +0700
+++ fib.sub.S   2015-05-08 20:44:09.000000000 +0700
@@ -34,7 +34,7 @@
 ## BB#3:
    movq    %rbx, %rdi
    callq   _KfibVfibI
-   xorq    $1, %rax
+   decq    %rax
    addq    %rax, %r14
    jo  LBB0_6
 ## BB#4:

But that would do better in other situations where constants could be folded, etc.

@waywardmonkeys

This comment has been minimized.

Show comment
Hide comment
@waywardmonkeys

waywardmonkeys May 8, 2015

Member

And to clear things up a bit more, the register dance that he was referring to can be seen here:

  callq _KfibVfibI
  movq  %rax, %r14
  subq  $8, %rbx
  jo  LBB0_6
## BB#3:
  movq  %rbx, %rdi
  callq _KfibVfibI
  decq  %rax
  addq  %rax, %r14
  jo  LBB0_6
## BB#4:
  movq  %r14, %rbx
LBB0_5:
  movb  $1, %dl
  movq  %rbx, %rax
Member

waywardmonkeys commented May 8, 2015

And to clear things up a bit more, the register dance that he was referring to can be seen here:

  callq _KfibVfibI
  movq  %rax, %r14
  subq  $8, %rbx
  jo  LBB0_6
## BB#3:
  movq  %rbx, %rdi
  callq _KfibVfibI
  decq  %rax
  addq  %rax, %r14
  jo  LBB0_6
## BB#4:
  movq  %r14, %rbx
LBB0_5:
  movb  $1, %dl
  movq  %rbx, %rax
@pkhuong

This comment has been minimized.

Show comment
Hide comment
@pkhuong

pkhuong May 8, 2015

dec is one byte (+ REX); xor with immediate is 5 (+ REX). More compact code is good ;)

Here's a short example of how LLVM can fold things better with dec than xor:

int
foo (int *x, long index)
{
    return x[index ^ 1];
}

int
bar (int *x, long index)
{
    return x[index - 1];
}

foo compiles to xorq $1, %rsi / movl (%rdi,%rsi,4), %eax, while bar compiles to a straight movl -4(%rdi,%rsi,4), %eax.

pkhuong commented May 8, 2015

dec is one byte (+ REX); xor with immediate is 5 (+ REX). More compact code is good ;)

Here's a short example of how LLVM can fold things better with dec than xor:

int
foo (int *x, long index)
{
    return x[index ^ 1];
}

int
bar (int *x, long index)
{
    return x[index - 1];
}

foo compiles to xorq $1, %rsi / movl (%rdi,%rsi,4), %eax, while bar compiles to a straight movl -4(%rdi,%rsi,4), %eax.

@waywardmonkeys

This comment has been minimized.

Show comment
Hide comment
@waywardmonkeys

waywardmonkeys May 8, 2015

Member

I'm testing the straight-forward way of fixing this with a 3 stage bootstrap now.

Member

waywardmonkeys commented May 8, 2015

I'm testing the straight-forward way of fixing this with a 3 stage bootstrap now.

@waywardmonkeys

This comment has been minimized.

Show comment
Hide comment
@waywardmonkeys

waywardmonkeys May 8, 2015

Member

Amusingly:

-rwxr-xr-x  1 bruce  staff  2129948 May  8 21:46 Bootstrap.2/lib/libdylan.dylib
-rwxr-xr-x  1 bruce  staff  2117804 May  8 21:35 Bootstrap.1/lib/libdylan.dylib

It got bigger!

I haven't looked at the generated code to see how much things may have changed yet.

Member

waywardmonkeys commented May 8, 2015

Amusingly:

-rwxr-xr-x  1 bruce  staff  2129948 May  8 21:46 Bootstrap.2/lib/libdylan.dylib
-rwxr-xr-x  1 bruce  staff  2117804 May  8 21:35 Bootstrap.1/lib/libdylan.dylib

It got bigger!

I haven't looked at the generated code to see how much things may have changed yet.

waywardmonkeys added a commit to waywardmonkeys/opendylan that referenced this issue May 8, 2015

[dylan] strip-integer-tag can use subtraction.
This is an experimental change to have strip-integer-tag use
subtraction rather than xor as that should allow LLVM to
do better optimization in some cases. It should also result
in smaller code.

Fix #897.

* sources/dylan/machine-word-lowlevel.dylan
  (macro simple-arithmetic): New.
  (simple-arithmetic machine-word-subtract): New.

* sources/dylan/conversion-tagged-integer.dylan
  (strip-integer-tag): Use machine-word-subtract instead of
    machine-word-logxor.

* sources/dylan/dylan.lid: Build machine-word-lowlevel.dylan before
    character.dylan as character.dylan ends up having a top level
    reference to machine-word-subtract which is an illegal forward
    reference.
@waywardmonkeys

This comment has been minimized.

Show comment
Hide comment
@waywardmonkeys

waywardmonkeys May 8, 2015

Member

With the change on that pull request and diffing disassembly, we can see things like (from table.dylan compiled with the C back-end):

- movl  0x1c(%eax), %eax
- xorl  $0x1, %eax
- subl  %eax, %esi
- movl  _dylan_teb+460(%edi), %eax
+ incl  %esi
+ subl  0x1c(%eax), %esi
+ movl  _dylan_teb+456(%edi), %eax
- andl  %ebx, %ecx
- movl  %esi, %edx
- movl  %edx, -0x58(%ebp)
- xorl  $0x1, %esi
- leal  0x5(%esi,%ecx), %eax
+ andl  %edx, %ecx
+ leal  0x1(%ecx), %edx
+ leal  -0x1(%esi), %eax
+ movl  %eax, -0x58(%ebp)
+ leal  0x4(%esi,%ecx), %eax
- movl  %ecx, %edx
- movl  %eax, %ecx
- xorl  $0x1, %ecx
- addl  %edx, %ecx
- movl  %ecx, %esi
+ leal  -0x1(%ecx,%eax), %edx
+ movl  %edx, %esi
- movl  _Dsecondary_stridesVKi-20622(%ecx), %eax
- movl  0x8(%edx,%eax), %eax
- xorl  $0x1, %eax
- movl  %eax, -0x44(%ebp)
- nopl  _Kuninitialized_table_testVKiI(%eax)
+ movl  _Dsecondary_stridesVKi-20334(%ecx), %eax
+ movl  $0x1, %esi
+ subl  0x8(%edx,%eax), %esi
+ movl  %esi, -0x40(%ebp)
+ nopl  (%eax)
  movl  %ebx, %eax
- subl  -0x44(%ebp), %eax
+ addl  -0x40(%ebp), %eax
+ testl %eax, %eax
Member

waywardmonkeys commented May 8, 2015

With the change on that pull request and diffing disassembly, we can see things like (from table.dylan compiled with the C back-end):

- movl  0x1c(%eax), %eax
- xorl  $0x1, %eax
- subl  %eax, %esi
- movl  _dylan_teb+460(%edi), %eax
+ incl  %esi
+ subl  0x1c(%eax), %esi
+ movl  _dylan_teb+456(%edi), %eax
- andl  %ebx, %ecx
- movl  %esi, %edx
- movl  %edx, -0x58(%ebp)
- xorl  $0x1, %esi
- leal  0x5(%esi,%ecx), %eax
+ andl  %edx, %ecx
+ leal  0x1(%ecx), %edx
+ leal  -0x1(%esi), %eax
+ movl  %eax, -0x58(%ebp)
+ leal  0x4(%esi,%ecx), %eax
- movl  %ecx, %edx
- movl  %eax, %ecx
- xorl  $0x1, %ecx
- addl  %edx, %ecx
- movl  %ecx, %esi
+ leal  -0x1(%ecx,%eax), %edx
+ movl  %edx, %esi
- movl  _Dsecondary_stridesVKi-20622(%ecx), %eax
- movl  0x8(%edx,%eax), %eax
- xorl  $0x1, %eax
- movl  %eax, -0x44(%ebp)
- nopl  _Kuninitialized_table_testVKiI(%eax)
+ movl  _Dsecondary_stridesVKi-20334(%ecx), %eax
+ movl  $0x1, %esi
+ subl  0x8(%edx,%eax), %esi
+ movl  %esi, -0x40(%ebp)
+ nopl  (%eax)
  movl  %ebx, %eax
- subl  -0x44(%ebp), %eax
+ addl  -0x40(%ebp), %eax
+ testl %eax, %eax
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment