Algorithmically quite slow #28

Restioson · 2018-04-30T20:54:57Z

I decided to look at the generated assembly for the left_child function (inlining everything it calls):

example::left_child:
  push rbp
  mov rbp, rsp
  xor eax, eax
  test dil, 1
  jne .LBB0_2
  mov rdx, rdi
  pop rbp
  ret
.LBB0_2:
  mov rdx, rdi
  mov rcx, rdx
.LBB0_3:
  shr rcx
  add rax, 1
  test dl, 2
  mov rdx, rcx
  jne .LBB0_3
  test rax, rax
  je .LBB0_5
  lea ecx, [rax + 1]
  shr rdi, cl
  add rdi, rdi
  lea rdx, [rax - 1]
  mov ecx, eax
  shl rdi, cl
  mov esi, 1
  mov ecx, edx
  shl rsi, cl
  mov eax, 1
  add rsi, -1
  or rdi, rsi
  mov rdx, rdi
  pop rbp
  ret
.LBB0_5:
  mov eax, 1
  mov rdx, rdi
  pop rbp
  ret

By rewriting it to store each level in its entirety consecutively (level 1, then 2, then 3, etc), I got this (since then changed, will update later, but wasn't much longer):

example::left_child:
  push rbp
  mov rbp, rsp
  lea rax, [rdi + rdi]
  add rax, -1
  pop rbp
  ret

I understand that the algorithm is taken from the JS project, but I still think that this algorithm is quite slow...

The text was updated successfully, but these errors were encountered:

ralphtheninja · 2018-04-30T21:48:06Z

Maybe so. Another question is if it matters?

yoshuawuyts · 2018-04-30T23:36:52Z

@Restioson oh, that's super interesting; that's quite a reduction!

I'd be super keen to see more on how this was achieved. You're right in that so far we've mostly focused on porting JS to Rust, but it'd be really cool to have a path forward to improve performance!

yoshuawuyts · 2018-04-30T23:42:14Z

@Restioson oh by the way; I'm seeing lots of test instructions in the output assembly. I believe those exist to prevent integer wrap-arounds. Could you make sure you're running the comparisons on release builds? Thanks!

Restioson · 2018-05-01T09:23:50Z

@yoshuawuyts I believe godbolt does it on release by default. I'll try create another godbolt demo to put the link in. Performance wise, subbing out the flat-tree fns for my own made my time go from ~15us -> 400ns about. My algorithm stores a tree like

1
2 3

as [1; 2; 3].

To find the left child, we multiply the (1-indexed) index by two. To find the right child, we multiply the (again 1-indexed) index by two and add one. To find the parent, we divide the index by 2 and floor it. This is worse for cache locality though, but in general faster. The very slow function, btw, was depth. It has a loop, and is O(depth) (I think).

ralphtheninja · 2018-05-01T09:54:33Z

Sounds about right that it's only depth that "needs" to be optimized.

Restioson · 2018-05-05T18:59:39Z

I assume depth is calculating log2? If it is, http://huonw.github.io/fast-math/fast_math/fn.log_raw2.html could be used.

khernyo · 2018-06-16T19:53:51Z

@Restioson If I understand your proposal, that would change the numbering in a way which makes it unfit for streaming data. Let me explain.

This library is used in https://github.com/datrs/merkle-tree-stream to help with producing the merkle tree stream. The idea is that data blocks are coming in and the hash of these blocks (the leaves) are generated continuously. When there are enough leaves (or parents) to calculate the cumulative hash of a subtree, this hash is generated. There is no info about how many data blocks we have, so no info about the depth either. This is why the numbering starts with the leftmost leaf so that any number of nodes can be generated incrementally. Take a look at https://github.com/datrs/print-flat-tree to get some idea about this layout/numbering.

Would it be possible to retain the same ordering with your scheme? If so, then indeed it could be used.

Regarding the depth function, it can be reduced to the following:

pub fn depth(i: usize) -> usize {
  (!i).trailing_zeros() as usize
}

(it could be beneficial to make it return u32, then no conversion would be necessary)

Restioson · 2018-06-17T18:13:11Z

@khernyo I wasn't specifically asking to replace the algorithm, just to improve the current one (perhaps using an alternate algorithm). What does depth do anyway, btw?

khernyo · 2018-06-17T19:18:04Z

Depth is a node's distance from a child leaf. So, e.g. in a tree with four leaves:

 0─┐
   1─┐
 2─┘ │
     3
 4─┐ │
   5─┘
 6─┘

the leaves (nodes 0, 2, 4 and 6) have a depth of 0, the root (node 3) has a depth of 2.

Restioson · 2018-06-17T20:33:10Z

Fair enough. I'll try hack around with this and see how much of a difference it makes subbing in the new depth function.

yoshuawuyts · 2018-11-02T12:05:58Z

Just rewrote the depth function in #38, so now it's just:

#[inline]
pub fn depth(i: usize) -> usize {
  (!i).trailing_zeros() as usize
}

In turn this is now in assembly:

xor     al, -1
movzx   edi, al
call    core::num::<impl u8>::trailing_zeros
mov     dword ptr [rsp + 4], eax
mov     eax, dword ptr [rsp + 4]
mov     ecx, eax
mov     eax, ecx

Which is about as fast as we can get. The only possible speedup I could see here is by not casting to a usize, but that would be a major change for the API.

Hope this should bring us a bit closer to acceptable performance.

Also if anyone would like to take a stab at building a perf suite using criterion that'd be amazing!

Restioson · 2018-11-05T13:24:48Z

I wonder if trailing_zeroes will ever get inlined? That might make a difference

yoshuawuyts · 2018-11-05T16:18:55Z

@Restioson .trailing_zeros() should always get inlined (src).

edit: yeah oops, I forgot to mention: if you add the right optimization flags to rustc it'll produce native instructions. This happens automatically through cargo build --release.

depth:
  not     rdi
  tzcnt   rax, rdi
  ret

I wrote an RFC yesterday to introduce .trailing_ones() to Rust's stdlib, and it looks like all that needs to happen is submit a PR to the compiler.

SirRujak · 2020-02-26T21:59:44Z

Just wanted to mention that it appears that .trailing_ones() has been inlined as of January 27, 2020 as per this pull request: rust-lang/rust#68165

yoshuawuyts mentioned this issue Nov 2, 2018

Criterion #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithmically quite slow #28

Algorithmically quite slow #28

Restioson commented Apr 30, 2018 •

edited

Loading

ralphtheninja commented Apr 30, 2018

yoshuawuyts commented Apr 30, 2018

yoshuawuyts commented Apr 30, 2018

Restioson commented May 1, 2018 •

edited

Loading

ralphtheninja commented May 1, 2018

Restioson commented May 5, 2018

khernyo commented Jun 16, 2018

Restioson commented Jun 17, 2018

khernyo commented Jun 17, 2018

Restioson commented Jun 17, 2018

yoshuawuyts commented Nov 2, 2018 •

edited

Loading

Restioson commented Nov 5, 2018

yoshuawuyts commented Nov 5, 2018 •

edited

Loading

SirRujak commented Feb 26, 2020

Algorithmically quite slow #28

Algorithmically quite slow #28

Comments

Restioson commented Apr 30, 2018 • edited Loading

ralphtheninja commented Apr 30, 2018

yoshuawuyts commented Apr 30, 2018

yoshuawuyts commented Apr 30, 2018

Restioson commented May 1, 2018 • edited Loading

ralphtheninja commented May 1, 2018

Restioson commented May 5, 2018

khernyo commented Jun 16, 2018

Restioson commented Jun 17, 2018

khernyo commented Jun 17, 2018

Restioson commented Jun 17, 2018

yoshuawuyts commented Nov 2, 2018 • edited Loading

Restioson commented Nov 5, 2018

yoshuawuyts commented Nov 5, 2018 • edited Loading

SirRujak commented Feb 26, 2020

Restioson commented Apr 30, 2018 •

edited

Loading

Restioson commented May 1, 2018 •

edited

Loading

yoshuawuyts commented Nov 2, 2018 •

edited

Loading

yoshuawuyts commented Nov 5, 2018 •

edited

Loading