Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithmically quite slow #28

Open
Restioson opened this issue Apr 30, 2018 · 14 comments
Open

Algorithmically quite slow #28

Restioson opened this issue Apr 30, 2018 · 14 comments

Comments

@Restioson
Copy link

Restioson commented Apr 30, 2018

I decided to look at the generated assembly for the left_child function (inlining everything it calls):

example::left_child:
  push rbp
  mov rbp, rsp
  xor eax, eax
  test dil, 1
  jne .LBB0_2
  mov rdx, rdi
  pop rbp
  ret
.LBB0_2:
  mov rdx, rdi
  mov rcx, rdx
.LBB0_3:
  shr rcx
  add rax, 1
  test dl, 2
  mov rdx, rcx
  jne .LBB0_3
  test rax, rax
  je .LBB0_5
  lea ecx, [rax + 1]
  shr rdi, cl
  add rdi, rdi
  lea rdx, [rax - 1]
  mov ecx, eax
  shl rdi, cl
  mov esi, 1
  mov ecx, edx
  shl rsi, cl
  mov eax, 1
  add rsi, -1
  or rdi, rsi
  mov rdx, rdi
  pop rbp
  ret
.LBB0_5:
  mov eax, 1
  mov rdx, rdi
  pop rbp
  ret

By rewriting it to store each level in its entirety consecutively (level 1, then 2, then 3, etc), I got this (since then changed, will update later, but wasn't much longer):

example::left_child:
  push rbp
  mov rbp, rsp
  lea rax, [rdi + rdi]
  add rax, -1
  pop rbp
  ret

I understand that the algorithm is taken from the JS project, but I still think that this algorithm is quite slow...

@ralphtheninja
Copy link
Contributor

Maybe so. Another question is if it matters?

@yoshuawuyts
Copy link
Collaborator

@Restioson oh, that's super interesting; that's quite a reduction!

I'd be super keen to see more on how this was achieved. You're right in that so far we've mostly focused on porting JS to Rust, but it'd be really cool to have a path forward to improve performance!

@yoshuawuyts
Copy link
Collaborator

@Restioson oh by the way; I'm seeing lots of test instructions in the output assembly. I believe those exist to prevent integer wrap-arounds. Could you make sure you're running the comparisons on release builds? Thanks!

@Restioson
Copy link
Author

Restioson commented May 1, 2018

@yoshuawuyts I believe godbolt does it on release by default. I'll try create another godbolt demo to put the link in. Performance wise, subbing out the flat-tree fns for my own made my time go from ~15us -> 400ns about. My algorithm stores a tree like

1
2 3

as [1; 2; 3].

To find the left child, we multiply the (1-indexed) index by two. To find the right child, we multiply the (again 1-indexed) index by two and add one. To find the parent, we divide the index by 2 and floor it. This is worse for cache locality though, but in general faster. The very slow function, btw, was depth. It has a loop, and is O(depth) (I think).

@ralphtheninja
Copy link
Contributor

Sounds about right that it's only depth that "needs" to be optimized.

@Restioson
Copy link
Author

I assume depth is calculating log2? If it is, http://huonw.github.io/fast-math/fast_math/fn.log_raw2.html could be used.

@khernyo
Copy link
Member

khernyo commented Jun 16, 2018

@Restioson If I understand your proposal, that would change the numbering in a way which makes it unfit for streaming data. Let me explain.

This library is used in https://github.com/datrs/merkle-tree-stream to help with producing the merkle tree stream. The idea is that data blocks are coming in and the hash of these blocks (the leaves) are generated continuously. When there are enough leaves (or parents) to calculate the cumulative hash of a subtree, this hash is generated. There is no info about how many data blocks we have, so no info about the depth either. This is why the numbering starts with the leftmost leaf so that any number of nodes can be generated incrementally. Take a look at https://github.com/datrs/print-flat-tree to get some idea about this layout/numbering.

Would it be possible to retain the same ordering with your scheme? If so, then indeed it could be used.

Regarding the depth function, it can be reduced to the following:

pub fn depth(i: usize) -> usize {
  (!i).trailing_zeros() as usize
}

(it could be beneficial to make it return u32, then no conversion would be necessary)

@Restioson
Copy link
Author

@khernyo I wasn't specifically asking to replace the algorithm, just to improve the current one (perhaps using an alternate algorithm). What does depth do anyway, btw?

@khernyo
Copy link
Member

khernyo commented Jun 17, 2018

Depth is a node's distance from a child leaf. So, e.g. in a tree with four leaves:

 0─┐
   1─┐
 2─┘ │
     3
 4─┐ │
   5─┘
 6─┘

the leaves (nodes 0, 2, 4 and 6) have a depth of 0, the root (node 3) has a depth of 2.

@Restioson
Copy link
Author

Fair enough. I'll try hack around with this and see how much of a difference it makes subbing in the new depth function.

@yoshuawuyts
Copy link
Collaborator

yoshuawuyts commented Nov 2, 2018

Just rewrote the depth function in #38, so now it's just:

#[inline]
pub fn depth(i: usize) -> usize {
  (!i).trailing_zeros() as usize
}

In turn this is now in assembly:

xor     al, -1
movzx   edi, al
call    core::num::<impl u8>::trailing_zeros
mov     dword ptr [rsp + 4], eax
mov     eax, dword ptr [rsp + 4]
mov     ecx, eax
mov     eax, ecx

Which is about as fast as we can get. The only possible speedup I could see here is by not casting to a usize, but that would be a major change for the API.

Hope this should bring us a bit closer to acceptable performance.

Also if anyone would like to take a stab at building a perf suite using criterion that'd be amazing!

@yoshuawuyts yoshuawuyts mentioned this issue Nov 2, 2018
@Restioson
Copy link
Author

I wonder if trailing_zeroes will ever get inlined? That might make a difference

@yoshuawuyts
Copy link
Collaborator

yoshuawuyts commented Nov 5, 2018

@Restioson .trailing_zeros() should always get inlined (src).

edit: yeah oops, I forgot to mention: if you add the right optimization flags to rustc it'll produce native instructions. This happens automatically through cargo build --release.

depth:
  not     rdi
  tzcnt   rax, rdi
  ret

I wrote an RFC yesterday to introduce .trailing_ones() to Rust's stdlib, and it looks like all that needs to happen is submit a PR to the compiler.

@SirRujak
Copy link

Just wanted to mention that it appears that .trailing_ones() has been inlined as of January 27, 2020 as per this pull request: rust-lang/rust#68165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants