New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of List.foldr #872

Closed
wants to merge 2 commits into
base: dev
from

Conversation

Projects
None yet
3 participants
@Skinney
Contributor

Skinney commented Jun 12, 2017

This PR takes the same idea presented in #707 and applies it to List.foldr. I see performance improvements of 30-40% on my machine (Safari and Chrome). Since foldr is used by many other list functions, this gives a nice performance boost across the board.

I tested each level of "unrollment" individually. 9 seemed to be the magic number.

I tried the same thing with foldl. It gave a modest improvement in Chrome (8-10%), but reduced performance in Safari (10-14%).

I have not tested if #707 still provides a performance improvement over a map fn using this foldr implementation.

@process-bot

This comment has been minimized.

Show comment
Hide comment
@process-bot

process-bot Jun 12, 2017

Thanks for the pull request! Make sure it satisfies this checklist. My human colleagues will appreciate it!

Here is what to expect next, and if anyone wants to comment, keep these things in mind.

process-bot commented Jun 12, 2017

Thanks for the pull request! Make sure it satisfies this checklist. My human colleagues will appreciate it!

Here is what to expect next, and if anyone wants to comment, keep these things in mind.

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Jun 13, 2017

Member

Whoo, very cool! :D Three questions:

  • Instead of having branches for every single scenario, what if you only had 0, 1, and 9? Or 0, 1, 2, 4, 8? Or some other pattern for reducing the number of branches? How much does that reduce the size of generated code? How much perf does it cost?

  • Can you use f a (f b (f c ...)) instead of using |>? No reason to require a compiler optimization to trigger if we do not need to.

  • Is the generated code is doing things like xs._1._1._1._1.ctor? Can that can be avoided by writing nested cases instead of one big one? How does that change generated code size and performance?

Member

evancz commented Jun 13, 2017

Whoo, very cool! :D Three questions:

  • Instead of having branches for every single scenario, what if you only had 0, 1, and 9? Or 0, 1, 2, 4, 8? Or some other pattern for reducing the number of branches? How much does that reduce the size of generated code? How much perf does it cost?

  • Can you use f a (f b (f c ...)) instead of using |>? No reason to require a compiler optimization to trigger if we do not need to.

  • Is the generated code is doing things like xs._1._1._1._1.ctor? Can that can be avoided by writing nested cases instead of one big one? How does that change generated code size and performance?

@Skinney

This comment has been minimized.

Show comment
Hide comment
@Skinney

Skinney Jun 13, 2017

Contributor
  • Won't compile without branches for every single scenario.
  • Sure
  • Not only did nested case-statements reduce generated code size with about 550 bytes (2500bytes -> 1950bytes). It also increased performance. We're now in the 40-45% range of performance improvements (Chrome & Safari). Good call :)
Contributor

Skinney commented Jun 13, 2017

  • Won't compile without branches for every single scenario.
  • Sure
  • Not only did nested case-statements reduce generated code size with about 550 bytes (2500bytes -> 1950bytes). It also increased performance. We're now in the 40-45% range of performance improvements (Chrome & Safari). Good call :)
@Skinney

This comment has been minimized.

Show comment
Hide comment
@Skinney

Skinney Jun 13, 2017

Contributor

Another interesting thing, for future reference, is that I consistently get 50+% (so another 5-10 percentiles) performance improvement when inlining the ctr > 500 line in the compiled code. By that I mean changing the compiled code from ...Util.cmp(ctr, 500) > 0 -> ctr > 500.

Contributor

Skinney commented Jun 13, 2017

Another interesting thing, for future reference, is that I consistently get 50+% (so another 5-10 percentiles) performance improvement when inlining the ctr > 500 line in the compiled code. By that I mean changing the compiled code from ...Util.cmp(ctr, 500) > 0 -> ctr > 500.

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Jun 13, 2017

Member

Neat! I want to do some thinking about whether this has implications on how pattern matching code is generated.

About it not compiling, you need to change it from a :: b :: [] to a :: b :: _ and do a recursive call.

Can you share the code that is doing the performance comparison? I would like to run it myself and see the numbers. With that ability, I can take a look at this on my own in the next few days.

And yeah, this is one of the first optimizations we can do if type information is kept all the way until code generation. Swapping in === and < for String, Int, etc. will surely make things faster.

Member

evancz commented Jun 13, 2017

Neat! I want to do some thinking about whether this has implications on how pattern matching code is generated.

About it not compiling, you need to change it from a :: b :: [] to a :: b :: _ and do a recursive call.

Can you share the code that is doing the performance comparison? I would like to run it myself and see the numbers. With that ability, I can take a look at this on my own in the next few days.

And yeah, this is one of the first optimizations we can do if type information is kept all the way until code generation. Swapping in === and < for String, Int, etc. will surely make things faster.

@Skinney

This comment has been minimized.

Show comment
Hide comment
Contributor

Skinney commented Jun 13, 2017

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Jun 16, 2017

Member

I merged it in, but it did not register here for some reason.

I ended up reducing the unrolling to 4 to make things a bit smaller without huge perf losses. Not sure if it's the right trade, but it seemed like a reasonable compromise.

Anyway, thanks for looking into this!

Member

evancz commented Jun 16, 2017

I merged it in, but it did not register here for some reason.

I ended up reducing the unrolling to 4 to make things a bit smaller without huge perf losses. Not sure if it's the right trade, but it seemed like a reasonable compromise.

Anyway, thanks for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment