New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tail-recursive version of List.take at large n #668

Merged
merged 2 commits into from Jul 18, 2016

Conversation

Projects
None yet
4 participants
@nphollon
Contributor

nphollon commented Jul 18, 2016

This is a follow up to #659. We are looking for a way of implementing List.take that doesn't run the risk of a stack overflow and has good performance over as wide a domain as possible. My previous PR failed the second criterion, but after taking a look at this post, I think I have found something that has acceptable performance.

Would love to get some feedback on this!

The Fast Way

takeFast is similar to the current version of take with a bit of loop unrolling. This means that we don't do quite as much recursion, so it can handle larger values of n without crashing. It is also slightly faster than the current implementation.

I expected that unrolling the loop 4 times would have raised the limit on n by a factor of 4, but this turned out to be wrong. In Node, it's only a factor of about 2.6.

If we unroll the loop further, we can speed up the function and raise the limit on n, but these effects drop off pretty quickly.

The Tail-Recursive Way

takeTailRec implements take so that the recursive calls can be optimized. The good news is we no longer have to worry about stack overflows. The bad news is we need to reverse the list before returning it.

Experimentally, the execution time for takeTailRec is about 1.5 times that of takeFast.

Switching between them

The new version of List.take in this PR uses takeFast for small n and switches to takeTailRec when n > 5000. This switch point is somewhat arbitrary. Obviously we want to use the faster implementation as much as we can, but different platforms have different stack sizes, so it's not clear exactly where the danger line is. Firefox started having issues with takeFast around n > 7000. Chrome and IE were better. If we want to test on other platforms, I can share the code for my (very primitive) test harness.

Methodology

These implementations of take were tested by repeatedly running take n [1..10000] for different values of n. Testing was done in Node, Firefox, and Chromium (on Linux Mint) and IE (on Windows 7). I would be happy to provide the raw data and the source code for the test harness.

@process-bot

This comment has been minimized.

Show comment
Hide comment
@process-bot

process-bot Jul 18, 2016

Thanks for the pull request! Make sure it satisfies this checklist. My human colleagues will appreciate it!

Here is what to expect next, and if anyone wants to comment, keep these things in mind.

process-bot commented Jul 18, 2016

Thanks for the pull request! Make sure it satisfies this checklist. My human colleagues will appreciate it!

Here is what to expect next, and if anyone wants to comment, keep these things in mind.

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Jul 18, 2016

Member

This is really exciting! It looks like this implementation falls back on the slow/reliable version based on the initial input, whereas the approach Yaron suggested in his post only fell back on the other method after getting N deep. Did you try an implementation like that? I'd expect it to make the beginning of a long take faster. I think it would require having an additional Int argument that tracks stack depth in takeFast but I'm not certain.

Member

evancz commented Jul 18, 2016

This is really exciting! It looks like this implementation falls back on the slow/reliable version based on the initial input, whereas the approach Yaron suggested in his post only fell back on the other method after getting N deep. Did you try an implementation like that? I'd expect it to make the beginning of a long take faster. I think it would require having an additional Int argument that tracks stack depth in takeFast but I'm not certain.

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Jul 18, 2016

Member

Two other thoughts:

  1. I have a theory on why unrolling to 4x does not increase stack depth by 4x. The stack is a fixed size. Allocating a stack frame takes space, independent of the particulars of the function. So I guess we save on the fixed costs of allocating a stack frame, but the body of that stack frame is longer than before. And maybe that fixed cost is so high, that jamming more in the function body ultimately takes a bit less space. Otherwise, I'm not sure! :)
  2. I am curious if this approach would give us a faster map implementation as well. It'd be great to move more (all?) of List out of JS, and perhaps it'd be faster in the typical cases anyway. In other words, if you are interested in looking into this approach on other functions, I think there's a chance it would work really well!
Member

evancz commented Jul 18, 2016

Two other thoughts:

  1. I have a theory on why unrolling to 4x does not increase stack depth by 4x. The stack is a fixed size. Allocating a stack frame takes space, independent of the particulars of the function. So I guess we save on the fixed costs of allocating a stack frame, but the body of that stack frame is longer than before. And maybe that fixed cost is so high, that jamming more in the function body ultimately takes a bit less space. Otherwise, I'm not sure! :)
  2. I am curious if this approach would give us a faster map implementation as well. It'd be great to move more (all?) of List out of JS, and perhaps it'd be faster in the typical cases anyway. In other words, if you are interested in looking into this approach on other functions, I think there's a chance it would work really well!
@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Jul 18, 2016

Member

Okay, sorry for being disorganized. Here are the two crucial questions.

First, what is the impact of falling back to takeTailRec only after you have recursed through takeFast as far as possible?

If it works well, I think that's the way to go.

Second, what is the impact of doing some unrolling in takeTailRec?

I notice it does not do that at the moment. I assume unrolling makes the compiled code larger, so at some point, it becomes undesirable. Unclear how that tradeoff would play out, so I think it should be outside the scope of this PR. I just wanted to bring it up as soon as possible.

Member

evancz commented Jul 18, 2016

Okay, sorry for being disorganized. Here are the two crucial questions.

First, what is the impact of falling back to takeTailRec only after you have recursed through takeFast as far as possible?

If it works well, I think that's the way to go.

Second, what is the impact of doing some unrolling in takeTailRec?

I notice it does not do that at the moment. I assume unrolling makes the compiled code larger, so at some point, it becomes undesirable. Unclear how that tradeoff would play out, so I think it should be outside the scope of this PR. I just wanted to bring it up as soon as possible.

@nphollon

This comment has been minimized.

Show comment
Hide comment
@nphollon

nphollon Jul 18, 2016

Contributor

I hadn't tried switching implementations mid-stream the way Yaron does. But I'm looking at that now and it does look like an improvement. As you'd expect, the execution time drifts slowly from the takeFast curve to the takeTailRec curve. I will update the PR.

Unrolling takeTailRec gave a slight improvement in performance (about 10% faster for n=5000). This is a much weaker effect than it has on takeFast (about 75% faster at n=5000).

Contributor

nphollon commented Jul 18, 2016

I hadn't tried switching implementations mid-stream the way Yaron does. But I'm looking at that now and it does look like an improvement. As you'd expect, the execution time drifts slowly from the takeFast curve to the takeTailRec curve. I will update the PR.

Unrolling takeTailRec gave a slight improvement in performance (about 10% faster for n=5000). This is a much weaker effect than it has on takeFast (about 75% faster at n=5000).

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Jul 18, 2016

Member

About switching midway, great, glad it works!

That's really interesting on takeTailRec. Perhaps because it gets compiled to a while loop, the cost of looping back is not really that high relative to everything else, whereas with a function call, the book keeping around that dominates the costs so much that avoiding it is a big deal. Given the numbers, unrolling or not both seem fine to me, so I'll leave it up to your discretion.

Member

evancz commented Jul 18, 2016

About switching midway, great, glad it works!

That's really interesting on takeTailRec. Perhaps because it gets compiled to a while loop, the cost of looping back is not really that high relative to everything else, whereas with a function call, the book keeping around that dominates the costs so much that avoiding it is a big deal. Given the numbers, unrolling or not both seem fine to me, so I'll leave it up to your discretion.

@nphollon

This comment has been minimized.

Show comment
Hide comment
@nphollon

nphollon Jul 18, 2016

Contributor

Implementation has been updated. I left takeTailRec as is.

I'd be happy to take a look at the other List functions, but I am not I will have much time to contribute in the near future.

Contributor

nphollon commented Jul 18, 2016

Implementation has been updated. I left takeTailRec as is.

I'd be happy to take a look at the other List functions, but I am not I will have much time to contribute in the near future.

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Jul 18, 2016

Member

Alright, looks good to me. Thanks for working on this and sharing your findings. I learned some cool stuff about JavaScript :)

Member

evancz commented Jul 18, 2016

Alright, looks good to me. Thanks for working on this and sharing your findings. I learned some cool stuff about JavaScript :)

@evancz evancz merged commit e24730e into elm:master Jul 18, 2016

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@nphollon

This comment has been minimized.

Show comment
Hide comment
@nphollon

nphollon Jul 18, 2016

Contributor

Glad I was able to help!

Contributor

nphollon commented Jul 18, 2016

Glad I was able to help!

@Mouvedia

This comment has been minimized.

Show comment
Hide comment
@Mouvedia

Mouvedia Dec 25, 2016

Was the choice of 1000 as the limit completely arbitrary?

Mouvedia commented Dec 25, 2016

Was the choice of 1000 as the limit completely arbitrary?

@nphollon

This comment has been minimized.

Show comment
Hide comment
@nphollon

nphollon Dec 27, 2016

Contributor

Not completely. See the comments in original post about setting the limit at 5000. With the changes that happened after that, I needed to lower the limit further.

If the limit is too high, the stack may overflow. The exact size of the stack depends on the platform. It's probably safe to raise the limit a little bit, but since I could not test the code on every platform, I thought it better to err on the conservative side.

Contributor

nphollon commented Dec 27, 2016

Not completely. See the comments in original post about setting the limit at 5000. With the changes that happened after that, I needed to lower the limit further.

If the limit is too high, the stack may overflow. The exact size of the stack depends on the platform. It's probably safe to raise the limit a little bit, but since I could not test the code on every platform, I thought it better to err on the conservative side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment