Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Improve Enumerable.ToArray to avoid excessive copying for lazy enumerables #11208
The idea that instead of resizing the array (aka throwing it out and allocating a new one) and re-copying the data every time the sequence doesn't fit, we can instead keep the old array around, allocate a new one of size * 2, and just continue reading into the new array at index 0. This will require keeping around a list of arrays we've filled up so far, however the overhead of this list will be substantially less than the size of the data itself.
A visualization of what I'm talking about for a 300-length enumerable:
(I also put this in the comments to help future readers understand.)
Then when we need the final array to return from the method, we just calculate
I opted to only use this algorithm for enumerables of > length 32, since for smaller sizes it will lead to increased fragmentation, and the overhead of the list allocation will probably be noticeable in comparison to, say, 20 elements. ~32 is also the threshold at around which
https://gist.github.com/jamesqo/399b72bf5de8e2cbd83d044836cbefa4 (includes results/source code)
You can see that the new implementation consistently has about 2/3 the gen0 collections as the old one.
The timings are somewhat inconsistent (disclaimer: I only did 100,000 iterations since the tests were taking way too long to run), but you can see that the new implementation is generally faster/the same speed as the old one. (I suspect it may be hard to measure the differences due to all of the interface invocations that are going on, in addition to the covariant array type checks if T is a reference type.)
I want to think about this some more. It's not clear to me yet it's entirely a win, even if a microbenchmark for throughput and Gen0 GCs shows an improvement. It may be; I'm slightly concerned though that this will end up with more objects held onto for the duration of the operation (plus the few additional small objects that get allocated), etc. As you say, it's also a lot more code.
What is the improvement for this case, or other common Linq use cases?
Linq has ToArray optimized in the layer above via
If the number of customers is greater than 32, for example 50, then we exchange the
I don't love the complexity this adds, but I can see it being worth it for larger enumerables. Thanks for working on it.