New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Enumerable.ToArray to avoid excessive copying for lazy enumerables #11208

Merged
merged 4 commits into from Sep 28, 2016

Conversation

@jamesqo
Contributor

jamesqo commented Aug 28, 2016

The idea that instead of resizing the array (aka throwing it out and allocating a new one) and re-copying the data every time the sequence doesn't fit, we can instead keep the old array around, allocate a new one of size * 2, and just continue reading into the new array at index 0. This will require keeping around a list of arrays we've filled up so far, however the overhead of this list will be substantially less than the size of the data itself.

A visualization of what I'm talking about for a 300-length enumerable:

First array: [items 0-31]

List of arrays:
[0]: [items 32-63]
[1]: [items 64-127]
[2]: [items 128-255]

Current array: [items 256-299]

(I also put this in the comments to help future readers understand.)

Then when we need the final array to return from the method, we just calculate first.Length + list.Sum(a => a.Length) + indexWeFinishedReadingIntoInCurrentArray, pre-allocate an array of that size, and then copy each of the arrays into that array.

I opted to only use this algorithm for enumerables of > length 32, since for smaller sizes it will lead to increased fragmentation, and the overhead of the list allocation will probably be noticeable in comparison to, say, 20 elements. ~32 is also the threshold at around which Array.Copy starts getting faster than a for-loop.

Benchmark results

https://gist.github.com/jamesqo/399b72bf5de8e2cbd83d044836cbefa4 (includes results/source code)

You can see that the new implementation consistently has about 2/3 the gen0 collections as the old one.

The timings are somewhat inconsistent (disclaimer: I only did 100,000 iterations since the tests were taking way too long to run), but you can see that the new implementation is generally faster/the same speed as the old one. (I suspect it may be hard to measure the differences due to all of the interface invocations that are going on, in addition to the covariant array type checks if T is a reference type.)

cc @stephentoub @VSadov @JonHanna

@jamesqo

This comment has been minimized.

Show comment
Hide comment
@jamesqo

jamesqo Sep 4, 2016

Contributor

@stephentoub can you PTAL?

I admit that the added logic is a bit complex, however this is a frequently-used path in cases like customers.Where(c => c.Id == 3).ToArray() and the memory improvements are substantial.

Contributor

jamesqo commented Sep 4, 2016

@stephentoub can you PTAL?

I admit that the added logic is a bit complex, however this is a frequently-used path in cases like customers.Where(c => c.Id == 3).ToArray() and the memory improvements are substantial.

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Sep 5, 2016

Member

I want to think about this some more. It's not clear to me yet it's entirely a win, even if a microbenchmark for throughput and Gen0 GCs shows an improvement. It may be; I'm slightly concerned though that this will end up with more objects held onto for the duration of the operation (plus the few additional small objects that get allocated), etc. As you say, it's also a lot more code.

@VSadov, @justinvp, @jkotas, thoughts?

Member

stephentoub commented Sep 5, 2016

I want to think about this some more. It's not clear to me yet it's entirely a win, even if a microbenchmark for throughput and Gen0 GCs shows an improvement. It may be; I'm slightly concerned though that this will end up with more objects held onto for the duration of the operation (plus the few additional small objects that get allocated), etc. As you say, it's also a lot more code.

@VSadov, @justinvp, @jkotas, thoughts?

}
// Generic methods are compiled per-instantiation for value types.
// Do not force the jit to compile a bunch of code it will never

This comment has been minimized.

@jkotas

jkotas Sep 5, 2016

Member

We want the implementation to work well for both JITed and AOT environments. This trick does not work for AOT - you will pay for the extra code there.

@jkotas

jkotas Sep 5, 2016

Member

We want the implementation to work well for both JITed and AOT environments. This trick does not work for AOT - you will pay for the extra code there.

This comment has been minimized.

@jamesqo

jamesqo Sep 5, 2016

Contributor

This trick does not work for AOT - you will pay for the extra code there.

How much is the cost? Since the method is compiled in advance, there is no first-time overhead invoking the method like when it is jitted.

@jamesqo

jamesqo Sep 5, 2016

Contributor

This trick does not work for AOT - you will pay for the extra code there.

How much is the cost? Since the method is compiled in advance, there is no first-time overhead invoking the method like when it is jitted.

This comment has been minimized.

@jkotas

jkotas Sep 5, 2016

Member

The cost is in binary size and compile time.

@jkotas

jkotas Sep 5, 2016

Member

The cost is in binary size and compile time.

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Sep 5, 2016

Member

however this is a frequently-used path in cases like customers.Where(c => c.Id == 3).ToArray()

What is the improvement for this case, or other common Linq use cases?

Linq has ToArray optimized in the layer above via IIListProvider interface. I am wondering how frequently would this optimization actually kicks in; and if there are common cases where it does not - whether it would be better to optimize those using IIListProvider instead.

Member

jkotas commented Sep 5, 2016

however this is a frequently-used path in cases like customers.Where(c => c.Id == 3).ToArray()

What is the improvement for this case, or other common Linq use cases?

Linq has ToArray optimized in the layer above via IIListProvider interface. I am wondering how frequently would this optimization actually kicks in; and if there are common cases where it does not - whether it would be better to optimize those using IIListProvider instead.

@jamesqo

This comment has been minimized.

Show comment
Hide comment
@jamesqo

jamesqo Sep 5, 2016

Contributor

What is the improvement for this case, or other common Linq use cases?

If the number of customers is greater than 32, for example 50, then we exchange the Customer[64] allocation at the end for a Customer[32] allocation. And we keep the Customer[32] which holds the first 32 customers around, instead of copying all of the data into a new array and then throwing it away.

Linq has ToArray optimized in the layer above via IIListProvider interface. I am wondering how frequently would this optimization actually kicks in

IIListProvider (to my understanding) represents an iterator we know the count of in advance, and we can make exactly 1 allocation for ToArray / ToList. For example, if we call customers.Select(c => Foo(c)) and we know that customers.Count == 38, we know the resulting enumerable will have exactly 38 elements and can allocate an array of exactly that size. I don't think this can be done with Where iterators.

Contributor

jamesqo commented Sep 5, 2016

What is the improvement for this case, or other common Linq use cases?

If the number of customers is greater than 32, for example 50, then we exchange the Customer[64] allocation at the end for a Customer[32] allocation. And we keep the Customer[32] which holds the first 32 customers around, instead of copying all of the data into a new array and then throwing it away.

Linq has ToArray optimized in the layer above via IIListProvider interface. I am wondering how frequently would this optimization actually kicks in

IIListProvider (to my understanding) represents an iterator we know the count of in advance, and we can make exactly 1 allocation for ToArray / ToList. For example, if we call customers.Select(c => Foo(c)) and we know that customers.Count == 38, we know the resulting enumerable will have exactly 38 elements and can allocate an array of exactly that size. I don't think this can be done with Where iterators.

// capacity. It only starts allocating arrays when we add the first item.
var buffers = new List<T[]>(); // list of previous buffers
var current = new T[first.Length]; // the current buffer we're reading the sequence into

This comment has been minimized.

@VSadov

VSadov Sep 15, 2016

Member

curious - why not new T[first.Length * 2] ?

@VSadov

VSadov Sep 15, 2016

Member

curious - why not new T[first.Length * 2] ?

This comment has been minimized.

@jamesqo

jamesqo Sep 18, 2016

Contributor

I didn't want to be overly optimistic with the allocations- having first.Length * 2 (which would be 64 as this PR is currently implemented) would lead us to allocate more than 2N elements for an N-length enumerable in some cases. For example, 100 would lead us to allocate space for 224 elements.

@jamesqo

jamesqo Sep 18, 2016

Contributor

I didn't want to be overly optimistic with the allocations- having first.Length * 2 (which would be 64 as this PR is currently implemented) would lead us to allocate more than 2N elements for an N-length enumerable in some cases. For example, 100 would lead us to allocate space for 224 elements.

This comment has been minimized.

@stephentoub

stephentoub Sep 26, 2016

Member

Makes sense. In a typical doubling algorithm, you'd have 32 elements and double the size of the array to 64, making room for another 32. Then double again to 128, making room for another 64. Etc. Since we're allocating an array that won't include the initial set of elements, we don't want to do that initial doubling. So we allocate an additional array of length 32 for 32 elements (64 total), then an additional array of length 64 for 64 elements (128 total), etc.

@stephentoub

stephentoub Sep 26, 2016

Member

Makes sense. In a typical doubling algorithm, you'd have 32 elements and double the size of the array to 64, making room for another 32. Then double again to 128, making room for another 64. Etc. Since we're allocating an array that won't include the initial set of elements, we don't want to do that initial doubling. So we allocate an additional array of length 32 for 32 elements (64 total), then an additional array of length 64 for 64 elements (128 total), etc.

This comment has been minimized.

@nbaraz

nbaraz Oct 2, 2016

I have an idea that should sometimes reduce the number of allocations/copying:
Allocate first.Length * 2 up front, but place the elements in the new array starting at first.Length. Once you reach the end of the new array, or the end of the iterator:

  • If there are more items, put them at the start of the new array (You will have to reorder them when you de-fragment the array list)
  • If there are no more items, place the elements from first in the new array, from 0 to first.Length
@nbaraz

nbaraz Oct 2, 2016

I have an idea that should sometimes reduce the number of allocations/copying:
Allocate first.Length * 2 up front, but place the elements in the new array starting at first.Length. Once you reach the end of the new array, or the end of the iterator:

  • If there are more items, put them at the start of the new array (You will have to reorder them when you de-fragment the array list)
  • If there are no more items, place the elements from first in the new array, from 0 to first.Length

This comment has been minimized.

@jamesqo

jamesqo Oct 2, 2016

Contributor

@NSIL That seems like it would increase allocations for e.g. lengths 33-64, since before we would allocate a T[32] and a T[32], and now we would allocate a T[32] and T[64]. What would be an example in which is decreases allocations/copying?

@jamesqo

jamesqo Oct 2, 2016

Contributor

@NSIL That seems like it would increase allocations for e.g. lengths 33-64, since before we would allocate a T[32] and a T[32], and now we would allocate a T[32] and T[64]. What would be an example in which is decreases allocations/copying?

This comment has been minimized.

@nbaraz

nbaraz Oct 2, 2016

If there are between 32 and 64 elements, you can just return the new array.
Edit: I don't know if resizing down allocates. if it does, then it doesn't help at all

@nbaraz

nbaraz Oct 2, 2016

If there are between 32 and 64 elements, you can just return the new array.
Edit: I don't know if resizing down allocates. if it does, then it doesn't help at all

This comment has been minimized.

@jamesqo

jamesqo Oct 2, 2016

Contributor

@NSIL Resizing down does allocate, unfortunately.

@jamesqo

jamesqo Oct 2, 2016

Contributor

@NSIL Resizing down does allocate, unfortunately.

This comment has been minimized.

@nbaraz

nbaraz Oct 2, 2016

Never mind then :) I assumed it didn't, will check next time.

@nbaraz

nbaraz Oct 2, 2016

Never mind then :) I assumed it didn't, will check next time.

@VSadov

This comment has been minimized.

Show comment
Hide comment
@VSadov

VSadov Sep 15, 2016

Member

I like the change. It seems that for n <= 32 it would have roughly the same behavior as before, while for large enumerables the transient bytes allocated and copying churn would be reduced by ~50%.

Member

VSadov commented Sep 15, 2016

I like the change. It seems that for n <= 32 it would have roughly the same behavior as before, while for large enumerables the transient bytes allocated and copying churn would be reduced by ~50%.

@VSadov

This comment has been minimized.

Show comment
Hide comment
@VSadov

VSadov Sep 15, 2016

Member

LGTM

Member

VSadov commented Sep 15, 2016

LGTM

@VSadov

This comment has been minimized.

Show comment
Hide comment
@VSadov

VSadov Sep 15, 2016

Member

@jamesqo - could you try benchmarking scenarios with T being a reference type (or a struct containing a reference type).
Copying references is more expensive because of GC fences, so I wonder how this change affects the throughput.

Member

VSadov commented Sep 15, 2016

@jamesqo - could you try benchmarking scenarios with T being a reference type (or a struct containing a reference type).
Copying references is more expensive because of GC fences, so I wonder how this change affects the throughput.

@jamesqo

This comment has been minimized.

Show comment
Hide comment
@jamesqo

jamesqo Sep 18, 2016

Contributor

@VSadov Here are the results for some types which are/contain references. I additionally took a look in PerfView and memmoveGCRefs seems to be taking up less time than it was before.

Contributor

jamesqo commented Sep 18, 2016

@VSadov Here are the results for some types which are/contain references. I additionally took a look in PerfView and memmoveGCRefs seems to be taking up less time than it was before.

@stephentoub

I don't love the complexity this adds, but I can see it being worth it for larger enumerables. Thanks for working on it.

Show outdated Hide outdated src/Common/src/System/Collections/Generic/EnumerableHelpers.cs
Show outdated Hide outdated src/Common/src/System/Collections/Generic/EnumerableHelpers.cs
Show outdated Hide outdated src/Common/src/System/Collections/Generic/EnumerableHelpers.cs
Show outdated Hide outdated src/Common/src/System/Collections/Generic/EnumerableHelpers.cs
// capacity. It only starts allocating arrays when we add the first item.
var buffers = new List<T[]>(); // list of previous buffers
var current = new T[first.Length]; // the current buffer we're reading the sequence into

This comment has been minimized.

@stephentoub

stephentoub Sep 26, 2016

Member

Makes sense. In a typical doubling algorithm, you'd have 32 elements and double the size of the array to 64, making room for another 32. Then double again to 128, making room for another 64. Etc. Since we're allocating an array that won't include the initial set of elements, we don't want to do that initial doubling. So we allocate an additional array of length 32 for 32 elements (64 total), then an additional array of length 64 for 64 elements (128 total), etc.

@stephentoub

stephentoub Sep 26, 2016

Member

Makes sense. In a typical doubling algorithm, you'd have 32 elements and double the size of the array to 64, making room for another 32. Then double again to 128, making room for another 64. Etc. Since we're allocating an array that won't include the initial set of elements, we don't want to do that initial doubling. So we allocate an additional array of length 32 for 32 elements (64 total), then an additional array of length 64 for 64 elements (128 total), etc.

Show outdated Hide outdated src/Common/src/System/Collections/Generic/EnumerableHelpers.cs
Show outdated Hide outdated src/Common/src/System/Collections/Generic/EnumerableHelpers.cs
Show outdated Hide outdated src/Common/src/System/Collections/Generic/EnumerableHelpers.cs
Show outdated Hide outdated src/Common/src/System/Collections/Generic/EnumerableHelpers.cs

jamesqo added some commits Sep 26, 2016

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Sep 28, 2016

Member

Thanks for your efforts on this, @jamesqo.

Member

stephentoub commented Sep 28, 2016

Thanks for your efforts on this, @jamesqo.

@stephentoub stephentoub merged commit c9c676e into dotnet:master Sep 28, 2016

10 checks passed

Innerloop CentOS7.1 Debug Build and Test Build finished.
Details
Innerloop CentOS7.1 Release Build and Test Build finished.
Details
Innerloop Linux ARM Emulator Debug Cross Build Build finished.
Details
Innerloop Linux ARM Emulator Release Cross Build Build finished.
Details
Innerloop OSX Debug Build and Test Build finished.
Details
Innerloop OSX Release Build and Test Build finished.
Details
Innerloop Ubuntu14.04 Debug Build and Test Build finished.
Details
Innerloop Ubuntu14.04 Release Build and Test Build finished.
Details
Innerloop Windows_NT Debug Build and Test Build finished.
Details
Innerloop Windows_NT Release Build and Test Build finished.
Details

@jamesqo jamesqo deleted the jamesqo:better-to-array branch Sep 29, 2016

@karelz karelz modified the milestone: 1.2.0 Dec 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment