New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: optimize whole-array operations on array views #16133
Comments
@benharsh and I talked about this a fair amount back in the day when he was doing bulk transfer work and I was doing slice optimizations, but nothing specific came of it that I can remember (on my side at least). @ronawho may have been in some of those conversations as well and may have additional thoughts here. |
I don't think I remember anything concrete from those conversations. Somewhat related -- I have been wondering off and on if we can avoid allocations for some array-views (either doing escape analysis to allocate on the stack or something like that). There's been a number of times where the allocation overhead kills performance in an inner loop and we've resorted to using a c_ptr or direct indexing. e.g. |
Recently, this came up in a GPU context. Traditionally, GPUs have limited amount of memory and for operations on bigger data sets, you'd have to copy slices in and out of the GPU memory. Moreover, our current way of doing multi-GPU parallelism via: var InputArr, OutputArr: [...] int;
coforall gpu in here.gpus do on gpu {
var DevArr = InputArr[someSlice];
...
OutputArr[someSlice] = DevArr;
} relies on heavy use of slicing in a way that can benefit from the optimization being asked for here. |
@e-kayrakli - I've created #24343 about the general need to do something to improve the performance of creating slices / rank change operations. At the same time, my current favorite approach for this is described in #23890. I'd also like to note here that the idea in this issue could arguably be an application of the "compound operator" idea described in #17999 (comment) . |
Those all look good. I am wondering where you stand on your original proposal here. Namely, transforming Arr1[slice1] = Arr2[slice2]; into chpl__transferView(Arr1, slice1, Arr2, slice2); This looks like a relatively easy transformation that would result in almost no overhead. @benharsh also noted somewhere that the infrastructure to implement The other solutions you propose are much bigger overhauls that would still have some sort of overhead (likely much smaller if we can get rid of class instance allocations). I guess left to my own devices, I would still pursue the transformation above sooner and decouple those proposals from this optimization. |
@e-kayrakli -- yes, that seems like a reasonable strategy if we feel we can make near-term progress that way. (Especially if you have in mind somebody who can look at implementing it soon). I haven't yet been able to figure out how the approach in #23890 can help to resolve the similar problems with Block array slices/rank changes. Certainly it could reduce some of the overheads, but IMO there is some risk there that it won't address the problem well enough. So, it seems that something like |
I would like to see it transformed into a foreach. That does not add any parallelization overhead and hopefully vectorizes things. If I had Engin's code like a1[slice1] = a2[slice2] and wanted parallel overhead, I could have written the (just as clear and concise) code [ (i, j) in zip(slice1, slice2) ] a1[i] = a2[j]; I would certainly like to see a1[lo1..#N] = a2[lo2..#N] where N is a param vectorized, especially if N is even. |
@damianmoz - it's also possible for the implementation to compute the size of the data transfer and make a decision about whether or not to be parallel based on that. We already use a memory size threshold for non-slice bulk transfers. For small transfers, I agree with you that we'll want it to be avoid task overhead. For large transfers, I would expect parallelism to be worthwhile. And, with a quick glance, I think we are currently avoiding creating new tasks for the case of nested parallelism (for non-slice bulk transfers). I'd expect that these same strategies will work OK for slice transfers. |
As a more extreme example of Michael's point, if the arrays being assigned were distributed, we'd definitely want to use a forall rather than a foreach to avoid having one task drag everything over the network. So I agree that a heuristic is what is needed here to decide between forall and foreach, making this feel thematically similar to #24243 (throttling parallelism when it's overkill). |
I agree. I should have been more expansive in my comments. You and Michael have fleshed out more of the implementation details of what needs to be considered |
Consider operations like these:
These are whole-array (or more accurately, whole-array-view) operations. In the current implementation, they are implemented by constructing an array view object (which involves allocating) and then running a general
=
operation on the array view object.While this approach is general, it does not necessarily perform as well as it could.
This issue is proposing that these operations instead are translated to the compiler into something conceptually similar to
chpl__bulkTransferArray
. It could use a tuple of ranges, in addition, to avoid allocation entirely.For example,
A[x..y] = B[m..n]
could translate intochpl__transferSliceView(A, (x..y,), B, (m..n))
and that could be implemented in a more optimized way along the lines ofWe would create a few different such functions for specific scenarios that we wish to optimize (including rank change or assignment from a literal or iterator). The compiler would identify the constructs like
A[x..y] = B[m..n]
at late in function resolution and replace them with calls to the optimized function.It would be worth trying to understand how much performance improvement this could offer before implementing too far.
See also the mailing list thread https://sourceforge.net/p/chapel/mailman/message/37068963/
The text was updated successfully, but these errors were encountered: