proposal: optimize whole-array operations on array views #16133

mppf · 2020-07-25T12:06:08Z

Consider operations like these:

var A:[1..n] int;
A[4..10] = 1; // assignment of slice to constant
A[4..5] = A[7..8]; // assignment of slice to another slice
A[1..3] = [i in 1..3] i; // assignment of slice to a loop expression

These are whole-array (or more accurately, whole-array-view) operations. In the current implementation, they are implemented by constructing an array view object (which involves allocating) and then running a general = operation on the array view object.

While this approach is general, it does not necessarily perform as well as it could.

This issue is proposing that these operations instead are translated to the compiler into something conceptually similar to chpl__bulkTransferArray. It could use a tuple of ranges, in addition, to avoid allocation entirely.

For example, A[x..y] = B[m..n] could translate into chpl__transferSliceView(A, (x..y,), B, (m..n)) and that could be implemented in a more optimized way along the lines of

  proc chpl__transferSliceView(A, aSliceDims, B, bSliceDims) {
    forall (i, j) in zip(A.domain.iterateSlice(aSliceDims), B.domain.iterateSlice(bSliceDims) {
      A[i] = B[j];
    }
  }

We would create a few different such functions for specific scenarios that we wish to optimize (including rank change or assignment from a literal or iterator). The compiler would identify the constructs like A[x..y] = B[m..n] at late in function resolution and replace them with calls to the optimized function.

It would be worth trying to understand how much performance improvement this could offer before implementing too far.

See also the mailing list thread https://sourceforge.net/p/chapel/mailman/message/37068963/

The text was updated successfully, but these errors were encountered:

bradcray · 2020-07-28T20:57:59Z

@benharsh and I talked about this a fair amount back in the day when he was doing bulk transfer work and I was doing slice optimizations, but nothing specific came of it that I can remember (on my side at least). @ronawho may have been in some of those conversations as well and may have additional thoughts here.

ronawho · 2020-08-03T17:05:19Z

I don't think I remember anything concrete from those conversations.

Somewhat related -- I have been wondering off and on if we can avoid allocations for some array-views (either doing escape analysis to allocate on the stack or something like that). There's been a number of times where the allocation overhead kills performance in an inner loop and we've resorted to using a c_ptr or direct indexing. e.g.

e-kayrakli · 2024-01-23T20:24:55Z

Recently, this came up in a GPU context. Traditionally, GPUs have limited amount of memory and for operations on bigger data sets, you'd have to copy slices in and out of the GPU memory. Moreover, our current way of doing multi-GPU parallelism via:

var InputArr, OutputArr: [...] int;

coforall gpu in here.gpus do on gpu {
  var DevArr = InputArr[someSlice];

  ...

  OutputArr[someSlice] = DevArr;
}

relies on heavy use of slicing in a way that can benefit from the optimization being asked for here.

mppf · 2024-02-06T15:27:34Z

@e-kayrakli - I've created #24343 about the general need to do something to improve the performance of creating slices / rank change operations. At the same time, my current favorite approach for this is described in #23890.

I'd also like to note here that the idea in this issue could arguably be an application of the "compound operator" idea described in #17999 (comment) .

e-kayrakli · 2024-02-06T17:11:56Z

Those all look good.

I am wondering where you stand on your original proposal here. Namely, transforming

Arr1[slice1] = Arr2[slice2];

into

chpl__transferView(Arr1, slice1, Arr2, slice2);

This looks like a relatively easy transformation that would result in almost no overhead. @benharsh also noted somewhere that the infrastructure to implement chpl__transferView is already almost there in our bulk transfer code.

The other solutions you propose are much bigger overhauls that would still have some sort of overhead (likely much smaller if we can get rid of class instance allocations). I guess left to my own devices, I would still pursue the transformation above sooner and decouple those proposals from this optimization.

mppf · 2024-02-08T15:58:36Z

@e-kayrakli -- yes, that seems like a reasonable strategy if we feel we can make near-term progress that way. (Especially if you have in mind somebody who can look at implementing it soon).

I haven't yet been able to figure out how the approach in #23890 can help to resolve the similar problems with Block array slices/rank changes. Certainly it could reduce some of the overheads, but IMO there is some risk there that it won't address the problem well enough. So, it seems that something like chpl__transferSliceView will probably be needed for Block arrays at least.

damianmoz · 2024-02-09T01:09:47Z

I would like to see it transformed into a foreach. That does not add any parallelization overhead and hopefully vectorizes things. If I had Engin's code like

a1[slice1] = a2[slice2]

and wanted parallel overhead, I could have written the (just as clear and concise) code

[ (i, j) in zip(slice1, slice2) ] a1[i] = a2[j];

I would certainly like to see

a1[lo1..#N] = a2[lo2..#N]

where N is a param vectorized, especially if N is even.

mppf · 2024-02-09T14:57:11Z

@damianmoz - it's also possible for the implementation to compute the size of the data transfer and make a decision about whether or not to be parallel based on that. We already use a memory size threshold for non-slice bulk transfers. For small transfers, I agree with you that we'll want it to be avoid task overhead. For large transfers, I would expect parallelism to be worthwhile. And, with a quick glance, I think we are currently avoiding creating new tasks for the case of nested parallelism (for non-slice bulk transfers). I'd expect that these same strategies will work OK for slice transfers.

bradcray · 2024-02-09T17:18:21Z

As a more extreme example of Michael's point, if the arrays being assigned were distributed, we'd definitely want to use a forall rather than a foreach to avoid having one task drag everything over the network. So I agree that a heuristic is what is needed here to decide between forall and foreach, making this feel thematically similar to #24243 (throttling parallelism when it's overkill).

damianmoz · 2024-02-09T21:27:34Z

I agree. I should have been more expansive in my comments. You and Michael have fleshed out more of the implementation details of what needs to be considered

mppf added area: Compiler area: Libraries / Modules type: Feature Request type: Performance type: Design labels Jul 25, 2020

mppf mentioned this issue Feb 6, 2024

Improve the performance of slices and rank change operations #24343

Open

e-kayrakli linked a pull request Feb 10, 2024 that will close this issue

Improve some array slice operations' performance #24390

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: optimize whole-array operations on array views #16133

proposal: optimize whole-array operations on array views #16133

mppf commented Jul 25, 2020

bradcray commented Jul 28, 2020

ronawho commented Aug 3, 2020

e-kayrakli commented Jan 23, 2024

mppf commented Feb 6, 2024 •

edited

e-kayrakli commented Feb 6, 2024

mppf commented Feb 8, 2024

damianmoz commented Feb 9, 2024

mppf commented Feb 9, 2024

bradcray commented Feb 9, 2024

damianmoz commented Feb 9, 2024

proposal: optimize whole-array operations on array views #16133

proposal: optimize whole-array operations on array views #16133

Comments

mppf commented Jul 25, 2020

bradcray commented Jul 28, 2020

ronawho commented Aug 3, 2020

e-kayrakli commented Jan 23, 2024

mppf commented Feb 6, 2024 • edited

e-kayrakli commented Feb 6, 2024

mppf commented Feb 8, 2024

damianmoz commented Feb 9, 2024

mppf commented Feb 9, 2024

bradcray commented Feb 9, 2024

damianmoz commented Feb 9, 2024

mppf commented Feb 6, 2024 •

edited