Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: optimize whole-array operations on array views #16133

Open
mppf opened this issue Jul 25, 2020 · 10 comments · May be fixed by #24390
Open

proposal: optimize whole-array operations on array views #16133

mppf opened this issue Jul 25, 2020 · 10 comments · May be fixed by #24390

Comments

@mppf
Copy link
Member

mppf commented Jul 25, 2020

Consider operations like these:

var A:[1..n] int;
A[4..10] = 1; // assignment of slice to constant
A[4..5] = A[7..8]; // assignment of slice to another slice
A[1..3] = [i in 1..3] i; // assignment of slice to a loop expression

These are whole-array (or more accurately, whole-array-view) operations. In the current implementation, they are implemented by constructing an array view object (which involves allocating) and then running a general = operation on the array view object.

While this approach is general, it does not necessarily perform as well as it could.

This issue is proposing that these operations instead are translated to the compiler into something conceptually similar to chpl__bulkTransferArray. It could use a tuple of ranges, in addition, to avoid allocation entirely.

For example, A[x..y] = B[m..n] could translate into chpl__transferSliceView(A, (x..y,), B, (m..n)) and that could be implemented in a more optimized way along the lines of

  proc chpl__transferSliceView(A, aSliceDims, B, bSliceDims) {
    forall (i, j) in zip(A.domain.iterateSlice(aSliceDims), B.domain.iterateSlice(bSliceDims) {
      A[i] = B[j];
    }
  }

We would create a few different such functions for specific scenarios that we wish to optimize (including rank change or assignment from a literal or iterator). The compiler would identify the constructs like A[x..y] = B[m..n] at late in function resolution and replace them with calls to the optimized function.

It would be worth trying to understand how much performance improvement this could offer before implementing too far.

See also the mailing list thread https://sourceforge.net/p/chapel/mailman/message/37068963/

@bradcray
Copy link
Member

@benharsh and I talked about this a fair amount back in the day when he was doing bulk transfer work and I was doing slice optimizations, but nothing specific came of it that I can remember (on my side at least). @ronawho may have been in some of those conversations as well and may have additional thoughts here.

@ronawho
Copy link
Contributor

ronawho commented Aug 3, 2020

I don't think I remember anything concrete from those conversations.

Somewhat related -- I have been wondering off and on if we can avoid allocations for some array-views (either doing escape analysis to allocate on the stack or something like that). There's been a number of times where the allocation overhead kills performance in an inner loop and we've resorted to using a c_ptr or direct indexing. e.g.

@e-kayrakli
Copy link
Contributor

Recently, this came up in a GPU context. Traditionally, GPUs have limited amount of memory and for operations on bigger data sets, you'd have to copy slices in and out of the GPU memory. Moreover, our current way of doing multi-GPU parallelism via:

var InputArr, OutputArr: [...] int;

coforall gpu in here.gpus do on gpu {
  var DevArr = InputArr[someSlice];

  ...

  OutputArr[someSlice] = DevArr;
}

relies on heavy use of slicing in a way that can benefit from the optimization being asked for here.

@mppf
Copy link
Member Author

mppf commented Feb 6, 2024

@e-kayrakli - I've created #24343 about the general need to do something to improve the performance of creating slices / rank change operations. At the same time, my current favorite approach for this is described in #23890.

I'd also like to note here that the idea in this issue could arguably be an application of the "compound operator" idea described in #17999 (comment) .

@e-kayrakli
Copy link
Contributor

Those all look good.

I am wondering where you stand on your original proposal here. Namely, transforming

Arr1[slice1] = Arr2[slice2];

into

chpl__transferView(Arr1, slice1, Arr2, slice2);

This looks like a relatively easy transformation that would result in almost no overhead. @benharsh also noted somewhere that the infrastructure to implement chpl__transferView is already almost there in our bulk transfer code.

The other solutions you propose are much bigger overhauls that would still have some sort of overhead (likely much smaller if we can get rid of class instance allocations). I guess left to my own devices, I would still pursue the transformation above sooner and decouple those proposals from this optimization.

@mppf
Copy link
Member Author

mppf commented Feb 8, 2024

@e-kayrakli -- yes, that seems like a reasonable strategy if we feel we can make near-term progress that way. (Especially if you have in mind somebody who can look at implementing it soon).

I haven't yet been able to figure out how the approach in #23890 can help to resolve the similar problems with Block array slices/rank changes. Certainly it could reduce some of the overheads, but IMO there is some risk there that it won't address the problem well enough. So, it seems that something like chpl__transferSliceView will probably be needed for Block arrays at least.

@damianmoz
Copy link

I would like to see it transformed into a foreach. That does not add any parallelization overhead and hopefully vectorizes things. If I had Engin's code like

a1[slice1] = a2[slice2]

and wanted parallel overhead, I could have written the (just as clear and concise) code

[ (i, j) in zip(slice1, slice2) ] a1[i] = a2[j];

I would certainly like to see

a1[lo1..#N] = a2[lo2..#N]

where N is a param vectorized, especially if N is even.

@mppf
Copy link
Member Author

mppf commented Feb 9, 2024

@damianmoz - it's also possible for the implementation to compute the size of the data transfer and make a decision about whether or not to be parallel based on that. We already use a memory size threshold for non-slice bulk transfers. For small transfers, I agree with you that we'll want it to be avoid task overhead. For large transfers, I would expect parallelism to be worthwhile. And, with a quick glance, I think we are currently avoiding creating new tasks for the case of nested parallelism (for non-slice bulk transfers). I'd expect that these same strategies will work OK for slice transfers.

@bradcray
Copy link
Member

bradcray commented Feb 9, 2024

As a more extreme example of Michael's point, if the arrays being assigned were distributed, we'd definitely want to use a forall rather than a foreach to avoid having one task drag everything over the network. So I agree that a heuristic is what is needed here to decide between forall and foreach, making this feel thematically similar to #24243 (throttling parallelism when it's overkill).

@damianmoz
Copy link

I agree. I should have been more expansive in my comments. You and Michael have fleshed out more of the implementation details of what needs to be considered

@e-kayrakli e-kayrakli linked a pull request Feb 10, 2024 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants