Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spike: how to address performance issues in #11333 #11426

Closed
vasslitvinov opened this issue Oct 22, 2018 · 7 comments
Closed

spike: how to address performance issues in #11333 #11426

vasslitvinov opened this issue Oct 22, 2018 · 7 comments

Comments

@vasslitvinov
Copy link
Member

vasslitvinov commented Oct 22, 2018

#11333 identifies a couple of performance issues with forall intents:

  • Something with + reduce intent over arrays slows down a forall loop.

This task is to identify changes to improve performance.

@bradcray
Copy link
Member

@vasslitvinov : The theory with the reductions was that the identity routine in the UDR interface creates a local variable of the element type and returns it, assigning it to the original variable. So if the eltType was [1..3] real (say), it results in array initialization + assignment, such that a deep copy and some domain ownership management code as noted in this comment would fire. That led me to wonder whether having the identity routine take its element in as a ref argument to do the zeroing in place would be preferable / have a signifcant performance impact.

That said, on reflection, this should only result in O(numTasks) overhead I think, so maybe there's something else going on here. For instance, compare the delay in the following code between printing "Before forall" and "in forall" with the reduce intent. Then rewrite it to use an in intent and see how much faster it goes.

var n_bodies = 10000;
const pDomain = {0..#n_bodies};

type vec3 = [0..#3] real;

var forces: [pDomain] vec3;
var velocities: [pDomain] vec3;
var positions: [pDomain] vec3;
var masses: [pDomain] real;

writeln("Before forall");
forall q in pDomain with (+ reduce forces) {
  if (q == 0) then
    writeln("In forall");
}

@vasslitvinov
Copy link
Member Author

@bradcray - how do you feel about investigating first the above code where vec3 is a tuple instead of a DR array?

@bradcray
Copy link
Member

There's not a problem with that case is there? What would we hope to learn from the investigation?

@vasslitvinov
Copy link
Member Author

If that case is not problematic, then indeed there is no reason to look at it.

@vasslitvinov vasslitvinov changed the title Follow up on performance issues in #11333 spike: how to address performance issues in #11333 Oct 30, 2018
@benjamin-robbins benjamin-robbins added this to the PB Sprint 17 milestone Oct 30, 2018
vasslitvinov added a commit that referenced this issue Nov 2, 2018
Avoid creating an array in chpl__sumType()

chpl__sumType() computes the type of `x+x` given the type of `x`.
To do so, it declares the variable `x`, computes `x+x` and takes its type.
All this happens strictly at compile time, which is desired, as long as
the type is not a "runtime type" i.e. an array or a domain.

This is time-consuming in the case `x` is an array. Even though chpl__sumType()
is smart to run `x+x` only on a variable of the array's element type, and
if that itself is an array, then only on its element type recursively,
allocating an array is already expensive, especially when:
* an array is large,
* it is an array of arrays, and/or
* it is a distributed array.

This is one of the causes of the slowdown discussed in #11333 / #11426.

In most practical cases, the type of `x+x` is the same as the type of `x`.
If so, we can save the hassle of allocating the array.

This PR adds a check "is the type of `x+x` the same as the type of `x` ?".
If so, chpl__sumType() skips declaring the variable `x`, instead returning
its type immediately.

This check is "conservative" aka has false negatives. I.e. it may say "no"
in some cases where chpl__sumType() could return immediately. If so,
chpl__sumType() will do the unnecessary work of allocating the `x`.
This is largly only a performance concern - because chpl__sumType()
still produces a correct result. In practice this situation should be rare.

While there, emit user-friendly error when + reducing something that does not
have a valid + implementation.

r: @mppf
@vasslitvinov
Copy link
Member Author

vasslitvinov commented Nov 2, 2018

Here are my findings.

  • Run-time computations in chpl__sumType() appear responsible for most of the overhead. I removed those in Avoid creating an array in chpl__sumType() #11556.

  • Each task creates two AS (accumulation states) variables. In this case, AS is the same as forces, i.e. an array of arrays. Two AS-es are also created at the top level.

    • Only one AS per task is needed. In this case, the top-level AS-es are not needed at all.

    • Implementing the new reduction interface (UDRI) proposal simple UDRI with examples #9879 will eliminate unnecessary AS-es.

  • Read/write operations on AOA (array of arrays) -- mostly involving forces-like AS-es -- takes relatively little time.

  • Even AOA creation and deletion take not-so-much time when done by the top-level task.

  • It is chpl__autoDestroy() calls on AOAs that are executed by multiple tasks concurrently that cause the majority of overhead in the forall loop with (+ reduce forces). (After Avoid creating an array in chpl__sumType() #11556.) Creating an AOA concurrently with autoDestroys is also very slow.

This leads me to believe that the underlying cause is lock contention. Which is due to all inner arrays being tied to the same domain -- the domain of the vec3 array type. Each creation and destruction of an inner array involves locking on that domain.

  • Ideally, an array over an unchangeable DR domain, ex. of the type vec3, would be implemented without referencing a domain object -- instead by recording its dimensions inline. This is sufficient when there are no accesses to that domain object through the array, ex. .domain queries. Or when we can implement the corresponding Chapel code differently, ex. bypassing the array, bypassing the domain, etc. This "inline the dimensions" implementation can also help with distributed computations, ex. Reduce intent on array doesn't appear to localize the array #6381 and Remote by-value copies of arrays use remote domain causing communication bottleneck #10888.

  • We could cater to AOA creation/deletion explicitly. Ex. lock the inner arrays' domain once per operation. Or once per task that participates in implementing the global operation.

@bradcray
Copy link
Member

bradcray commented Nov 5, 2018

The reduction of loop startup time is dramatic, thanks!

W.r.t. domain/array overheads, note the relationship to issue #10911.

@vasslitvinov, Do you attribute the remaining difference between this case and the in intents to be due to allocating 2x the number of AoAs due to the duplicate array state? Here are some very rough timings:

reduce intents:

real	1m12.161s
user	3m57.464s
sys	0m0.568s

in intents:

real	0m27.974s
user	1m18.181s
sys	0m0.165s

@vasslitvinov
Copy link
Member Author

In addition, there are two AoAs for the global Accumulation State. They are created before the loop and deleted afterwards.

Re #10911, I wonder if a const domain still needs to account for arrays it is associated with. At the very least, it needs to know how many it got. That way when an array is deallocated, we know whether to deallocate the domain or not.

vasslitvinov added a commit to vasslitvinov/chapel that referenced this issue Nov 6, 2018
Related to the performance issues chapel-lang#11133 / chapel-lang#11426.

Since it is not doing anything correctness-check-worthy,
I skipif it unless testing performance.

FYI, the numbers I got on my Linux desktop without/with
the fix in chapel-lang#11556 are:

vec3 in-intent - startup: 0.011575 / 0.011829
vec3 in-intent - completion: 7.54034 / 7.61882
vec3 reduce intent - startup: 16.8032 / 0.634863
vec3 reduce intent - completion: 38.3702 / 19.1754

Also testing arrays fo tuples (not affected by chapel-lang#11556):

tup3 in-intent - startup: 0.000096
tup3 in-intent - completion: 0.000286
tup3 reduce intent - startup: 0.000281
tup3 reduce intent - completion: 0.000371
vasslitvinov added a commit that referenced this issue Nov 6, 2018
…test

Add a performance test for reduce intents with arrays-of-arrays.

Related to the performance issues #11133 / #11426.

Since it is not doing anything correctness-check-worthy,
I skipif it unless testing performance.

FYI, the numbers I got on my Linux desktop without/with
the fix in #11556 are:

vec3 in-intent - startup: 0.011575 / 0.011829
vec3 in-intent - completion: 7.54034 / 7.61882
vec3 reduce intent - startup: 16.8032 / 0.634863
vec3 reduce intent - completion: 38.3702 / 19.1754

Also testing arrays fo tuples (not affected by #11556):

tup3 in-intent - startup: 0.000096
tup3 in-intent - completion: 0.000286
tup3 reduce intent - startup: 0.000281
tup3 reduce intent - completion: 0.000371
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants