spike: how to address performance issues in #11333 #11426

vasslitvinov · 2018-10-22T18:30:53Z

#11333 identifies a couple of performance issues with forall intents:

Something with + reduce intent over arrays slows down a forall loop.

This task is to identify changes to improve performance.

bradcray · 2018-10-22T19:08:06Z

@vasslitvinov : The theory with the reductions was that the identity routine in the UDR interface creates a local variable of the element type and returns it, assigning it to the original variable. So if the eltType was [1..3] real (say), it results in array initialization + assignment, such that a deep copy and some domain ownership management code as noted in this comment would fire. That led me to wonder whether having the identity routine take its element in as a ref argument to do the zeroing in place would be preferable / have a signifcant performance impact.

That said, on reflection, this should only result in O(numTasks) overhead I think, so maybe there's something else going on here. For instance, compare the delay in the following code between printing "Before forall" and "in forall" with the reduce intent. Then rewrite it to use an in intent and see how much faster it goes.

var n_bodies = 10000;
const pDomain = {0..#n_bodies};

type vec3 = [0..#3] real;

var forces: [pDomain] vec3;
var velocities: [pDomain] vec3;
var positions: [pDomain] vec3;
var masses: [pDomain] real;

writeln("Before forall");
forall q in pDomain with (+ reduce forces) {
  if (q == 0) then
    writeln("In forall");
}

vasslitvinov · 2018-10-27T00:58:39Z

@bradcray - how do you feel about investigating first the above code where vec3 is a tuple instead of a DR array?

bradcray · 2018-10-29T16:16:06Z

There's not a problem with that case is there? What would we hope to learn from the investigation?

vasslitvinov · 2018-10-29T17:37:34Z

If that case is not problematic, then indeed there is no reason to look at it.

@mppf

Avoid creating an array in chpl__sumType() chpl__sumType() computes the type of `x+x` given the type of `x`. To do so, it declares the variable `x`, computes `x+x` and takes its type. All this happens strictly at compile time, which is desired, as long as the type is not a "runtime type" i.e. an array or a domain. This is time-consuming in the case `x` is an array. Even though chpl__sumType() is smart to run `x+x` only on a variable of the array's element type, and if that itself is an array, then only on its element type recursively, allocating an array is already expensive, especially when: * an array is large, * it is an array of arrays, and/or * it is a distributed array. This is one of the causes of the slowdown discussed in #11333 / #11426. In most practical cases, the type of `x+x` is the same as the type of `x`. If so, we can save the hassle of allocating the array. This PR adds a check "is the type of `x+x` the same as the type of `x` ?". If so, chpl__sumType() skips declaring the variable `x`, instead returning its type immediately. This check is "conservative" aka has false negatives. I.e. it may say "no" in some cases where chpl__sumType() could return immediately. If so, chpl__sumType() will do the unnecessary work of allocating the `x`. This is largly only a performance concern - because chpl__sumType() still produces a correct result. In practice this situation should be rare. While there, emit user-friendly error when + reducing something that does not have a valid + implementation. r: @mppf

vasslitvinov · 2018-11-02T22:27:40Z

Here are my findings.

Run-time computations in chpl__sumType() appear responsible for most of the overhead. I removed those in Avoid creating an array in chpl__sumType() #11556.
Each task creates two AS (accumulation states) variables. In this case, AS is the same as forces, i.e. an array of arrays. Two AS-es are also created at the top level.
- Only one AS per task is needed. In this case, the top-level AS-es are not needed at all.
- Implementing the new reduction interface (UDRI) proposal simple UDRI with examples #9879 will eliminate unnecessary AS-es.
Read/write operations on AOA (array of arrays) -- mostly involving forces-like AS-es -- takes relatively little time.
Even AOA creation and deletion take not-so-much time when done by the top-level task.
It is chpl__autoDestroy() calls on AOAs that are executed by multiple tasks concurrently that cause the majority of overhead in the forall loop with (+ reduce forces). (After Avoid creating an array in chpl__sumType() #11556.) Creating an AOA concurrently with autoDestroys is also very slow.

This leads me to believe that the underlying cause is lock contention. Which is due to all inner arrays being tied to the same domain -- the domain of the vec3 array type. Each creation and destruction of an inner array involves locking on that domain.

Ideally, an array over an unchangeable DR domain, ex. of the type vec3, would be implemented without referencing a domain object -- instead by recording its dimensions inline. This is sufficient when there are no accesses to that domain object through the array, ex. .domain queries. Or when we can implement the corresponding Chapel code differently, ex. bypassing the array, bypassing the domain, etc. This "inline the dimensions" implementation can also help with distributed computations, ex. Reduce intent on array doesn't appear to localize the array #6381 and Remote by-value copies of arrays use remote domain causing communication bottleneck #10888.
We could cater to AOA creation/deletion explicitly. Ex. lock the inner arrays' domain once per operation. Or once per task that participates in implementing the global operation.

bradcray · 2018-11-05T17:26:02Z

The reduction of loop startup time is dramatic, thanks!

W.r.t. domain/array overheads, note the relationship to issue #10911.

@vasslitvinov, Do you attribute the remaining difference between this case and the in intents to be due to allocating 2x the number of AoAs due to the duplicate array state? Here are some very rough timings:

reduce intents:

real	1m12.161s
user	3m57.464s
sys	0m0.568s

in intents:

real	0m27.974s
user	1m18.181s
sys	0m0.165s

vasslitvinov · 2018-11-05T19:01:54Z

In addition, there are two AoAs for the global Accumulation State. They are created before the loop and deleted afterwards.

Re #10911, I wonder if a const domain still needs to account for arrays it is associated with. At the very least, it needs to know how many it got. That way when an array is deallocated, we know whether to deallocate the domain or not.

Related to the performance issues chapel-lang#11133 / chapel-lang#11426. Since it is not doing anything correctness-check-worthy, I skipif it unless testing performance. FYI, the numbers I got on my Linux desktop without/with the fix in chapel-lang#11556 are: vec3 in-intent - startup: 0.011575 / 0.011829 vec3 in-intent - completion: 7.54034 / 7.61882 vec3 reduce intent - startup: 16.8032 / 0.634863 vec3 reduce intent - completion: 38.3702 / 19.1754 Also testing arrays fo tuples (not affected by chapel-lang#11556): tup3 in-intent - startup: 0.000096 tup3 in-intent - completion: 0.000286 tup3 reduce intent - startup: 0.000281 tup3 reduce intent - completion: 0.000371

…test Add a performance test for reduce intents with arrays-of-arrays. Related to the performance issues #11133 / #11426. Since it is not doing anything correctness-check-worthy, I skipif it unless testing performance. FYI, the numbers I got on my Linux desktop without/with the fix in #11556 are: vec3 in-intent - startup: 0.011575 / 0.011829 vec3 in-intent - completion: 7.54034 / 7.61882 vec3 reduce intent - startup: 16.8032 / 0.634863 vec3 reduce intent - completion: 38.3702 / 19.1754 Also testing arrays fo tuples (not affected by #11556): tup3 in-intent - startup: 0.000096 tup3 in-intent - completion: 0.000286 tup3 reduce intent - startup: 0.000281 tup3 reduce intent - completion: 0.000371

vasslitvinov self-assigned this Oct 22, 2018

vasslitvinov added area: Compiler type: Performance labels Oct 22, 2018

bradcray added the user issue label Oct 22, 2018

bradcray mentioned this issue Oct 22, 2018

Performance issues when running n-body simulation #11333

Closed

vasslitvinov changed the title ~~Follow up on performance issues in #11333~~ spike: how to address performance issues in #11333 Oct 30, 2018

benjamin-robbins added this to the PB Sprint 17 milestone Oct 30, 2018

vasslitvinov mentioned this issue Nov 2, 2018

Avoid creating an array in chpl__sumType() #11556

Merged

vasslitvinov closed this as completed Nov 2, 2018

vasslitvinov mentioned this issue Nov 6, 2018

Add a performance test for reduce intents with arrays-of-arrays. #11573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike: how to address performance issues in #11333 #11426

spike: how to address performance issues in #11333 #11426

vasslitvinov commented Oct 22, 2018 •

edited

Loading

bradcray commented Oct 22, 2018

vasslitvinov commented Oct 27, 2018

bradcray commented Oct 29, 2018

vasslitvinov commented Oct 29, 2018

vasslitvinov commented Nov 2, 2018 •

edited

Loading

bradcray commented Nov 5, 2018

vasslitvinov commented Nov 5, 2018

spike: how to address performance issues in #11333 #11426

spike: how to address performance issues in #11333 #11426

Comments

vasslitvinov commented Oct 22, 2018 • edited Loading

bradcray commented Oct 22, 2018

vasslitvinov commented Oct 27, 2018

bradcray commented Oct 29, 2018

vasslitvinov commented Oct 29, 2018

vasslitvinov commented Nov 2, 2018 • edited Loading

bradcray commented Nov 5, 2018

vasslitvinov commented Nov 5, 2018

vasslitvinov commented Oct 22, 2018 •

edited

Loading

vasslitvinov commented Nov 2, 2018 •

edited

Loading