Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues when running n-body simulation #11333

Closed
novoselrok opened this issue Oct 10, 2018 · 24 comments · Fixed by #15895
Closed

Performance issues when running n-body simulation #11333

novoselrok opened this issue Oct 10, 2018 · 24 comments · Fixed by #15895

Comments

@novoselrok
Copy link

Summary of Problem

Recently I asked for some help on StackOverflow on implementing n-body simulation. Brad was helpful enough to provide a full featured shared memory implementation of the simulation. But after compiling and running the implementation I have run into some performance issues with large amount of bodies (10.000 and more). The Chapel program seems to get 'stuck' (for some amount of time, it eventually continues) at the forall q in pDomain with (+ reduce forces) loop. You can verify it by observing the writeln output. Meanwhile my OpenMP implementation does not run into any problems.

Steps to Reproduce

Source Code:
Here are some data files you can use for testing: link

config const filename = "input.txt";
config const iterations = 100;
config const out_filename = "out.txt";
const X = 0;
const Y = 1;
const Z = 2;
const G = 6.67e-11;
config const dt = 0.1;

// Read input file, initialize bodies                                       
var f = open(filename, iomode.r);
var reader = f.reader();

var n_bodies = reader.read(int);
const pDomain = {0..#n_bodies};

type vec3 = [0..#3] real;

var forces: [pDomain] vec3;
var velocities: [pDomain] vec3;
var positions: [pDomain] vec3;
var masses: [pDomain] real;

for i in pDomain {
  positions[i] = reader.read(vec3);

  velocities[i] = reader.read(vec3);

  masses[i] = reader.read(real);
}

f.close();
reader.close();

for i in 0..#iterations {
  // Reset forces                                                           
  forces = [0.0, 0.0, 0.0];
  writeln(i);
  forall q in pDomain with (+ reduce forces) {
    writeln("q ", q);
    stdout.flush();
    for k in pDomain {
      if k <= q {
        continue;
      }
      var diff = positions[q] - positions[k];
      var dist = sqrt(diff[X]**2 + diff[Y]**2 + diff[Z]**2);
      var dist_cubed = dist**3;

      var tmp = -G * masses[q] * masses[k] / dist_cubed;
      var force_qk = tmp * diff;

      forces[q] += force_qk;
      forces[k] -= force_qk;
    }
  }


  forall q in pDomain {
    positions[q] += dt * velocities[q];
    velocities[q] += dt / masses[q] * forces[q];
  }
}

var outf = open(out_filename, iomode.cw);
var writer = outf.writer();

for q in pDomain {
  writer.writeln("%er %er %er %er %er %er".format(positions[q][X], positions[q][Y], positions[q][Z], velocities[q][X], velocities[q][Y], velocities[q][Z]));
}

writer.close();
outf.close();

Compile command:
chpl -O -o main main.chpl

Execution command:
./main --filename=../data/test1e4.txt --iterations=1

Configuration Information

  • Output of chpl --version: chpl version 1.18.0
  • Output of $CHPL_HOME/util/printchplenv --anonymize:
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: intrinsics
CHPL_GMP: none
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_AUX_FILESYS: none
  • Back-end compiler and version, e.g. gcc --version or clang --version: gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
  • CPU info:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               8
Model name:          AMD Ryzen 7 2700X Eight-Core Processor
Stepping:            2
CPU MHz:             1884.355
CPU max MHz:         3700,0000
CPU min MHz:         2200,0000
BogoMIPS:            7385.26
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

@bradcray
Copy link
Member

I'm noting that you're not using the --fast flag which can have a significant impact on Chapel performance (turns off a number of execution-time sanity checks which can add lots of overhead). While I'm trying to reproduce this, I'd be curious whether that helps you at all.

@novoselrok
Copy link
Author

Unfortunately I still have the same issue. I also ran it on a separate computer (macbook pro) and the issue persists.

@bradcray
Copy link
Member

Thanks for the quick check on --fast. I've verified that I also see a surprisingly long pause when entering the forall loop.

I have not verified anything about the OpenMP version... Note that the time required to enter our forall loop with the reduce intent should be somewhat comparable to OpenMP's allocation and initialization of the per-task forces arrays (since that's where we'd create our per-task copies), but I think you said on chat that you don't see any similar hiccup anywhere in OpenMP.

Here's a simple reproducer that doesn't require an input file:

var n_bodies = 10000;
const pDomain = {0..#n_bodies};

type vec3 = [0..#3] real;

var forces: [pDomain] vec3;
var velocities: [pDomain] vec3;
var positions: [pDomain] vec3;
var masses: [pDomain] real;

writeln("Before forall");
forall q in pDomain with (+ reduce forces) {
  if (q == 0) then
    writeln("In forall");
}

It looks to me like something about the reduce intent is bogging things down, as removing the with-clause is sufficient to make it run super-fast. For that reason, tagging @vasslitvinov as well as the @chapel-lang/perf-team.

@ty1027
Copy link

ty1027 commented Oct 10, 2018

On my Mac (OSX10.11) with chapel-1.16 (sorry an old version...), changing this line
type vec3 = [0..#3] real;
to
type vec3 = 3*real;
seems to make the code very fast. Is this possibly related...? (e.g., in the former case,
the memory is not stored contiguously, causing much overhead?)

Also, in the original n-body code, if I make the similar change to tuples, plus
X, Y, Z -> 1, 2, 3
read(vec3) -> read(real, real, real)
forces = [0.0, 0.0, 0.0]; -> forces = ( 0.0, 0.0, 0.0 );
then the code runs very fast. (I learned this from
the BenchmarkGame page)

Using out1000.txt as input.txt, I get these results on my Mac (4-core)
for iterations = 5: (with writeln( "iteration = ", i ))

$ chpl --fast nbody.chpl
$ time ./a.out
iteration = 0
iteration = 1
iteration = 2
iteration = 3
iteration = 4

real 0m4.090s
user 0m13.813s
sys 0m0.035s

$ chpl --fast nbody_tup.chpl
$ time ./a.out
iteration = 0
iteration = 1
iteration = 2
iteration = 3
iteration = 4

real 0m0.045s
user 0m0.086s
sys 0m0.007s

@benharsh
Copy link
Member

benharsh commented Oct 10, 2018

@ty1027 beat me to it! I think you've stumbled across some unknown performance issue with arrays-of-arrays, so I think using tuples in this case would be better.

Edit: For other Chapel devs: offline we hypothesized that there were two causes for the poor performance:

  1. Overhead for creating/destroying the array-of-arrays in reduction's task-private variables
  2. Overhead for creating temporary arrays in the computation (var diff = positions[q] - positions[k];)

@bradcray
Copy link
Member

Though @benharsh refers to this as an unknown performance issue, conferring with him, we're guessing that the underlying problem relates to some known existing issues (like #9414 in which a bunch of arrays with a shared const domain bash on it needlessly as they're created and destroyed; and #10911 which is my preferred way to address this specific case and other related problems).

It's interesting to note that, with my simple reproducer, changing from a reduce intent to an in intent makes things run much faster (though still not as fast as a tuple). Since the storage created in each case should be the same, I'm guessing that the problem is with the + reduction's identity() routine which creates a local variable of the element type (in this case [0..#3] real), sets it, and returns it. This will result in the creation of lots more arrays with the shared const domain, bashing on it further.

This suggests to me, @vasslitvinov, that in the revised UDRI proposal, we should consider whether we can make the init() routine take the element in as a ref argument and initialize it in place. While the general optimizations for const domains alluded to above would reduce the current overheads, it still seems like it'd be preferable to zero out an array element in place rather than creating a temporary array, returning it, and assigning the result to a pre-existing one?

@bradcray
Copy link
Member

I want to assert for those following this issue that I believe it's our goal to have small const-sized arrays like this achieve performance that's nearly comparable to tuples. Until/unless we support some means of getting in-place allocation for such arrays, they'll probably always lag somewhat due to spatial locality issues / pointer dereference, but I think we ought to be able to minimize the gap significantly compared to where it is today.

@ty1027
Copy link

ty1027 commented Oct 10, 2018

I guess that people coming from C/C++ is very likely to write a code with
[1..3] real (in a way similar to double [3]), which may give a similar speed-down
and so not an ideal first impression... Const-size arrays (similar to tuples
with sufficient functionalities as (serial) arrays) may be really nice for
such cases (I've seen such a library here before,
so possibly just a wrapper type for n-tuple might help?)

Another concern is that people coming from Fortran (like me!) would probably write
a code with a rectangular 2-d array (of shape n_bodies x 3) because it is
fast in Fortran. For example:

config const filename = "input.txt";
config const iterations = 5;
config const out_filename = "out.txt";
const X = 1;
const Y = 2;
const Z = 3;
const G = 6.67e-11;
config const dt = 0.1;

// Read input file, initialize bodies                                       
var f = open(filename, iomode.r);
var reader = f.reader();

var n_bodies = reader.read(int);

const pRange = 1..n_bodies;

var forces, velocities, positions : [pRange, 1..3] real;

var masses : [pRange] real;

for i in pRange {
  positions[i, ..] = reader.read(real, real, real);
  velocities[i, ..] = reader.read(real, real, real);
  masses[i] = reader.read(real);
}

f.close();
reader.close();

for i in 0..#iterations {

  // Reset forces                                                           
  forces = 0.0;
  writeln( "iteration = ", i );

  forall q in pRange with (+ reduce forces) {

    for k in pRange {
      if k <= q then continue;

      var diff = positions[q, ..] - positions[k, ..];

      var dist = sqrt(diff[X]**2 + diff[Y]**2 + diff[Z]**2);

      var dist_cubed = dist**3;

      var tmp = -G * masses[q] * masses[k] / dist_cubed;
      var force_qk = tmp * diff;

      forces[q, ..] += force_qk;
      forces[k, ..] -= force_qk;
    }
  }

  forall q in pRange {
    positions[q, ..] += dt * velocities[q, ..];
    velocities[q, ..] += dt / masses[q] * forces[q, ..];
  }
}

var outf = open(out_filename, iomode.cw);
var writer = outf.writer();

for q in pRange {
  writer.writeln("%er %er %er %er %er %er".format(
                     positions[q, X], positions[q, Y], positions[q, Z],
                     velocities[q, X], velocities[q, Y], velocities[q, Z]));
}

writer.close();
outf.close();

But this gives (on the same environment with chapel 1.16 (<-- an old version) above):

real 0m3.845s
user 0m12.057s
sys 0m0.024s

so the performance is similar to the code with array of arrays. I guess
a similar scenario is very likely to happen for people from Fortran...

@novoselrok
Copy link
Author

Including changes from @ty1027 I can confirm that there is no longer an issue. The time is now comparable to the OpenMP implementation for 100.000 bodies and 1 iteration (6s - OpenMP, 10s - Chapel)

@novoselrok
Copy link
Author

novoselrok commented Oct 20, 2018

I'm having same issues when trying to implement the k-means algorithm in Chapel. Here is my implementation using reductions: https://github.com/novoselrok/parallel-algorithms/blob/eaf3efaa6307837b3d66e55c275ff006ca9836f7/kmeans/chapel/main.chpl

Compilation: chpl --fast -O -o main main.chpl
Execution: ./main --filename=../data/test100000.txt --rows=100000 --cols=10 --k=4

I commented out the proposed solution where instead of an array I use tuples:

type Point = [0..#cols] real;
// type Point = 10 * real;

The non-tuple version takes 120s, the tuple version is 0.8s for 100.000 points. But tuples are not really useful for this problem since I do not know in advance the point dimensions. Is there any other way to work around this problem?

@ben-albrecht
Copy link
Member

With the array version, the following line is allocating a new array to store the result of data[i] for every iteration:

var point = data[i];

For the --rows=100000 --cols=10 case, that's creating 10000 10-element arrays for every iteration!

I bet switching that line to a reference to avoid the array copies will speed your program up significantly:

ref point = data[i];

@novoselrok
Copy link
Author

@ben-albrecht You are correct, this did speed it up significantly for the 100.000 points test case (down to 1s).

To test it out, I generated another test case with 256.000 points, 16 columns and 100 clusters (./main --filename=../data/test_R256k_C16_k100.txt --rows=256000 --cols=16 --k=100). This one still takes roughly 120s to complete (meanwhile the Python version takes <2s https://github.com/novoselrok/parallel-algorithms/blob/26636c939f59ba6f64da136616ba7dd67bd641d7/kmeans/chapel/main.py). I know that the Python version calls some C libraries underneath, but they are single threaded. That's why I am expecting the parallel Chapel version to be at least competitive with the Python version.

@ben-albrecht
Copy link
Member

ben-albrecht commented Oct 21, 2018

This one still takes roughly 120s to complete (meanwhile the Python version takes <2s

I'm getting ~52 seconds for Chapel (1.19 pre-release 9099c557dd) and ~5 seconds for Python3 (anaconda 3.6.5) on my machine (4 physical cores) on master (26636c939f)

For reference, here are my reproducer steps:

# kmeans/data/
$ python3 gen.py 256000 16 100 test_R256k_C16_k100

# kmeans/chapel/
$ chpl main.chpl --fast
$ ./main --filename=../data/test_R256k_C16_k100.txt --rows=256000 --cols=16 --k=100

Another improvement I noticed is to reduce the number of dists arrays allocated, by allocating it in the outer loop, and using an in intent to copy it into the forall. This results in numTasks (~number of cores) allocations rather than pdomain.size (~256k) allocations of the dists arrays:

    var dists: [clusters.domain] real;
    forall i in pDomain with (+ reduce new_clusters, in dists) {
        ref point = data[i];
        dists = [cluster in clusters] cluster.dist(point);
        ...
    }

That shaved 5 seconds off for me - now ~47 seconds.

I know that the Python version calls some C libraries underneath, but they are single threaded.

When trying to optimize an algorithm against a reference version, it is usually easier to try to match the serial performance before introducing parallelism.

I switched the forall to a for (and dropped the with clauses) to see how the serial implementation would compare, and got ~24 seconds (2x faster). @chapel-lang/perf-team - maybe there is a performance issue for us to investigate here

@novoselrok - From here, I'd suggest taking a look at the source code of sklearn.cluster.k_means_.py and trying to utilize the same optimizations they have implemented to match their serial implementation with serial Chapel. For example, it looks like they are precomputing distances under some heuristically determined condition (matrix size < 100MB). Timing subsections of your implementation might give you a good idea of where to start looking for improvements. Alternatively, you could take a look at what parts sklearn is offloading to C (via Cython) to see where performance matters most.

That's why I am expecting the parallel Chapel version to be at least competitive with the Python version.

I think this expectation is reasonable assuming the serial Chapel implementation matches the serial Python implementation, and the algorithm / workload / hardware can take advantage of parallelism.

It's generally easy to write a Chapel implementation that outperforms a python implementation. However, as you pointed out, we're comparing a Chapel implementation to a python implementation calling C for the computational expensive portions of the program. Chapel's comparisons to C are closer to parity so I think it will take some effort, e.g. using the same optimizations as the sklearn implementation, in order to compete in performance.

@novoselrok
Copy link
Author

Thank you for the detailed response! I agree that it's not a completely valid comparison between the Chapel and Python/C versions, since the underlying algorithms are quite different (the Python one has plenty of optimizations and heuristics). Perhaps it would be better to write the C version by myself (serial and add OpenMP for parallelization) with the same algorithm as in the Chapel version and then compare the two. Still, I'm not expecting the handwritten C performance to degrade that much compared to Python/C. At the very least, I can use the Chapel version as an example how succinctly the k-means algorithm can be written 😃

@bradcray
Copy link
Member

@ben-albrecht : Maybe I've lost the thread, but I thought we'd already identified the performance issue (here and here) as being due to needless domain-array ownership management due to not optimizing for const domains as well as we should (i.e., issues #10911 and #9414).

For those not up-to-date on the issue, the problem is that, in general, domains must track their arrays so that if the domain is resized, the array can be reallocated. For a const domain, this is not semantically required, but we don't optimize it away. So if I create a million arrays over a small const domain, I'll needlessly pay some (ridiculous) overhead managing that domain's ownership of those milliopn arrays. E.g., here:

var A: [1..1000000] [1..3] real;

Since {1..3} is a constant domain, we shouldn't need to track the million arrays that it describes, yet we do. If we stopped doing this for const domains, the theory is that the cost of these arrays would come way down relative to the tuples (though there'd still probably be a slight difference due to differences in heap vs. in-place allocation, and array meta-data).

@ben-albrecht
Copy link
Member

ben-albrecht commented Oct 22, 2018

Maybe I've lost the thread, but I thought we'd already identified the performance issue

@novoselrok posted about a k means implementation where he was encountering performance issues as well, so I was suggesting some performance improvements unique to that code.

That said, the const domain optimization (or lack of) performance issue identified in the N-body implementation is definitely impacting the k means code as well.

@bradcray
Copy link
Member

posted about a k means implementation

Oh, you're right that I overlooked this topic switch. It seems to me that this should've been forked into its own issue at that point.

the const domain optimization (or lack of) performance issue identified in the N-body implementation is definitely impacting the k means code as well.

Is there evidence that there are additional issues other than that one?

Belatedly, @vasslitvinov's issue #11426 reminded me that we saw surprising performance overheads when using reduce intents as compared to in intents which might be a second source of overhead that I brushed past too quickly in my previous summary.

@ben-albrecht
Copy link
Member

It seems to me that this should've been forked into its own issue at that point.

Agreed, and it's still not too late!

Is there evidence that there are additional issues other than that one?

Switching the forall to a serial for in the k means implementation resulted in a 2x performance improvement on my machine:

    forall i in pDomain with (+ reduce new_clusters) {
        ref point = data[i];
        dists = [cluster in clusters] cluster.dist(point);
        var (_, min_index) = minloc reduce zip(dists, dists.domain);
        labels[i] = min_index;
        new_clusters[min_index].add_point(point);
    }

vs.

    for i in pDomain {
        ref point = data[i];
        dists = [cluster in clusters] cluster.dist(point);
        var (_, min_index) = minloc reduce zip(dists, dists.domain);
        labels[i] = min_index;
        new_clusters[min_index].add_point(point);
    }

This was for the pDomain.size=256000 and new_clusters.domain.size=100 case:

./main --filename=../data/test_R256k_C16_k100.txt --rows=256000 --cols=16 --k=100

This could be related to the reduce intent overhead issue (#11426) you mentioned.

@vasslitvinov

This comment has been minimized.

vasslitvinov added a commit to vasslitvinov/chapel that referenced this issue Nov 2, 2018
chpl__sumType() computes the type of `x+x` given the type of `x`.
To do so, it declares the variable `x`, computes `x+x` and takes its type.
All this happens strictly at compile time, which is desired, as long as
the type is not a "runtime type" i.e. an array or a domain.

This is time-consuming in the case `x` is an array. Even though chpl__sumType()
is smart to run `x+x` only on a variable of the array's element type, and
if that itself is an array, then only on its element type recursively,
allocating an array is already expensive, especially when:
* an array is large,
* it is an array of arrays, and/or
* it is a distributed array.

This is one of the causes of the slowdown discussed in chapel-lang#11333.

In most practical cases, the type of `x+x` is the same as the type of `x`.
If so, we can save the hassle of allocating the array.

This PR adds a check "is the type of `x+x` the same as the type of `x` ?".
If so, chpl__sumType() skips declaring the variable `x`, instead returning
its type immediately.

This check is "conservative" aka has false negatives. I.e. it may say "no"
in some cases where chpl__sumType() could return immediately. If so,
chpl__sumType() will do the unnecessary work of allocating the `x`.
This is largly only a performance concern - because chpl__sumType()
still produces a correct result. In practice this situation should be rare.

While there, emit user-friendly error when + reducing something that does not
have a valid + implementation.
vasslitvinov added a commit that referenced this issue Nov 2, 2018
Avoid creating an array in chpl__sumType()

chpl__sumType() computes the type of `x+x` given the type of `x`.
To do so, it declares the variable `x`, computes `x+x` and takes its type.
All this happens strictly at compile time, which is desired, as long as
the type is not a "runtime type" i.e. an array or a domain.

This is time-consuming in the case `x` is an array. Even though chpl__sumType()
is smart to run `x+x` only on a variable of the array's element type, and
if that itself is an array, then only on its element type recursively,
allocating an array is already expensive, especially when:
* an array is large,
* it is an array of arrays, and/or
* it is a distributed array.

This is one of the causes of the slowdown discussed in #11333 / #11426.

In most practical cases, the type of `x+x` is the same as the type of `x`.
If so, we can save the hassle of allocating the array.

This PR adds a check "is the type of `x+x` the same as the type of `x` ?".
If so, chpl__sumType() skips declaring the variable `x`, instead returning
its type immediately.

This check is "conservative" aka has false negatives. I.e. it may say "no"
in some cases where chpl__sumType() could return immediately. If so,
chpl__sumType() will do the unnecessary work of allocating the `x`.
This is largly only a performance concern - because chpl__sumType()
still produces a correct result. In practice this situation should be rare.

While there, emit user-friendly error when + reducing something that does not
have a valid + implementation.

r: @mppf
@vasslitvinov
Copy link
Member

FYI, I posted some findings related to n-body performance here .

@bradcray
Copy link
Member

bradcray commented Nov 5, 2018

After PR #11556, the performance for loop startup in cases that use reduce intents has been reduced dramatically. Teardown is still not so good, similar to in intents, so there's more to do here still, linked back to issue #10911

@ronawho
Copy link
Contributor

ronawho commented Jun 19, 2020

#15895 will dramatically reduce the teardown time for the array-of-array cases, which I think will resolve this.

For Brad's smaller reproducer in #11333 (comment) I see a ~700x speedup:

$ time -p ./brad-11333-repro-master
Before forall
In forall
real 486.60

$ time -p ./brad-11333-repro-pr
Before forall
In forall
real 0.68

@ronawho
Copy link
Contributor

ronawho commented Jun 19, 2020

For the original nbody code (commenting out the writeln("q ", q)), here's what I see:

$ time -p ./nbody-master --filename=test1e3.txt --iterations=1
real 192.94
$ time -p ./nbody-pr --filename=test1e3.txt --iterations=1
real 1.21

@novoselrok I don't have a sense of the expected performance here, so if you get a chance I'd love some more official comparisons.

ronawho added a commit that referenced this issue Jun 19, 2020
Switch the domain's list of arrays to a hashtable

[codeveloped with @mppf]

Domains keep track of the arrays declared over them so that when a
domain is resized it can resize all the arrays declared over it.
Previously, we used a simple LinkedList to store the arrays, which had
performance issues because removal from a linked list is O(n). For the
most part we got lucky and removal happened in reverse insertion order
so removal was effectively O(1). This was because we did serial init and
deinit of arrays-of-arrays, but there are still cases where the
insertion or removal was parallel, which killed performance. This
includes creating arrays in a task-intent clause (#11333) and having a
distributed array of arrays (#9414). This significantly improves the
performance of those cases, and does not hurt serial insertion/deletion
performance.

We have long wanted to use a set/hashtable to store a domain's arrays, but
previously we couldn't because using an associative array inside BaseDom and
BaseDist would result in a circular dependency. Now that chpl__hashtable has
been split out we can just use that. This creates a simple set wrapper over
hashtable. We do this instead of using the standard library set to have more
control and to avoid bringing in all of set code to hopefully limit the impact
on compilation speed.

This improves startup/teardown for a single node forall loop using
array-of-arrays task intents. There's a 75x speedup for ri-a2-AoA.chpl,
which is a reproducer #11333. This also improves deinit for a block dist
array-of-arrays by 500x at 512 nodes (improves non-kernel scalability
for isx -- #9414) Additionally, this will allow us to parallelize array
initialization for all types and array deinitialization in a future PR.

Some technical notes:
 - I had to comment out some `key` printing in chpl__hashtble to avoid
   dynamic dispatch trying to stamp out read/writeThis methods for all
   array types. This resulted in trying to print our arrays of sync and
   we don't support writing syncs.
 - Removing an element that isn't in the set is a noop. This is
   required for optimizations/bulkcomm/bharshbarg/arrayViews, but I made
   inserting an existing element an assertion failure.
 - rootLocaleInitialized is now used earlier, so I moved it to
   ChapelBase.

Resolves #9414
Resolves #11333
Resolves Cray/chapel-private#1036
Closes Cray/chapel-private#1063
@novoselrok
Copy link
Author

@ronawho Even though I'm not using Chapel day-to-day anymore, I really appreciate the effort. The timing numbers you posted seem to be what I was looking for (as far as I can remember).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants