Performance issues when running n-body simulation #11333

novoselrok · 2018-10-10T19:55:19Z

Summary of Problem

Recently I asked for some help on StackOverflow on implementing n-body simulation. Brad was helpful enough to provide a full featured shared memory implementation of the simulation. But after compiling and running the implementation I have run into some performance issues with large amount of bodies (10.000 and more). The Chapel program seems to get 'stuck' (for some amount of time, it eventually continues) at the forall q in pDomain with (+ reduce forces) loop. You can verify it by observing the writeln output. Meanwhile my OpenMP implementation does not run into any problems.

Steps to Reproduce

Source Code:
Here are some data files you can use for testing: link

config const filename = "input.txt";
config const iterations = 100;
config const out_filename = "out.txt";
const X = 0;
const Y = 1;
const Z = 2;
const G = 6.67e-11;
config const dt = 0.1;

// Read input file, initialize bodies                                       
var f = open(filename, iomode.r);
var reader = f.reader();

var n_bodies = reader.read(int);
const pDomain = {0..#n_bodies};

type vec3 = [0..#3] real;

var forces: [pDomain] vec3;
var velocities: [pDomain] vec3;
var positions: [pDomain] vec3;
var masses: [pDomain] real;

for i in pDomain {
  positions[i] = reader.read(vec3);

  velocities[i] = reader.read(vec3);

  masses[i] = reader.read(real);
}

f.close();
reader.close();

for i in 0..#iterations {
  // Reset forces                                                           
  forces = [0.0, 0.0, 0.0];
  writeln(i);
  forall q in pDomain with (+ reduce forces) {
    writeln("q ", q);
    stdout.flush();
    for k in pDomain {
      if k <= q {
        continue;
      }
      var diff = positions[q] - positions[k];
      var dist = sqrt(diff[X]**2 + diff[Y]**2 + diff[Z]**2);
      var dist_cubed = dist**3;

      var tmp = -G * masses[q] * masses[k] / dist_cubed;
      var force_qk = tmp * diff;

      forces[q] += force_qk;
      forces[k] -= force_qk;
    }
  }


  forall q in pDomain {
    positions[q] += dt * velocities[q];
    velocities[q] += dt / masses[q] * forces[q];
  }
}

var outf = open(out_filename, iomode.cw);
var writer = outf.writer();

for q in pDomain {
  writer.writeln("%er %er %er %er %er %er".format(positions[q][X], positions[q][Y], positions[q][Z], velocities[q][X], velocities[q][Y], velocities[q][Z]));
}

writer.close();
outf.close();

Compile command:
chpl -O -o main main.chpl

Execution command:
./main --filename=../data/test1e4.txt --iterations=1

Configuration Information

Output of chpl --version: chpl version 1.18.0
Output of $CHPL_HOME/util/printchplenv --anonymize:

CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: intrinsics
CHPL_GMP: none
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_AUX_FILESYS: none

Back-end compiler and version, e.g. gcc --version or clang --version: gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CPU info:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               8
Model name:          AMD Ryzen 7 2700X Eight-Core Processor
Stepping:            2
CPU MHz:             1884.355
CPU max MHz:         3700,0000
CPU min MHz:         2200,0000
BogoMIPS:            7385.26
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

The text was updated successfully, but these errors were encountered:

bradcray · 2018-10-10T20:02:30Z

I'm noting that you're not using the --fast flag which can have a significant impact on Chapel performance (turns off a number of execution-time sanity checks which can add lots of overhead). While I'm trying to reproduce this, I'd be curious whether that helps you at all.

novoselrok · 2018-10-10T20:10:47Z

Unfortunately I still have the same issue. I also ran it on a separate computer (macbook pro) and the issue persists.

bradcray · 2018-10-10T20:29:19Z

Thanks for the quick check on --fast. I've verified that I also see a surprisingly long pause when entering the forall loop.

I have not verified anything about the OpenMP version... Note that the time required to enter our forall loop with the reduce intent should be somewhat comparable to OpenMP's allocation and initialization of the per-task forces arrays (since that's where we'd create our per-task copies), but I think you said on chat that you don't see any similar hiccup anywhere in OpenMP.

Here's a simple reproducer that doesn't require an input file:

var n_bodies = 10000;
const pDomain = {0..#n_bodies};

type vec3 = [0..#3] real;

var forces: [pDomain] vec3;
var velocities: [pDomain] vec3;
var positions: [pDomain] vec3;
var masses: [pDomain] real;

writeln("Before forall");
forall q in pDomain with (+ reduce forces) {
  if (q == 0) then
    writeln("In forall");
}

It looks to me like something about the reduce intent is bogging things down, as removing the with-clause is sufficient to make it run super-fast. For that reason, tagging @vasslitvinov as well as the @chapel-lang/perf-team.

ty1027 · 2018-10-10T20:46:24Z

On my Mac (OSX10.11) with chapel-1.16 (sorry an old version...), changing this line
type vec3 = [0..#3] real;
to
type vec3 = 3*real;
seems to make the code very fast. Is this possibly related...? (e.g., in the former case,
the memory is not stored contiguously, causing much overhead?)

Also, in the original n-body code, if I make the similar change to tuples, plus
X, Y, Z -> 1, 2, 3
read(vec3) -> read(real, real, real)
forces = [0.0, 0.0, 0.0]; -> forces = ( 0.0, 0.0, 0.0 );
then the code runs very fast. (I learned this from
the BenchmarkGame page)

Using out1000.txt as input.txt, I get these results on my Mac (4-core)
for iterations = 5: (with writeln( "iteration = ", i ))

$ chpl --fast nbody.chpl
$ time ./a.out
iteration = 0
iteration = 1
iteration = 2
iteration = 3
iteration = 4

real 0m4.090s
user 0m13.813s
sys 0m0.035s

$ chpl --fast nbody_tup.chpl
$ time ./a.out
iteration = 0
iteration = 1
iteration = 2
iteration = 3
iteration = 4

real 0m0.045s
user 0m0.086s
sys 0m0.007s

benharsh · 2018-10-10T20:49:07Z

@ty1027 beat me to it! I think you've stumbled across some unknown performance issue with arrays-of-arrays, so I think using tuples in this case would be better.

Edit: For other Chapel devs: offline we hypothesized that there were two causes for the poor performance:

Overhead for creating/destroying the array-of-arrays in reduction's task-private variables
Overhead for creating temporary arrays in the computation (var diff = positions[q] - positions[k];)

bradcray · 2018-10-10T21:01:02Z

Though @benharsh refers to this as an unknown performance issue, conferring with him, we're guessing that the underlying problem relates to some known existing issues (like #9414 in which a bunch of arrays with a shared const domain bash on it needlessly as they're created and destroyed; and #10911 which is my preferred way to address this specific case and other related problems).

It's interesting to note that, with my simple reproducer, changing from a reduce intent to an in intent makes things run much faster (though still not as fast as a tuple). Since the storage created in each case should be the same, I'm guessing that the problem is with the + reduction's identity() routine which creates a local variable of the element type (in this case [0..#3] real), sets it, and returns it. This will result in the creation of lots more arrays with the shared const domain, bashing on it further.

This suggests to me, @vasslitvinov, that in the revised UDRI proposal, we should consider whether we can make the init() routine take the element in as a ref argument and initialize it in place. While the general optimizations for const domains alluded to above would reduce the current overheads, it still seems like it'd be preferable to zero out an array element in place rather than creating a temporary array, returning it, and assigning the result to a pre-existing one?

bradcray · 2018-10-10T21:14:26Z

I want to assert for those following this issue that I believe it's our goal to have small const-sized arrays like this achieve performance that's nearly comparable to tuples. Until/unless we support some means of getting in-place allocation for such arrays, they'll probably always lag somewhat due to spatial locality issues / pointer dereference, but I think we ought to be able to minimize the gap significantly compared to where it is today.

ty1027 · 2018-10-10T21:59:39Z

I guess that people coming from C/C++ is very likely to write a code with
[1..3] real (in a way similar to double [3]), which may give a similar speed-down
and so not an ideal first impression... Const-size arrays (similar to tuples
with sufficient functionalities as (serial) arrays) may be really nice for
such cases (I've seen such a library here before,
so possibly just a wrapper type for n-tuple might help?)

Another concern is that people coming from Fortran (like me!) would probably write
a code with a rectangular 2-d array (of shape n_bodies x 3) because it is
fast in Fortran. For example:

config const filename = "input.txt";
config const iterations = 5;
config const out_filename = "out.txt";
const X = 1;
const Y = 2;
const Z = 3;
const G = 6.67e-11;
config const dt = 0.1;

// Read input file, initialize bodies                                       
var f = open(filename, iomode.r);
var reader = f.reader();

var n_bodies = reader.read(int);

const pRange = 1..n_bodies;

var forces, velocities, positions : [pRange, 1..3] real;

var masses : [pRange] real;

for i in pRange {
  positions[i, ..] = reader.read(real, real, real);
  velocities[i, ..] = reader.read(real, real, real);
  masses[i] = reader.read(real);
}

f.close();
reader.close();

for i in 0..#iterations {

  // Reset forces                                                           
  forces = 0.0;
  writeln( "iteration = ", i );

  forall q in pRange with (+ reduce forces) {

    for k in pRange {
      if k <= q then continue;

      var diff = positions[q, ..] - positions[k, ..];

      var dist = sqrt(diff[X]**2 + diff[Y]**2 + diff[Z]**2);

      var dist_cubed = dist**3;

      var tmp = -G * masses[q] * masses[k] / dist_cubed;
      var force_qk = tmp * diff;

      forces[q, ..] += force_qk;
      forces[k, ..] -= force_qk;
    }
  }

  forall q in pRange {
    positions[q, ..] += dt * velocities[q, ..];
    velocities[q, ..] += dt / masses[q] * forces[q, ..];
  }
}

var outf = open(out_filename, iomode.cw);
var writer = outf.writer();

for q in pRange {
  writer.writeln("%er %er %er %er %er %er".format(
                     positions[q, X], positions[q, Y], positions[q, Z],
                     velocities[q, X], velocities[q, Y], velocities[q, Z]));
}

writer.close();
outf.close();

But this gives (on the same environment with chapel 1.16 (<-- an old version) above):

real 0m3.845s
user 0m12.057s
sys 0m0.024s

so the performance is similar to the code with array of arrays. I guess
a similar scenario is very likely to happen for people from Fortran...

novoselrok · 2018-10-11T15:36:43Z

Including changes from @ty1027 I can confirm that there is no longer an issue. The time is now comparable to the OpenMP implementation for 100.000 bodies and 1 iteration (6s - OpenMP, 10s - Chapel)

novoselrok · 2018-10-20T13:59:26Z

I'm having same issues when trying to implement the k-means algorithm in Chapel. Here is my implementation using reductions: https://github.com/novoselrok/parallel-algorithms/blob/eaf3efaa6307837b3d66e55c275ff006ca9836f7/kmeans/chapel/main.chpl

Compilation: chpl --fast -O -o main main.chpl
Execution: ./main --filename=../data/test100000.txt --rows=100000 --cols=10 --k=4

I commented out the proposed solution where instead of an array I use tuples:

type Point = [0..#cols] real;
// type Point = 10 * real;

The non-tuple version takes 120s, the tuple version is 0.8s for 100.000 points. But tuples are not really useful for this problem since I do not know in advance the point dimensions. Is there any other way to work around this problem?

ben-albrecht · 2018-10-20T20:58:53Z

With the array version, the following line is allocating a new array to store the result of data[i] for every iteration:

var point = data[i];

For the --rows=100000 --cols=10 case, that's creating 10000 10-element arrays for every iteration!

I bet switching that line to a reference to avoid the array copies will speed your program up significantly:

ref point = data[i];

novoselrok · 2018-10-21T07:55:11Z

@ben-albrecht You are correct, this did speed it up significantly for the 100.000 points test case (down to 1s).

To test it out, I generated another test case with 256.000 points, 16 columns and 100 clusters (./main --filename=../data/test_R256k_C16_k100.txt --rows=256000 --cols=16 --k=100). This one still takes roughly 120s to complete (meanwhile the Python version takes <2s https://github.com/novoselrok/parallel-algorithms/blob/26636c939f59ba6f64da136616ba7dd67bd641d7/kmeans/chapel/main.py). I know that the Python version calls some C libraries underneath, but they are single threaded. That's why I am expecting the parallel Chapel version to be at least competitive with the Python version.

ben-albrecht · 2018-10-21T12:48:23Z

This one still takes roughly 120s to complete (meanwhile the Python version takes <2s

I'm getting ~52 seconds for Chapel (1.19 pre-release 9099c557dd) and ~5 seconds for Python3 (anaconda 3.6.5) on my machine (4 physical cores) on master (26636c939f)

For reference, here are my reproducer steps:

# kmeans/data/
$ python3 gen.py 256000 16 100 test_R256k_C16_k100

# kmeans/chapel/
$ chpl main.chpl --fast
$ ./main --filename=../data/test_R256k_C16_k100.txt --rows=256000 --cols=16 --k=100

Another improvement I noticed is to reduce the number of dists arrays allocated, by allocating it in the outer loop, and using an in intent to copy it into the forall. This results in numTasks (~number of cores) allocations rather than pdomain.size (~256k) allocations of the dists arrays:

    var dists: [clusters.domain] real;
    forall i in pDomain with (+ reduce new_clusters, in dists) {
        ref point = data[i];
        dists = [cluster in clusters] cluster.dist(point);
        ...
    }

That shaved 5 seconds off for me - now ~47 seconds.

I know that the Python version calls some C libraries underneath, but they are single threaded.

When trying to optimize an algorithm against a reference version, it is usually easier to try to match the serial performance before introducing parallelism.

I switched the forall to a for (and dropped the with clauses) to see how the serial implementation would compare, and got ~24 seconds (2x faster). @chapel-lang/perf-team - maybe there is a performance issue for us to investigate here

@novoselrok - From here, I'd suggest taking a look at the source code of sklearn.cluster.k_means_.py and trying to utilize the same optimizations they have implemented to match their serial implementation with serial Chapel. For example, it looks like they are precomputing distances under some heuristically determined condition (matrix size < 100MB). Timing subsections of your implementation might give you a good idea of where to start looking for improvements. Alternatively, you could take a look at what parts sklearn is offloading to C (via Cython) to see where performance matters most.

That's why I am expecting the parallel Chapel version to be at least competitive with the Python version.

I think this expectation is reasonable assuming the serial Chapel implementation matches the serial Python implementation, and the algorithm / workload / hardware can take advantage of parallelism.

It's generally easy to write a Chapel implementation that outperforms a python implementation. However, as you pointed out, we're comparing a Chapel implementation to a python implementation calling C for the computational expensive portions of the program. Chapel's comparisons to C are closer to parity so I think it will take some effort, e.g. using the same optimizations as the sklearn implementation, in order to compete in performance.

novoselrok · 2018-10-21T13:05:21Z

Thank you for the detailed response! I agree that it's not a completely valid comparison between the Chapel and Python/C versions, since the underlying algorithms are quite different (the Python one has plenty of optimizations and heuristics). Perhaps it would be better to write the C version by myself (serial and add OpenMP for parallelization) with the same algorithm as in the Chapel version and then compare the two. Still, I'm not expecting the handwritten C performance to degrade that much compared to Python/C. At the very least, I can use the Chapel version as an example how succinctly the k-means algorithm can be written 😃

bradcray · 2018-10-22T18:51:15Z

@ben-albrecht : Maybe I've lost the thread, but I thought we'd already identified the performance issue (here and here) as being due to needless domain-array ownership management due to not optimizing for const domains as well as we should (i.e., issues #10911 and #9414).

For those not up-to-date on the issue, the problem is that, in general, domains must track their arrays so that if the domain is resized, the array can be reallocated. For a const domain, this is not semantically required, but we don't optimize it away. So if I create a million arrays over a small const domain, I'll needlessly pay some (ridiculous) overhead managing that domain's ownership of those milliopn arrays. E.g., here:

var A: [1..1000000] [1..3] real;

Since {1..3} is a constant domain, we shouldn't need to track the million arrays that it describes, yet we do. If we stopped doing this for const domains, the theory is that the cost of these arrays would come way down relative to the tuples (though there'd still probably be a slight difference due to differences in heap vs. in-place allocation, and array meta-data).

ben-albrecht · 2018-10-22T19:01:40Z

Maybe I've lost the thread, but I thought we'd already identified the performance issue

@novoselrok posted about a k means implementation where he was encountering performance issues as well, so I was suggesting some performance improvements unique to that code.

That said, the const domain optimization (or lack of) performance issue identified in the N-body implementation is definitely impacting the k means code as well.

bradcray · 2018-10-22T19:18:28Z

posted about a k means implementation

Oh, you're right that I overlooked this topic switch. It seems to me that this should've been forked into its own issue at that point.

the const domain optimization (or lack of) performance issue identified in the N-body implementation is definitely impacting the k means code as well.

Is there evidence that there are additional issues other than that one?

Belatedly, @vasslitvinov's issue #11426 reminded me that we saw surprising performance overheads when using reduce intents as compared to in intents which might be a second source of overhead that I brushed past too quickly in my previous summary.

ben-albrecht · 2018-10-22T19:40:03Z

It seems to me that this should've been forked into its own issue at that point.

Agreed, and it's still not too late!

Is there evidence that there are additional issues other than that one?

Switching the forall to a serial for in the k means implementation resulted in a 2x performance improvement on my machine:

    forall i in pDomain with (+ reduce new_clusters) {
        ref point = data[i];
        dists = [cluster in clusters] cluster.dist(point);
        var (_, min_index) = minloc reduce zip(dists, dists.domain);
        labels[i] = min_index;
        new_clusters[min_index].add_point(point);
    }

vs.

    for i in pDomain {
        ref point = data[i];
        dists = [cluster in clusters] cluster.dist(point);
        var (_, min_index) = minloc reduce zip(dists, dists.domain);
        labels[i] = min_index;
        new_clusters[min_index].add_point(point);
    }

This was for the pDomain.size=256000 and new_clusters.domain.size=100 case:

./main --filename=../data/test_R256k_C16_k100.txt --rows=256000 --cols=16 --k=100

This could be related to the reduce intent overhead issue (#11426) you mentioned.

chpl__sumType() computes the type of `x+x` given the type of `x`. To do so, it declares the variable `x`, computes `x+x` and takes its type. All this happens strictly at compile time, which is desired, as long as the type is not a "runtime type" i.e. an array or a domain. This is time-consuming in the case `x` is an array. Even though chpl__sumType() is smart to run `x+x` only on a variable of the array's element type, and if that itself is an array, then only on its element type recursively, allocating an array is already expensive, especially when: * an array is large, * it is an array of arrays, and/or * it is a distributed array. This is one of the causes of the slowdown discussed in chapel-lang#11333. In most practical cases, the type of `x+x` is the same as the type of `x`. If so, we can save the hassle of allocating the array. This PR adds a check "is the type of `x+x` the same as the type of `x` ?". If so, chpl__sumType() skips declaring the variable `x`, instead returning its type immediately. This check is "conservative" aka has false negatives. I.e. it may say "no" in some cases where chpl__sumType() could return immediately. If so, chpl__sumType() will do the unnecessary work of allocating the `x`. This is largly only a performance concern - because chpl__sumType() still produces a correct result. In practice this situation should be rare. While there, emit user-friendly error when + reducing something that does not have a valid + implementation.

@mppf

Avoid creating an array in chpl__sumType() chpl__sumType() computes the type of `x+x` given the type of `x`. To do so, it declares the variable `x`, computes `x+x` and takes its type. All this happens strictly at compile time, which is desired, as long as the type is not a "runtime type" i.e. an array or a domain. This is time-consuming in the case `x` is an array. Even though chpl__sumType() is smart to run `x+x` only on a variable of the array's element type, and if that itself is an array, then only on its element type recursively, allocating an array is already expensive, especially when: * an array is large, * it is an array of arrays, and/or * it is a distributed array. This is one of the causes of the slowdown discussed in #11333 / #11426. In most practical cases, the type of `x+x` is the same as the type of `x`. If so, we can save the hassle of allocating the array. This PR adds a check "is the type of `x+x` the same as the type of `x` ?". If so, chpl__sumType() skips declaring the variable `x`, instead returning its type immediately. This check is "conservative" aka has false negatives. I.e. it may say "no" in some cases where chpl__sumType() could return immediately. If so, chpl__sumType() will do the unnecessary work of allocating the `x`. This is largly only a performance concern - because chpl__sumType() still produces a correct result. In practice this situation should be rare. While there, emit user-friendly error when + reducing something that does not have a valid + implementation. r: @mppf

vasslitvinov · 2018-11-02T22:33:09Z

FYI, I posted some findings related to n-body performance here .

bradcray · 2018-11-05T17:31:20Z

After PR #11556, the performance for loop startup in cases that use reduce intents has been reduced dramatically. Teardown is still not so good, similar to in intents, so there's more to do here still, linked back to issue #10911

ronawho · 2020-06-19T19:37:10Z

#15895 will dramatically reduce the teardown time for the array-of-array cases, which I think will resolve this.

For Brad's smaller reproducer in #11333 (comment) I see a ~700x speedup:

$ time -p ./brad-11333-repro-master
Before forall
In forall
real 486.60

$ time -p ./brad-11333-repro-pr
Before forall
In forall
real 0.68

ronawho · 2020-06-19T20:31:14Z

For the original nbody code (commenting out the writeln("q ", q)), here's what I see:

$ time -p ./nbody-master --filename=test1e3.txt --iterations=1
real 192.94

$ time -p ./nbody-pr --filename=test1e3.txt --iterations=1
real 1.21

@novoselrok I don't have a sense of the expected performance here, so if you get a chance I'd love some more official comparisons.

@mppf

Switch the domain's list of arrays to a hashtable [codeveloped with @mppf] Domains keep track of the arrays declared over them so that when a domain is resized it can resize all the arrays declared over it. Previously, we used a simple LinkedList to store the arrays, which had performance issues because removal from a linked list is O(n). For the most part we got lucky and removal happened in reverse insertion order so removal was effectively O(1). This was because we did serial init and deinit of arrays-of-arrays, but there are still cases where the insertion or removal was parallel, which killed performance. This includes creating arrays in a task-intent clause (#11333) and having a distributed array of arrays (#9414). This significantly improves the performance of those cases, and does not hurt serial insertion/deletion performance. We have long wanted to use a set/hashtable to store a domain's arrays, but previously we couldn't because using an associative array inside BaseDom and BaseDist would result in a circular dependency. Now that chpl__hashtable has been split out we can just use that. This creates a simple set wrapper over hashtable. We do this instead of using the standard library set to have more control and to avoid bringing in all of set code to hopefully limit the impact on compilation speed. This improves startup/teardown for a single node forall loop using array-of-arrays task intents. There's a 75x speedup for ri-a2-AoA.chpl, which is a reproducer #11333. This also improves deinit for a block dist array-of-arrays by 500x at 512 nodes (improves non-kernel scalability for isx -- #9414) Additionally, this will allow us to parallelize array initialization for all types and array deinitialization in a future PR. Some technical notes: - I had to comment out some `key` printing in chpl__hashtble to avoid dynamic dispatch trying to stamp out read/writeThis methods for all array types. This resulted in trying to print our arrays of sync and we don't support writing syncs. - Removing an element that isn't in the set is a noop. This is required for optimizations/bulkcomm/bharshbarg/arrayViews, but I made inserting an existing element an assertion failure. - rootLocaleInitialized is now used earlier, so I moved it to ChapelBase. Resolves #9414 Resolves #11333 Resolves Cray/chapel-private#1036 Closes Cray/chapel-private#1063

novoselrok · 2020-06-20T08:08:46Z

@ronawho Even though I'm not using Chapel day-to-day anymore, I really appreciate the effort. The timing numbers you posted seem to be what I was looking for (as far as I can remember).

bradcray added type: Performance user issue labels Oct 10, 2018

bradcray added the area: Compiler label Oct 10, 2018

bradcray added the area: Libraries / Modules label Oct 10, 2018

vasslitvinov mentioned this issue Oct 22, 2018

spike: how to address performance issues in #11333 #11426

Closed

This comment has been minimized.

Sign in to view

vasslitvinov mentioned this issue Nov 2, 2018

Avoid creating an array in chpl__sumType() #11556

Merged

bradcray mentioned this issue Mar 13, 2020

Should/can array element deinitialization be parallelized? #15215

Closed

ronawho mentioned this issue Jun 19, 2020

Switch the domain's list of arrays to a hashtable #15895

Merged

ronawho closed this as completed in #15895 Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues when running n-body simulation #11333

Performance issues when running n-body simulation #11333

novoselrok commented Oct 10, 2018

bradcray commented Oct 10, 2018

novoselrok commented Oct 10, 2018

bradcray commented Oct 10, 2018

ty1027 commented Oct 10, 2018 •

edited

Loading

benharsh commented Oct 10, 2018 •

edited

Loading

bradcray commented Oct 10, 2018

bradcray commented Oct 10, 2018

ty1027 commented Oct 10, 2018 •

edited by ben-albrecht

Loading

novoselrok commented Oct 11, 2018

novoselrok commented Oct 20, 2018 •

edited by ben-albrecht

Loading

ben-albrecht commented Oct 20, 2018

novoselrok commented Oct 21, 2018

ben-albrecht commented Oct 21, 2018 •

edited

Loading

novoselrok commented Oct 21, 2018

bradcray commented Oct 22, 2018

ben-albrecht commented Oct 22, 2018 •

edited

Loading

bradcray commented Oct 22, 2018

ben-albrecht commented Oct 22, 2018

This comment has been minimized.

vasslitvinov commented Nov 2, 2018

bradcray commented Nov 5, 2018

ronawho commented Jun 19, 2020

ronawho commented Jun 19, 2020

novoselrok commented Jun 20, 2020

Performance issues when running n-body simulation #11333

Performance issues when running n-body simulation #11333

Comments

novoselrok commented Oct 10, 2018

Summary of Problem

Steps to Reproduce

Configuration Information

bradcray commented Oct 10, 2018

novoselrok commented Oct 10, 2018

bradcray commented Oct 10, 2018

ty1027 commented Oct 10, 2018 • edited Loading

benharsh commented Oct 10, 2018 • edited Loading

bradcray commented Oct 10, 2018

bradcray commented Oct 10, 2018

ty1027 commented Oct 10, 2018 • edited by ben-albrecht Loading

novoselrok commented Oct 11, 2018

novoselrok commented Oct 20, 2018 • edited by ben-albrecht Loading

ben-albrecht commented Oct 20, 2018

novoselrok commented Oct 21, 2018

ben-albrecht commented Oct 21, 2018 • edited Loading

novoselrok commented Oct 21, 2018

bradcray commented Oct 22, 2018

ben-albrecht commented Oct 22, 2018 • edited Loading

bradcray commented Oct 22, 2018

ben-albrecht commented Oct 22, 2018

This comment has been minimized.

vasslitvinov commented Nov 2, 2018

bradcray commented Nov 5, 2018

ronawho commented Jun 19, 2020

ronawho commented Jun 19, 2020

novoselrok commented Jun 20, 2020

ty1027 commented Oct 10, 2018 •

edited

Loading

benharsh commented Oct 10, 2018 •

edited

Loading

ty1027 commented Oct 10, 2018 •

edited by ben-albrecht

Loading

novoselrok commented Oct 20, 2018 •

edited by ben-albrecht

Loading

ben-albrecht commented Oct 21, 2018 •

edited

Loading

ben-albrecht commented Oct 22, 2018 •

edited

Loading