-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues when running n-body simulation #11333
Comments
I'm noting that you're not using the --fast flag which can have a significant impact on Chapel performance (turns off a number of execution-time sanity checks which can add lots of overhead). While I'm trying to reproduce this, I'd be curious whether that helps you at all. |
Unfortunately I still have the same issue. I also ran it on a separate computer (macbook pro) and the issue persists. |
Thanks for the quick check on I have not verified anything about the OpenMP version... Note that the time required to enter our forall loop with the reduce intent should be somewhat comparable to OpenMP's allocation and initialization of the per-task forces arrays (since that's where we'd create our per-task copies), but I think you said on chat that you don't see any similar hiccup anywhere in OpenMP. Here's a simple reproducer that doesn't require an input file: var n_bodies = 10000;
const pDomain = {0..#n_bodies};
type vec3 = [0..#3] real;
var forces: [pDomain] vec3;
var velocities: [pDomain] vec3;
var positions: [pDomain] vec3;
var masses: [pDomain] real;
writeln("Before forall");
forall q in pDomain with (+ reduce forces) {
if (q == 0) then
writeln("In forall");
} It looks to me like something about the reduce intent is bogging things down, as removing the |
On my Mac (OSX10.11) with chapel-1.16 (sorry an old version...), changing this line Also, in the original n-body code, if I make the similar change to tuples, plus Using out1000.txt as input.txt, I get these results on my Mac (4-core) $ chpl --fast nbody.chpl real 0m4.090s $ chpl --fast nbody_tup.chpl real 0m0.045s |
@ty1027 beat me to it! I think you've stumbled across some unknown performance issue with arrays-of-arrays, so I think using tuples in this case would be better. Edit: For other Chapel devs: offline we hypothesized that there were two causes for the poor performance:
|
Though @benharsh refers to this as an unknown performance issue, conferring with him, we're guessing that the underlying problem relates to some known existing issues (like #9414 in which a bunch of arrays with a shared const domain bash on it needlessly as they're created and destroyed; and #10911 which is my preferred way to address this specific case and other related problems). It's interesting to note that, with my simple reproducer, changing from a This suggests to me, @vasslitvinov, that in the revised UDRI proposal, we should consider whether we can make the init() routine take the element in as a |
I want to assert for those following this issue that I believe it's our goal to have small const-sized arrays like this achieve performance that's nearly comparable to tuples. Until/unless we support some means of getting in-place allocation for such arrays, they'll probably always lag somewhat due to spatial locality issues / pointer dereference, but I think we ought to be able to minimize the gap significantly compared to where it is today. |
I guess that people coming from C/C++ is very likely to write a code with Another concern is that people coming from Fortran (like me!) would probably write config const filename = "input.txt";
config const iterations = 5;
config const out_filename = "out.txt";
const X = 1;
const Y = 2;
const Z = 3;
const G = 6.67e-11;
config const dt = 0.1;
// Read input file, initialize bodies
var f = open(filename, iomode.r);
var reader = f.reader();
var n_bodies = reader.read(int);
const pRange = 1..n_bodies;
var forces, velocities, positions : [pRange, 1..3] real;
var masses : [pRange] real;
for i in pRange {
positions[i, ..] = reader.read(real, real, real);
velocities[i, ..] = reader.read(real, real, real);
masses[i] = reader.read(real);
}
f.close();
reader.close();
for i in 0..#iterations {
// Reset forces
forces = 0.0;
writeln( "iteration = ", i );
forall q in pRange with (+ reduce forces) {
for k in pRange {
if k <= q then continue;
var diff = positions[q, ..] - positions[k, ..];
var dist = sqrt(diff[X]**2 + diff[Y]**2 + diff[Z]**2);
var dist_cubed = dist**3;
var tmp = -G * masses[q] * masses[k] / dist_cubed;
var force_qk = tmp * diff;
forces[q, ..] += force_qk;
forces[k, ..] -= force_qk;
}
}
forall q in pRange {
positions[q, ..] += dt * velocities[q, ..];
velocities[q, ..] += dt / masses[q] * forces[q, ..];
}
}
var outf = open(out_filename, iomode.cw);
var writer = outf.writer();
for q in pRange {
writer.writeln("%er %er %er %er %er %er".format(
positions[q, X], positions[q, Y], positions[q, Z],
velocities[q, X], velocities[q, Y], velocities[q, Z]));
}
writer.close();
outf.close(); But this gives (on the same environment with chapel 1.16 (<-- an old version) above): real 0m3.845s so the performance is similar to the code with array of arrays. I guess |
Including changes from @ty1027 I can confirm that there is no longer an issue. The time is now comparable to the OpenMP implementation for 100.000 bodies and 1 iteration (6s - OpenMP, 10s - Chapel) |
I'm having same issues when trying to implement the k-means algorithm in Chapel. Here is my implementation using reductions: https://github.com/novoselrok/parallel-algorithms/blob/eaf3efaa6307837b3d66e55c275ff006ca9836f7/kmeans/chapel/main.chpl Compilation: I commented out the proposed solution where instead of an array I use tuples: type Point = [0..#cols] real;
// type Point = 10 * real; The non-tuple version takes 120s, the tuple version is 0.8s for 100.000 points. But tuples are not really useful for this problem since I do not know in advance the point dimensions. Is there any other way to work around this problem? |
With the array version, the following line is allocating a new array to store the result of var point = data[i]; For the I bet switching that line to a reference to avoid the array copies will speed your program up significantly: ref point = data[i]; |
@ben-albrecht You are correct, this did speed it up significantly for the 100.000 points test case (down to 1s). To test it out, I generated another test case with 256.000 points, 16 columns and 100 clusters ( |
I'm getting ~52 seconds for Chapel (1.19 pre-release For reference, here are my reproducer steps: # kmeans/data/
$ python3 gen.py 256000 16 100 test_R256k_C16_k100
# kmeans/chapel/
$ chpl main.chpl --fast
$ ./main --filename=../data/test_R256k_C16_k100.txt --rows=256000 --cols=16 --k=100 Another improvement I noticed is to reduce the number of var dists: [clusters.domain] real;
forall i in pDomain with (+ reduce new_clusters, in dists) {
ref point = data[i];
dists = [cluster in clusters] cluster.dist(point);
...
} That shaved 5 seconds off for me - now ~47 seconds.
When trying to optimize an algorithm against a reference version, it is usually easier to try to match the serial performance before introducing parallelism. I switched the @novoselrok - From here, I'd suggest taking a look at the source code of
I think this expectation is reasonable assuming the serial Chapel implementation matches the serial Python implementation, and the algorithm / workload / hardware can take advantage of parallelism. It's generally easy to write a Chapel implementation that outperforms a python implementation. However, as you pointed out, we're comparing a Chapel implementation to a python implementation calling C for the computational expensive portions of the program. Chapel's comparisons to C are closer to parity so I think it will take some effort, e.g. using the same optimizations as the sklearn implementation, in order to compete in performance. |
Thank you for the detailed response! I agree that it's not a completely valid comparison between the Chapel and Python/C versions, since the underlying algorithms are quite different (the Python one has plenty of optimizations and heuristics). Perhaps it would be better to write the C version by myself (serial and add OpenMP for parallelization) with the same algorithm as in the Chapel version and then compare the two. Still, I'm not expecting the handwritten C performance to degrade that much compared to Python/C. At the very least, I can use the Chapel version as an example how succinctly the k-means algorithm can be written 😃 |
@ben-albrecht : Maybe I've lost the thread, but I thought we'd already identified the performance issue (here and here) as being due to needless domain-array ownership management due to not optimizing for const domains as well as we should (i.e., issues #10911 and #9414). For those not up-to-date on the issue, the problem is that, in general, domains must track their arrays so that if the domain is resized, the array can be reallocated. For a const domain, this is not semantically required, but we don't optimize it away. So if I create a million arrays over a small const domain, I'll needlessly pay some (ridiculous) overhead managing that domain's ownership of those milliopn arrays. E.g., here: var A: [1..1000000] [1..3] real; Since |
@novoselrok posted about a k means implementation where he was encountering performance issues as well, so I was suggesting some performance improvements unique to that code. That said, the const domain optimization (or lack of) performance issue identified in the N-body implementation is definitely impacting the k means code as well. |
Oh, you're right that I overlooked this topic switch. It seems to me that this should've been forked into its own issue at that point.
Is there evidence that there are additional issues other than that one? Belatedly, @vasslitvinov's issue #11426 reminded me that we saw surprising performance overheads when using reduce intents as compared to |
Agreed, and it's still not too late!
Switching the forall i in pDomain with (+ reduce new_clusters) {
ref point = data[i];
dists = [cluster in clusters] cluster.dist(point);
var (_, min_index) = minloc reduce zip(dists, dists.domain);
labels[i] = min_index;
new_clusters[min_index].add_point(point);
} vs. for i in pDomain {
ref point = data[i];
dists = [cluster in clusters] cluster.dist(point);
var (_, min_index) = minloc reduce zip(dists, dists.domain);
labels[i] = min_index;
new_clusters[min_index].add_point(point);
} This was for the
This could be related to the reduce intent overhead issue (#11426) you mentioned. |
This comment has been minimized.
This comment has been minimized.
chpl__sumType() computes the type of `x+x` given the type of `x`. To do so, it declares the variable `x`, computes `x+x` and takes its type. All this happens strictly at compile time, which is desired, as long as the type is not a "runtime type" i.e. an array or a domain. This is time-consuming in the case `x` is an array. Even though chpl__sumType() is smart to run `x+x` only on a variable of the array's element type, and if that itself is an array, then only on its element type recursively, allocating an array is already expensive, especially when: * an array is large, * it is an array of arrays, and/or * it is a distributed array. This is one of the causes of the slowdown discussed in chapel-lang#11333. In most practical cases, the type of `x+x` is the same as the type of `x`. If so, we can save the hassle of allocating the array. This PR adds a check "is the type of `x+x` the same as the type of `x` ?". If so, chpl__sumType() skips declaring the variable `x`, instead returning its type immediately. This check is "conservative" aka has false negatives. I.e. it may say "no" in some cases where chpl__sumType() could return immediately. If so, chpl__sumType() will do the unnecessary work of allocating the `x`. This is largly only a performance concern - because chpl__sumType() still produces a correct result. In practice this situation should be rare. While there, emit user-friendly error when + reducing something that does not have a valid + implementation.
Avoid creating an array in chpl__sumType() chpl__sumType() computes the type of `x+x` given the type of `x`. To do so, it declares the variable `x`, computes `x+x` and takes its type. All this happens strictly at compile time, which is desired, as long as the type is not a "runtime type" i.e. an array or a domain. This is time-consuming in the case `x` is an array. Even though chpl__sumType() is smart to run `x+x` only on a variable of the array's element type, and if that itself is an array, then only on its element type recursively, allocating an array is already expensive, especially when: * an array is large, * it is an array of arrays, and/or * it is a distributed array. This is one of the causes of the slowdown discussed in #11333 / #11426. In most practical cases, the type of `x+x` is the same as the type of `x`. If so, we can save the hassle of allocating the array. This PR adds a check "is the type of `x+x` the same as the type of `x` ?". If so, chpl__sumType() skips declaring the variable `x`, instead returning its type immediately. This check is "conservative" aka has false negatives. I.e. it may say "no" in some cases where chpl__sumType() could return immediately. If so, chpl__sumType() will do the unnecessary work of allocating the `x`. This is largly only a performance concern - because chpl__sumType() still produces a correct result. In practice this situation should be rare. While there, emit user-friendly error when + reducing something that does not have a valid + implementation. r: @mppf
FYI, I posted some findings related to n-body performance here . |
#15895 will dramatically reduce the teardown time for the array-of-array cases, which I think will resolve this. For Brad's smaller reproducer in #11333 (comment) I see a ~700x speedup:
|
For the original nbody code (commenting out the
@novoselrok I don't have a sense of the expected performance here, so if you get a chance I'd love some more official comparisons. |
Switch the domain's list of arrays to a hashtable [codeveloped with @mppf] Domains keep track of the arrays declared over them so that when a domain is resized it can resize all the arrays declared over it. Previously, we used a simple LinkedList to store the arrays, which had performance issues because removal from a linked list is O(n). For the most part we got lucky and removal happened in reverse insertion order so removal was effectively O(1). This was because we did serial init and deinit of arrays-of-arrays, but there are still cases where the insertion or removal was parallel, which killed performance. This includes creating arrays in a task-intent clause (#11333) and having a distributed array of arrays (#9414). This significantly improves the performance of those cases, and does not hurt serial insertion/deletion performance. We have long wanted to use a set/hashtable to store a domain's arrays, but previously we couldn't because using an associative array inside BaseDom and BaseDist would result in a circular dependency. Now that chpl__hashtable has been split out we can just use that. This creates a simple set wrapper over hashtable. We do this instead of using the standard library set to have more control and to avoid bringing in all of set code to hopefully limit the impact on compilation speed. This improves startup/teardown for a single node forall loop using array-of-arrays task intents. There's a 75x speedup for ri-a2-AoA.chpl, which is a reproducer #11333. This also improves deinit for a block dist array-of-arrays by 500x at 512 nodes (improves non-kernel scalability for isx -- #9414) Additionally, this will allow us to parallelize array initialization for all types and array deinitialization in a future PR. Some technical notes: - I had to comment out some `key` printing in chpl__hashtble to avoid dynamic dispatch trying to stamp out read/writeThis methods for all array types. This resulted in trying to print our arrays of sync and we don't support writing syncs. - Removing an element that isn't in the set is a noop. This is required for optimizations/bulkcomm/bharshbarg/arrayViews, but I made inserting an existing element an assertion failure. - rootLocaleInitialized is now used earlier, so I moved it to ChapelBase. Resolves #9414 Resolves #11333 Resolves Cray/chapel-private#1036 Closes Cray/chapel-private#1063
@ronawho Even though I'm not using Chapel day-to-day anymore, I really appreciate the effort. The timing numbers you posted seem to be what I was looking for (as far as I can remember). |
Summary of Problem
Recently I asked for some help on StackOverflow on implementing n-body simulation. Brad was helpful enough to provide a full featured shared memory implementation of the simulation. But after compiling and running the implementation I have run into some performance issues with large amount of bodies (10.000 and more). The Chapel program seems to get 'stuck' (for some amount of time, it eventually continues) at the
forall q in pDomain with (+ reduce forces)
loop. You can verify it by observing thewriteln
output. Meanwhile my OpenMP implementation does not run into any problems.Steps to Reproduce
Source Code:
Here are some data files you can use for testing: link
Compile command:
chpl -O -o main main.chpl
Execution command:
./main --filename=../data/test1e4.txt --iterations=1
Configuration Information
chpl --version
:chpl version 1.18.0
$CHPL_HOME/util/printchplenv --anonymize
:gcc --version
orclang --version
:gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
The text was updated successfully, but these errors were encountered: