-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openmp #315
Openmp #315
Conversation
…apticPathway etc.
…uch that we can use a function (openmp_pragma) that will handle the insert of pragma statements in the code. Now the number of threads can be given as a brian preference. By default, if 0, nothing is done and the code is not relying on openmp at all. Otherwise, if a positive number is given, then that number of threads is used
… master before launching a pull request
…d into templates now
…port system, needs some tests to check that openmp support is fine, and I'll push it Conflicts: brian2/devices/cpp_standalone/brianlib/network.cpp brian2/devices/cpp_standalone/brianlib/network.h brian2/devices/cpp_standalone/device.py brian2/devices/cpp_standalone/templates/main.cpp brian2/devices/cpp_standalone/templates/ratemonitor.cpp
…ging, and change the default value of the openmp preference to 0. Everything seems to run smoothly, speedup depends on the simulation, so more intensive benchmarks should be performed
…d some more pragma statement in the main.cpp files, when objects are created from the device.py file. Now I'll start to do some benchmarks to check when optimizations are useful
…tion. Benchmarks are on their way, but the results seems to vary a lot depending on computer architectures, networks, ... Quite some overhead due to openmp threads creation could maybe be reduce by merging threshold and stateupdate loops, need to check
…nd thanks to the ordered flag. This seems to lead to large speed improvement, because the threshold operation is one of the major bottleneck in according to some profiling.
… Not sure this has any effect on speed performances, but at least, it allows to better understand what are the calls exectuted by single thread that need to be synchronized and those that do not require it. Note that updates are always synchronized because of the barrier in the network.cpp file
… in the code. Also add the static-ordered trick in the peek() function of the SynapticPathway, to share the load safely among threads. Even if not parallel, this seems to be still a bit faster than one thread running everything. Did some profiling: performances are always increased with OpenMP, as long as the defaultclock is not too small. Sadly, for code using small dt (<0.1ms), the overhead due to loops and to synchronization slows everything compared to without OpenMP
…recorded and displayed, thanks to code insertion in the c++ templates. To do so, I added one #include in the main.cpp templates. Note that those benchmarks are showing that for some deterministic example, operations are not always performed in the same order on a run-to-run basis, so results can be bistable
To make the point that OpenMP implementations are speeding the code, here are 4 benchmarks that can be found in dev/benchmarks/openmp/test_openmp_**.py. Those are 4 different networks covering various objects and launched without OpenMP, and then with using 1, 2, 4, 6 threads. Simulations results are superimposed, and bottom right of those plots show the exact simulation time (extracted thanks to the insert_code_device() function). As you can see, openmp if threads > 1 is always speeding up the simulation :-) Results are also always the same, and for simulation 3/4, discrepancies are due to the fact that the network has a bistable behavior, as shown to Marcel, because of the order of pre/post operations.. |
…even for arbitrary orderings (same dt, when and order attributes).
Nice! I cherry-picked the commit that makes the order of objects deterministic (by falling back to the name) -- I think we all agree that even if ordering is arbitrary, it should be deterministic. But in the long run, for this specific case, I think pre codes should be executed before post codes (IIRC, this was the case in Brian1) |
Oh, and I think from my side, the only thing missing in this branch is the documentation for the OpenMP features -- @thesamovar: if you are happy with the changes, feel free to merge it. |
Great stuff! I haven't looked at the code yet but do we have a deterministic test that shows that the output is exactly the same? I think we need this before we can merge. I'll take a look at this in the next couple of days. |
As said, I have 4 test cases of real and pretty complex networks showing Best Pierre
|
What I mean is that we need tests in the test suite that show identical behaviour. Afaict there is nothing in the test suite at the moment? I've had a look at the code now and run the examples and I have to say I'm really impressed. On my machine I am actually getting some superlinear speedups. I am getting numbers that are a little slower for 0/1 thread, but faster for 6 threads. I'm on a quad core with hyperthreading (8 virtual cores). Specifically an i7 3612QM @ 2.1 GHz. I wonder if you have time you could perhaps write some notes on the parallelisation strategies and why you chose them/how they work? I'm happy for this to be merged once we have some docs and tests in the suite. On August 28, 2014 3:45:25 PM GMT+01:00, Pierre Yger notifications@github.com wrote:
|
Good to hear ! I would be more than happy to write some documentation, about the Best Pierre
|
We could perhaps have something so that if a template has the string |
I discussed a similar thing with Pierre previously, I think it is a good idea to have templates explicitly mention that they know about OpenMP (maybe |
…m a developer point of view. Note that the part on the OPENMP_AWARE flag is not written, because this has still to be done. Marcel said that Dan may have the time to implement that
…-threading. A quite decently complex network is covering almost all the templates, and simulations results are carefully compared without OpenMP, with OpenMP and 1 or 2 threads.
Conflicts: brian2/devices/cpp_standalone/device.py brian2/devices/cpp_standalone/templates/spikemonitor.cpp brian2/tests/test_cpp_standalone.py
…all numerical differences
… Mark all the current standalone templates as compatible.
…at it does not raise an error when a target is not available
I merged this branch with the The test pass, except for the Python3 timeout because of Cython (see #326) |
Great, in that case I'm happy to merge this after we merge Cython. I think an error is probably better than a warning in this case, since they're explicitly trying to do something that doesn't work. |
Conflicts: brian2/tests/__init__.py
I don't really mind but since the number of threads is a preference, I could image a situation where you have this set to, say, 4 in your global preference file and then it's annoying to set it to 0 manually whenever you run something that does not support multithreading. On the other hand, we don't have anything that does not support multithreading in standalone, anyway... |
Yeah it's pretty marginal one way or the other, indeed. OK I'm going to have another look at this now. From your point of view is it good to merge now? |
Yes, it is. |
I modified the OpenMP test to check that the standalone results are the same as the runtime results (as well as checking that standalone with threads was the same as without). The tests all pass on Windows and I'm happy with the code so as soon as Travis reports back that the tests pass I'll merge this. I'll write some documentation for this and Cython in the docs branch (I think a new 'ways to run Brian' page might be a good idea). |
OK it passes, merging now! |
Great ! I'll have a look tomorrow. I wrote some doc for the openmp branch, Best Pierre
|
Hey @yger, I'm looking through your documentation for the openmp branch now - great stuff! I'm going to work on it for a little while in the docs_improvements branch. So maybe some time this weekend or next week you could take a look and add anything that isn't finished? |
The OpenMP branch is now up and running, and in the default case, it does not depend on openmp or insert any #pragma in the code. The number of threads used in the simulation has to be set via brian_prefs.codegen.cpp_standalone.openmp_threads. By default this number is 0, but if positive, then OpenMP is used in the templates and the appropriate number of threads is recruited. Note that on benchmarks, results are very encouraging, and speedup are pretty robust for codes that are located in dev/benchmark/openmp. All the cpp standalone templates are now OpenMP compatible. Basically, all the object creation is done without the parallelism, so no changes were made, but then all objects inserted within the run() loop need to be openmp-compatible. Major changes, done with Marcel, are also that now network.cpp and synapses.h (renamed in synapses_classes.cpp) are now templates and not in brianlib anymore. This was needed to have a fully template-based solution.