Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openmp #315

Merged
merged 25 commits into from
Oct 2, 2014
Merged

Openmp #315

merged 25 commits into from
Oct 2, 2014

Conversation

yger
Copy link
Contributor

@yger yger commented Aug 18, 2014

The OpenMP branch is now up and running, and in the default case, it does not depend on openmp or insert any #pragma in the code. The number of threads used in the simulation has to be set via brian_prefs.codegen.cpp_standalone.openmp_threads. By default this number is 0, but if positive, then OpenMP is used in the templates and the appropriate number of threads is recruited. Note that on benchmarks, results are very encouraging, and speedup are pretty robust for codes that are located in dev/benchmark/openmp. All the cpp standalone templates are now OpenMP compatible. Basically, all the object creation is done without the parallelism, so no changes were made, but then all objects inserted within the run() loop need to be openmp-compatible. Major changes, done with Marcel, are also that now network.cpp and synapses.h (renamed in synapses_classes.cpp) are now templates and not in brianlib anymore. This was needed to have a fully template-based solution.

yger and others added 13 commits July 8, 2014 17:08
…uch that we can use a function (openmp_pragma) that will handle

the insert of pragma statements in the code. Now the number of threads can be given as a brian preference. By default, if 0, nothing is done and
the code is not relying on openmp at all. Otherwise, if a positive number is given, then that number of threads is used
…port system, needs some tests to check that openmp support is

fine, and I'll push it

Conflicts:
	brian2/devices/cpp_standalone/brianlib/network.cpp
	brian2/devices/cpp_standalone/brianlib/network.h
	brian2/devices/cpp_standalone/device.py
	brian2/devices/cpp_standalone/templates/main.cpp
	brian2/devices/cpp_standalone/templates/ratemonitor.cpp
…ging, and change the default value of the

openmp preference to 0. Everything seems to run smoothly, speedup depends on the simulation, so more intensive
benchmarks should be performed
…d some more pragma statement in the main.cpp files, when

objects are created from the device.py file. Now I'll start to do some benchmarks to check when optimizations are useful
…tion. Benchmarks are on their way, but

the results seems to vary a lot depending on computer architectures, networks, ... Quite some overhead due to
openmp threads creation could maybe be reduce by merging threshold and stateupdate loops, need to check
…nd thanks to the ordered flag.

This seems to lead to large speed improvement, because the threshold operation is one of the
major bottleneck in according to some profiling.
… Not sure this has any

effect on speed performances, but at least, it allows to better understand what are the calls
exectuted by single thread that need to be synchronized and those that do not require it. Note
that updates are always synchronized because of the barrier in the network.cpp file
… in the code. Also add the static-ordered

trick in the peek() function of the SynapticPathway, to share the load safely among threads. Even if not parallel,
this seems to be still a bit faster than one thread running everything. Did some profiling: performances are always
increased with OpenMP, as long as the defaultclock is not too small. Sadly, for code using small dt (<0.1ms), the overhead due
to loops and to synchronization slows everything compared to without OpenMP
…recorded and displayed, thanks to code insertion in the c++

templates. To do so, I added one #include in the main.cpp templates. Note that those benchmarks are showing that for some
deterministic example, operations are not always performed in the same order on a run-to-run basis, so results can be bistable
@yger
Copy link
Contributor Author

yger commented Aug 28, 2014

To make the point that OpenMP implementations are speeding the code, here are 4 benchmarks that can be found in dev/benchmarks/openmp/test_openmp_**.py. Those are 4 different networks covering various objects and launched without OpenMP, and then with using 1, 2, 4, 6 threads. Simulations results are superimposed, and bottom right of those plots show the exact simulation time (extracted thanks to the insert_code_device() function). As you can see, openmp if threads > 1 is always speeding up the simulation :-) Results are also always the same, and for simulation 3/4, discrepancies are due to the fact that the network has a bistable behavior, as shown to Marcel, because of the order of pre/post operations..

cuba_openmp

stdp_openmp

net1_openmp

net2_openmp

…even for arbitrary orderings (same dt, when and order attributes).
@mstimberg
Copy link
Member

Nice! I cherry-picked the commit that makes the order of objects deterministic (by falling back to the name) -- I think we all agree that even if ordering is arbitrary, it should be deterministic. But in the long run, for this specific case, I think pre codes should be executed before post codes (IIRC, this was the case in Brian1)

@mstimberg
Copy link
Member

Oh, and I think from my side, the only thing missing in this branch is the documentation for the OpenMP features -- docs_sphinx/user/devices.rst would be the obvious place to put it, I guess.

@thesamovar: if you are happy with the changes, feel free to merge it.

@thesamovar
Copy link
Member

Great stuff!

I haven't looked at the code yet but do we have a deterministic test that shows that the output is exactly the same? I think we need this before we can merge.

I'll take a look at this in the next couple of days.

@yger
Copy link
Contributor Author

yger commented Aug 28, 2014

As said, I have 4 test cases of real and pretty complex networks showing
that the results are always the same. Among those 4, 2 are deterministic
networks, and results are the same irregardless of the number of threads
(modulo Marcel's fix for the pre/post order of the synaptic updates)

Best

Pierre

Great stuff!

I haven't looked at the code yet but do we have a deterministic test
that shows that the output is exactly the same? I think we need this
before we can merge.

I'll take a look at this in the next couple of days.


Reply to this email directly or view it on GitHub
#315 (comment).

@thesamovar
Copy link
Member

What I mean is that we need tests in the test suite that show identical behaviour. Afaict there is nothing in the test suite at the moment?

I've had a look at the code now and run the examples and I have to say I'm really impressed. On my machine I am actually getting some superlinear speedups. I am getting numbers that are a little slower for 0/1 thread, but faster for 6 threads. I'm on a quad core with hyperthreading (8 virtual cores). Specifically an i7 3612QM @ 2.1 GHz.

I wonder if you have time you could perhaps write some notes on the parallelisation strategies and why you chose them/how they work?

I'm happy for this to be merged once we have some docs and tests in the suite.

On August 28, 2014 3:45:25 PM GMT+01:00, Pierre Yger notifications@github.com wrote:

As said, I have 4 test cases of real and pretty complex networks
showing
that the results are always the same. Among those 4, 2 are
deterministic
networks, and results are the same irregardless of the number of
threads
(modulo Marcel's fix for the pre/post order of the synaptic updates)

Best

Pierre

Great stuff!

I haven't looked at the code yet but do we have a deterministic test
that shows that the output is exactly the same? I think we need this
before we can merge.

I'll take a look at this in the next couple of days.


Reply to this email directly or view it on GitHub

#315 (comment).


Reply to this email directly or view it on GitHub:
#315 (comment)

@yger
Copy link
Contributor Author

yger commented Aug 28, 2014

Good to hear !

I would be more than happy to write some documentation, about the
optimizations that I performed, and also about how to use it.
Regarding the tests needed, i 'll ask Marcel, as always, for advice and I
could write some. The catch now is simply that any new templates that will
be added in the standalone module will need to be 'openmp' compatible. If
not, then default behaviour without openmp will be fine, but as soon as
multi threading will be turned on, the device have to handle it properly,
otherwise simulations may crash or lead to false results. This is a
developer guideline that should be explained somewhere....

Best

Pierre
Le 28 août 2014 17:55, "Dan Goodman" notifications@github.com a écrit :

What I mean is that we need tests in the test suite that show identical
behaviour. Afaict there is nothing in the test suite at the moment?

I've had a look at the code now and run the examples and I have to say I'm
really impressed. On my machine I am actually getting some superlinear
speedups. I am getting numbers that are a little slower for 0/1 thread, but
faster for 6 threads. I'm on a quad core with hyperthreading (8 virtual
cores). Specifically an i7 3612QM @ 2.1 GHz.

I wonder if you have time you could perhaps write some notes on the
parallelisation strategies and why you chose them/how they work?

I'm happy for this to be merged once we have some docs and tests in the
suite.

On August 28, 2014 3:45:25 PM GMT+01:00, Pierre Yger <
notifications@github.com> wrote:

As said, I have 4 test cases of real and pretty complex networks
showing
that the results are always the same. Among those 4, 2 are
deterministic
networks, and results are the same irregardless of the number of
threads
(modulo Marcel's fix for the pre/post order of the synaptic updates)

Best

Pierre

Great stuff!

I haven't looked at the code yet but do we have a deterministic test
that shows that the output is exactly the same? I think we need this
before we can merge.

I'll take a look at this in the next couple of days.


Reply to this email directly or view it on GitHub

#315 (comment).


Reply to this email directly or view it on GitHub:
#315 (comment)


Reply to this email directly or view it on GitHub
#315 (comment).

@thesamovar
Copy link
Member

We could perhaps have something so that if a template has the string I_AM_OPENMP_COMPATIBLE in it then it is treated as such, otherwise we wrap the whole thing with a #pragma openmp single (or whatever the correct syntax is)?

@mstimberg
Copy link
Member

We could perhaps have something so that if a template has the string I_AM_OPENMP_COMPATIBLE in it then it is treated as such, otherwise we wrap the whole thing with a #pragma openmp single (or whatever the correct syntax is)?

I discussed a similar thing with Pierre previously, I think it is a good idea to have templates explicitly mention that they know about OpenMP (maybe OPENMP_AWARE?). I'm not sure how trivial the wrapping would be, I personally would not mind if the user simply gets an error message "your code is using template XYZ which is not compatible with OpenMP, switching off multithreading".

Pierre Yger and others added 5 commits September 25, 2014 14:17
…m a developer point of view. Note that the part on the

OPENMP_AWARE flag is not written, because this has still to be done. Marcel said that Dan may have the time to implement that
…-threading. A quite decently complex

network is covering almost all the templates, and simulations results are carefully compared without OpenMP,
with OpenMP and 1 or 2 threads.
Conflicts:
	brian2/devices/cpp_standalone/device.py
	brian2/devices/cpp_standalone/templates/spikemonitor.cpp
	brian2/tests/test_cpp_standalone.py
Marcel Stimberg added 2 commits October 1, 2014 23:33
… Mark all the current standalone templates as compatible.
…at it does not raise an error when a target is not available
@mstimberg
Copy link
Member

I merged this branch with the cython_codegen2 branch (which is the branch that will be merged into master next, I presume). I adapted the new correctness test to use assert_allclose instead of the strict equality test since we can get small numerical differences (e.g. because of a different order of additions). I also added a new template comment {# IS_OPENMP_COMPATIBLE #} to all standalone templates (they are all openmp-compatible, right?) and made CppStandaloneDevice.build raise an error if one of the used templates does not have this flag and a number of threads > 0 has been requested. Maybe we should instead just raise a warning and set the number of threads to 0?

The test pass, except for the Python3 timeout because of Cython (see #326)

@thesamovar
Copy link
Member

Great, in that case I'm happy to merge this after we merge Cython. I think an error is probably better than a warning in this case, since they're explicitly trying to do something that doesn't work.

Conflicts:
	brian2/tests/__init__.py
@mstimberg
Copy link
Member

I think an error is probably better than a warning in this case, since they're explicitly trying to do something that doesn't work.

I don't really mind but since the number of threads is a preference, I could image a situation where you have this set to, say, 4 in your global preference file and then it's annoying to set it to 0 manually whenever you run something that does not support multithreading. On the other hand, we don't have anything that does not support multithreading in standalone, anyway...

@thesamovar
Copy link
Member

Yeah it's pretty marginal one way or the other, indeed. OK I'm going to have another look at this now. From your point of view is it good to merge now?

@mstimberg
Copy link
Member

From your point of view is it good to merge now?

Yes, it is.

@thesamovar
Copy link
Member

I modified the OpenMP test to check that the standalone results are the same as the runtime results (as well as checking that standalone with threads was the same as without). The tests all pass on Windows and I'm happy with the code so as soon as Travis reports back that the tests pass I'll merge this. I'll write some documentation for this and Cython in the docs branch (I think a new 'ways to run Brian' page might be a good idea).

@thesamovar
Copy link
Member

OK it passes, merging now!

thesamovar added a commit that referenced this pull request Oct 2, 2014
@thesamovar thesamovar merged commit f9475df into master Oct 2, 2014
@thesamovar thesamovar deleted the openmp branch October 2, 2014 19:16
@yger
Copy link
Contributor Author

yger commented Oct 2, 2014

Great ! I'll have a look tomorrow. I wrote some doc for the openmp branch,
but more will never hurt...

Best

Pierre
Le 2 oct. 2014 21:15, "Dan Goodman" notifications@github.com a écrit :

Merged #315 #315.


Reply to this email directly or view it on GitHub
#315 (comment).

@thesamovar
Copy link
Member

Hey @yger, I'm looking through your documentation for the openmp branch now - great stuff! I'm going to work on it for a little while in the docs_improvements branch. So maybe some time this weekend or next week you could take a look and add anything that isn't finished?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants