Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writes to shared variables not correctly handled with OpenMP #551

Closed
mstimberg opened this issue Aug 28, 2015 · 8 comments
Closed

Writes to shared variables not correctly handled with OpenMP #551

mstimberg opened this issue Aug 28, 2015 · 8 comments

Comments

@mstimberg
Copy link
Member

Writes to shared variables will be executed by each thread, therefore things like s += 1 will not give the expected result. Example:

from brian2 import *
set_device('cpp_standalone')
prefs.devices.cpp_standalone.openmp_threads = 4
G = NeuronGroup(10, 's : 1 (shared)')
G.run_regularly('s += 1')
run(defaultclock.dt)
device.build(run=True, compile=True)
print G.s

This should print 1.0, but it prints 4.0 (or sometimes 3.0 if two threads update s at the same time).

This is not trivial to fix, and I'm wondering whether we don't have to introduce a massive change to the templates to correct this. The problem is that we can't simply wrap the scalar code in the template with a simple or master pragma, because then all variable definitions within this block are not accessible outside, which would prevent vector code from accessing any of the scalar variables (e.g. the lio constants). So we'd have to split the scalar_code block into a part that declares the variables and one that contains the statements. I'd prefer to avoid this, since it entails low-level changes to the code generation pipeline and will also break the GPU devices that are currently worked on. I think a better option would be to not have the OpenMP parallel block in the network run, but in each code object individually. This way, we could have the scalar code executed serially before going into the parallel part. Another advantage would be that there would be no longer a need for the IS_OPENMP_COMPATIBLE annotation -- if a template does not know about OpenMP it does not open a parallel block and everything runs serially. It would be a change to all templates, but one that would not affect the GPU devices.
However, I think @yger did something like this when he started working on the OpenMP support and IIRC, this was too slow because of OpenMP overhead? Maybe I should just manually try this out with an example..

@thesamovar
Copy link
Member

I think the easiest solution would be to wrap the shared code with:

if(omp_get_thread_num()==0)
{
// scalar code
...
}
#pragma omp barrier

The if check will ensure that only one thread does it, and the barrier ensures that shared variables aren't used before they're defined.

@mstimberg
Copy link
Member Author

This is equivalent to the use of something lke #pragma omp master or #pragma omp single. This was actually the first thing I tried but as I mentioned above, we would have to split up the scalar block into declarations and statements for this to work. Otherwise, this will generate code like this:

#pragma omp single
{
// scalar code
const int _lio_const = exp(-0.01);
}

// vector code
... = ... _lio_const ... ;  //error: _lio_const not declared

@thesamovar
Copy link
Member

Ah I see what you mean. OK, another option would be to split the code object into two: one for the scalar variables (with the vector code left empty) and another with the vector code (and the scalar code left empty).

@mstimberg
Copy link
Member Author

OK, another option would be to split the code object into two: one for the scalar variables (with the vector code left empty) and another with the vector code (and the scalar code left empty).

But the vector code in general needs the scalar code (e.g. the loop-invariant constants), so we can't leave it away for the vector code. And seperating out the parts we need and those we don't might not be possible in the general case, since it could be a situation like this:

//scalar code
scalar_variable += 1;
const int _lio_const = exp(-scalar_variable);

for (...)
{
    // vector code
    v += _lio_const;
}

@mstimberg
Copy link
Member Author

But actually I think the straightforward solution of moving the parallel pragma into the templates is the best one. I think we create a thread pool in the beginning when we do omp_set_num_threads(..) in the beginning, so opening and closing parallel blocks should not take much time. In the end, it might even be less overhead because we currently have a lot of code objects (thresholds, synaptic codes that update post-synaptic variables, ...) that are using #pragma omp single.

@thesamovar
Copy link
Member

Agreed, if it's fast enough that's the best option!

@mstimberg
Copy link
Member Author

Agreed, if it's fast enough that's the best option!

Ok, I'll give it a try and test it on a few examples.

@thesamovar
Copy link
Member

If that doesn't work I can't think of anything else apart from splitting up the declarations and evaluations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants