Writes to shared variables not correctly handled with OpenMP #551

mstimberg · 2015-08-28T15:00:48Z

Writes to shared variables will be executed by each thread, therefore things like s += 1 will not give the expected result. Example:

from brian2 import *
set_device('cpp_standalone')
prefs.devices.cpp_standalone.openmp_threads = 4
G = NeuronGroup(10, 's : 1 (shared)')
G.run_regularly('s += 1')
run(defaultclock.dt)
device.build(run=True, compile=True)
print G.s

This should print 1.0, but it prints 4.0 (or sometimes 3.0 if two threads update s at the same time).

This is not trivial to fix, and I'm wondering whether we don't have to introduce a massive change to the templates to correct this. The problem is that we can't simply wrap the scalar code in the template with a simple or master pragma, because then all variable definitions within this block are not accessible outside, which would prevent vector code from accessing any of the scalar variables (e.g. the lio constants). So we'd have to split the scalar_code block into a part that declares the variables and one that contains the statements. I'd prefer to avoid this, since it entails low-level changes to the code generation pipeline and will also break the GPU devices that are currently worked on. I think a better option would be to not have the OpenMP parallel block in the network run, but in each code object individually. This way, we could have the scalar code executed serially before going into the parallel part. Another advantage would be that there would be no longer a need for the IS_OPENMP_COMPATIBLE annotation -- if a template does not know about OpenMP it does not open a parallel block and everything runs serially. It would be a change to all templates, but one that would not affect the GPU devices.
However, I think @yger did something like this when he started working on the OpenMP support and IIRC, this was too slow because of OpenMP overhead? Maybe I should just manually try this out with an example..

The text was updated successfully, but these errors were encountered:

thesamovar · 2015-08-30T11:23:37Z

I think the easiest solution would be to wrap the shared code with:

if(omp_get_thread_num()==0)
{
// scalar code
...
}
#pragma omp barrier

The if check will ensure that only one thread does it, and the barrier ensures that shared variables aren't used before they're defined.

mstimberg · 2015-08-31T08:16:32Z

This is equivalent to the use of something lke #pragma omp master or #pragma omp single. This was actually the first thing I tried but as I mentioned above, we would have to split up the scalar block into declarations and statements for this to work. Otherwise, this will generate code like this:

#pragma omp single
{
// scalar code
const int _lio_const = exp(-0.01);
}

// vector code
... = ... _lio_const ... ;  //error: _lio_const not declared

thesamovar · 2015-08-31T13:12:29Z

Ah I see what you mean. OK, another option would be to split the code object into two: one for the scalar variables (with the vector code left empty) and another with the vector code (and the scalar code left empty).

mstimberg · 2015-08-31T13:18:34Z

OK, another option would be to split the code object into two: one for the scalar variables (with the vector code left empty) and another with the vector code (and the scalar code left empty).

But the vector code in general needs the scalar code (e.g. the loop-invariant constants), so we can't leave it away for the vector code. And seperating out the parts we need and those we don't might not be possible in the general case, since it could be a situation like this:

//scalar code
scalar_variable += 1;
const int _lio_const = exp(-scalar_variable);

for (...)
{
    // vector code
    v += _lio_const;
}

mstimberg · 2015-08-31T13:21:17Z

But actually I think the straightforward solution of moving the parallel pragma into the templates is the best one. I think we create a thread pool in the beginning when we do omp_set_num_threads(..) in the beginning, so opening and closing parallel blocks should not take much time. In the end, it might even be less overhead because we currently have a lot of code objects (thresholds, synaptic codes that update post-synaptic variables, ...) that are using #pragma omp single.

thesamovar · 2015-08-31T13:25:03Z

Agreed, if it's fast enough that's the best option!

mstimberg · 2015-08-31T13:25:54Z

Agreed, if it's fast enough that's the best option!

Ok, I'll give it a try and test it on a few examples.

thesamovar · 2015-08-31T13:27:27Z

If that doesn't work I can't think of anything else apart from splitting up the declarations and evaluations.

mstimberg added bug high priority device: C++ standalone labels Aug 28, 2015

mstimberg mentioned this issue Aug 28, 2015

[MRG] Model clocks as Group #552

Merged

mstimberg added this to the 2.0 milestone Aug 28, 2015

mstimberg mentioned this issue Aug 31, 2015

[MRG] Fix scalar writes with OpenMP #553

Merged

mstimberg closed this as completed in d708696 Sep 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writes to shared variables not correctly handled with OpenMP #551

Writes to shared variables not correctly handled with OpenMP #551

mstimberg commented Aug 28, 2015

thesamovar commented Aug 30, 2015

mstimberg commented Aug 31, 2015

thesamovar commented Aug 31, 2015

mstimberg commented Aug 31, 2015

mstimberg commented Aug 31, 2015

thesamovar commented Aug 31, 2015

mstimberg commented Aug 31, 2015

thesamovar commented Aug 31, 2015

Writes to shared variables not correctly handled with OpenMP #551

Writes to shared variables not correctly handled with OpenMP #551

Comments

mstimberg commented Aug 28, 2015

thesamovar commented Aug 30, 2015

mstimberg commented Aug 31, 2015

thesamovar commented Aug 31, 2015

mstimberg commented Aug 31, 2015

mstimberg commented Aug 31, 2015

thesamovar commented Aug 31, 2015

mstimberg commented Aug 31, 2015

thesamovar commented Aug 31, 2015