Skip to content


_trace_lambda_mul_add_assign_nonlocal insufficiently safe #175

kostrzewa opened this Issue · 10 comments

2 participants

European Twisted Mass Collaboration member

The omp atomic pragma is in general not sufficient to guarantee synchronicity between threads of a given memory address. For a large enough number of collisions this will result in incorrect results. (according to these lecture slides: )

The problem is that if multiple threads read and write from the same memory location, this can lead to conflicts even with the atomic pragma, because of the complex memory model in OpenMP.

I think this is why currently the clover runs are not reproducible (but the problem is there, in princple, in deriv_Sb as well).

I will have to attempt several modifications of _trace_lambda_mul_add_assign_nonlocal which may or may not lead to correctness.

European Twisted Mass Collaboration member

A comprehensive workaround would be a reworking of sw_all and deriv_Sb to use a "pull" rather than a "push" type approach.

European Twisted Mass Collaboration member

Okay, so I've tried defining a local variable "dtemp" in which the computation is done, but this does not seem to be sufficient:

#define _trace_lambda_mul_add_assign_nonlocal(r,c,a) \
dtemp = c*(-cimag((a).c10)-cimag((a).c01)); \
_Pragma("omp atomic") \
(r).d1 += dtemp; \
dtemp = c*(+creal((a).c10)-creal((a).c01)); \
_Pragma("omp atomic") \
(r).d2 += dtemp; \
dtemp = c*(-cimag((a).c00)+cimag((a).c11)); \
_Pragma("omp atomic") \
(r).d3 += dtemp; \
dtemp = c*(-cimag((a).c20)-cimag((a).c02)); \
_Pragma("omp atomic") \
(r).d4 += dtemp; \
dtemp = c*(+creal((a).c20)-creal((a).c02)); \
_Pragma("omp atomic") \
(r).d5 += dtemp; \
dtemp = c*(-cimag((a).c21)-cimag((a).c12)); \
_Pragma("omp atomic") \
(r).d6 += dtemp; \
dtemp = c*(+creal((a).c21)-creal((a).c12)); \
_Pragma("omp atomic") \
(r).d7 += dtemp; \
dtemp = c*((-cimag((a).c00)-cimag((a).c11) + 2.0 * cimag(a.c22))*0.577350269189625); \
_Pragma("omp atomic") \
(r).d8 += dtemp;

This seems to confirm the explanation in those lecture notes that this is only a safe construct if the address in question is modified from one thread and read from multiple others or written to (note, not updated!!) from multiple threads and read from one, but not multiple-multiple as in our case.

The next thing that I tried is to use a critical construct (which is madness from an efficiency point of view because we only ever update the d# locally, never mixing in between the 8).

define _trace_lambda_mul_add_assign_nonlocal(r,c,a) \
_Pragma("omp critical") \
{ \
(r).d1 += c*(-cimag((a).c10)-cimag((a).c01)); \
(r).d2 += c*(+creal((a).c10)-creal((a).c01)); \
(r).d3 += c*(-cimag((a).c00)+cimag((a).c11)); \
(r).d4 += c*(-cimag((a).c20)-cimag((a).c02)); \
(r).d5 += c*(+creal((a).c20)-creal((a).c02)); \
(r).d6 += c*(-cimag((a).c21)-cimag((a).c12)); \
(r).d7 += c*(+creal((a).c21)-creal((a).c12)); \
(r).d8 += c*((-cimag((a).c00)-cimag((a).c11) + 2.0 * cimag(a.c22))*0.577350269189625);\

The fact that I DON'T GET reproducible results from this is actually calming me down. The effect that we're seeing might be down to varying rounding errors only! I think that critical sections should be synchronisation safe (*) but I need to read up some more.

(*) before and after the critical section all threads see a consistent picture of the memory which was touched

European Twisted Mass Collaboration member

The only way to know conclusively if these things are working is to see effects on the acceptance rate and expectation value with many threads (64) and much higher statistics.

I see a very clear effect on the expectation value from this difference (unsurprisingly) but within statistical error. The acceptance rate is only mildly affected. However, this is a probabilistic effect (ie, most reads and writes will be correct and without synchro issues) and clearly increases with the number of threads.

Another thing that calms me down is the fact that on BG/Q without the nonlocal macro there was NO acceptance because of conflicts in deriv_Sb. After the change I got acceptance but I obviously don't know if there couldn't be a sub-percent effect hiding.

Would someone be willing to run (at least) two high-statistics (4^4) runs of sample-hmc-cloverdet with standard settings and maybe 4e5 trajectories to pin down the statistical error and see whether this is really a problem. This should be run with as many threads as possible (say 64 threads each on 2 nodes, making conflicts between threads inevitable because the local volume will be 64 [eo]). My current estimate (by eye) of the acceptance rate is 96.60+0.07 and the autocorrelation time of the plaquette is 2.14 +- 0.04


you mean on a BG/Q? because I wouldn't know where else I could do this...

European Twisted Mass Collaboration member

Oh, btw, note that the "problem" (if it is one) definitely emerges from sw_all, serializing it gives reproducible clover trajetories.

Yes I would do this on BG/Q. The only problem is that it might not show up because maybe BG/Q actually implements the atomic pragma in a reasonable way with proper synchronisation.

If the acceptance rate, expectation value and other typical statistical indicators are all fine and the results from various runs are in agreement within errors, I think we can safely assume that we're seeing just rounding errors varying because the accumulations are done in varying orders, in which case I would say the clover term is fine.


Isn't there an even easier test: when this happens by chance, why not running twice 1000 trajectories. If we get binary identical results, shoudn't that be enough?

European Twisted Mass Collaboration member

I know that we don't get binary identical results even after one trajectory. The question is whether it is a problem or not... I know that serializing sw_all gives binary identical results. As soon as threads are involved the trajectories diverge right away. It seems (with only 8 threads), however, that statistically the results are OK. The mild difference I see in acceptance rate, autocorrelation time and expectation value could just be a roundoff effect because the summation order will be affected by the locking and threading (at least that's my current explanation for the differences) but I can't be sure we're not actually introducing a possibly subtle bias into our simulations. The only way to really find out is to maximize the probability for conflicts and do a high statistics run. If the expectation values and other statistical indicators are still OK, we're good and it's just roundoff.


ah, sorry, I see... Okay, I'll try to do such a run at some point.

But honestly, if we have a doubt, and I am not completely convinced, we should maybe better rewrite the correspondong routines to avoid this problem...

European Twisted Mass Collaboration member

I'm not convinced either which is why I think the test is important. On the other hand, if you look at the NDCLOVER expectation value here: and the difference between the two OpenMP runs... (note, there is no difference between half- and fullspinor anymore [1] in this run, the difference comes from sw_all)

On the other hand, the difference can also be smaller as in this run:

[1] (NOTE: the serial_hs run was interrupted and does not quite have full statistics. The serial runs are also split into three jobs at slightly different trajectory numbers between the two, resulting in differences in random numbers after the split, the initial history of ~40000 trajactories is exactly the same)

European Twisted Mass Collaboration member

I wrote an explicit test for this with a similar accumulation as with our trace_lambda_mul_add_assign... So far I haven't seen a failure yet but if it happens once in 90 billion calls of += then I would probably not see it in 30 minutes...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.