Consider not using _mm_mfence even when it is available #36

AlexGuteniev · 2020-06-09T13:17:05Z

Consider either making BOOST_ATOMIC_NO_MFENCE overriding detection of SSE2 or x64, or even always defaulting to not using mfence.

The reason is that mfence is slower than a dummy interlocked operation.

On x64 there's even an intrinsic for that - see __faststorefence

There might be a reason to use precisely mfence. An interlocked operation does not provide fences to non-temporal stores. But this is out of scope for implementing C++ memory model on x86 in the usual way.

The text was updated successfully, but these errors were encountered:

Lastique · 2020-06-09T14:18:10Z

If my reading of the C++ standard ([intro.races]/9) is correct, a release operation on an atomic should make all prior memory accesses visible to other threads after they execute an acquire operation on the atomic. The standard doesn't discriminate between regular or non-temporal stores, but it can be argued that it doesn't have to because both are "evaluations", in the standard's language. Put simply, non-temporal or not, the stores before the release fence should probably be flushed.

I know Boost.Atomic is issuing mfence for seq_cst, not just release. But this probably means I may have to upgrade release fence to sfence.

As for __faststorefence, I can't comment on that as I don't know which instructions it translates to. There may be a place for a lighter fence in Boost.Atomic as an extension, if it is generic enough and could be useful beyond just x86.

Lastique · 2020-06-09T14:34:14Z

OTOH, all compilers seem to generate mfence for seq_cst and no instructions for other cases, see: https://gcc.godbolt.org/z/P6gF4g.

AlexGuteniev · 2020-06-09T14:34:41Z

__faststorefence means lock or [rsp], 0
A dummy interlocked operation.

The particular choice of an interlocked operation is a matter of micro-optimization, like trying to use as fewer registers as possible (including result register), as fewer flags as possible, avoid stack variable, etc. Specifically this one avoids stack variable. Can be approximately simulated by _InterlockedOr(reinterpret_cast<long volatile *>(_AddressOfReturnAddress()), 0)

So it is implementable only on x86 (32 and 64 bit), and actually exists as intrinsic only on 64-bit x86.

I know Boost.Atomic is issuing mfence for seq_cst, not just release. But this probably means I may have to upgrade release fence to sfence.

No, I don't want that. The standard does not define non-temporal stores, and they are always intrinsics, so I don't think these should be covered.

AlexGuteniev · 2020-06-09T14:36:29Z

https://stackoverflow.com/a/61382843/2945027 relevant SO thread
microsoft/STL#740 relevant MSVC STL PR

AlexGuteniev · 2020-06-09T14:46:55Z

I wouldn't mind if you prefer keeping mfence for seq_cst, but adding sfence for release may turn me away from Boost.Atomic. If you're inclined to do that, at least please consider a macro to control it.

Lastique · 2020-06-09T14:52:13Z

Given that other implementations don't issue sfence, I won't be doing that either. I guess, everyone is considering non-temporal stores an exception, so __faststorefence or an equivalent might be acceptable.

AlexGuteniev · 2020-06-10T05:25:33Z

The most weird about emitting mfence for atomic_thread_fence(seq_cst) is that it is not emitted for other seq_cst operations: https://gcc.godbolt.org/z/JyPPxw

void inc_seq_cst(int* v)
{
    __atomic_fetch_add(v, 1, __ATOMIC_SEQ_CST);
}

is no different from others.

Similarly, release store does not have extra fences as well.
So NT stores don't get any special care of by the memory model implementation.

Maybe compiler developers hope that mfence will be faster with future CPU (faster than an interlocked op), or just missed this optimization opportunity. I've also read that mfence is that much slow on Skylake after microcode update, not from the very beginning

AlexGuteniev · 2020-06-11T20:23:34Z

Thanks.

Especially for the link against __faststorefence.
So trying to save a stack variable turns out to be a bad idea.
(Though it is hypothetically possible that __faststorefence could be trained to avoid that, but I would not even bother looking into it)

Then probably the best fence one can do is lock not [dummy_variable], which does not try to save a variable, but saves registers, flags, and has a short encoding. This was suggested during MSVC change discussion, but unfortunately there's (currently) no intrinsic to emit it. (MSVC would use lock inc [stack_variable] emitted by _InterlockedIncrement)

Lastique · 2020-06-11T20:57:30Z

Thanks, lock not indeed seems to be a better alternative.

Lastique closed this as completed in 559eba8 Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider not using _mm_mfence even when it is available #36

Consider not using _mm_mfence even when it is available #36

AlexGuteniev commented Jun 9, 2020

Lastique commented Jun 9, 2020

Lastique commented Jun 9, 2020

AlexGuteniev commented Jun 9, 2020

AlexGuteniev commented Jun 9, 2020

AlexGuteniev commented Jun 9, 2020

Lastique commented Jun 9, 2020

AlexGuteniev commented Jun 10, 2020

AlexGuteniev commented Jun 11, 2020

Lastique commented Jun 11, 2020

Consider not using _mm_mfence even when it is available #36

Consider not using _mm_mfence even when it is available #36

Comments

AlexGuteniev commented Jun 9, 2020

Lastique commented Jun 9, 2020

Lastique commented Jun 9, 2020

AlexGuteniev commented Jun 9, 2020

AlexGuteniev commented Jun 9, 2020

AlexGuteniev commented Jun 9, 2020

Lastique commented Jun 9, 2020

AlexGuteniev commented Jun 10, 2020

AlexGuteniev commented Jun 11, 2020

Lastique commented Jun 11, 2020