Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider not using _mm_mfence even when it is available #36

Closed
AlexGuteniev opened this issue Jun 9, 2020 · 9 comments
Closed

Consider not using _mm_mfence even when it is available #36

AlexGuteniev opened this issue Jun 9, 2020 · 9 comments

Comments

@AlexGuteniev
Copy link

Consider either making BOOST_ATOMIC_NO_MFENCE overriding detection of SSE2 or x64, or even always defaulting to not using mfence.

The reason is that mfence is slower than a dummy interlocked operation.

On x64 there's even an intrinsic for that - see __faststorefence

There might be a reason to use precisely mfence. An interlocked operation does not provide fences to non-temporal stores. But this is out of scope for implementing C++ memory model on x86 in the usual way.

@Lastique
Copy link
Member

Lastique commented Jun 9, 2020

If my reading of the C++ standard ([intro.races]/9) is correct, a release operation on an atomic should make all prior memory accesses visible to other threads after they execute an acquire operation on the atomic. The standard doesn't discriminate between regular or non-temporal stores, but it can be argued that it doesn't have to because both are "evaluations", in the standard's language. Put simply, non-temporal or not, the stores before the release fence should probably be flushed.

I know Boost.Atomic is issuing mfence for seq_cst, not just release. But this probably means I may have to upgrade release fence to sfence.

As for __faststorefence, I can't comment on that as I don't know which instructions it translates to. There may be a place for a lighter fence in Boost.Atomic as an extension, if it is generic enough and could be useful beyond just x86.

@Lastique
Copy link
Member

Lastique commented Jun 9, 2020

OTOH, all compilers seem to generate mfence for seq_cst and no instructions for other cases, see: https://gcc.godbolt.org/z/P6gF4g.

@AlexGuteniev
Copy link
Author

__faststorefence means lock or [rsp], 0
A dummy interlocked operation.

The particular choice of an interlocked operation is a matter of micro-optimization, like trying to use as fewer registers as possible (including result register), as fewer flags as possible, avoid stack variable, etc. Specifically this one avoids stack variable. Can be approximately simulated by _InterlockedOr(reinterpret_cast<long volatile *>(_AddressOfReturnAddress()), 0)

So it is implementable only on x86 (32 and 64 bit), and actually exists as intrinsic only on 64-bit x86.

I know Boost.Atomic is issuing mfence for seq_cst, not just release. But this probably means I may have to upgrade release fence to sfence.

No, I don't want that. The standard does not define non-temporal stores, and they are always intrinsics, so I don't think these should be covered.

@AlexGuteniev
Copy link
Author

https://stackoverflow.com/a/61382843/2945027 relevant SO thread
microsoft/STL#740 relevant MSVC STL PR

@AlexGuteniev
Copy link
Author

I wouldn't mind if you prefer keeping mfence for seq_cst, but adding sfence for release may turn me away from Boost.Atomic. If you're inclined to do that, at least please consider a macro to control it.

@Lastique
Copy link
Member

Lastique commented Jun 9, 2020

Given that other implementations don't issue sfence, I won't be doing that either. I guess, everyone is considering non-temporal stores an exception, so __faststorefence or an equivalent might be acceptable.

@AlexGuteniev
Copy link
Author

The most weird about emitting mfence for atomic_thread_fence(seq_cst) is that it is not emitted for other seq_cst operations: https://gcc.godbolt.org/z/JyPPxw

void inc_seq_cst(int* v)
{
    __atomic_fetch_add(v, 1, __ATOMIC_SEQ_CST);
}

is no different from others.

Similarly, release store does not have extra fences as well.
So NT stores don't get any special care of by the memory model implementation.

Maybe compiler developers hope that mfence will be faster with future CPU (faster than an interlocked op), or just missed this optimization opportunity. I've also read that mfence is that much slow on Skylake after microcode update, not from the very beginning

@AlexGuteniev
Copy link
Author

Thanks.

Especially for the link against __faststorefence.
So trying to save a stack variable turns out to be a bad idea.
(Though it is hypothetically possible that __faststorefence could be trained to avoid that, but I would not even bother looking into it)

Then probably the best fence one can do is lock not [dummy_variable], which does not try to save a variable, but saves registers, flags, and has a short encoding. This was suggested during MSVC change discussion, but unfortunately there's (currently) no intrinsic to emit it. (MSVC would use lock inc [stack_variable] emitted by _InterlockedIncrement)

@Lastique
Copy link
Member

Thanks, lock not indeed seems to be a better alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants