Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[Arm64] Fix WorkStealingQueue memory ordering #17508

Closed

Conversation

sdmaclea
Copy link

@sdmaclea sdmaclea commented Apr 11, 2018

Fixes #17178

Audited and reworked WorkStealingQueue to take into account multi-threaded memory ordering issues encountered on high core count arm64 testing.

This fixes easily reproducible errors in System.Threading.Tasks.Tests, System.Threading.Channel.Tests, and System.Threading.Tasks.Dataflow.Tests. The same issue was making the linux-arm64 SDK unusable.

This is very important for linux-arm64 for the 2.1 release. I am very happy to put the patch inside a #if ARM64 if this is deemed too risky for other platforms. (I suspect arm32 needs a similar/identical patch).

@BruceForstall FYI

@stephentoub stephentoub added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 11, 2018
@stephentoub stephentoub requested a review from kouvel April 11, 2018 11:24
@kouvel
Copy link
Member

kouvel commented Apr 11, 2018

I won't be able to take a look immediately (out of office for personal reasons), my current expectation is that I'll be able to look at this by end of week or earlier.

@sdmaclea sdmaclea force-pushed the PR-ARM64-CONSERVATIVE-THREADPOOL branch from 094c69e to 9ff8e86 Compare April 12, 2018 06:02
@sdmaclea sdmaclea changed the title WIP [Arm64] Conservative ThreadPool [Arm64] Conservative ThreadPool Apr 12, 2018
@sdmaclea sdmaclea changed the title [Arm64] Conservative ThreadPool [Arm64] Fix WorkStealingQueue memory ordering Apr 12, 2018
@Petermarcu
Copy link
Member

@jkotas thoughts on gettting this in by tomorrow for 2.1?

@jkotas
Copy link
Member

jkotas commented Apr 13, 2018

The delta seems to be combination of correctness fixes and optimizations. Optimizations of tricky lock-free code always come with bug tail.

I think I would be ok with the minimal set of correctness fixes in 2.1, but not the optimizations.

@stephentoub
Copy link
Member

This also appears to be adding interlocked operations where they weren't before, e.g. changing the arg to SpinLock.Exit to true. We wouldn't want to incur those costs for non-ARM64.

@sdmaclea
Copy link
Author

@jkotas thoughts on getting this in by tomorrow for 2.1?

I would be ok with the minimal set of correctness fixes in 2.1, but not the optimizations.

It is hard for me to identify exactly which is correctness and which is optimization. It was implemented as an audit and rewrite followed by debug. I'll make a first attempt.

This also appears to be adding interlocked operations where they weren't before, e.g. changing the arg to SpinLock.Exit to true. We wouldn't want to incur those costs for non-ARM64

useMemoryBarrier is an unfortunate variable name. For arm* functional correctness, exiting the lock always needs a release barrier to keep locked things from being deferred (in the write gather buffer) till long after the lock is released.

It doesn't have to be an Interlocked operation. It can be a simple Volatile.Write() when releasing the lock.

I would propose SpinLock.Exit(useMemoryBarrier: false); use a Volatile.Write() to exit the lock at least for arm*. @stephentoub Is it OK for x64/x86, or do you want #if (arm64 || arm) on the SpinLock().Exit() change.

@stephentoub
Copy link
Member

It doesn't have to be an Interlocked operation. It can be a simple Volatile.Write() when releasing the lock.

I don't understand. It already is a volatile write when useMemoryBarrier is false, and an interlocked when useMemoryBarrier is true. The m_owner field itself is marked as volatile.

@sdmaclea
Copy link
Author

It already is a volatile write when useMemoryBarrier is false, and an interlocked when useMemoryBarrier is true. The m_owner field itself is marked as volatile.

I didn't notice that. Like I said useMemoryBarrier is an unfortunate variable name ... Thanks.

@sdmaclea
Copy link
Author

I started revising to the minimal set of changes.

There are two symptoms w/ the original code.

  • The same work item was being executed by multiple threads
  • The queue was underflowing

The root cause seems to be that LocalPop() fast path made a faulty assumption that only one item could be stolen by TrySteal(). This can be significantly higher (infinite) especially in cases where the Local thread is preempted.

Because the symptoms were fixed in the order above, it seems the Exchange to the m_array may not be necessary in the minimal set of changes. I need to test to confirm.

@sdmaclea
Copy link
Author

To fix the LocalPop() fast path issue, LocalPop() must be made properly lockless in the fast path. This will require the Exchange on m_array.

@sdmaclea sdmaclea force-pushed the PR-ARM64-CONSERVATIVE-THREADPOOL branch from 9ff8e86 to c0ef151 Compare April 13, 2018 18:36
@sdmaclea
Copy link
Author

@jkotas I have greatly simplified the patch and tried to only include critical fixes.
@stephentoub I have removed the changes to the useMemoryBarrier argument

The changes are now all in LocalPop() and TrySteal(). The basic idea is use Exchange to arbitrate which thread gets m_array[last]. After popping/stealing an item. Always check for queue underflow. Repair while locked to guarantee correct repair.

@kouvel
Copy link
Member

kouvel commented Apr 13, 2018

@sdmaclea, looking at the original code, I'm trying to understand what the issue is that can cause LocalPop and TrySteal to pop/steal the same item. I can see that would be an issue if the two Interlocked.Exchanges here do not imply a full memory barrier. Otherwise, my rationale is that either:

  • LocalPop will see the updated m_headIndex (from TrySteal's exchange) and fail the fast path, requiring a lock which resolves any potential race
  • Or TrySteal will see the updated m_tailIndex (from LocalPop's exchange) and fail to steal on that iteration

Does Interlocked.Exchange imply a full memory barrier on arm64?

Other than that, perhaps the initial reads of m_headIndex and m_tailIndex in both functions and others should be volatile reads depending on uses from outside so that they would not miss an existing item.

If there's some other potential ordering issue, could you please elaborate?

@kouvel
Copy link
Member

kouvel commented Apr 13, 2018

It looks like Interlocked.Exchange is just an InternalCall (not intrinsic) and translates to __sync_swap which states that it is a full barrier (assuming that means relative to all memory and not just for the address being referenced).

@sdmaclea
Copy link
Author

I expect the Exchange to currently be an Load Acquire Exclusive and Store Release Exclusive. So it is better than a full barrier.

If I under stand it the issue occurred on the LocalPop() fast path.

// If there is no interaction with a take, we can head down the fast path.

On arm/arm64 the tail exchange is guaranteed to be observed before any other Local write is observed. We then read head which is guaranteed to be read after the exchange. But is not guaranteed to read all pending writes on other observers. If (head > tail) we take the fast path. We read m_array[idx] & check for null. and write null.

There is insufficent handshaking with TrySteal() to guarantee that TrySteal() has not run more than once. With 48 threads on Centriq, it is highly likely multiple threads attempt to steal in these tests.

@kouvel
Copy link
Member

kouvel commented Apr 13, 2018

We then read head which is guaranteed to be read after the exchange. But is not guaranteed to read all pending writes on other observers.

There may be a pending write to head in TrySteal that is not observed by LocalPop. That means LocalPop will pop the item through the fast path, but it also means that TrySteal will fail to steal that item because its exchange/barrier must have occurred after LocalPop's exchange/barrier and TrySteal will observe the updated tail. Or did I miss something?

There is insufficent handshaking with TrySteal() to guarantee that TrySteal() has not run more than once. With 48 threads on Centriq, it is highly likely multiple threads attempt to steal in these tests.

Most of what TrySteal does is inside the lock so I don't see how multiple TrySteal threads could surface additional ordering issues compared to just one TrySteal thread.

@kouvel
Copy link
Member

kouvel commented Apr 13, 2018

Although, SpinLock exit is not the typical lock exit (no full barrier at the moment). Could that be the main issue?
Nevermind, it only needs release and that's already covered above

@sdmaclea
Copy link
Author

SpinLock does have a Release barrier on Exit(). As @stephentoub pointed out m_owner is marker volatile. Volatile writes are treated as Release barriers. Volatile reads function as Acquire barriers.

@sdmaclea
Copy link
Author

Multple TrySteal() can run.

The m_foreignLock guarantees two major things.

  • Only one TrySteal() can occur at a time.
  • The changes while the lock was held will be visible before the release of the lock is observed.

With multiple threads trying to steal work from a single producer Local queue. You could see how many could be consumed.

The only thing that will stop them is the observation of an empty queue.

Since they are guaranteed to observe each others writes to m_tailIndex after observing and acquiring the lock, They cannot underrun the queue w/o LocalPop().

@sdmaclea
Copy link
Author

So TrySteal() and LocalPop() must observe each other take the last item and handle it appropriately.

In the new code

TrySteal() increments m_headIndex. Then performs an exchange on m_array[last], Then reads m_tailIndex, then conditionally repiars the m_headIndex and releases m_foreignLock

LocalPop() will observe events as happening in the following order.

  1. Write m_headIndex
  2. Exchange of m_array[last]. (Atomic swap w/ barrier)
  3. Read of m_tailIndex
  4. Conditional write of m_headIndex
  5. Release of m_foreignLock

The exchange effectively orders reads with respect to writes. This effectively means that the m_headIndex has left the write gathering buffers before the read of m_array[last] and m_tailIndex.

@sdmaclea
Copy link
Author

Continuing with the new code

LocalPop() follows a similar pattern.

LocalPop() decrements m_tailIndex. Then performs an exchange on m_array[last], Then reads m_headIndex, then conditionally obtains m_foreignLock and conditionally repairs the m_tailIndex and releases m_foreignLock

TrySteal() will observe events as happening in the following order.

  1. Write m_tailIndex
  2. Exchange of m_array[last]. (Atomic swap w/ barrier)
  3. Read of m_headIndex
  4. Conditional Exchange of m_foreignLock
  5. Conditional write of m_tailIndex
  6. Conditional Release of m_foreignLock

So if you look carefully, at both TrySteal() and LocalPop for Arm the operations will be observed like they were executed on a Sequentially Consistent machine.

@sdmaclea
Copy link
Author

Continuing with the new code

So the question remains, does the new ordering guarantee we will never

  • Execute the same item twice
  • Execute a LocalPush(item) and never be able to execute the item.

The Execute the same item twice same item twice is easily covered by the two Exchanges only one can get the item.

The Execute a LocalPush(item) and never be able to execute the item. is guaranteed as long as we never leave allow the head > tail condition to persist.

Given the ordering analysis above, we have:

Local Write tail --> Read head (Local and Foreign observed)
Foreign Write head --> Read tail (Local and Foreign observed)

So we have a six order of events possible.

Non overlapping
Local Write, Local Read, Foreign Write, Foreign Read
Foreign Write, Foreign Read, Local Write, Local Read
Overlapping
Local Write, Foreign Write, Local Read, Foreign Read
Local Write, Foreign Write, Foreign Read, Local Read
Foreign Write, Local Write, Foreign Read, Local Read
Foreign Write, Local Write, Local Read, Foreign Read

In all cases if the head and tail get out of order, one of them can fix the underflowed queue.

I think this is sufficient to demonstrate the new code works

@sdmaclea
Copy link
Author

So now why was the old code broken?

@kouvel
Copy link
Member

kouvel commented Apr 14, 2018

Thanks @sdmaclea, I'll get back shortly

@sdmaclea
Copy link
Author

@kouvel I think my analysis of the old C# code was completely wrong. I deleted some of my most recent comments

I just took a look at a C# 6.0 draft spec for the volatile keyword with respect to Execution ordering and critical points. From a C# perspective, it is not obvious to me why the old code was not working.

I just reread the ARMv8 ARM barrier-ordered section. From what I understand of the C# spec, the arm64 JIT for volatile and GT_XCHG, and the ARMv8 spec it seems the intended JIT behavior is correct.

I think I need to look at JIT generated code....

@sdmaclea
Copy link
Author

The actual disassembly of the critical section of LocalPop()

885FFC02          ldaxr   w2, [x0]          // w2 = LoadAcqEx(&m_tailIndex)
8801FC13          stlxr   w1, w19, [x0]     // w1 = StoreRelEx(&m_tailIndex, tail)
35FFFFC1          cbnz    w1, G_M55277_IG05 // Retry
F9400FA0          ldr     x0, [fp,#24]      // x0 = this
B9801400          ldrsw   x0, [x0,#20]      // x0 = this->m_headIndex
D50331BF          dmb     oshld             // Load acquire barrier
6B13001F          cmp     w0, w19           // Compare head tail       
5400038C          bgt     G_M55277_IG07     // Branch if fast path is not safe

So the code to read the m_headIndex surprised me. I expected a ldar not a ldrsw. Because head/tail are signed and not native sized ints, they must be sign extended. I had forgotten that I skipped handling signed integer loads when I handled volatile in the JIT in #12087.

The net result is my analysis was correct or sort of correct, but for the wrong reasons.

The load of m_headIndex is able to move before the write to m_tailIndex. This is because the stlxr is a release barrier.

@sdmaclea
Copy link
Author

Looking at the available documentation for volatile and Interlocked.Exchange() the behavior of this case is not exactly specified. The C# spec details the behavior of volatile write and volatile read, but not volatile read modify write or volatile atomic... The Interlocked.Exchange() docs also do not specify effect on memory ordering.

I think it is reasonable to assume the Interlocked.Exchange(ref m_tailIndex, tail) is either unordered or is treated as a volatile write w/ Store with Release semantics. The generated arm64 code implements exactly that.

Therefore functionally correct C# code should insert an Interlocked.MemoryBarrier()` if it wants a full memory barrier.

So Interlocked.Exchange(ref m_tailIndex, tail); should become

m_tailIndex = tail;
Interlocked.MemoryBarrier();

I will revise the patch and test.

@sdmaclea sdmaclea force-pushed the PR-ARM64-CONSERVATIVE-THREADPOOL branch from c0ef151 to 114ff67 Compare April 14, 2018 06:31
@sdmaclea
Copy link
Author

The patch is revised. The revised fix is working on linux-arm64. The patch now certainly meets @jkotas minimal change requirements.

PTAL

@kouvel
Copy link
Member

kouvel commented Apr 14, 2018

Looking at the available documentation for volatile and Interlocked.Exchange() the behavior of this case is not exactly specified. The C# spec details the behavior of volatile write and volatile read, but not volatile read modify write or volatile atomic... The Interlocked.Exchange() docs also do not specify effect on memory ordering.

I think it is reasonable to assume the Interlocked.Exchange(ref m_tailIndex, tail) is either unordered or is treated as a volatile write w/ Store with Release semantics. The generated arm64 code implements exactly that.

Though it is not sufficiently documented (documentation issues in this area don't seem uncommon) for historical reasons for .NET anyway (based on how it works on x86/x64), I believe any of the interlocked operations requires at minimum:

  1. The equivalent of a globally ordered memory barrier
  2. And sequential consistency on the memory region referenced (read and write where applicable) with respect to other interlocked operations referencing the same memory region

Based on what I gather from this, dmb satisfies (1) but ldaxr + stlxr do not. It's not very clear to me from the spec, but it appears as though an ldaxr + stlxr loop (which appears to be similar to a compare-and-swap loop) would satisfy (2) but not (1).

I tried compiling the following with vc++ compiler targeting arm64:

#include <Windows.h>

volatile uint32_t g_y;

int main()
{
    uint32_t x = 0;
    uint32_t y = InterlockedExchange(&x, 1);
    g_y = y;
    return 0;
}

And I see the following relevant code generated for the interlocked operation in main:

; 9235 :     return (unsigned) _InterlockedExchange((volatile long*) Target, (long) Value);

  00010	mov         w11,#1
  00014	mov         x10,sp
  00018		 |$LN5@main|
  00018	ldaxr       w8,[x10]
  0001c	stlxr       w9,w11,[x10]
  00020	cbnz        w9,|$LN5@main|
  00024	dmb         ish
; File main.cpp

; 6    :     volatile uint32_t y = InterlockedExchange(&x, 1);

  00028	str         w8,[sp]

The ldaxr + stlxr loop I think satisfies (2) and the dmb satisfies (1). The dmb would also prevent the reordering you mentioned here:

The load of m_headIndex is able to move before the write to m_tailIndex. This is because the stlxr is a release barrier.

So I think that dmb is critical to get the expected behavior. Any idea why the dmb is not there for the exchange in the generated code you posted?

I think a proper fix needs to be in the PAL since the code generated for the interlocked operations would be clang-generated code unless I'm mistaken. It may also affect other interlocked operations besides Exchange.

RE your latest change, I'll have to look at it more closely with regards to removal of the interlocked operation, I'll get back on this

@kouvel
Copy link
Member

kouvel commented Apr 14, 2018

The docs for __sync_*, in particular __sync_swap that I mentioned above, is also not very clear in this aspect of what it guarantees.

@kouvel
Copy link
Member

kouvel commented Apr 14, 2018

885FFC02          ldaxr   w2, [x0]          // w2 = LoadAcqEx(&m_tailIndex)
8801FC13          stlxr   w1, w19, [x0]     // w1 = StoreRelEx(&m_tailIndex, tail)
35FFFFC1          cbnz    w1, G_M55277_IG05 // Retry
F9400FA0          ldr     x0, [fp,#24]      // x0 = this
B9801400          ldrsw   x0, [x0,#20]      // x0 = this->m_headIndex

The load of m_headIndex is able to move before the write to m_tailIndex. This is because the stlxr is a release barrier.

Looking at this more closely, I don't see how the processor could legally perform that reordering, considering that it would be reordering a load into a loop that may incur a load-acquire on retry which would refute legality of the reordering. Perhaps it could do a speculative load prior to stlxr and retain the loaded value if stlxr does not fail. It seems like a stretch though, is that actually possible?

@kouvel
Copy link
Member

kouvel commented Apr 14, 2018

@sdmaclea you mentioned JIT-generated code above, is the Interlocked.Exchange code that you posted above generated by the JIT? Maybe I missed how the JIT manages to treat interlocked operations as intrinsics, if the code is generated by the JIT then maybe the fix should be there.

@sdmaclea
Copy link
Author

JIT code is here.

void CodeGen::genLockedInstructions(GenTreeOp* treeNode)
{
GenTree* data = treeNode->gtOp.gtOp2;
GenTree* addr = treeNode->gtOp.gtOp1;
regNumber targetReg = treeNode->gtRegNum;
regNumber dataReg = data->gtRegNum;
regNumber addrReg = addr->gtRegNum;
regNumber exResultReg = treeNode->ExtractTempReg(RBM_ALLINT);
regNumber storeDataReg = (treeNode->OperGet() == GT_XCHG) ? dataReg : treeNode->ExtractTempReg(RBM_ALLINT);
regNumber loadReg = (targetReg != REG_NA) ? targetReg : storeDataReg;
// Check allocator assumptions
//
// The register allocator should have extended the lifetimes of all input and internal registers so that
// none interfere with the target.
noway_assert(addrReg != targetReg);
noway_assert(addrReg != loadReg);
noway_assert(dataReg != loadReg);
noway_assert(addrReg != storeDataReg);
noway_assert((treeNode->OperGet() == GT_XCHG) || (addrReg != dataReg));
assert(addr->isUsedFromReg());
noway_assert(exResultReg != REG_NA);
noway_assert(exResultReg != targetReg);
noway_assert((targetReg != REG_NA) || (treeNode->OperGet() != GT_XCHG));
// Store exclusive unpredictable cases must be avoided
noway_assert(exResultReg != storeDataReg);
noway_assert(exResultReg != addrReg);
genConsumeAddress(addr);
genConsumeRegs(data);
// NOTE: `genConsumeAddress` marks the consumed register as not a GC pointer, as it assumes that the input registers
// die at the first instruction generated by the node. This is not the case for these atomics as the input
// registers are multiply-used. As such, we need to mark the addr register as containing a GC pointer until
// we are finished generating the code for this node.
gcInfo.gcMarkRegPtrVal(addrReg, addr->TypeGet());
// TODO-ARM64-CQ Use ARMv8.1 atomics if available
// https://github.com/dotnet/coreclr/issues/11881
// Emit code like this:
// retry:
// ldxr loadReg, [addrReg]
// add storeDataReg, loadReg, dataReg # Only for GT_XADD & GT_LOCKADD
// # GT_XCHG storeDataReg === dataReg
// stxr exResult, storeDataReg, [addrReg]
// cbnz exResult, retry
BasicBlock* labelRetry = genCreateTempLabel();
genDefineTempLabel(labelRetry);
emitAttr dataSize = emitActualTypeSize(data);
// The following instruction includes a acquire half barrier
// TODO-ARM64-CQ Evaluate whether this is necessary
// https://github.com/dotnet/coreclr/issues/14346
getEmitter()->emitIns_R_R(INS_ldaxr, dataSize, loadReg, addrReg);
switch (treeNode->OperGet())
{
case GT_XADD:
case GT_LOCKADD:
if (data->isContainedIntOrIImmed())
{
// Even though INS_add is specified here, the encoder will choose either
// an INS_add or an INS_sub and encode the immediate as a positive value
genInstrWithConstant(INS_add, dataSize, storeDataReg, loadReg, data->AsIntConCommon()->IconValue(),
REG_NA);
}
else
{
getEmitter()->emitIns_R_R_R(INS_add, dataSize, storeDataReg, loadReg, dataReg);
}
break;
case GT_XCHG:
assert(!data->isContained());
storeDataReg = dataReg;
break;
default:
unreached();
}
// The following instruction includes a release half barrier
// TODO-ARM64-CQ Evaluate whether this is necessary
// https://github.com/dotnet/coreclr/issues/14346
getEmitter()->emitIns_R_R_R(INS_stlxr, dataSize, exResultReg, storeDataReg, addrReg);
getEmitter()->emitIns_J_R(INS_cbnz, EA_4BYTE, labelRetry, exResultReg);
gcInfo.gcMarkRegSetNpt(addr->gtGetRegMask());
if (treeNode->gtRegNum != REG_NA)
{
genProduceReg(treeNode);
}
}
//------------------------------------------------------------------------
// genCodeForCmpXchg: Produce code for a GT_CMPXCHG node.
//
// Arguments:
// tree - the GT_CMPXCHG node
//
void CodeGen::genCodeForCmpXchg(GenTreeCmpXchg* treeNode)
{
assert(treeNode->OperIs(GT_CMPXCHG));
GenTree* addr = treeNode->gtOpLocation; // arg1
GenTree* data = treeNode->gtOpValue; // arg2
GenTree* comparand = treeNode->gtOpComparand; // arg3
regNumber targetReg = treeNode->gtRegNum;
regNumber dataReg = data->gtRegNum;
regNumber addrReg = addr->gtRegNum;
regNumber comparandReg = comparand->gtRegNum;
regNumber exResultReg = treeNode->ExtractTempReg(RBM_ALLINT);
// Check allocator assumptions
//
// The register allocator should have extended the lifetimes of all input and internal registers so that
// none interfere with the target.
noway_assert(addrReg != targetReg);
noway_assert(dataReg != targetReg);
noway_assert(comparandReg != targetReg);
noway_assert(addrReg != dataReg);
noway_assert(targetReg != REG_NA);
noway_assert(exResultReg != REG_NA);
noway_assert(exResultReg != targetReg);
assert(addr->isUsedFromReg());
assert(data->isUsedFromReg());
assert(!comparand->isUsedFromMemory());
// Store exclusive unpredictable cases must be avoided
noway_assert(exResultReg != dataReg);
noway_assert(exResultReg != addrReg);
genConsumeAddress(addr);
genConsumeRegs(data);
genConsumeRegs(comparand);
// NOTE: `genConsumeAddress` marks the consumed register as not a GC pointer, as it assumes that the input registers
// die at the first instruction generated by the node. This is not the case for these atomics as the input
// registers are multiply-used. As such, we need to mark the addr register as containing a GC pointer until
// we are finished generating the code for this node.
gcInfo.gcMarkRegPtrVal(addrReg, addr->TypeGet());
// TODO-ARM64-CQ Use ARMv8.1 atomics if available
// https://github.com/dotnet/coreclr/issues/11881
// Emit code like this:
// retry:
// ldxr targetReg, [addrReg]
// cmp targetReg, comparandReg
// bne compareFail
// stxr exResult, dataReg, [addrReg]
// cbnz exResult, retry
// compareFail:
BasicBlock* labelRetry = genCreateTempLabel();
BasicBlock* labelCompareFail = genCreateTempLabel();
genDefineTempLabel(labelRetry);
// The following instruction includes a acquire half barrier
// TODO-ARM64-CQ Evaluate whether this is necessary
// https://github.com/dotnet/coreclr/issues/14346
getEmitter()->emitIns_R_R(INS_ldaxr, emitTypeSize(treeNode), targetReg, addrReg);
if (comparand->isContainedIntOrIImmed())
{
if (comparand->IsIntegralConst(0))
{
getEmitter()->emitIns_J_R(INS_cbnz, emitActualTypeSize(treeNode), labelCompareFail, targetReg);
}
else
{
getEmitter()->emitIns_R_I(INS_cmp, emitActualTypeSize(treeNode), targetReg,
comparand->AsIntConCommon()->IconValue());
getEmitter()->emitIns_J(INS_bne, labelCompareFail);
}
}
else
{
getEmitter()->emitIns_R_R(INS_cmp, emitActualTypeSize(treeNode), targetReg, comparandReg);
getEmitter()->emitIns_J(INS_bne, labelCompareFail);
}
// The following instruction includes a release half barrier
// TODO-ARM64-CQ Evaluate whether this is necessary
// https://github.com/dotnet/coreclr/issues/14346
getEmitter()->emitIns_R_R_R(INS_stlxr, emitTypeSize(treeNode), exResultReg, dataReg, addrReg);
getEmitter()->emitIns_J_R(INS_cbnz, EA_4BYTE, labelRetry, exResultReg);
genDefineTempLabel(labelCompareFail);
gcInfo.gcMarkRegSetNpt(addr->gtGetRegMask());
genProduceReg(treeNode);
}

If we are certain we want all interlocked operations to have barrier semantics, the fix is trivial.

Based on above, it seems you are asserting that should be the expected behavior.

gcc exchanges are moving from the older __sync* to newer intrinsics which allow specifying the ordering behavior. None, Acquire, Release, AcquireRelease, Full ...

@sdmaclea
Copy link
Author

@kouvel JIT change is #17567. If you prefer that, please mark it w/ 2.1 label.

@stephentoub
Copy link
Member

If we are certain we want all interlocked operations to have barrier semantics, the fix is trivial.

There's a fair amount of code, both in coreclr/corefx and external to it that expect these semantics. If we don't do that, I expect you'll be chasing a long tail of these kinds of race conditions.

@sdmaclea
Copy link
Author

There's a fair amount of code, both in coreclr/corefx and external to it that expect these semantics. If we don't do that, I expect you'll be chasing a long tail of these kinds of race conditions.

Currently w/o #17567. The interlocked operations have Load Acquire and Store Release semantics. In the vast majority of cases that should be enough.

For the 2.1 release, #17567 seems safest.

My hesitation would be that the performance cost of extra Barriers is significant. However as I have experienced, tracking down these obscure races also takes significant manpower.

Perhaps the long term solution might be to extend the API to take an optional barrier type on these Interlocked methods. Following the same evolution taken by gcc et. al.

Given that, my recommendation would be to prefer #17567 over #17508 for the 2.1 release.

@sdmaclea
Copy link
Author

Perhaps it could do a speculative load prior to stlxr and retain the loaded value if stlxr does not fail. It seems like a stretch though, is that actually possible?

Yes it is possible and typical for modern arm CPUs

@kouvel
Copy link
Member

kouvel commented Apr 14, 2018

my recommendation would be to prefer #17567 over #17508 for the 2.1 release.

That sounds good to me as well, it seems like a good fix.

It looks like for an Exchange where the result is not used, it can just be:

  00000	mov         w9,#1
  00004	stlr        w9,[x0]
  00008	dmb         ish

stlr is necessary so that things can't be reordered after it and dmb is necessary so that things can't be reordered before the store. Not sure if the JIT also takes into consideration whether the result is used (haven't looked at #17567 yet).

@sdmaclea I wonder if this also needs a fix in the PAL, which use the __sync* functions. Would you be able to double-check the code generated for those as well? I'll also try to take a look and see if I can figure out how to do that.

@stephentoub stephentoub removed this from the 2.1.0 milestone Apr 15, 2018
@kouvel
Copy link
Member

kouvel commented Apr 15, 2018

From cross-compiling for arm64 with clang-3.9 it looks like the same reordering is possible in the code generated for most of the __sync* functions, I'll put up a fix. For example:

    unsigned int sum = 0;

    sum += __sync_swap(&x, 1);
    sum += g;

    sum += __sync_fetch_and_add(&x, 1);
    sum += g;
	.loc	1 51 12 prologue_end    // main.cpp:51:12 // sum += __sync_swap(&x, 1);
	orr	w8, wzr, #0x1
.Ltmp20:
.LBB1_1:                                // =>This Inner Loop Header: Depth=1
	ldaxr	w9, [x0]
	stlxr	w10, w8, [x0]
	cbnz	w10, .LBB1_1
// BB#2:
.Ltmp21:
	//DEBUG_VALUE: Goo:sum <- %W9
	.loc	1 52 12                 // main.cpp:52:12 // sum += g;
	ldr		w8, [x1]
	.loc	1 52 9 is_stmt 0        // main.cpp:52:9
	add		w8, w8, w9
.Ltmp22:
	//DEBUG_VALUE: Goo:sum <- %W8
.LBB1_3:                                // =>This Inner Loop Header: Depth=1
	.loc	1 54 12 is_stmt 1       // main.cpp:54:12 // sum += __sync_fetch_and_add(&x, 1);
	ldaxr	w9, [x0]
	add	w10, w9, #1             // =1
	stlxr	w11, w10, [x0]
	cbnz	w11, .LBB1_3
// BB#4:
	.loc	1 55 12                 // main.cpp:55:12 // sum += g;
	ldr		w10, [x1]
	.loc	1 54 9                  // main.cpp:54:9
	add		w8, w8, w9

@sdmaclea
Copy link
Author

sdmaclea commented Apr 15, 2018

@kouvel What do you think is wrong with the above code? Is it that it doesn't match CoreCLR assumptions for the C++ Interlocked operations?

@kouvel
Copy link
Member

kouvel commented Apr 15, 2018

Yes, the way in which the __sync* functions are used for instance in the PAL in interlocked operations, it's expected that the load of g occurs deterministically after the store of x is completed (and visible to other threads with respect to other interlocked operations).

@sdmaclea
Copy link
Author

@kouvel Can I close this?

@kouvel
Copy link
Member

kouvel commented Apr 15, 2018

Ya I'll follow up on a separate PR, closing

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
7 participants