[Arm64] Fix WorkStealingQueue memory ordering #17508

sdmaclea · 2018-04-11T03:11:20Z

Fixes #17178

Audited and reworked WorkStealingQueue to take into account multi-threaded memory ordering issues encountered on high core count arm64 testing.

This fixes easily reproducible errors in System.Threading.Tasks.Tests, System.Threading.Channel.Tests, and System.Threading.Tasks.Dataflow.Tests. The same issue was making the linux-arm64 SDK unusable.

This is very important for linux-arm64 for the 2.1 release. I am very happy to put the patch inside a #if ARM64 if this is deemed too risky for other platforms. (I suspect arm32 needs a similar/identical patch).

@BruceForstall FYI

kouvel · 2018-04-11T19:06:40Z

I won't be able to take a look immediately (out of office for personal reasons), my current expectation is that I'll be able to look at this by end of week or earlier.

Petermarcu · 2018-04-13T04:02:13Z

@jkotas thoughts on gettting this in by tomorrow for 2.1?

jkotas · 2018-04-13T04:54:20Z

The delta seems to be combination of correctness fixes and optimizations. Optimizations of tricky lock-free code always come with bug tail.

I think I would be ok with the minimal set of correctness fixes in 2.1, but not the optimizations.

stephentoub · 2018-04-13T05:13:25Z

This also appears to be adding interlocked operations where they weren't before, e.g. changing the arg to SpinLock.Exit to true. We wouldn't want to incur those costs for non-ARM64.

sdmaclea · 2018-04-13T12:20:58Z

@jkotas thoughts on getting this in by tomorrow for 2.1?

I would be ok with the minimal set of correctness fixes in 2.1, but not the optimizations.

It is hard for me to identify exactly which is correctness and which is optimization. It was implemented as an audit and rewrite followed by debug. I'll make a first attempt.

This also appears to be adding interlocked operations where they weren't before, e.g. changing the arg to SpinLock.Exit to true. We wouldn't want to incur those costs for non-ARM64

useMemoryBarrier is an unfortunate variable name. For arm* functional correctness, exiting the lock always needs a release barrier to keep locked things from being deferred (in the write gather buffer) till long after the lock is released.

It doesn't have to be an Interlocked operation. It can be a simple Volatile.Write() when releasing the lock.

I would propose SpinLock.Exit(useMemoryBarrier: false); use a Volatile.Write() to exit the lock at least for arm*. @stephentoub Is it OK for x64/x86, or do you want #if (arm64 || arm) on the SpinLock().Exit() change.

stephentoub · 2018-04-13T12:32:58Z

It doesn't have to be an Interlocked operation. It can be a simple Volatile.Write() when releasing the lock.

I don't understand. It already is a volatile write when useMemoryBarrier is false, and an interlocked when useMemoryBarrier is true. The m_owner field itself is marked as volatile.

sdmaclea · 2018-04-13T12:39:10Z

It already is a volatile write when useMemoryBarrier is false, and an interlocked when useMemoryBarrier is true. The m_owner field itself is marked as volatile.

I didn't notice that. Like I said useMemoryBarrier is an unfortunate variable name ... Thanks.

sdmaclea · 2018-04-13T15:48:26Z

I started revising to the minimal set of changes.

There are two symptoms w/ the original code.

The same work item was being executed by multiple threads
The queue was underflowing

The root cause seems to be that LocalPop() fast path made a faulty assumption that only one item could be stolen by TrySteal(). This can be significantly higher (infinite) especially in cases where the Local thread is preempted.

Because the symptoms were fixed in the order above, it seems the Exchange to the m_array may not be necessary in the minimal set of changes. I need to test to confirm.

sdmaclea · 2018-04-13T16:14:10Z

To fix the LocalPop() fast path issue, LocalPop() must be made properly lockless in the fast path. This will require the Exchange on m_array.

sdmaclea · 2018-04-13T18:53:05Z

@jkotas I have greatly simplified the patch and tried to only include critical fixes.
@stephentoub I have removed the changes to the useMemoryBarrier argument

The changes are now all in LocalPop() and TrySteal(). The basic idea is use Exchange to arbitrate which thread gets m_array[last]. After popping/stealing an item. Always check for queue underflow. Repair while locked to guarantee correct repair.

kouvel · 2018-04-13T19:21:27Z

@sdmaclea, looking at the original code, I'm trying to understand what the issue is that can cause LocalPop and TrySteal to pop/steal the same item. I can see that would be an issue if the two Interlocked.Exchanges here do not imply a full memory barrier. Otherwise, my rationale is that either:

LocalPop will see the updated m_headIndex (from TrySteal's exchange) and fail the fast path, requiring a lock which resolves any potential race
Or TrySteal will see the updated m_tailIndex (from LocalPop's exchange) and fail to steal on that iteration

Does Interlocked.Exchange imply a full memory barrier on arm64?

Other than that, perhaps the initial reads of m_headIndex and m_tailIndex in both functions and others should be volatile reads depending on uses from outside so that they would not miss an existing item.

If there's some other potential ordering issue, could you please elaborate?

kouvel · 2018-04-13T19:42:11Z

It looks like Interlocked.Exchange is just an InternalCall (not intrinsic) and translates to __sync_swap which states that it is a full barrier (assuming that means relative to all memory and not just for the address being referenced).

sdmaclea · 2018-04-13T21:15:22Z

I expect the Exchange to currently be an Load Acquire Exclusive and Store Release Exclusive. So it is better than a full barrier.

If I under stand it the issue occurred on the LocalPop() fast path.

// If there is no interaction with a take, we can head down the fast path.

On arm/arm64 the tail exchange is guaranteed to be observed before any other Local write is observed. We then read head which is guaranteed to be read after the exchange. But is not guaranteed to read all pending writes on other observers. If (head > tail) we take the fast path. We read m_array[idx] & check for null. and write null.

There is insufficent handshaking with TrySteal() to guarantee that TrySteal() has not run more than once. With 48 threads on Centriq, it is highly likely multiple threads attempt to steal in these tests.

kouvel · 2018-04-13T21:26:56Z

We then read head which is guaranteed to be read after the exchange. But is not guaranteed to read all pending writes on other observers.

There may be a pending write to head in TrySteal that is not observed by LocalPop. That means LocalPop will pop the item through the fast path, but it also means that TrySteal will fail to steal that item because its exchange/barrier must have occurred after LocalPop's exchange/barrier and TrySteal will observe the updated tail. Or did I miss something?

There is insufficent handshaking with TrySteal() to guarantee that TrySteal() has not run more than once. With 48 threads on Centriq, it is highly likely multiple threads attempt to steal in these tests.

Most of what TrySteal does is inside the lock so I don't see how multiple TrySteal threads could surface additional ordering issues compared to just one TrySteal thread.

kouvel · 2018-04-13T21:30:31Z

~~Although, SpinLock exit is not the typical lock exit (no full barrier at the moment). Could that be the main issue?~~
Nevermind, it only needs release and that's already covered above

sdmaclea · 2018-04-13T21:54:05Z

SpinLock does have a Release barrier on Exit(). As @stephentoub pointed out m_owner is marker volatile. Volatile writes are treated as Release barriers. Volatile reads function as Acquire barriers.

sdmaclea · 2018-04-13T22:08:24Z

Multple TrySteal() can run.

The m_foreignLock guarantees two major things.

Only one TrySteal() can occur at a time.
The changes while the lock was held will be visible before the release of the lock is observed.

With multiple threads trying to steal work from a single producer Local queue. You could see how many could be consumed.

The only thing that will stop them is the observation of an empty queue.

Since they are guaranteed to observe each others writes to m_tailIndex after observing and acquiring the lock, They cannot underrun the queue w/o LocalPop().

sdmaclea · 2018-04-13T22:39:14Z

So TrySteal() and LocalPop() must observe each other take the last item and handle it appropriately.

In the new code

TrySteal() increments m_headIndex. Then performs an exchange on m_array[last], Then reads m_tailIndex, then conditionally repiars the m_headIndex and releases m_foreignLock

LocalPop() will observe events as happening in the following order.

Write m_headIndex
Exchange of m_array[last]. (Atomic swap w/ barrier)
Read of m_tailIndex
Conditional write of m_headIndex
Release of m_foreignLock

The exchange effectively orders reads with respect to writes. This effectively means that the m_headIndex has left the write gathering buffers before the read of m_array[last] and m_tailIndex.

sdmaclea · 2018-04-13T22:55:59Z

Continuing with the new code

LocalPop() follows a similar pattern.

LocalPop() decrements m_tailIndex. Then performs an exchange on m_array[last], Then reads m_headIndex, then conditionally obtains m_foreignLock and conditionally repairs the m_tailIndex and releases m_foreignLock

TrySteal() will observe events as happening in the following order.

Write m_tailIndex
Exchange of m_array[last]. (Atomic swap w/ barrier)
Read of m_headIndex
Conditional Exchange of m_foreignLock
Conditional write of m_tailIndex
Conditional Release of m_foreignLock

So if you look carefully, at both TrySteal() and LocalPop for Arm the operations will be observed like they were executed on a Sequentially Consistent machine.

sdmaclea · 2018-04-13T23:17:52Z

Continuing with the new code

So the question remains, does the new ordering guarantee we will never

Execute the same item twice
Execute a LocalPush(item) and never be able to execute the item.

The Execute the same item twice same item twice is easily covered by the two Exchanges only one can get the item.

The Execute a LocalPush(item) and never be able to execute the item. is guaranteed as long as we never leave allow the head > tail condition to persist.

Given the ordering analysis above, we have:

Local Write tail --> Read head (Local and Foreign observed)
Foreign Write head --> Read tail (Local and Foreign observed)

So we have a six order of events possible.

Non overlapping
Local Write, Local Read, Foreign Write, Foreign Read
Foreign Write, Foreign Read, Local Write, Local Read
Overlapping
Local Write, Foreign Write, Local Read, Foreign Read
Local Write, Foreign Write, Foreign Read, Local Read
Foreign Write, Local Write, Foreign Read, Local Read
Foreign Write, Local Write, Local Read, Foreign Read

In all cases if the head and tail get out of order, one of them can fix the underflowed queue.

I think this is sufficient to demonstrate the new code works

sdmaclea · 2018-04-13T23:18:20Z

So now why was the old code broken?

kouvel · 2018-04-14T01:16:32Z

Thanks @sdmaclea, I'll get back shortly

sdmaclea · 2018-04-14T03:35:53Z

@kouvel I think my analysis of the old C# code was completely wrong. I deleted some of my most recent comments

I just took a look at a C# 6.0 draft spec for the volatile keyword with respect to Execution ordering and critical points. From a C# perspective, it is not obvious to me why the old code was not working.

I just reread the ARMv8 ARM barrier-ordered section. From what I understand of the C# spec, the arm64 JIT for volatile and GT_XCHG, and the ARMv8 spec it seems the intended JIT behavior is correct.

I think I need to look at JIT generated code....

sdmaclea · 2018-04-14T05:34:47Z

The actual disassembly of the critical section of LocalPop()

885FFC02          ldaxr   w2, [x0]          // w2 = LoadAcqEx(&m_tailIndex)
8801FC13          stlxr   w1, w19, [x0]     // w1 = StoreRelEx(&m_tailIndex, tail)
35FFFFC1          cbnz    w1, G_M55277_IG05 // Retry
F9400FA0          ldr     x0, [fp,#24]      // x0 = this
B9801400          ldrsw   x0, [x0,#20]      // x0 = this->m_headIndex
D50331BF          dmb     oshld             // Load acquire barrier
6B13001F          cmp     w0, w19           // Compare head tail       
5400038C          bgt     G_M55277_IG07     // Branch if fast path is not safe

So the code to read the m_headIndex surprised me. I expected a ldar not a ldrsw. Because head/tail are signed and not native sized ints, they must be sign extended. I had forgotten that I skipped handling signed integer loads when I handled volatile in the JIT in #12087.

The net result is my analysis was correct or sort of correct, but for the wrong reasons.

The load of m_headIndex is able to move before the write to m_tailIndex. This is because the stlxr is a release barrier.

sdmaclea · 2018-04-14T05:53:30Z

Looking at the available documentation for volatile and Interlocked.Exchange() the behavior of this case is not exactly specified. The C# spec details the behavior of volatile write and volatile read, but not volatile read modify write or volatile atomic... The Interlocked.Exchange() docs also do not specify effect on memory ordering.

I think it is reasonable to assume the Interlocked.Exchange(ref m_tailIndex, tail) is either unordered or is treated as a volatile write w/ Store with Release semantics. The generated arm64 code implements exactly that.

Therefore functionally correct C# code should insert an Interlocked.MemoryBarrier()` if it wants a full memory barrier.

So Interlocked.Exchange(ref m_tailIndex, tail); should become

m_tailIndex = tail;
Interlocked.MemoryBarrier();

I will revise the patch and test.

sdmaclea · 2018-04-14T06:33:32Z

The patch is revised. The revised fix is working on linux-arm64. The patch now certainly meets @jkotas minimal change requirements.

PTAL

kouvel · 2018-04-14T07:27:01Z

Looking at the available documentation for volatile and Interlocked.Exchange() the behavior of this case is not exactly specified. The C# spec details the behavior of volatile write and volatile read, but not volatile read modify write or volatile atomic... The Interlocked.Exchange() docs also do not specify effect on memory ordering.

I think it is reasonable to assume the Interlocked.Exchange(ref m_tailIndex, tail) is either unordered or is treated as a volatile write w/ Store with Release semantics. The generated arm64 code implements exactly that.

Though it is not sufficiently documented (documentation issues in this area don't seem uncommon) for historical reasons for .NET anyway (based on how it works on x86/x64), I believe any of the interlocked operations requires at minimum:

The equivalent of a globally ordered memory barrier
And sequential consistency on the memory region referenced (read and write where applicable) with respect to other interlocked operations referencing the same memory region

Based on what I gather from this, dmb satisfies (1) but ldaxr + stlxr do not. It's not very clear to me from the spec, but it appears as though an ldaxr + stlxr loop (which appears to be similar to a compare-and-swap loop) would satisfy (2) but not (1).

I tried compiling the following with vc++ compiler targeting arm64:

#include <Windows.h>

volatile uint32_t g_y;

int main()
{
    uint32_t x = 0;
    uint32_t y = InterlockedExchange(&x, 1);
    g_y = y;
    return 0;
}

And I see the following relevant code generated for the interlocked operation in main:

; 9235 :     return (unsigned) _InterlockedExchange((volatile long*) Target, (long) Value);

  00010	mov         w11,#1
  00014	mov         x10,sp
  00018		 |$LN5@main|
  00018	ldaxr       w8,[x10]
  0001c	stlxr       w9,w11,[x10]
  00020	cbnz        w9,|$LN5@main|
  00024	dmb         ish
; File main.cpp

; 6    :     volatile uint32_t y = InterlockedExchange(&x, 1);

  00028	str         w8,[sp]

The ldaxr + stlxr loop I think satisfies (2) and the dmb satisfies (1). The dmb would also prevent the reordering you mentioned here:

The load of m_headIndex is able to move before the write to m_tailIndex. This is because the stlxr is a release barrier.

So I think that dmb is critical to get the expected behavior. Any idea why the dmb is not there for the exchange in the generated code you posted?

I think a proper fix needs to be in the PAL since the code generated for the interlocked operations would be clang-generated code unless I'm mistaken. It may also affect other interlocked operations besides Exchange.

RE your latest change, I'll have to look at it more closely with regards to removal of the interlocked operation, I'll get back on this

kouvel · 2018-04-14T07:34:23Z

The docs for __sync_*, in particular __sync_swap that I mentioned above, is also not very clear in this aspect of what it guarantees.

kouvel · 2018-04-14T07:55:55Z

885FFC02          ldaxr   w2, [x0]          // w2 = LoadAcqEx(&m_tailIndex)
8801FC13          stlxr   w1, w19, [x0]     // w1 = StoreRelEx(&m_tailIndex, tail)
35FFFFC1          cbnz    w1, G_M55277_IG05 // Retry
F9400FA0          ldr     x0, [fp,#24]      // x0 = this
B9801400          ldrsw   x0, [x0,#20]      // x0 = this->m_headIndex

The load of m_headIndex is able to move before the write to m_tailIndex. This is because the stlxr is a release barrier.

Looking at this more closely, I don't see how the processor could legally perform that reordering, considering that it would be reordering a load into a loop that may incur a load-acquire on retry which would refute legality of the reordering. Perhaps it could do a speculative load prior to stlxr and retain the loaded value if stlxr does not fail. It seems like a stretch though, is that actually possible?

kouvel · 2018-04-14T08:13:05Z

@sdmaclea you mentioned JIT-generated code above, is the Interlocked.Exchange code that you posted above generated by the JIT? Maybe I missed how the JIT manages to treat interlocked operations as intrinsics, if the code is generated by the JIT then maybe the fix should be there.

sdmaclea · 2018-04-14T09:31:13Z

JIT code is here.

coreclr/src/jit/codegenarm64.cpp

Lines 2657 to 2862 in 4942bb1

    
           void CodeGen::genLockedInstructions(GenTreeOp* treeNode) 
        
           { 
        
               GenTree*  data      = treeNode->gtOp.gtOp2; 
        
               GenTree*  addr      = treeNode->gtOp.gtOp1; 
        
               regNumber targetReg = treeNode->gtRegNum; 
        
               regNumber dataReg   = data->gtRegNum; 
        
               regNumber addrReg   = addr->gtRegNum; 
        
               regNumber exResultReg  = treeNode->ExtractTempReg(RBM_ALLINT); 
        
               regNumber storeDataReg = (treeNode->OperGet() == GT_XCHG) ? dataReg : treeNode->ExtractTempReg(RBM_ALLINT); 
        
               regNumber loadReg      = (targetReg != REG_NA) ? targetReg : storeDataReg; 
        
               // Check allocator assumptions 
        
               // 
        
               // The register allocator should have extended the lifetimes of all input and internal registers so that 
        
               // none interfere with the target. 
        
               noway_assert(addrReg != targetReg); 
        
               noway_assert(addrReg != loadReg); 
        
               noway_assert(dataReg != loadReg); 
        
               noway_assert(addrReg != storeDataReg); 
        
               noway_assert((treeNode->OperGet() == GT_XCHG) || (addrReg != dataReg)); 
        
               assert(addr->isUsedFromReg()); 
        
               noway_assert(exResultReg != REG_NA); 
        
               noway_assert(exResultReg != targetReg); 
        
               noway_assert((targetReg != REG_NA) || (treeNode->OperGet() != GT_XCHG)); 
        
               // Store exclusive unpredictable cases must be avoided 
        
               noway_assert(exResultReg != storeDataReg); 
        
               noway_assert(exResultReg != addrReg); 
        
               genConsumeAddress(addr); 
        
               genConsumeRegs(data); 
        
               // NOTE: `genConsumeAddress` marks the consumed register as not a GC pointer, as it assumes that the input registers 
        
               // die at the first instruction generated by the node. This is not the case for these atomics as the  input 
        
               // registers are multiply-used. As such, we need to mark the addr register as containing a GC pointer until 
        
               // we are finished generating the code for this node. 
        
               gcInfo.gcMarkRegPtrVal(addrReg, addr->TypeGet()); 
        
               // TODO-ARM64-CQ Use ARMv8.1 atomics if available 
        
               // https://github.com/dotnet/coreclr/issues/11881 
        
               // Emit code like this: 
        
               //   retry: 
        
               //     ldxr loadReg, [addrReg] 
        
               //     add storeDataReg, loadReg, dataReg         # Only for GT_XADD & GT_LOCKADD 
        
               //                                                # GT_XCHG storeDataReg === dataReg 
        
               //     stxr exResult, storeDataReg, [addrReg] 
        
               //     cbnz exResult, retry 
        
               BasicBlock* labelRetry = genCreateTempLabel(); 
        
               genDefineTempLabel(labelRetry); 
        
               emitAttr dataSize = emitActualTypeSize(data); 
        
               // The following instruction includes a acquire half barrier 
        
               // TODO-ARM64-CQ Evaluate whether this is necessary 
        
               // https://github.com/dotnet/coreclr/issues/14346 
        
               getEmitter()->emitIns_R_R(INS_ldaxr, dataSize, loadReg, addrReg); 
        
               switch (treeNode->OperGet()) 
        
               { 
        
                   case GT_XADD: 
        
                   case GT_LOCKADD: 
        
                       if (data->isContainedIntOrIImmed()) 
        
                       { 
        
                           // Even though INS_add is specified here, the encoder will choose either 
        
                           // an INS_add or an INS_sub and encode the immediate as a positive value 
        
                           genInstrWithConstant(INS_add, dataSize, storeDataReg, loadReg, data->AsIntConCommon()->IconValue(), 
        
                                                REG_NA); 
        
                       } 
        
                       else 
        
                       { 
        
                           getEmitter()->emitIns_R_R_R(INS_add, dataSize, storeDataReg, loadReg, dataReg); 
        
                       } 
        
                       break; 
        
                   case GT_XCHG: 
        
                       assert(!data->isContained()); 
        
                       storeDataReg = dataReg; 
        
                       break; 
        
                   default: 
        
                       unreached(); 
        
               } 
        
               // The following instruction includes a release half barrier 
        
               // TODO-ARM64-CQ Evaluate whether this is necessary 
        
               // https://github.com/dotnet/coreclr/issues/14346 
        
               getEmitter()->emitIns_R_R_R(INS_stlxr, dataSize, exResultReg, storeDataReg, addrReg); 
        
               getEmitter()->emitIns_J_R(INS_cbnz, EA_4BYTE, labelRetry, exResultReg); 
        
               gcInfo.gcMarkRegSetNpt(addr->gtGetRegMask()); 
        
               if (treeNode->gtRegNum != REG_NA) 
        
               { 
        
                   genProduceReg(treeNode); 
        
               } 
        
           } 
        
           //------------------------------------------------------------------------ 
        
           // genCodeForCmpXchg: Produce code for a GT_CMPXCHG node. 
        
           // 
        
           // Arguments: 
        
           //    tree - the GT_CMPXCHG node 
        
           // 
        
           void CodeGen::genCodeForCmpXchg(GenTreeCmpXchg* treeNode) 
        
           { 
        
               assert(treeNode->OperIs(GT_CMPXCHG)); 
        
               GenTree* addr      = treeNode->gtOpLocation;  // arg1 
        
               GenTree* data      = treeNode->gtOpValue;     // arg2 
        
               GenTree* comparand = treeNode->gtOpComparand; // arg3 
        
               regNumber targetReg    = treeNode->gtRegNum; 
        
               regNumber dataReg      = data->gtRegNum; 
        
               regNumber addrReg      = addr->gtRegNum; 
        
               regNumber comparandReg = comparand->gtRegNum; 
        
               regNumber exResultReg  = treeNode->ExtractTempReg(RBM_ALLINT); 
        
               // Check allocator assumptions 
        
               // 
        
               // The register allocator should have extended the lifetimes of all input and internal registers so that 
        
               // none interfere with the target. 
        
               noway_assert(addrReg != targetReg); 
        
               noway_assert(dataReg != targetReg); 
        
               noway_assert(comparandReg != targetReg); 
        
               noway_assert(addrReg != dataReg); 
        
               noway_assert(targetReg != REG_NA); 
        
               noway_assert(exResultReg != REG_NA); 
        
               noway_assert(exResultReg != targetReg); 
        
               assert(addr->isUsedFromReg()); 
        
               assert(data->isUsedFromReg()); 
        
               assert(!comparand->isUsedFromMemory()); 
        
               // Store exclusive unpredictable cases must be avoided 
        
               noway_assert(exResultReg != dataReg); 
        
               noway_assert(exResultReg != addrReg); 
        
               genConsumeAddress(addr); 
        
               genConsumeRegs(data); 
        
               genConsumeRegs(comparand); 
        
               // NOTE: `genConsumeAddress` marks the consumed register as not a GC pointer, as it assumes that the input registers 
        
               // die at the first instruction generated by the node. This is not the case for these atomics as the  input 
        
               // registers are multiply-used. As such, we need to mark the addr register as containing a GC pointer until 
        
               // we are finished generating the code for this node. 
        
               gcInfo.gcMarkRegPtrVal(addrReg, addr->TypeGet()); 
        
               // TODO-ARM64-CQ Use ARMv8.1 atomics if available 
        
               // https://github.com/dotnet/coreclr/issues/11881 
        
               // Emit code like this: 
        
               //   retry: 
        
               //     ldxr targetReg, [addrReg] 
        
               //     cmp targetReg, comparandReg 
        
               //     bne compareFail 
        
               //     stxr exResult, dataReg, [addrReg] 
        
               //     cbnz exResult, retry 
        
               //   compareFail: 
        
               BasicBlock* labelRetry       = genCreateTempLabel(); 
        
               BasicBlock* labelCompareFail = genCreateTempLabel(); 
        
               genDefineTempLabel(labelRetry); 
        
               // The following instruction includes a acquire half barrier 
        
               // TODO-ARM64-CQ Evaluate whether this is necessary 
        
               // https://github.com/dotnet/coreclr/issues/14346 
        
               getEmitter()->emitIns_R_R(INS_ldaxr, emitTypeSize(treeNode), targetReg, addrReg); 
        
               if (comparand->isContainedIntOrIImmed()) 
        
               { 
        
                   if (comparand->IsIntegralConst(0)) 
        
                   { 
        
                       getEmitter()->emitIns_J_R(INS_cbnz, emitActualTypeSize(treeNode), labelCompareFail, targetReg); 
        
                   } 
        
                   else 
        
                   { 
        
                       getEmitter()->emitIns_R_I(INS_cmp, emitActualTypeSize(treeNode), targetReg, 
        
                                                 comparand->AsIntConCommon()->IconValue()); 
        
                       getEmitter()->emitIns_J(INS_bne, labelCompareFail); 
        
                   } 
        
               } 
        
               else 
        
               { 
        
                   getEmitter()->emitIns_R_R(INS_cmp, emitActualTypeSize(treeNode), targetReg, comparandReg); 
        
                   getEmitter()->emitIns_J(INS_bne, labelCompareFail); 
        
               } 
        
               // The following instruction includes a release half barrier 
        
               // TODO-ARM64-CQ Evaluate whether this is necessary 
        
               // https://github.com/dotnet/coreclr/issues/14346 
        
               getEmitter()->emitIns_R_R_R(INS_stlxr, emitTypeSize(treeNode), exResultReg, dataReg, addrReg); 
        
               getEmitter()->emitIns_J_R(INS_cbnz, EA_4BYTE, labelRetry, exResultReg); 
        
               genDefineTempLabel(labelCompareFail); 
        
               gcInfo.gcMarkRegSetNpt(addr->gtGetRegMask()); 
        
               genProduceReg(treeNode); 
        
           }

If we are certain we want all interlocked operations to have barrier semantics, the fix is trivial.

Based on above, it seems you are asserting that should be the expected behavior.

gcc exchanges are moving from the older __sync* to newer intrinsics which allow specifying the ordering behavior. None, Acquire, Release, AcquireRelease, Full ...

sdmaclea · 2018-04-14T09:55:43Z

@kouvel JIT change is #17567. If you prefer that, please mark it w/ 2.1 label.

stephentoub · 2018-04-14T13:31:32Z

If we are certain we want all interlocked operations to have barrier semantics, the fix is trivial.

There's a fair amount of code, both in coreclr/corefx and external to it that expect these semantics. If we don't do that, I expect you'll be chasing a long tail of these kinds of race conditions.

sdmaclea · 2018-04-14T13:46:28Z

There's a fair amount of code, both in coreclr/corefx and external to it that expect these semantics. If we don't do that, I expect you'll be chasing a long tail of these kinds of race conditions.

Currently w/o #17567. The interlocked operations have Load Acquire and Store Release semantics. In the vast majority of cases that should be enough.

For the 2.1 release, #17567 seems safest.

My hesitation would be that the performance cost of extra Barriers is significant. However as I have experienced, tracking down these obscure races also takes significant manpower.

Perhaps the long term solution might be to extend the API to take an optional barrier type on these Interlocked methods. Following the same evolution taken by gcc et. al.

Given that, my recommendation would be to prefer #17567 over #17508 for the 2.1 release.

sdmaclea · 2018-04-14T13:58:57Z

Perhaps it could do a speculative load prior to stlxr and retain the loaded value if stlxr does not fail. It seems like a stretch though, is that actually possible?

Yes it is possible and typical for modern arm CPUs

kouvel · 2018-04-14T16:52:12Z

my recommendation would be to prefer #17567 over #17508 for the 2.1 release.

That sounds good to me as well, it seems like a good fix.

It looks like for an Exchange where the result is not used, it can just be:

  00000	mov         w9,#1
  00004	stlr        w9,[x0]
  00008	dmb         ish

stlr is necessary so that things can't be reordered after it and dmb is necessary so that things can't be reordered before the store. Not sure if the JIT also takes into consideration whether the result is used (haven't looked at #17567 yet).

@sdmaclea I wonder if this also needs a fix in the PAL, which use the __sync* functions. Would you be able to double-check the code generated for those as well? I'll also try to take a look and see if I can figure out how to do that.

kouvel · 2018-04-15T13:05:58Z

From cross-compiling for arm64 with clang-3.9 it looks like the same reordering is possible in the code generated for most of the __sync* functions, I'll put up a fix. For example:

    unsigned int sum = 0;

    sum += __sync_swap(&x, 1);
    sum += g;

    sum += __sync_fetch_and_add(&x, 1);
    sum += g;

	.loc	1 51 12 prologue_end    // main.cpp:51:12 // sum += __sync_swap(&x, 1);
	orr	w8, wzr, #0x1
.Ltmp20:
.LBB1_1:                                // =>This Inner Loop Header: Depth=1
	ldaxr	w9, [x0]
	stlxr	w10, w8, [x0]
	cbnz	w10, .LBB1_1
// BB#2:
.Ltmp21:
	//DEBUG_VALUE: Goo:sum <- %W9
	.loc	1 52 12                 // main.cpp:52:12 // sum += g;
	ldr		w8, [x1]
	.loc	1 52 9 is_stmt 0        // main.cpp:52:9
	add		w8, w8, w9
.Ltmp22:
	//DEBUG_VALUE: Goo:sum <- %W8
.LBB1_3:                                // =>This Inner Loop Header: Depth=1
	.loc	1 54 12 is_stmt 1       // main.cpp:54:12 // sum += __sync_fetch_and_add(&x, 1);
	ldaxr	w9, [x0]
	add	w10, w9, #1             // =1
	stlxr	w11, w10, [x0]
	cbnz	w11, .LBB1_3
// BB#4:
	.loc	1 55 12                 // main.cpp:55:12 // sum += g;
	ldr		w10, [x1]
	.loc	1 54 9                  // main.cpp:54:9
	add		w8, w8, w9

sdmaclea · 2018-04-15T14:33:47Z

@kouvel What do you think is wrong with the above code? Is it that it doesn't match CoreCLR assumptions for the C++ Interlocked operations?

kouvel · 2018-04-15T15:27:02Z

Yes, the way in which the __sync* functions are used for instance in the PAL in interlocked operations, it's expected that the load of g occurs deterministically after the store of x is completed (and visible to other threads with respect to other interlocked operations).

sdmaclea · 2018-04-15T19:39:01Z

@kouvel Can I close this?

kouvel · 2018-04-15T23:28:30Z

Ya I'll follow up on a separate PR, closing

dotnet-bot added the 2 - In Progress label Apr 11, 2018

stephentoub added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 11, 2018

stephentoub requested a review from kouvel April 11, 2018 11:24

sdmaclea force-pushed the PR-ARM64-CONSERVATIVE-THREADPOOL branch from 094c69e to 9ff8e86 Compare April 12, 2018 06:02

sdmaclea changed the title ~~WIP [Arm64] Conservative ThreadPool~~ [Arm64] Conservative ThreadPool Apr 12, 2018

sdmaclea changed the title ~~[Arm64] Conservative ThreadPool~~ [Arm64] Fix WorkStealingQueue memory ordering Apr 12, 2018

RussKeldorph added the area-System.Threading label Apr 13, 2018

RussKeldorph added this to the 2.1.0 milestone Apr 13, 2018

sdmaclea force-pushed the PR-ARM64-CONSERVATIVE-THREADPOOL branch from 9ff8e86 to c0ef151 Compare April 13, 2018 18:36

[Arm64] Fix ThreadPool ordering

114ff67

sdmaclea force-pushed the PR-ARM64-CONSERVATIVE-THREADPOOL branch from c0ef151 to 114ff67 Compare April 14, 2018 06:31

sdmaclea mentioned this pull request Apr 14, 2018

[Arm64] Add full barrier after locking operations #17567

Merged

stephentoub removed this from the 2.1.0 milestone Apr 15, 2018

kouvel closed this Apr 15, 2018

dotnet-bot removed the 2 - In Progress label Apr 15, 2018

sdmaclea deleted the PR-ARM64-CONSERVATIVE-THREADPOOL branch April 16, 2018 16:50

sdmaclea mentioned this pull request Jan 31, 2020

Optimize ThreadPool WorkStealingQueue dotnet/runtime#10148

Closed

kouvel mentioned this pull request Jan 31, 2020

[Arm64] Interlocked operations in the runtime need an additional memory barrier dotnet/runtime#10165

Closed

[Arm64] Fix WorkStealingQueue memory ordering #17508

[Arm64] Fix WorkStealingQueue memory ordering #17508

Conversation

sdmaclea commented Apr 11, 2018 • edited

kouvel commented Apr 11, 2018

Petermarcu commented Apr 13, 2018

jkotas commented Apr 13, 2018

stephentoub commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

stephentoub commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

kouvel commented Apr 13, 2018 • edited

kouvel commented Apr 13, 2018 • edited

sdmaclea commented Apr 13, 2018

kouvel commented Apr 13, 2018

kouvel commented Apr 13, 2018 • edited

sdmaclea commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

sdmaclea commented Apr 13, 2018

kouvel commented Apr 14, 2018

sdmaclea commented Apr 14, 2018

sdmaclea commented Apr 14, 2018

sdmaclea commented Apr 14, 2018

sdmaclea commented Apr 14, 2018

kouvel commented Apr 14, 2018 • edited

kouvel commented Apr 14, 2018 • edited

kouvel commented Apr 14, 2018

kouvel commented Apr 14, 2018

sdmaclea commented Apr 14, 2018

sdmaclea commented Apr 14, 2018

stephentoub commented Apr 14, 2018

sdmaclea commented Apr 14, 2018

sdmaclea commented Apr 14, 2018

kouvel commented Apr 14, 2018

kouvel commented Apr 15, 2018 • edited

sdmaclea commented Apr 15, 2018 • edited

kouvel commented Apr 15, 2018 • edited

sdmaclea commented Apr 15, 2018

kouvel commented Apr 15, 2018

sdmaclea commented Apr 11, 2018 •

edited

kouvel commented Apr 13, 2018 •

edited

kouvel commented Apr 13, 2018 •

edited

kouvel commented Apr 13, 2018 •

edited

kouvel commented Apr 14, 2018 •

edited

kouvel commented Apr 14, 2018 •

edited

kouvel commented Apr 15, 2018 •

edited

sdmaclea commented Apr 15, 2018 •

edited

kouvel commented Apr 15, 2018 •

edited