-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[Arm64] Fix WorkStealingQueue memory ordering #17508
Conversation
I won't be able to take a look immediately (out of office for personal reasons), my current expectation is that I'll be able to look at this by end of week or earlier. |
094c69e
to
9ff8e86
Compare
@jkotas thoughts on gettting this in by tomorrow for 2.1? |
The delta seems to be combination of correctness fixes and optimizations. Optimizations of tricky lock-free code always come with bug tail. I think I would be ok with the minimal set of correctness fixes in 2.1, but not the optimizations. |
This also appears to be adding interlocked operations where they weren't before, e.g. changing the arg to SpinLock.Exit to true. We wouldn't want to incur those costs for non-ARM64. |
It is hard for me to identify exactly which is correctness and which is optimization. It was implemented as an audit and rewrite followed by debug. I'll make a first attempt.
It doesn't have to be an I would propose |
I don't understand. It already is a volatile write when useMemoryBarrier is false, and an interlocked when useMemoryBarrier is true. The |
I didn't notice that. Like I said |
I started revising to the minimal set of changes. There are two symptoms w/ the original code.
The root cause seems to be that LocalPop() fast path made a faulty assumption that only one item could be stolen by TrySteal(). This can be significantly higher (infinite) especially in cases where the Local thread is preempted. Because the symptoms were fixed in the order above, it seems the Exchange to the m_array may not be necessary in the minimal set of changes. I need to test to confirm. |
To fix the |
9ff8e86
to
c0ef151
Compare
@jkotas I have greatly simplified the patch and tried to only include critical fixes. The changes are now all in |
@sdmaclea, looking at the original code, I'm trying to understand what the issue is that can cause LocalPop and TrySteal to pop/steal the same item. I can see that would be an issue if the two Interlocked.Exchanges here do not imply a full memory barrier. Otherwise, my rationale is that either:
Does Interlocked.Exchange imply a full memory barrier on arm64? Other than that, perhaps the initial reads of m_headIndex and m_tailIndex in both functions and others should be volatile reads depending on uses from outside so that they would not miss an existing item. If there's some other potential ordering issue, could you please elaborate? |
It looks like Interlocked.Exchange is just an InternalCall (not intrinsic) and translates to __sync_swap which states that it is a full barrier (assuming that means relative to all memory and not just for the address being referenced). |
I expect the Exchange to currently be an Load Acquire Exclusive and Store Release Exclusive. So it is better than a full barrier. If I under stand it the issue occurred on the
On arm/arm64 the tail exchange is guaranteed to be observed before any other Local write is observed. We then read head which is guaranteed to be read after the exchange. But is not guaranteed to read all pending writes on other observers. If (head > tail) we take the fast path. We read m_array[idx] & check for null. and write null. There is insufficent handshaking with TrySteal() to guarantee that TrySteal() has not run more than once. With 48 threads on Centriq, it is highly likely multiple threads attempt to steal in these tests. |
There may be a pending write to head in TrySteal that is not observed by LocalPop. That means LocalPop will pop the item through the fast path, but it also means that TrySteal will fail to steal that item because its exchange/barrier must have occurred after LocalPop's exchange/barrier and TrySteal will observe the updated tail. Or did I miss something?
Most of what TrySteal does is inside the lock so I don't see how multiple TrySteal threads could surface additional ordering issues compared to just one TrySteal thread. |
|
SpinLock does have a Release barrier on Exit(). As @stephentoub pointed out |
Multple TrySteal() can run. The
With multiple threads trying to steal work from a single producer Local queue. You could see how many could be consumed. The only thing that will stop them is the observation of an empty queue. Since they are guaranteed to observe each others writes to |
So In the new code
The exchange effectively orders reads with respect to writes. This effectively means that the |
Continuing with the new code
So if you look carefully, at both |
Continuing with the new code So the question remains, does the new ordering guarantee we will never
The The Given the ordering analysis above, we have: Local Write tail --> Read head (Local and Foreign observed) So we have a six order of events possible. Non overlapping In all cases if the head and tail get out of order, one of them can fix the underflowed queue. I think this is sufficient to demonstrate the new code works |
So now why was the old code broken? |
Thanks @sdmaclea, I'll get back shortly |
@kouvel I think my analysis of the old C# code was completely wrong. I deleted some of my most recent comments I just took a look at a C# 6.0 draft spec for the I just reread the ARMv8 ARM barrier-ordered section. From what I understand of the C# spec, the arm64 JIT for volatile and GT_XCHG, and the ARMv8 spec it seems the intended JIT behavior is correct. I think I need to look at JIT generated code.... |
The actual disassembly of the critical section of
So the code to read the The net result is my analysis was correct or sort of correct, but for the wrong reasons. The load of |
Looking at the available documentation for I think it is reasonable to assume the Therefore functionally correct So
I will revise the patch and test. |
c0ef151
to
114ff67
Compare
The patch is revised. The revised fix is working on linux-arm64. The patch now certainly meets @jkotas minimal change requirements. PTAL |
Though it is not sufficiently documented (documentation issues in this area don't seem uncommon) for historical reasons for .NET anyway (based on how it works on x86/x64), I believe any of the interlocked operations requires at minimum:
Based on what I gather from this, I tried compiling the following with vc++ compiler targeting arm64: #include <Windows.h>
volatile uint32_t g_y;
int main()
{
uint32_t x = 0;
uint32_t y = InterlockedExchange(&x, 1);
g_y = y;
return 0;
} And I see the following relevant code generated for the interlocked operation in main: ; 9235 : return (unsigned) _InterlockedExchange((volatile long*) Target, (long) Value);
00010 mov w11,#1
00014 mov x10,sp
00018 |$LN5@main|
00018 ldaxr w8,[x10]
0001c stlxr w9,w11,[x10]
00020 cbnz w9,|$LN5@main|
00024 dmb ish
; File main.cpp
; 6 : volatile uint32_t y = InterlockedExchange(&x, 1);
00028 str w8,[sp] The
So I think that dmb is critical to get the expected behavior. Any idea why the dmb is not there for the exchange in the generated code you posted? I think a proper fix needs to be in the PAL since the code generated for the interlocked operations would be clang-generated code unless I'm mistaken. It may also affect other interlocked operations besides Exchange. RE your latest change, I'll have to look at it more closely with regards to removal of the interlocked operation, I'll get back on this |
The docs for |
Looking at this more closely, I don't see how the processor could legally perform that reordering, considering that it would be reordering a load into a loop that may incur a load-acquire on retry which would refute legality of the reordering. Perhaps it could do a speculative load prior to stlxr and retain the loaded value if stlxr does not fail. It seems like a stretch though, is that actually possible? |
@sdmaclea you mentioned JIT-generated code above, is the Interlocked.Exchange code that you posted above generated by the JIT? Maybe I missed how the JIT manages to treat interlocked operations as intrinsics, if the code is generated by the JIT then maybe the fix should be there. |
JIT code is here. coreclr/src/jit/codegenarm64.cpp Lines 2657 to 2862 in 4942bb1
If we are certain we want all interlocked operations to have barrier semantics, the fix is trivial. Based on above, it seems you are asserting that should be the expected behavior. gcc exchanges are moving from the older __sync* to newer intrinsics which allow specifying the ordering behavior. None, Acquire, Release, AcquireRelease, Full ... |
There's a fair amount of code, both in coreclr/corefx and external to it that expect these semantics. If we don't do that, I expect you'll be chasing a long tail of these kinds of race conditions. |
Currently w/o #17567. The interlocked operations have Load Acquire and Store Release semantics. In the vast majority of cases that should be enough. For the 2.1 release, #17567 seems safest. My hesitation would be that the performance cost of extra Barriers is significant. However as I have experienced, tracking down these obscure races also takes significant manpower. Perhaps the long term solution might be to extend the API to take an optional barrier type on these Interlocked methods. Following the same evolution taken by gcc et. al. Given that, my recommendation would be to prefer #17567 over #17508 for the 2.1 release. |
Yes it is possible and typical for modern arm CPUs |
That sounds good to me as well, it seems like a good fix. It looks like for an Exchange where the result is not used, it can just be: 00000 mov w9,#1
00004 stlr w9,[x0]
00008 dmb ish stlr is necessary so that things can't be reordered after it and dmb is necessary so that things can't be reordered before the store. Not sure if the JIT also takes into consideration whether the result is used (haven't looked at #17567 yet). @sdmaclea I wonder if this also needs a fix in the PAL, which use the __sync* functions. Would you be able to double-check the code generated for those as well? I'll also try to take a look and see if I can figure out how to do that. |
From cross-compiling for arm64 with clang-3.9 it looks like the same reordering is possible in the code generated for most of the __sync* functions, I'll put up a fix. For example: unsigned int sum = 0;
sum += __sync_swap(&x, 1);
sum += g;
sum += __sync_fetch_and_add(&x, 1);
sum += g; .loc 1 51 12 prologue_end // main.cpp:51:12 // sum += __sync_swap(&x, 1);
orr w8, wzr, #0x1
.Ltmp20:
.LBB1_1: // =>This Inner Loop Header: Depth=1
ldaxr w9, [x0]
stlxr w10, w8, [x0]
cbnz w10, .LBB1_1
// BB#2:
.Ltmp21:
//DEBUG_VALUE: Goo:sum <- %W9
.loc 1 52 12 // main.cpp:52:12 // sum += g;
ldr w8, [x1]
.loc 1 52 9 is_stmt 0 // main.cpp:52:9
add w8, w8, w9
.Ltmp22:
//DEBUG_VALUE: Goo:sum <- %W8
.LBB1_3: // =>This Inner Loop Header: Depth=1
.loc 1 54 12 is_stmt 1 // main.cpp:54:12 // sum += __sync_fetch_and_add(&x, 1);
ldaxr w9, [x0]
add w10, w9, #1 // =1
stlxr w11, w10, [x0]
cbnz w11, .LBB1_3
// BB#4:
.loc 1 55 12 // main.cpp:55:12 // sum += g;
ldr w10, [x1]
.loc 1 54 9 // main.cpp:54:9
add w8, w8, w9 |
@kouvel What do you think is wrong with the above code? Is it that it doesn't match CoreCLR assumptions for the C++ Interlocked operations? |
Yes, the way in which the __sync* functions are used for instance in the PAL in interlocked operations, it's expected that the load of |
@kouvel Can I close this? |
Ya I'll follow up on a separate PR, closing |
Fixes #17178
Audited and reworked WorkStealingQueue to take into account multi-threaded memory ordering issues encountered on high core count arm64 testing.
This fixes easily reproducible errors in
System.Threading.Tasks.Tests
,System.Threading.Channel.Tests
, andSystem.Threading.Tasks.Dataflow.Tests
. The same issue was making the linux-arm64 SDK unusable.This is very important for linux-arm64 for the 2.1 release. I am very happy to put the patch inside a
#if ARM64
if this is deemed too risky for other platforms. (I suspect arm32 needs a similar/identical patch).@BruceForstall FYI