Generic dictionary lookup race condition on expansion due to JIT codegen issue #101131
Labels
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone
Description
The situation is that we are in a method compiled for shared generics, which has several specific conditions.
a. Thread A is attempting to access slot 4
b. Thread B is attempting to access slot 5
c. Thread C is attempting to access slot 4
a. Thread A attempts to access slot 4, determines the value which should be placed in the slot, and gathers the address of the slot which should be updated
b. Thread B attempts to access slot 5, determines the value which should be placed in the slot, determines that the accessing the slot requires expanding the generic dictionary, expands it, and copies the current state of the slots into the newly expanded dictionary. Notably, since thread A has not written to slot 4, that slot remains as NULL. However, the expanded dictionary is not yet written into the MethodDesc
c. Thread A fills in slot 4 in the old dictionary with the new value.
d. Thread C reads the value in slot 4, and determines that it is not NULL, and so commits to the fast path of generic dictionary lookup.
e. Thread B writes the newly expanded dictionary into the MethodDesc
f. Thread C then reads the value in slot 4, and since the expanded dictionary does not have a non-NULL value in slot 4, the rest of execution fails unexpectedly.
The issue is caused by incorrect codegen by the JIT, at least in Tier0 mode.
The failing logic is here
The problem is that the dictionary returned by
00007ffb`b9e1b4e4 488b4910 mov rcx,qword ptr [rcx+10h]
Can be updated by another thread, which can cause the actual dictionary entry to be resolved without having a proper NULL check.
There are 2 possibilities for valid code here.
Either we could generate code that looks like…
Where we only read the dictionary pointer once.
OR
Where we read the dictionary entry once.
Reproduction Steps
Run code which uses many threads, on very subtly different code paths. This particular failure happens in Roslyn, but requires the machine running the code to be running under a tremendous amount of stress.
Expected behavior
Generic dictionary should be able to expand without triggering a crash.
Actual behavior
Under extremely rare conditions it will fail.
Regression?
No, this failing behavior has been in place since 2020.
Known Workarounds
No response
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: