Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT drops unused volatile reads. #6280

Closed
pgavlin opened this issue Jul 7, 2016 · 17 comments
Closed

RyuJIT drops unused volatile reads. #6280

pgavlin opened this issue Jul 7, 2016 · 17 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug help wanted [up-for-grabs] Good issue for external contributors JitUntriaged CLR JIT issues needing additional triage memory model issues associated with memory model
Milestone

Comments

@pgavlin
Copy link
Contributor

pgavlin commented Jul 7, 2016

If the following program is compiled with optimizations enabled (e.g. the /o+ flag is provided on the csc command line), RyuJIT will drop the volatile read of a:

using System.Runtime.CompilerServices;

static class C
{
    static volatile int a;
    static int b;

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void LightweightStoreFence()
    {
        a = 1;
        var dummy = a;
        b = 2;
    }

    static void Main()
    {
        LightweightStoreFence();
    }
}�

The IL as reported by a JIT dump is:

IL_0000  17                ldc.i4.1    
IL_0001  fe 13             volatile.   
IL_0003  80 01 00 00 04    stsfld       0x4000001
IL_0008  fe 13             volatile.   
IL_000a  7e 01 00 00 04    ldsfld       0x4000001
IL_000f  26                pop         
IL_0010  18                ldc.i4.2    
IL_0011  80 02 00 00 04    stsfld       0x4000002
IL_0016  2a                ret         �

The generated assembly is:

G_M23768_IG01:        ; func=00, offs=000000H, size=0000H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
G_M23768_IG02:        ; offs=000000H, size=0014H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN0001: 000000 mov      dword ptr [reloc classVar[0xd2155a38]], 1
IN0002: 00000A mov      dword ptr [reloc classVar[0xd2155a58]], 2

G_M23768_IG03:        ; offs=000014H, size=0001H, epilog, nogc, emitadd

IN0003: 000014 ret      �

From the dump, it appears that the importer drops the volatile ldsfld on the floor when it imports the pop instruction. This is not correct per the ECMA spec, which states that volatile operations may not be removed or coalesced.

Thanks to @omariom for reporting the original issue in #6257.

category:correctness
theme:volatile
skill-level:expert
cost:medium

@omariom
Copy link
Contributor

omariom commented Jul 7, 2016

It doesn't happen if the result of volatile read is put in an observable location.

@benaadams
Copy link
Member

If it happens in stack space, did it really happen? (Understand the motivation is for a barrier)

@GSPP
Copy link

GSPP commented Jul 11, 2016

Is it ever observable if an unused volatile read is removed? ECMA surely does not say they can't be dropped because it's not detectable that they are dropped. Coalescing also is fine because it's not detectable. It just looks like the two reads happened without any interleaved writes which was possible before.

So even if ECMA says that that must refer to the abstract machine, not to the generated x86. Under the as-if rule this is OK.

@mikedn
Copy link
Contributor

mikedn commented Jul 11, 2016

ECMA surely does not say they can't be dropped because it's not detectable that they are dropped

The rationale provided by ECMA references mapped hardware registers and as such it's pretty clear that there's no such thing an "unused volatile read".

@GSPP
Copy link

GSPP commented Jul 11, 2016

I did not know that ECMA assigned hardware port semantics to volatile! This seems like a bad hybrid of C++ volatile and memory barriers. ECMA should be changed to not support hardware registers in memory and only assign barrier semantics to volatile. .NET is not being used in kernel mode. That's unrealistic.

Indeed, I.12.6.7 makes this optimization illegal. I think this is a design defect in ECMA. This should be changed and the JIT should ignore that since adhering to that paragraph can't possibly do any good.

This is not a breaking change since hardware registers never were accessible to user mode. Nobody could have made use of this rule.

@mikedn
Copy link
Contributor

mikedn commented Jul 11, 2016

This seems like a bad hybrid of C++ volatile and memory barriers

More or less. It's what VC++ does with volatile and is an extension of the C++ standard.

ECMA should be changed to not support hardware registers in memory and only assign barrier semantics to volatile.

I'm not sure what actual problem that would solve.

@GSPP
Copy link

GSPP commented Jul 11, 2016

It would allow the JIT to coalesce volatile accesses. Currently, that is illegal (although I think the JIT should just not comply).

C++ compilers already do this for std::atomic.

@pgavlin
Copy link
Contributor Author

pgavlin commented Jul 11, 2016

Even if the ECMA memory model only dealt with barrier semantics, the removal of a volatile read is potentially observable. For example, given the code in the original post, if the load is removed, the compiler is free to reorder the non-volatile write to b with respect to the volatile write to a that precedes it. If another thread were to read b before reading a (ensuring ordering using volatile reads), that thread may observe the reordered writes. The generated code does not exhibit that sort of code motion, but it is theoretically possible (though unlikely given the types of optimizations currently performed by the JIT).

@GSPP
Copy link

GSPP commented Jul 11, 2016

Yes, the barrier must stay but the load can go. For example, it can be a dummy load from a stack location. Ideally, the compiler had an internal representation for a partial barrier. Code gen could then pick the cheapest instructions for that.

@pgavlin
Copy link
Contributor Author

pgavlin commented Jul 11, 2016

Yes, the barrier must stay but the load can go. For example, it can be a dummy load from a stack location. Ideally, the compiler had an internal representation for a partial barrier. Code gen could then pick the cheapest instructions for that.

Yes, that's a good point.

This is not a breaking change since hardware registers never were accessible to user mode. Nobody could have made use of this rule.

I'm not terribly familiar with the intricacies of user-mode drivers, but this statement only holds so long as there is no way for a user-mode program to access device memory directly. If such memory has ever been accessible to a user-mode .NET program (e.g. via a call to MmMapLockedPagesSpecifyCache in a cooperating kernel-mode driver), then allowing the removal of unused volatile reads may break programs that expect them to be side-effecting.

@mikedn
Copy link
Contributor

mikedn commented Jul 12, 2016

It would allow the JIT to coalesce volatile accesses. Currently, that is illegal (although I think the JIT should just not comply). C++ compilers already do this for std::atomic.

Coalesce as in "eliminate redundant loads"? Again, I'm not sure what actual problem that would solve.

Also, I don't remember seeing a C++ compiler doing that. Which is not surprising considering that it's not exactly trivial to do so correctly.

@GSPP
Copy link

GSPP commented Jul 12, 2016

Here are some resources that I liked:

  1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html
  2. https://gcc.gnu.org/wiki/Atomic/GCCMM/Optimizations
  3. There was an article demonstrating that GCC optimized a loop using std::atomic. I'm pretty sure it was on http://preshing.com/ but I can't find it right now.

Other compilers are starting to build understanding about (partial) barriers into their engines. I'm personally not sure that time is well spent for .NET since .NET code usually does not spend much time with such low level primitives. But it's useful to understand what's possible and what approaches are being taken by others.

@pgavlin
Copy link
Contributor Author

pgavlin commented Jul 12, 2016

@GSPP these are good resources; thanks for sharing them!

Regarding the original issue: the likelihood of changes to ECMA or a decision to disregard the semantics it prescribes is remote. A more realistic (but not necessarily likely) path forward for enabling some of these transformations is a new set of APIs similar to the Volatilte.* APIs that provide the desired barrier semantics. If you think that such APIs would be valuable, I'd encourage you to file an issue to that effect.

@mikedn
Copy link
Contributor

mikedn commented Jul 12, 2016

Here are some resources that I liked:

Exaggerating a bit those resources can be summed up by the following quote from N4455:

We aren’t aware of compilers performing this optimization yet, but it is being discussed. std::atomic_signal_fence could be used to prevent this reordering and fusion, or one could use a stronger memory ordering for the operations: this optimization is only valid on relaxed operations which aren’t ordered with respect to each other.

That "being discussed" discussion is a link to a LLVM bug created some 3 years ago. After more than a year someone replied and asked for a use case. Nothing happened since then.

I'm not saying that there aren't any such optimizations that are useful but it's very difficult to justify changing an existing spec based on what at the moment looks more like speculation. As already pointed out by @pgavlin above this would be better handled by adding new APIs, similar to what C++ did with std::atomic. Hacking volatile just won't do.

@mikedn
Copy link
Contributor

mikedn commented Jul 12, 2016

Even if the ECMA memory model only dealt with barrier semantics, the removal of a volatile read is potentially observable. For example, given the code in the original post, if the load is removed, the compiler is free to reorder the non-volatile write to b with respect to the volatile write to a that precedes it

And even on x86 where stores aren't reordered you can still run into issues if a load is removed because that load might have been there to prevent load/store reordering which x86 does allow. Consider Intel's load/store reordering example:

x = 1       y = 1
r1 = y      r2 = y

It is possible to get r1 == 0 and r2 == 0. What if we don't want to happen? Let's add a load from the previous store location:

x = 1       y = 1
t = x       t = y
r1 = y      r2 = y

The added load can't be reordered with the previous store because they access the same location. It also can't be reordered with the existing load because x86 (supposedly) doesn't reorder loads.

Now, imagine what happens if the JIT comes in and eliminates the added load because t isn't used anywhere.

That said, fixing JIT's volatile load elimination doesn't actually fix this. But that's another story, at least then JIT will follow Intel's MM even if that doesn't accurately match .NET's MM.

@VSadov
Copy link
Member

VSadov commented Mar 30, 2022

I believe this can be closed.

ECMA reserves the possibility of using device memory where reads may have nontrivial sideeffects, but a possibility of such scenario in combination with GC seems very doubtful.
If device memory is supported one day, it would need to be very explicit unmanaged access and most likely via special intrinsics. (i.e. Device.Read<T>(T* ptr)).

For the regular case we should treat effects of volatile reads as once we read, we may observe a different value and not as once we read, a bulb lights up on the wall.
Therefore unused volatile read has no observable effects.

With the above said, the JIT is right. Unused volatile reads can be dropped.

@VSadov VSadov self-assigned this Sep 5, 2022
@VSadov VSadov added the memory model issues associated with memory model label Sep 7, 2022
@VSadov
Copy link
Member

VSadov commented Dec 27, 2022

Per memory document it is permitted to drop unused reads (volatile or ordinary)

NB: coalescing adjacent volatile reads is not permitted, but dropping unused ones is ok.

@VSadov VSadov closed this as completed Dec 27, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jan 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug help wanted [up-for-grabs] Good issue for external contributors JitUntriaged CLR JIT issues needing additional triage memory model issues associated with memory model
Projects
None yet
Development

No branches or pull requests

8 participants