Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Volatile barrier APIs #98837

Open
hamarb123 opened this issue Feb 23, 2024 · 70 comments
Open

[API Proposal]: Volatile barrier APIs #98837

hamarb123 opened this issue Feb 23, 2024 · 70 comments
Labels
api-ready-for-review API is ready for review, it is NOT ready for implementation area-System.Threading memory model issues associated with memory model
Milestone

Comments

@hamarb123
Copy link
Contributor

hamarb123 commented Feb 23, 2024

Background and motivation

This API proposal exposes methods to perform non-atomic volatile memory operations. Our volatile semantics are explained in our memory model, but I will outline the tl;dr of the relevant parts here:

  • Usually memory accesses can be re-ordered and combined by the C# compiler, JIT, and CPU
  • Volatile writes disallow the write to occur earlier than specified (i.e., prior memory accesses must complete before this executes)
  • Volatile reads disallow the read to occur later than specified (i.e., subsequent memory accesses can only begin after this completes)
  • Memory accesses to primitive types are always atomic if they're properly aligned (unless unaligned. is used), and either 1) the size of the type is at most the size of the pointer, or 2) a method on Volatile or Interlocked such as Volatile.Write(double&, double) has been called

Currently, we expose APIs on Volatile. for the atomic memory accesses, but there is no way to perform the equivalent operations for non-atomic types. If we have Volatile barrier APIs, they will be easy to write, and it should make it clear which memory operations can move past the barrier in which ways.

API Proposal

namespace System.Threading;

public static class Volatile
{
    public static void ReadBarrier();
    public static void WriteBarrier();
}

=== Desired semantics:

  • Volatile.ReadBarrier()

Provides a Read-ReadWrite barrier.
All reads preceding the barrier will need to complete before any subsequent memory operation.

Volatile.ReadBarrier() matches the semantics of Volatile.Read in terms of ordering reads, relative to all subsequent, in program order, operations.

The important difference from Volatile.Read(ref x) is that Volatile.ReadBarrier() has effect on all preceeding reads and not just a particular single read of x.

  • Volatile.WriteBarrier()

Provides a ReadWrite-Write barrier.
All memory operations preceding the barrier will need to complete before any subsequent write.

Volatile.WriteBarrier() matches the semantics of Volatile.Write in terms of ordering writes, relative to all preceeding, in program order, operations.

The important difference from Volatile.Write(ref x) is that Volatile.WriteBarrier() has effect on all subsequent writes and not just a particular single write of x.

The actual implementation will depend on underlying platform.

  • On TSO architectures (x86/x64) it would have only compiler optimization-preventing effects.
  • On weaker architectures some ordering instructions will be emitted in addition - as appropriate and available in the given ISA.

API Usage

The runtime uses an internal API Interlocked.ReadMemoryBarrier() in 2 places (here and here) to batch multiple reads on both CoreCLR and NativeAOT, and is supported on all platforms. This ability is also useful to third-party developers (such as me, in my example below), but is currently not possible to write efficiently.

An example where non-atomic volatile operations would be useful is as follows. Consider a game which wants to save its state, ideally while continuing to run; these are the most obvious options:

  • Save on the main thread (can cause lag spikes if lots of data to save)
  • Create a copy of the data and then save on a separate thread (less time spent on main thread, but potentially still non-trivial, and a large number of allocations most likely)
  • Use locks around access to the memory (massive overhead)
  • Just save it on another thread without any sync (saves an inconsistent state)

But there is actually another option which utilises non-atomic volatile semantics:

  • Has low overhead, especially on x86 and x64
  • Saves a consistent state
//main threads sets IsSaving to true and increments SavingVersion before starting the saving thread, and to false once it's definitely done (e.g., on next frame)
//saving thread performs a full memory barrier before starting (when required, since starting a brand new thread every time isn't ideal), to ensure that _value is up-to-date
//memory synchronisation works because _value is always read before any saving information, and it's always written after the saving information
//if the version we read on the saving thread is not the current version, then our read from _value is correct, otherwise our read from _savingValue will be correct
//in the rare case that we loop to saving version == 0, then we can manually write all _savingVersion values to 0, skip to version == 1, and go from there (excluded from here though for clarity)

static class SavingState
{
    public static bool IsSaving { get; set; }
    public static nuint SavingVersion { get; set; }
}

struct SaveableHolder<T>
{
    nuint _savingVersion;
    T _value;
    T _savingValue;

    //Called only from main thread
    public T Value
    {
        get => _value;
        set
        {
            if (SavingState.IsSaving)
            {
                if (SavingVersion != SavingState.SavingVersion)
                {
                    _savingValue = _value;

                    //ensure the saving value is written before the saving version, so that we read it in the correct order
                    Volatile.Write(ref _savingVersion, SavingState.SavingVersion);
                }

                //_value can only become torn or incorrect after we have written our saving value and version
                Volatile.WriteBarrier();
                _value = value; //write must occur after prior writes
            }
            else
            {
                _value = value;
            }
        }
    }

    //Called only from saving thread while SavingState.IsSaving with a higher SavingState.SavingVersion than last time
    public T SavingValue
    {
        get
        {
            var value = Value; //read must occur before reads
            Volatile.ReadBarrier();

            //_savingVersion must be read after _value is, so if it's visibly changed/changing then we will either catch it here
            if (Volatile.Read(in _savingVersion) != SavingState.SavingVersion) return value;

            //volatile read on _savingVersion ensures we get an up-to-date _savingValue since it's written first
            return _savingValue;
        }
    }
}

Alternative Designs

  • We could expose read/write APIs instead:
namespace System.Threading;

public static class Volatile
{
    public static T ReadNonAtomic<T>(ref readonly T location)
    {
        //ldarg.0
        //volatile.
        //ldobj !!T
    }

    public static void WriteNonAtomic<T>(ref T location, T value)
    {
        //ldarg.0
        //ldarg.1
        //volatile.
        //stobj !!T
    }
}

We do have IL instructions, but they're currently broken and not exposed, see #91530 - the proposal here was originally to expose APIs for volatile. ldobj and volatile. stobj + the unaligned variants (as seen aobe), and fix the instructions (or implement these without the instructions and have the instructions call these APIs - not much of a difference really). It was changed based on feedback to expose barrier APIs, which can provide equivalent semantics, but also allow additional scenarios. It is also clearer which memory operations can be reordered with the barrier APIs.

  • We could expose APIs on Interlocked instead:
public static class Interlocked
{
    // Existing API
    public static void MemoryBarrier();
    // New APIs
    public static void MemoryBarrierAcquire(); //volatile read semantics
    public static void MemoryBarrierRelease(); //volatile write semantics
}
  • We could expose the APIs on Unsafe instead:
namespace System.Runtime.CompilerServices;

public static class Unsafe
{
    public static T ReadVolatile<T>(ref readonly T location);
    public static void WriteVolatile<T>(ref T location, T value);
}
  • We could add unaligned overloads:
namespace System.Runtime.CompilerServices;

public static class Unsafe
{
    public static T ReadVolatileUnaligned<T>(ref readonly byte location);
    public static void WriteVolatileUnaligned<T>(ref byte location, T value);
}
  • We could also expose APIs for other operations which allow volatile. - initblk and cpblk, people may have use for these also:
namespace System.Runtime.CompilerServices;

public static class Unsafe
{
    public static void CopyBlockVolatile(ref byte destination, ref readonly byte source, uint byteCount);
    public static void CopyBlockVolatileUnaligned(ref byte destination, ref readonly byte source, uint byteCount);
    public static void InitBlockVolatile(ref byte startAddress, byte value, uint byteCount);
    public static void InitBlockVolatileUnaligned(ref byte startAddress, byte value, uint byteCount);
}

Open Questions

There is a question as to whether we should have Read-ReadWrite/ReadWrite-Write barriers or Read-Read/Write-Write barriers. I was initially in favour of the former (as it matches our current memory model), but now think the latter is probably better, since there are many scenarios (including in my example API usage, and the runtime's uses too) where the additional guarantees provided by the former are unnecessary, and thus may cause unnecessary overhead. We could also just provide both if we think they're both useful.

Risks

No more than other volatile/interlocked APIs really, other than potential misunderstanding of what they do.

@hamarb123 hamarb123 added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Feb 23, 2024
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 23, 2024
@jkotas jkotas added area-System.Threading and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Feb 23, 2024
@ghost
Copy link

ghost commented Feb 23, 2024

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

This API proposal exposes methods to perform non-atomic volatile memory operations. Our volatile semantics are explained in our memory model, but I will outline the tl;dr of the relevant parts here:

  • Usually memory accesses can be re-ordered and combined by the C# compiler, JIT, and CPU
  • Volatile writes disallow the write to occur earlier than specified (i.e., prior memory accesses must complete before this executes)
  • Volatile reads disallow the read to occur later than specified (i.e., subsequent memory accesses can only begin after this completes)
  • Memory accesses to primitive types are always atomic if they're properly aligned (unless unaligned. is used), and either 1) the size of the type is at most the size of the pointer, or 2) a method on Volatile or Interlocked such as Volatile.Write(double&, double) has been called

Currently, we expose APIs on Volatile. for the atomic memory accesses, but there is no way to perform the equivalent operations for non-atomic types. We do have IL instructions, but they're currently broken and not exposed, see #91530 - the proposal here is to expose APIs for volatile. ldobj and volatile. stobj + the unaligned variants, and fix the instructions (or implement these without the instructions and have the instructions call these APIs - not much of a difference really).

API Proposal

namespace System.Threading;

public static class Volatile
{
    public static T ReadNonAtomic<T>(ref readonly T location)
    {
        //ldarg.0
        //volatile.
        //ldobj !!T
    }

    public static void WriteNonAtomic<T>(ref T location, T value)
    {
        //ldarg.0
        //ldarg.1
        //volatile.
        //stobj !!T
    }

    public static T ReadUnalignedNonAtomic<T>(ref readonly T location)
    {
        //ldarg.0
        //unaligned. 1
        //volatile.
        //ldobj !!T
    }

    public static void WriteUnalignedNonAtomic<T>(ref T location, T value)
    {
        //ldarg.0
        //ldarg.1
        //unaligned. 1
        //volatile.
        //stobj !!T
    }
}

API Usage

An example where non-atomic volatile operations would be useful is as follows. Consider a game which wants to save its state, ideally while continuing to run; these are the most obvious options:

  • Save on the main thread (can cause lag spikes if lots of data to save)
  • Create a copy of the data and then save on a separate thread (less time spent on main thread, but potentially still non-trivial, and a large number of allocations most likely)
  • Use locks around access to the memory (massive overhead)
  • Just save it on another thread without any sync (saves an inconsistent state)

But there is actually another option which utilises non-atomic volatile semantics:

  • Has low overhead, especially on x86 and x64
  • Saves a consistent state
//main threads sets IsSaving to true and increments SavingVersion before starting the saving thread, and to false once it's definitely done (e.g., on next frame)
//saving thread performs a full memory barrier before starting (when required, since starting a brand new thread every time isn't ideal), to ensure that _value is up-to-date
//memory synchronisation works because _value is always read before any saving information, and it's always written after the saving information
//if the version we read on the saving thread is not the current version, then our read from _value is correct, otherwise our read from _savingValue will be correct
//in the rare case that we loop to saving version == 0, then we can manually write all _savingVersion values to 0, skip to version == 1, and go from there (excluded from here though for clarity)

static class SavingState
{
    public static bool IsSaving { get; set; }
    public static nuint SavingVersion { get; set; }
}

struct SaveableHolder<T>
{
    nuint _savingVersion;
    T _value;
    T _savingValue;

    //Called only from main thread
    public T Value
    {
        get => _value;
        set
        {
            if (SavingState.IsSaving)
            {
                if (SavingVersion != SavingState.SavingVersion)
                {
                    _savingVersion = SavingState.SavingVersion;
                    _savingValue = _value;
                }

                //_value can only become torn or incorrect after we have written our saving value and version
                Volatile.WriteNonAtomic(ref _value, value); //write must occur after prior operations
            }
            else
            {
                _value = value;
            }
        }
    }

    //Called only from saving thread while SavingState.IsSaving with a higher SavingState.SavingVersion than last time
    public T SavingValue
    {
        get
        {
            var value = Volatile.ReadNonAtomic(in Value); //read must occur before subsequent code

            //_savingVersion must be read after _value is, so if it's visibly changed/changing then we will either catch it here
            if (_savingVersion != SavingState.SavingVersion) return value;
            return _savingValue
        }
    }
}

Alternative Designs

  • We could expose barrier APIs instead:
namespace System.Threading;

public static class Volatile
{
    public static void ReadBarrier();
    public static void WriteBarrier();
}
  • We could expose the APIs on Unsafe instead.
  • We could also expose APIs for other operations which allow volatile. - initblk and cpblk, people may have use for these also

Risks

No more than other volatile APIs really, other than the lack of atomicity (which is in the name).

Author: hamarb123
Assignees: -
Labels:

api-suggestion, area-System.Threading, untriaged

Milestone: -

@jkotas
Copy link
Member

jkotas commented Feb 23, 2024

ReadUnalignedNonAtomic(ref readonly T location)

ref T is expected to be aligned. API for manipulation of unaligned memory should not take ref T. I do not think it belongs on System.Threading.Volatile.

I think it is very rare to actually need volatile and misaligned memory together. It would make more sense to cover that scenario by Volatile.Read/WriteBarrier.

@jkotas
Copy link
Member

jkotas commented Feb 23, 2024

Another alternative design:

class Interlocked
{
    // Existing API
    void MemoryBarrier();
    // New APIs
    void ReadMemoryBarrier();
    void WriteMemoryBarrier();
}

@alexrp
Copy link
Contributor

alexrp commented Feb 23, 2024

@jkotas to clarify, do you mean something like MemoryBarrierAcquire()/MemoryBarrierRelease()? I ask because read/write barriers aren't typically defined to have the semantics as acquire/release barriers, and I think it would be a mistake to introduce read/write barriers as an additional memory model concept that users have to understand.

@alexrp
Copy link
Contributor

alexrp commented Feb 23, 2024

It's not great that .NET's atomic APIs are split across Volatile and Interlocked. On the one hand, MemoryBarrierAcquire()/MemoryBarrierRelease() seem like they should sit next to MemoryBarrier() on Interlocked... but then on the other hand, Interlocked currently only deals in sequential consistency semantics, so conceptual consistency for the API would be hurt. So maybe putting them on Volatile is better. But then again, there's a discoverability argument for putting them on Interlocked...

Unfortunate situation all around. Really makes me wish we could slap [EB(Never)] on Interlocked/Volatile and unify them in an Atomic class modelled mostly after C/C++11 atomics. But I digress.

I guess I lean 51/49 in favor of putting them on Volatile. 🤷

@jkotas
Copy link
Member

jkotas commented Feb 23, 2024

do you mean something like MemoryBarrierAcquire()/MemoryBarrierRelease()?

Right, I do not opinion about the exact names. The naming is all over the place as you have pointed out.

I assume that the only difference between Volatile.Read/WriteBarrier and Interlocked.MemoryBarrierAcquire/Release alternatives are the names. There is no functional difference. Correct?

@hamarb123
Copy link
Contributor Author

hamarb123 commented Feb 23, 2024

I assume that the only difference between Volatile.Read/WriteBarrier and Interlocked.MemoryBarrierAcquire/Release alternatives are the names. There is no functional difference. Correct?

That is the idea, yes. They just emit the barrier that Volatile.Read/Write emits, without necessarily doing the operation (by "without necessarily doing the operation", I mean that you could also do the non-volatile operation and the JIT could combine it).

@hamarb123
Copy link
Contributor Author

@mangod9 @kouvel (or anyone else who can mark it), what needs to be done to get this api-ready-for-review API is ready for review, it is NOT ready for implementation ?

@jkotas
Copy link
Member

jkotas commented Mar 10, 2024

I have looked at your API usage example - it appears to be either incomplete or buggy. Consider this interleaving of foreground and saving thread:

                   _savingVersion = SavingState.SavingVersion;
---
            var value = Volatile.ReadNonAtomic(in Value); //read must occur before subsequent code

            //_savingVersion must be read after _value is, so if it's visibly changed/changing then we will either catch it here
            if (_savingVersion != SavingState.SavingVersion) return value; <- this will be false - the foreground just assigned the _savingVersion
            return _savingValue; <- _savingValue has not been assigned by the foreground thread yet, this will return stale or uninitialized _savingValue

---
                   _savingValue = _value;

@hamarb123
Copy link
Contributor Author

hamarb123 commented Mar 10, 2024

I have looked at your API usage example - it appears to be either incomplete or buggy. Consider this interleaving of foreground and saving thread:

You are correct - I needed the read/write to version to be volatile also, and the write to version/saving-value to be swapped. I'll update it shortly. I'm pretty sure the volatile operations on value are still needed though.

Thanks for noticing that :)

@hamarb123
Copy link
Contributor Author

@jkotas was there anything else I need to fix or change?

@jkotas
Copy link
Member

jkotas commented Mar 13, 2024

The area owners should take it from here.

@kouvel @VSadov @mangod9 Thoughts about this API proposal?

@VSadov
Copy link
Member

VSadov commented Mar 13, 2024

I’d prefer read/write barriers.

Ordering accesses that are themselves nonatomic would have complicated semantics. Reasoning about barriers is easier and they can be used to construct “nonatomic” volatile accesses.

  • we probably want to match the ordering behavior of volatile accesses. It would be less confusing if, for example, a volatile write could be simulated by a barrier followed by an ordinary write.
  • We already have a read barrier. Just need to make it public.
  • The write barrier on arm64 would need to be emitted as full-fence (dmb). The dmb st is weaker than a volatile/release fence as it only waits for writes, while volatile write waits for reads too.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Mar 13, 2024

  • The write barrier on arm64 would need to be emitted as full-fence (dmb). The dmb st is weaker than a volatile/release fence as it only waits for writes, while volatile write waits for reads too.

We shouldn't need to emit a dmb ish for the following code would be my only concern:

Volatile.WriteBarrier(); //write barrier would be used on stlur for the first alignof(T) write of the following line
location = value;

And I would expect similar combining for read barrier where possible too.

And on X86 it'd be even better assembly code, obviously.

Other than that, I'd be 100% happy with the barriers solution, I changed it to the ReadNonAtomic and WriteNonAtomic APIs when writing the proposal since that'd be easier to guarantee that they get combined when possible (since they're always combined in these APIs obviously).

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

I also prefer a barrier approach. I assume the barriers would be no-op on x86/x64.

The write barrier on arm64 would need to be emitted as full-fence (dmb). The dmb st is weaker than a volatile/release fence as it only waits for writes, while volatile write waits for reads too.

That doesn't seem to be the case from what I can tell. The docs on "Barrier-ordered-before" say the following:

A read or a write RW1 is Barrier-ordered-before a read or a write RW2 from the
same Observer if and only if RW1 appears in program order before RW2 and any of the following
cases apply:

  • ...
  • RW1 is a write W1 generated by an instruction with Release semantics and RW2 is a read R2
    generated by an instruction with Acquire semantics.
  • RW1 is a read R1 and either:
    • R1 appears in program order before a DMB LD that appears in program order before RW2.
    • R1 is generated by an instruction with Acquire or AcquirePC semantics.
  • RW2 is a write W2 and either:
    • RW1 is a write W1 appearing in program order before a DMB ST that appears in program
      order before W2.
    • W2 is generated by an instruction with Release semantics.

Which seems to suggest that stlr would have the same ordering guarantees as dmb st (or dmb ishst, which I assume is what we'd want) followed by a write.

In any case, from a spec perspective we probably shouldn't specify any stronger ordering guarantees for the volatile barriers than are necessary.

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

As for this part:

RW1 is a write W1 generated by an instruction with Release semantics and RW2 is a read R2
generated by an instruction with Acquire semantics.

Is that something specific to arm/arm64? I'm not sure at the moment if other archs offer the same guarantee with acquire/release loads/stores. My inclination is to not specify that, unless it's a typical thing to be expected.

Anyway it seems the barriers don't guarantee that behavior, only the acquire/release load/store instructions, so may not be relevant here.

@VSadov
Copy link
Member

VSadov commented Mar 13, 2024

We shouldn't need to emit a dmb ish for the following code would be my only concern:

Volatile.WriteBarrier(); //write barrier would be used on stlur for the first alignof(T) write of the following line
location = value;
And I would expect similar combining for read barrier where possible too.

Optimization that fuses a barrier with the following/preceding ordinary access into an acquiring/releasing access does not work in general. That is why barriers are useful even when releasing/acquiring accesses are also available. Barriers order multiple accesses at once, while an acq/rel accesses order just one access against all preceding/following (in program order.)

Example:

R1, R2, ReadBarrier R3, R4

Here R4 is "ordered after" R2 and R1
Now if we apply "fusing" optimization, we have

R1, R2, VolatileRead(R3), R4

Now R4 is not ordered with R2 and R1. Thus the optimization is invalid.

R1, VolatileRead(R2), R3, R4

Still R4 is not ordered with R1.

Same goes for writes.

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

R1, R2, VolatileRead(R3), R4
Now R4 is not ordered with R2 and R1. Thus the optimization is invalid.

R1, VolatileRead(R2), R3, R4
Still R4 is not ordered with R1.
Same goes for writes.

Why not? I thought that ordering guarantee was exactly what the VolatileRead was supposed to offer.

@VSadov
Copy link
Member

VSadov commented Mar 13, 2024

I also prefer a barrier approach. I assume the barriers would be no-op on x86/x64.

There is effect on x86/x64.
The barriers must act as a compiler fence even if hardware is strong ordered to not require any instructions. While hardware would not reorder accesses, compiler could, so that must be prevented.

In the current implementation of the Interlocked.ReadMemoryBarrier(); the JIT needs to treat the barrier as present in the code until the very end where on x86/x64 it does not emit anything into the instruction stream. That is because a fence has optimization-suppressing effects.

@VSadov
Copy link
Member

VSadov commented Mar 13, 2024

Which seems to suggest that stlr would have the same ordering guarantees as dmb st (or dmb ishst, which I assume is what we'd want) followed by a write.

The spec is terse and should be read very carefully to notice that it is not symmetrical. DMB ST has no effect on preceding reads.
This is just a quirk of DMB ST, that it is a Write-Write barrier, not a ReadWrite-Write barrier.

Arm64 does not have a barrier instruction that is ReadWrite-Write. I think RISC-V has.

As for this part:

RW1 is a write W1 generated by an instruction with Release semantics and RW2 is a read R2
generated by an instruction with Acquire semantics.

Is that something specific to arm/arm64? I'm not sure at the moment if other archs offer the same guarantee with acquire/release loads/stores. My inclination is to not specify that, unless it's a typical thing to be expected.

I think it is another implementation quirk of arm64 ISA that can be ignored here.

What arm64 calls "Acquire" is slightly stronger than what dotnet memory model specifies for volatile reads. "AcquirePC" is a closer match.
(this is why we use ldapr instead of ldar, when available)

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

The barriers must act as a compiler fence

Agreed, I meant there wouldn't be any special instructions for these on x86/x64.

@VSadov
Copy link
Member

VSadov commented Mar 13, 2024

To summarise:

  • Interlocked.ReadMemoryBarrier(); (existing name and class, but could change, this is not a public API right now)
    Provides a Read-ReadWrite barrier.
    All reads preceding the barrier will need to complete before any subsequent memory operation.

  • Interlocked.WriteMemoryBarrier(); (the name is just for consistency with above)
    Provides a ReadWrite-Write barrier.
    All memory operations preceding the barrier will need to complete before any subsequent write.

The barriers match semantics of acquire/release/volatile in terms of ordering reads or writes, correspondingly, relative to all operations.

The barriers are stronger than acquire/release/volatile in terms of ordering multiple reads or writes, correspondingly, not just a particular single read or write.

The actual implementation will depend on underlying platform.

  • On TSO architectures (x86/x64) it would have only compiler optimization-preventing effects.
  • On weaker architectures some ordering instructions will be emitted in addition - as appropriate and available in the given ISA.

@VSadov VSadov added the memory model issues associated with memory model label Mar 13, 2024
@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

The spec is terse and should be read very carefully to notice that it is not symmetrical. DMB ST has no effect on preceding reads.
This is just a quirk of DMB ST, that it is a Write-Write barrier, not a ReadWrite-Write barrier.

Unless I'm misreading it, stlr also appears to only guarantee Write-Write ordering. [Edit] Looks like I misread it, stlr appears to offer ReadWrite-Write ordering.

Do ordinary memory operations on x86/x64 guarantee Read-ReadWrite and ReadWrite-Write ordering for acquire/release semantics? I believe stores are not reordered there, but I thought loads could be reordered after an ordinary store (which has release semantics).

Wonder if these should just be Read-Read and Write-Write barriers? I see the benefits of the stronger guarantees, but it seems the implementation would not be much more efficient.

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

Is the main benefit of these APIs for unaligned or larger-than-atomic reads/writes?

@VSadov
Copy link
Member

VSadov commented Mar 13, 2024

One example of existing code using a read barrier could be found here:

In that scenario we do following reads
read version, [ read source, read targetResult ], read version again

The order between read source and read targetResult does not matter, but they both must be read after we read the version for the first time and before we read it second time.

To ensure that the first read happens before source/targetResult tuple, we just need to Volatile.Read(version).
To ensure that source and targetResult are both read before we read version for the second time, both must be volatile read, since an ordinary read can be arbitrary delayed even in the presence of acquiring reads.

So with only acquires we would have
Volatile.Read(version), [ Volatile.Read(source), Volatile.Read(targetResult) ], read version again

That is 3 volatile reads.

If we can have a read barrier, we can do:

Volatile.Read(version), [ read source, read targetResult ], [ReadMemoryBarrier] read version again

That is one barrier replacing 2 volatile reads. Benchmarks pointed that even that is profitable.

However there is also a generic version of the same cache.

In the generic version the entry ( basically what goes into [ ... ] and needs to be "sandwiched" between two version reads) contains {_key, _value} which are generic, so reading the entry takes unknown number of reads.
We definitely want a read barrier in the generic implementation.

Volatile.Read(version), [ read TKey _key, read TValue _value ], [ReadMemoryBarrier] read version again

@hamarb123
Copy link
Contributor Author

I have updated the issue description - if we're all happy with it, someone can mark it as api-ready-for-review API is ready for review, it is NOT ready for implementation :) thanks

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

We should match it with the rest of .NET semantics imo. If we want a more general solution we could do something like Interlocked.Barrier and expose the following enum to chose from: https://en.cppreference.com/w/cpp/atomic/memory_order; otherwise we should just match the rest of .NET.

The .NET memory model for volatile accesses is clear. These are new APIs and I'd like to consider options. This is also an interesting read: https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering. It only makes very specific guarantees, and unless I misread it, does not make the guarantees that are equivalent to those offered by the .NET memory model for volatile accesses. Interestingly, the summary descriptions of memory_order_acquire and memory_order_release (link) offer stronger guarantees than the detailed descriptions, of course I might have missed something.

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

I can appreciate the value in keeping these barriers aligned with the .NET memory model for volatile accesses, I just wanted to point out some things. They would be more expensive (perhaps unnecessarily in some cases), but I can understand that it would fit better.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Mar 13, 2024

We could expose one like this:

[Flags]
public enum MemoryOrderKind
{
    ReadEarlier = 1,
    WriteEarlier = 2,
    ReadLater = 4,
    WriteLater = 8,
}

void MemoryBarrier(MemoryOrderKind blockedOps);

There's not really a way to expose whether it should be per memory location, unless we also expose relevant read / write apis.

My current proposed APIs would just be MemoryBarrier(MemoryOrderKind.ReadEarlier | MemoryOrderKind.WriteEarlier) and MemoryBarrier(MemoryOrderKind.ReadLater | MemoryOrderKind.WriteLater) with these.

@hamarb123 hamarb123 changed the title [API Proposal]: Non-atomic volatile APIs [API Proposal]: Volatile barrier APIs Mar 13, 2024
@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

public enum MemoryOrderKind

It would be nice but maybe not necessary yet? We could always add an overload.

@VSadov
Copy link
Member

VSadov commented Mar 13, 2024

Regarding exposing all possible memory ordering semantics - I think "never say never", but the need for that would be hard to demonstrate, while API costs are considerable.

It would entail introducing an enum for various memory ordering kinds and adding APIs to barriers/reads/writes/interlocked that would take extra parameters for ordering. Considering that things like CAS may have different ordering on pass and fail branches, that is a lot of combinations. Most of them would make difference only in extreme niche scenarios on platforms that have direct support for those.

Adding two volatile barriers would not close the door to going all the way of std::memory_order, if that will ever be needed enough.

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

I don't think it would be a big deal to add different ordering mechanics, it's not there already because there hasn't been sufficient interest. That said, we shouldn't confuse ease of development with design.

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

And that would be a way forward for those cases where the .NET memory model is too strict.

@kouvel kouvel removed the untriaged New issue has not been triaged by the area owner label Mar 13, 2024
@hamarb123
Copy link
Contributor Author

What about:

void ReadBarrier();
void WriteBarrier();
void PartialReadBarrier(); // or Full based on which we want to be the default / HalfReadBarrier / etc.
void PartialWriteBarrier(); // same as above

?
I don't have an issue on providing the ones with less guarantees that only stop the same operation, not all operations (they would suffice in my code example for example).

What I have an issue with is not also providing one's that match our current memory model, whether we only provide these, or provide these in addition to others.

@kouvel kouvel added the api-ready-for-review API is ready for review, it is NOT ready for implementation label Mar 13, 2024
@VSadov
Copy link
Member

VSadov commented Mar 13, 2024

@hamarb123 - Regarding the API proposal. A few suggestions:

  • I am not sure it is worth mentioning about possible optimizations.
    I'd leave that to implementors to figure whether and when which optimization is applicable. (I think "fusing" optimizations would not be valid in general, for example, but maybe there are cases where they would work).
    By definition optimizations should not be observable via program effects, so we should just specify the desired semantics and JIT developers will do whatever is needed/possible.

  • One more use of the barriers is to order a batch of multiple reads or writes. That also includes ordering non-atomic reads/writes like with structs. It is worth mentioning.
    This is the reason why Interlocked.ReadMemoryBarrier() exists. The fact that we have it in two runtimes (CoreCLR and NativeAOT) and on all supported platforms is a good evidence that someone else might use the barriers for similar purposes.

  • Not sure we need to go in details about what we are no longer proposing - like volatile. ldobj and cpblk and initblk. Not in the beginning of the proposal anyways. This is a complicated topic - imagine the reader is losing focus a little with every word they read. We do not want to spend that attention on cpblk. :-)
    Maybe worth mentioning in alternative designs, but not in the actual proposal.

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

Marked as ready-for-review, we can discuss further there.

@hamarb123
Copy link
Contributor Author

@VSadov I will update it when I'm able later today :)
Thanks

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

I do think the semantics of the operations need to be clearly defined. For instance, though it would be unfortunate, the difference indicated here would need to be clearly specified.

@kouvel
Copy link
Member

kouvel commented Mar 13, 2024

It may be a matter of documentation, but we should have a clear understanding of what the aim is for now from the OP.

@VSadov
Copy link
Member

VSadov commented Mar 13, 2024

For the desired semantics. I think we should start with:


  • Volatile.ReadBarrier()
    Provides a Read-ReadWrite barrier.
    All reads preceding the barrier will need to complete before any subsequent memory operation.

Volatile.ReadBarrier() matches the semantics of Volatile.Read in terms of ordering reads, relative to all subsequent, in program order, operations.

The important difference from Volatile.Read(ref x) is that Volatile.ReadBarrier() has effect on all preceeding reads and not just a particular single read of x.

  • Volatile.WriteBarrier()
    Provides a ReadWrite-Write barrier.
    All memory operations preceding the barrier will need to complete before any subsequent write.

Volatile.WriteBarrier() matches the semantics of Volatile.Write in terms of ordering writes, relative to all preceeding, in program order, operations.

The important difference from Volatile.Write(ref x) is that Volatile.WriteBarrier() has effect on all subsequent writes and not just a particular single write of x.

The actual implementation will depend on underlying platform.

  • On TSO architectures (x86/x64) it would have only compiler optimization-preventing effects.
  • On weaker architectures some ordering instructions will be emitted in addition - as appropriate and available in the given ISA.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Mar 14, 2024

It may be a matter of documentation, but we should have a clear understanding of what the aim is for now from the OP.

My main aim is to enable the API usage I have as an example. It would also be nice if we could fix volatile. prefixes, but this can be done separately if desired.

Notably, I wouldn't actually need ReadWrite-Write or Read-ReadWrite barriers I think, I believe Write-Write and Read-Read should be enough for this.

@hamarb123
Copy link
Contributor Author

@VSadov I've updated it, can you double check that it's fine?

@VSadov
Copy link
Member

VSadov commented Mar 14, 2024

@hamarb123 Looks very good! Thanks!

I will add the expected semantics to the proposed entry points. But we will see where we will land with those after reviews.

@hamarb123
Copy link
Contributor Author

Btw @VSadov, both my example and the runtime's usages only seem to need Read-Read/Write-Write, so I think it'd be good to get overloads for those if we also keep the Read-ReadWrite/ReadWrite-Write ones that match our current memory model, since they should have lower overhead and seem to be all that would be required most of the time. It's in the open questions section, but just thought I'd mention it so you're aware if you hadn't seen it.

@kouvel
Copy link
Member

kouvel commented Mar 14, 2024

An alternative may be to overload Interlocked.MemoryBarrier with a MemoryConstraint enum or something like that, somewhat like in C++. The enum values could be something like ReadWrite (similar to full), Acquire, Release, and perhaps in the future if needed, Read and Write, which would be Read-Read and Write-Write respectively. Another enum value that may be useful for CAS operations could be None (similar to relaxed), if we were to expand those APIs with similar overloads. The APIs being on the Volatile class may imply that they have volatile semantics, which are very specific, and overloading them with options of different semantics may appear odd.

@kouvel
Copy link
Member

kouvel commented Mar 14, 2024

For instance, there are already use cases in Lock that could benefit from acquire/release/relaxed semantics for CAS operations. Enabling more granular barriers has also been proposed before.

@VSadov
Copy link
Member

VSadov commented Mar 14, 2024

Btw @VSadov, both my example and the runtime's usages only seem to need Read-Read/Write-Write, so I think it'd be good to get overloads for those if we also keep the Read-ReadWrite/ReadWrite-Write ones that match our current memory model, since they should have lower overhead and seem to be all that would be required most of the time. It's in the open questions section, but just thought I'd mention it so you're aware if you hadn't seen it.

Yes, I noticed. It is a common thing with volatile. While volatile orders relatively to all accesses, some cases, typically involving a chain of several volatile accesses when you have just writes or just reads in a row, could use a weaker fence. This is a case in both scenarios that you mention.

The main impact of a fence is forbidding optimizations at hardware level. They would not necessarily make the memory accesses to cost more. The level of cache that is being used is likely a lot more impactful than forcing a particular order of accesses.
Intuitively, with everything else the same, a weaker barrier would be cheaper, but I am not sure by how much in reality - 10%? 1%?

Figuring the minimum strength required would be even more difficult and error-prone task than figuring when Volatile is needed.
Honestly - sometimes people just put volatile on everything accessed from different threads - because it is not that expensive, compared to bugs that could happen once a week and a year after something shipped, just because there is a new chip on the market and it does something different from where the code was originally tested.

I think going all the way of std::memory_order is possible, but being possible might not be enough reason to do it.

@VSadov
Copy link
Member

VSadov commented Mar 14, 2024

I think one datapoint that could be useful for the ReadWrite-Write vs. Write-Write discussion, could be the performance difference of dmb ish vs. dmb ishst on a few arm64 implementations - just to have a practical perspective on potential wins.

@kouvel
Copy link
Member

kouvel commented Mar 14, 2024

The perf differences may be more apparent in memory-intensive situations where the extra ordering constraints would disable some optimizations and impose extra work on the processor / cache. It may be difficult to measure the difference in typical microbenchmarks, though perhaps it would become more apparent by somehow folding in some memory pressure and measuring maybe not just the operation in question but also latency of other memory operations.

@jkotas
Copy link
Member

jkotas commented Mar 14, 2024

I think going all the way of std::memory_order is possible, but being possible might not be enough reason to do it.

I agree. I think we should start with simple barriers that are aligned with .NET memory model, and wait for evidence that we need more.

It is a non-goal for .NET programs to express everything that is possible. We strike a balance between simplicity and what may be possible in theory.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Mar 14, 2024

I think going all the way of std::memory_order is possible, but being possible might not be enough reason to do it.

I agree. I think we should start with simple barriers that are aligned with .NET memory model, and wait for evidence that we need more.

It is worth mentioning, out of my use case, and the 2 uses of ReadBarrier in the runtime, both only require Read-Read/Write-Write barriers, whereas the ones matching our memory model would be Read-ReadWrite/ReadWrite-Write. This would result in throwing away some performance on arm for no reason other than lack of APIs (although I do not know precisely how much). I do think this is evidence that the full Read-ReadWrite/ReadWrite-Write barriers are probably less commonly needed than just the Read-Read/Write-Write barriers.

Edit: I'd still be happy if we just ended up with the ones that matched our memory model, but I'd obviously be more happy if we got the Read-Read/Write-Write ones, since they'd perform slightly better and be all I require.

@kouvel kouvel added this to the 9.0.0 milestone May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-ready-for-review API is ready for review, it is NOT ready for implementation area-System.Threading memory model issues associated with memory model
Projects
None yet
Development

No branches or pull requests

6 participants
@alexrp @jkotas @VSadov @kouvel @hamarb123 and others