Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interlocked.Read from 32-bit apps are much slower on Intel Sapphire Rapids CPUs #93624

Closed
TrevorFellman-MSFT opened this issue Oct 17, 2023 · 18 comments · Fixed by #93766
Closed
Labels
area-System.IO tenet-performance Performance related issue
Milestone

Comments

@TrevorFellman-MSFT
Copy link

TrevorFellman-MSFT commented Oct 17, 2023

Description

Interlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace UnmanagedMemoryStreamPerfTest
{
    class Program
    {
        static bool _Is32BitProcess = true;
        static Stream _Stream = null;
        static void Main(string[] args)
        {
            _Is32BitProcess = GetIs32BitProcess();
            _Stream = GetStream();

            Console.WriteLine("Is 32 bit process: " + _Is32BitProcess);
            Stopwatch sw = new Stopwatch();
            sw.Start();

            MainAsync();

            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
            Console.ReadLine();

        }

        static async void MainAsync()
        {
            //100 threads to make it eaven more obvious
            List<Task> tasks = new List<Task>();
            for (int i = 0; i < 100; i++)
                tasks.Add(DoStuff(GetStream()));
            await Task.WhenAll(tasks);
        }


        static unsafe async Task DoStuff(Stream stream)
        {
            //contrived, but this is just so we can see it.
            //for every XAML file, WPF will read baml streams from the assembly.
            //for apps with dense UIs, that can thousands of streams to open a screen

             //the problem, 32-bit will be about 3-20 times slower.  200 times slower on high end 2023 AMD and Intel chips due to a throttling feature for the type of lock UnmangedMemoryStream takes.
            int count = 1000000;
            for (int i = 0; i < count; i++)
            {
                DoStuffStream(stream);
                //DoStuffInterlocked()
                //DoStuffInterlockedTreF()
            }

        }

        static unsafe long DoStuffStream(Stream stream)
        {
            return stream.Length;
        }

        static unsafe long DoStuffInterlocked()
        {
            long len = 0;
            Interlocked.Read(ref len);
            return len;
        }

        static unsafe long DoStuffInterlockedTreF()
        {
            // inspriation from Parallel.For() fix  //https://devdiv.visualstudio.com/DevDiv/_search?text=969699&type=workitem&pageSize=25&filters=Projects%7BDevDiv%7D
            //Read uses Add under the hood, so this should be ok

            //gotcha is that 32-bit apps could then only use 2GB streams, limited by 32-bit int
            //maybe another Flag _BigStream

            long len = 0;
            if (_Is32BitProcess  /*&& !_BigStream*/)
            {
                long* indexPtr = &len;
                {
                    Interlocked.Add(ref *(int*)indexPtr, 0);
                }
            }
            else
            {
                Interlocked.Read(ref len);
            }
            return len;
        }

        static unsafe Stream GetStream()
        {
            //https://learn.microsoft.com/en-us/dotnet/api/system.io.unmanagedmemorystream?view=net-7.0
            //this is over kill, but its copy/paste MS sample code...

            byte[] message = UnicodeEncoding.Unicode.GetBytes(GetMessage());

            // Allocate a block of unmanaged memory and return an IntPtr object.	
            IntPtr memIntPtr = Marshal.AllocHGlobal(message.Length);

            // Get a byte pointer from the IntPtr object.
            byte* memBytePtr = (byte*)memIntPtr.ToPointer();

            // Create an UnmanagedMemoryStream object using a pointer to unmanaged memory.
            UnmanagedMemoryStream writeStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Write);

            // Write the data.
            writeStream.Write(message, 0, message.Length);

            // Close the stream.
            writeStream.Close();
            writeStream.Dispose();

            // Create another UnmanagedMemoryStream object using a pointer to unmanaged memory.
            UnmanagedMemoryStream readStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Read);

            // Create a byte array to hold data from unmanaged memory.
            byte[] outMessage = new byte[message.Length];

            // Read from unmanaged memory to the byte array.
            readStream.Read(outMessage, 0, message.Length);

            // Close the stream.
            //readStream.Close();
            //readStream.Dispose();

            return readStream;
        }

        static string GetMessage()
        {
            return "Here is some data.";
        }


        static bool GetIs32BitProcess()
        {
            return IntPtr.Size == 4;
        }
    }
}

Configuration

.NET 4.8 and 6.0

Regression?

No, new CPU behavior.

Data

WPR of real world apps are available.

Analysis

This was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors.
Interlocked.Read is used by a lot of APIs and apps. If we can improve perf here, a lot of apps will magically run faster on modern hardware. Otherwise, this issue will have to be addressed on an individual basis.

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

@TrevorFellman-MSFT TrevorFellman-MSFT added the tenet-performance Performance related issue label Oct 17, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 17, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 17, 2023
@TrevorFellman-MSFT
Copy link
Author

Intel provided the following links.

https://lwn.net/Articles/790464
https://www.kernel.org/doc/html/v5.18/x86/buslock.html

Bit 29 in MSR 0x33 can be used to disable exception generation that is used by kernel/hypervisor to rate limit the bus locks
Your default value for MSR 0x33 may look like 0x20000000. You need to write 0 to bit 29 to disable this rate limit feature

Do we really need to take a lock to read a 64-bit int on 64-bit processors?

@jkotas jkotas added area-System.Threading and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 17, 2023
@ghost
Copy link

ghost commented Oct 17, 2023

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Interlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace UnmanagedMemoryStreamPerfTest
{
class Program
{
static bool _Is32BitProcess = true;
static Stream _Stream = null;
static void Main(string[] args)
{
_Is32BitProcess = GetIs32BitProcess();
_Stream = GetStream();

        Console.WriteLine("Is 32 bit process: " + _Is32BitProcess);
        Stopwatch sw = new Stopwatch();
        sw.Start();

        MainAsync();

        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
        Console.ReadLine();

    }

    static async void MainAsync()
    {
        //100 threads to make it eaven more obvious
        List<Task> tasks = new List<Task>();
        for (int i = 0; i < 100; i++)
            tasks.Add(DoStuff(GetStream()));
        await Task.WhenAll(tasks);
    }


    static unsafe async Task DoStuff(Stream stream)
    {
        //contrived, but this is just so we can see it.
        //for every XAML file, WPF will read baml streams from the assembly.
        //for apps with dense UIs, that can thousands of streams to open a screen

         //the problem, 32-bit will be about 3-20 times slower.  200 times slower on high end 2023 AMD and Intel chips due to a throttling feature for the type of lock UnmangedMemoryStream takes.
        int count = 1000000;
        for (int i = 0; i < count; i++)
        {
            DoStuffStream(stream);
            //DoStuffInterlocked()
            //DoStuffInterlockedTreF()
        }

    }

    static unsafe long DoStuffStream(Stream stream)
    {
        return stream.Length;
    }

    static unsafe long DoStuffInterlocked()
    {
        long len = 0;
        Interlocked.Read(ref len);
        return len;
    }

    static unsafe long DoStuffInterlockedTreF()
    {
        // inspriation from Parallel.For() fix  //https://devdiv.visualstudio.com/DevDiv/_search?text=969699&type=workitem&pageSize=25&filters=Projects%7BDevDiv%7D
        //Read uses Add under the hood, so this should be ok

        //gotcha is that 32-bit apps could then only use 2GB streams, limited by 32-bit int
        //maybe another Flag _BigStream

        long len = 0;
        if (_Is32BitProcess  /*&& !_BigStream*/)
        {
            long* indexPtr = &len;
            {
                Interlocked.Add(ref *(int*)indexPtr, 0);
            }
        }
        else
        {
            Interlocked.Read(ref len);
        }
        return len;
    }

















    static unsafe Stream GetStream()
    {
        //https://learn.microsoft.com/en-us/dotnet/api/system.io.unmanagedmemorystream?view=net-7.0
        //this is over kill, but its copy/paste MS sample code...

        byte[] message = UnicodeEncoding.Unicode.GetBytes(GetMessage());

        // Allocate a block of unmanaged memory and return an IntPtr object.	
        IntPtr memIntPtr = Marshal.AllocHGlobal(message.Length);

        // Get a byte pointer from the IntPtr object.
        byte* memBytePtr = (byte*)memIntPtr.ToPointer();

        // Create an UnmanagedMemoryStream object using a pointer to unmanaged memory.
        UnmanagedMemoryStream writeStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Write);

        // Write the data.
        writeStream.Write(message, 0, message.Length);

        // Close the stream.
        writeStream.Close();
        writeStream.Dispose();

        // Create another UnmanagedMemoryStream object using a pointer to unmanaged memory.
        UnmanagedMemoryStream readStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Read);

        // Create a byte array to hold data from unmanaged memory.
        byte[] outMessage = new byte[message.Length];

        // Read from unmanaged memory to the byte array.
        readStream.Read(outMessage, 0, message.Length);

        // Close the stream.
        //readStream.Close();
        //readStream.Dispose();

        return readStream;
    }

    static string GetMessage()
    {
        return "Here is some data.";
    }


    static bool GetIs32BitProcess()
    {
        return IntPtr.Size == 4;
    }
}

}

Configuration

.NET 4.8 and 6.0

Regression?

No, new CPU behavior.

Data

WPR of real world apps are available.

Analysis

This was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors.
Interlocked.Read is used by a lot of APIs and apps. If we can improve perf here, a lot of apps will magically run faster on modern hardware. Otherwise, this issue will have to be addressed on an individual basis.

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

Author: TrevorFellman-MSFT
Assignees: -
Labels:

area-System.Threading, tenet-performance, untriaged, needs-area-label

Milestone: -

@danmoseley danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Threading labels Oct 18, 2023
@ghost
Copy link

ghost commented Oct 18, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Interlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace UnmanagedMemoryStreamPerfTest
{
class Program
{
static bool _Is32BitProcess = true;
static Stream _Stream = null;
static void Main(string[] args)
{
_Is32BitProcess = GetIs32BitProcess();
_Stream = GetStream();

        Console.WriteLine("Is 32 bit process: " + _Is32BitProcess);
        Stopwatch sw = new Stopwatch();
        sw.Start();

        MainAsync();

        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
        Console.ReadLine();

    }

    static async void MainAsync()
    {
        //100 threads to make it eaven more obvious
        List<Task> tasks = new List<Task>();
        for (int i = 0; i < 100; i++)
            tasks.Add(DoStuff(GetStream()));
        await Task.WhenAll(tasks);
    }


    static unsafe async Task DoStuff(Stream stream)
    {
        //contrived, but this is just so we can see it.
        //for every XAML file, WPF will read baml streams from the assembly.
        //for apps with dense UIs, that can thousands of streams to open a screen

         //the problem, 32-bit will be about 3-20 times slower.  200 times slower on high end 2023 AMD and Intel chips due to a throttling feature for the type of lock UnmangedMemoryStream takes.
        int count = 1000000;
        for (int i = 0; i < count; i++)
        {
            DoStuffStream(stream);
            //DoStuffInterlocked()
            //DoStuffInterlockedTreF()
        }

    }

    static unsafe long DoStuffStream(Stream stream)
    {
        return stream.Length;
    }

    static unsafe long DoStuffInterlocked()
    {
        long len = 0;
        Interlocked.Read(ref len);
        return len;
    }

    static unsafe long DoStuffInterlockedTreF()
    {
        // inspriation from Parallel.For() fix  //https://devdiv.visualstudio.com/DevDiv/_search?text=969699&type=workitem&pageSize=25&filters=Projects%7BDevDiv%7D
        //Read uses Add under the hood, so this should be ok

        //gotcha is that 32-bit apps could then only use 2GB streams, limited by 32-bit int
        //maybe another Flag _BigStream

        long len = 0;
        if (_Is32BitProcess  /*&& !_BigStream*/)
        {
            long* indexPtr = &len;
            {
                Interlocked.Add(ref *(int*)indexPtr, 0);
            }
        }
        else
        {
            Interlocked.Read(ref len);
        }
        return len;
    }

















    static unsafe Stream GetStream()
    {
        //https://learn.microsoft.com/en-us/dotnet/api/system.io.unmanagedmemorystream?view=net-7.0
        //this is over kill, but its copy/paste MS sample code...

        byte[] message = UnicodeEncoding.Unicode.GetBytes(GetMessage());

        // Allocate a block of unmanaged memory and return an IntPtr object.	
        IntPtr memIntPtr = Marshal.AllocHGlobal(message.Length);

        // Get a byte pointer from the IntPtr object.
        byte* memBytePtr = (byte*)memIntPtr.ToPointer();

        // Create an UnmanagedMemoryStream object using a pointer to unmanaged memory.
        UnmanagedMemoryStream writeStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Write);

        // Write the data.
        writeStream.Write(message, 0, message.Length);

        // Close the stream.
        writeStream.Close();
        writeStream.Dispose();

        // Create another UnmanagedMemoryStream object using a pointer to unmanaged memory.
        UnmanagedMemoryStream readStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Read);

        // Create a byte array to hold data from unmanaged memory.
        byte[] outMessage = new byte[message.Length];

        // Read from unmanaged memory to the byte array.
        readStream.Read(outMessage, 0, message.Length);

        // Close the stream.
        //readStream.Close();
        //readStream.Dispose();

        return readStream;
    }

    static string GetMessage()
    {
        return "Here is some data.";
    }


    static bool GetIs32BitProcess()
    {
        return IntPtr.Size == 4;
    }
}

}

Configuration

.NET 4.8 and 6.0

Regression?

No, new CPU behavior.

Data

WPR of real world apps are available.

Analysis

This was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors.
Interlocked.Read is used by a lot of APIs and apps. If we can improve perf here, a lot of apps will magically run faster on modern hardware. Otherwise, this issue will have to be addressed on an individual basis.

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

Author: TrevorFellman-MSFT
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@Clockwork-Muse
Copy link
Contributor

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

This might be true, but it always bugs me.
What I wish they would do is set up a private package repo....

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2023

@DeepakRajendrakumaran is that the issue we dicussed?

Long story short - the perf bottle-neck is in

the long field is not guaranteed to be aligned to 8-bytes on 32bit and that causes the perf regression. In theory, we could offer 8-byte alignment for such types (all types with long fields?) just like we do it for some cases with doubles (aligning to 8-bytes via fake objects) but that probably a bit of work, as, presumably, VM will have to align fields to 8-bytes then + potential heap size regression on 32bit.

Anyway we don't think it's a codegen issue in any way.

@DeepakRajendrakumaran
Copy link
Contributor

DeepakRajendrakumaran commented Oct 18, 2023

@DeepakRajendrakumaran is that the issue we dicussed?

Long story short - the perf bottle-neck is in

the long field is not guaranteed to be aligned to 8-bytes on 32bit and that causes the perf regression. In theory, we could offer 8-byte alignment for such types (all types with long fields?) just like we do it for some cases with doubles (aligning to 8-bytes via fake objects) but that probably a bit of work, as, presumably, VM will have to align fields to 8-bytes then + potential heap size regression on 32bit.

Anyway we don't think it's a codegen issue in any way.

Yes. This is the same issue

@EgorBo : Is using a 4 byte data type instead of long an option for position and length?

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2023

@EgorBo : Is using a 4 byte data type instead of long an option for position and length?

Not for the UnmanagedMemoryStream I guess, it's a public API, changing types for it will be a breaking change. As a workaround, users can make their own UnmanagedMemoryStream2 with int length..

@TrevorFellman-MSFT
Copy link
Author

is there a better implementation for Interlocked.Read that is better for modern hardware?
Why does UnmanagedMemoryStream need to use Interlocked.Read? Is it even necessary here?

MemoryStream doesn't

FileStream doesn't either

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2023

is there a better implementation for Interlocked.Read that is better for modern hardware?

I am not aware of simpler ways to perform a volatile/interlocked 8-byte load on x86 (32bit) other than https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange64

Why does UnmanagedMemoryStream need to use Interlocked.Read?

Presumably, to make it thread-safe so when some other thread is changing length, the value won't be torn if read by other thread - I don't know why it's done for UnmanagedMemoryStream and not for MemoryStream

PS: ah, MemoryStream's length is actually int so it doesn't need any Interlocked.

@Daniel-Svensson
Copy link
Contributor

Just curious, wouldn't it work if the field was changed to a double making it aligned allowing a 8 byte load to vector register?

However it would lead to increased complexity and worse code on x64 (which might be somewhat negated by different code paths based on size of intptr) so it is not be a generic solution

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2023

Just curious, wouldn't it work if the field was changed to a double making it aligned allowing a 8 byte load to vector register?

double fields won't force GC to use 8-byte alignment, I think we use 8-byte alignment on 32 bit only in two cases:

  1. arrays
  2. boxed structs with doubles

Regarding SIMD, I don't think a simd load can guarantee us atomicity?

@DeepakRajendrakumaran
Copy link
Contributor

There are other places we do something similar(End up doing a CompareExchange on long) -

For eg.

Interlocked.Or(ref Unsafe.As<Flags, ulong>(ref _flags), (ulong)flags);
->
long oldValue = CompareExchange(ref location1, newValue, current);

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2023

Interlocked.Or is expected to be accelerated on x64, but it's currently not implemented in JIT, it's a relatively new API (accelerated on ARM64 though).

@Clockwork-Muse
Copy link
Contributor

PS: ah, MemoryStream's length is actually int so it doesn't need any Interlocked.

...on 32bit+ systems.
A good chunk of the embedded space is (and possibly always will be) less than that.

@gfoidl
Copy link
Member

gfoidl commented Oct 19, 2023

Regarding SIMD, I don't think a simd load can guarantee us atomicity?

When data is aligned then it's atomic. Otherwise no guarantee, and for cacheline split it's not atomic.

@EgorBo
Copy link
Member

EgorBo commented Oct 19, 2023

Regarding SIMD, I don't think a simd load can guarantee us atomicity?

When data is aligned then it's atomic. Otherwise no guarantee, and for cacheline split it's not atomic.

Only on AVX+ hardware, if aligned to 16b and special aligned loads/stores are used. It makes it useless for us for atomicity

@DeepakRajendrakumaran
Copy link
Contributor

DeepakRajendrakumaran commented Oct 19, 2023

FYI - I locally switched the _position and _length in UnmanagedMemoryStream(We use the lock on these) to int and ran the app. The problem does go away

p.s. This was just an expt to verify and confirm that this is indeed the issue.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Oct 20, 2023
jkotas added a commit to jkotas/runtime that referenced this issue Oct 20, 2023
UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware..

This change removes the expensive Interlocked operations and addresses
the tearing issues in alternative way:
- The _length field is converted to nuint that is guaranteed to be
  updated atomically.
- Writes to _length field are volatile to guaranteed the
  unininitialized memory cannot be read.
- The _position field remains long and it has a risk of tearing. It is
  not a problem since tearing of this field cannot lead to buffer
  overruns.

Fixes dotnet#93624
@ghost
Copy link

ghost commented Oct 20, 2023

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Interlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace UnmanagedMemoryStreamPerfTest
{
    class Program
    {
        static bool _Is32BitProcess = true;
        static Stream _Stream = null;
        static void Main(string[] args)
        {
            _Is32BitProcess = GetIs32BitProcess();
            _Stream = GetStream();

            Console.WriteLine("Is 32 bit process: " + _Is32BitProcess);
            Stopwatch sw = new Stopwatch();
            sw.Start();

            MainAsync();

            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
            Console.ReadLine();

        }

        static async void MainAsync()
        {
            //100 threads to make it eaven more obvious
            List<Task> tasks = new List<Task>();
            for (int i = 0; i < 100; i++)
                tasks.Add(DoStuff(GetStream()));
            await Task.WhenAll(tasks);
        }


        static unsafe async Task DoStuff(Stream stream)
        {
            //contrived, but this is just so we can see it.
            //for every XAML file, WPF will read baml streams from the assembly.
            //for apps with dense UIs, that can thousands of streams to open a screen

             //the problem, 32-bit will be about 3-20 times slower.  200 times slower on high end 2023 AMD and Intel chips due to a throttling feature for the type of lock UnmangedMemoryStream takes.
            int count = 1000000;
            for (int i = 0; i < count; i++)
            {
                DoStuffStream(stream);
                //DoStuffInterlocked()
                //DoStuffInterlockedTreF()
            }

        }

        static unsafe long DoStuffStream(Stream stream)
        {
            return stream.Length;
        }

        static unsafe long DoStuffInterlocked()
        {
            long len = 0;
            Interlocked.Read(ref len);
            return len;
        }

        static unsafe long DoStuffInterlockedTreF()
        {
            // inspriation from Parallel.For() fix  //https://devdiv.visualstudio.com/DevDiv/_search?text=969699&type=workitem&pageSize=25&filters=Projects%7BDevDiv%7D
            //Read uses Add under the hood, so this should be ok

            //gotcha is that 32-bit apps could then only use 2GB streams, limited by 32-bit int
            //maybe another Flag _BigStream

            long len = 0;
            if (_Is32BitProcess  /*&& !_BigStream*/)
            {
                long* indexPtr = &len;
                {
                    Interlocked.Add(ref *(int*)indexPtr, 0);
                }
            }
            else
            {
                Interlocked.Read(ref len);
            }
            return len;
        }

        static unsafe Stream GetStream()
        {
            //https://learn.microsoft.com/en-us/dotnet/api/system.io.unmanagedmemorystream?view=net-7.0
            //this is over kill, but its copy/paste MS sample code...

            byte[] message = UnicodeEncoding.Unicode.GetBytes(GetMessage());

            // Allocate a block of unmanaged memory and return an IntPtr object.	
            IntPtr memIntPtr = Marshal.AllocHGlobal(message.Length);

            // Get a byte pointer from the IntPtr object.
            byte* memBytePtr = (byte*)memIntPtr.ToPointer();

            // Create an UnmanagedMemoryStream object using a pointer to unmanaged memory.
            UnmanagedMemoryStream writeStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Write);

            // Write the data.
            writeStream.Write(message, 0, message.Length);

            // Close the stream.
            writeStream.Close();
            writeStream.Dispose();

            // Create another UnmanagedMemoryStream object using a pointer to unmanaged memory.
            UnmanagedMemoryStream readStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Read);

            // Create a byte array to hold data from unmanaged memory.
            byte[] outMessage = new byte[message.Length];

            // Read from unmanaged memory to the byte array.
            readStream.Read(outMessage, 0, message.Length);

            // Close the stream.
            //readStream.Close();
            //readStream.Dispose();

            return readStream;
        }

        static string GetMessage()
        {
            return "Here is some data.";
        }


        static bool GetIs32BitProcess()
        {
            return IntPtr.Size == 4;
        }
    }
}

Configuration

.NET 4.8 and 6.0

Regression?

No, new CPU behavior.

Data

WPR of real world apps are available.

Analysis

This was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors.
Interlocked.Read is used by a lot of APIs and apps. If we can improve perf here, a lot of apps will magically run faster on modern hardware. Otherwise, this issue will have to be addressed on an individual basis.

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

Author: TrevorFellman-MSFT
Assignees: -
Labels:

area-System.IO, tenet-performance, area-CodeGen-coreclr, untriaged, in-pr

Milestone: -

@JulieLeeMSFT JulieLeeMSFT removed the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 20, 2023
Jozkee pushed a commit that referenced this issue Oct 20, 2023
* Improve performance of UnmanagedMemoryStream

UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware..

This change removes the expensive Interlocked operations and addresses
the tearing issues in alternative way:
- The _length field is converted to nuint that is guaranteed to be
  updated atomically.
- Writes to _length field are volatile to guaranteed the
  unininitialized memory cannot be read.
- The _position field remains long and it has a risk of tearing. It is
  not a problem since tearing of this field cannot lead to buffer
  overruns.

Fixes #93624

* Add comment
@ghost ghost removed in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner labels Oct 20, 2023
@Jozkee Jozkee added this to the 9.0.0 milestone Oct 20, 2023
github-actions bot pushed a commit that referenced this issue Oct 21, 2023
UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware..

This change removes the expensive Interlocked operations and addresses
the tearing issues in alternative way:
- The _length field is converted to nuint that is guaranteed to be
  updated atomically.
- Writes to _length field are volatile to guaranteed the
  unininitialized memory cannot be read.
- The _position field remains long and it has a risk of tearing. It is
  not a problem since tearing of this field cannot lead to buffer
  overruns.

Fixes #93624
carlossanlop pushed a commit that referenced this issue Oct 23, 2023
* Improve performance of UnmanagedMemoryStream

UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware..

This change removes the expensive Interlocked operations and addresses
the tearing issues in alternative way:
- The _length field is converted to nuint that is guaranteed to be
  updated atomically.
- Writes to _length field are volatile to guaranteed the
  unininitialized memory cannot be read.
- The _position field remains long and it has a risk of tearing. It is
  not a problem since tearing of this field cannot lead to buffer
  overruns.

Fixes #93624

* Add comment

---------

Co-authored-by: Jan Kotas <jkotas@microsoft.com>
lewing added a commit that referenced this issue Oct 28, 2023
* [release/8.0] Stable branding for .NET 8 GA

* Handle an empty bandPreleaseVersion

* [release/8.0] Update dependencies from dotnet/emsdk (#93801)

* Update dependencies from https://github.com/dotnet/emsdk build 20231020.2

Microsoft.NET.Workload.Emscripten.Current.Manifest-8.0.100.Transport
 From Version 8.0.0-rtm.23511.3 -> To Version 8.0.0-rtm.23520.2

* switch to the stable version now

* Update dependencies

* Also fix the trigger

* pin the wbt sdk to avoid the latest analizer nonsense

* Use source generation for the serializer

* Try to make sourcebuild happy

* Try again to make sourcebuild happy

* Use reflection and suppress trim analysis instead

This reverts commit 768b65b.

* Fix reverting too much

---------

Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>
Co-authored-by: Larry Ewing <lewing@microsoft.com>

* [release/8.0] Update APICompat settings under source build (#93865)

* Update APICompat settings under source build

* Update resolveContract.targets

---------

Co-authored-by: Viktor Hofer <viktor.hofer@microsoft.com>

* Override InformationalVersion for NativeAOT corelib too

* [release/8.0] Improve performance of UnmanagedMemoryStream (#93812)

* Improve performance of UnmanagedMemoryStream

UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware..

This change removes the expensive Interlocked operations and addresses
the tearing issues in alternative way:
- The _length field is converted to nuint that is guaranteed to be
  updated atomically.
- Writes to _length field are volatile to guaranteed the
  unininitialized memory cannot be read.
- The _position field remains long and it has a risk of tearing. It is
  not a problem since tearing of this field cannot lead to buffer
  overruns.

Fixes #93624

* Add comment

---------

Co-authored-by: Jan Kotas <jkotas@microsoft.com>

* Update dependencies from https://github.com/dotnet/emsdk build 20231023.2 (#93881)

Microsoft.SourceBuild.Intermediate.emsdk , Microsoft.NET.Workload.Emscripten.Current.Manifest-8.0.100
 From Version 8.0.0-rtm.23520.2 -> To Version 8.0.0-rtm.23523.2

Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>

* [release/8.0] Update dependencies from dnceng/internal/dotnet-optimization (#93827)

* Update dependencies from https://dev.azure.com/dnceng/internal/_git/dotnet-optimization build 20231021.3

optimization.linux-arm64.MIBC.Runtime , optimization.linux-x64.MIBC.Runtime , optimization.windows_nt-arm64.MIBC.Runtime , optimization.windows_nt-x64.MIBC.Runtime , optimization.windows_nt-x86.MIBC.Runtime , optimization.PGO.CoreCLR
 From Version 1.0.0-prerelease.23519.5 -> To Version 1.0.0-prerelease.23521.3

* Update dependencies from https://dev.azure.com/dnceng/internal/_git/dotnet-optimization build 20231021.3

optimization.linux-arm64.MIBC.Runtime , optimization.linux-x64.MIBC.Runtime , optimization.windows_nt-arm64.MIBC.Runtime , optimization.windows_nt-x64.MIBC.Runtime , optimization.windows_nt-x86.MIBC.Runtime , optimization.PGO.CoreCLR
 From Version 1.0.0-prerelease.23519.5 -> To Version 1.0.0-prerelease.23521.3

---------

Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>

* [release/8.0][wasm] Fix perf pipeline runs (#93888)

* Remove --experimental-wasm-eh argument from the wasm_args used for wasm performance runs. (#93357)

(cherry picked from commit a770017)

* performance-setup.sh: Use `release/8.0` as the default channel

* performance-setup.ps1: use release/8.0 as the default channel

* Fix passing wasmArgs for bdn

---------

Co-authored-by: Parker Bibus <parkerbibus@microsoft.com>

* [release/8.0] Honor JsonSerializerOptions.PropertyNameCaseInsensitive in property name conflict resolution. (#93935)

* Honor JsonSerializerOptions.PropertyNameCaseInsensitive in property name conflict resolution.

* Update src/libraries/System.Text.Json/tests/Common/PropertyNameTests.cs

Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com>

* Address feedback

---------

Co-authored-by: Eirik Tsarpalis <eirik.tsarpalis@gmail.com>
Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com>

* [8.0] Update MsQuic (#93979)

* Try pinning the installer version to a 8.01xx sdk

* Target net8.0 in SatelliteAssemblyFromProjectRef

---------

Co-authored-by: Carlos Sánchez López <1175054+carlossanlop@users.noreply.github.com>
Co-authored-by: Larry Ewing <lewing@microsoft.com>
Co-authored-by: dotnet-maestro[bot] <42748379+dotnet-maestro[bot]@users.noreply.github.com>
Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Viktor Hofer <viktor.hofer@microsoft.com>
Co-authored-by: Alexander Köplinger <alex.koeplinger@outlook.com>
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
Co-authored-by: Ankit Jain <radical@gmail.com>
Co-authored-by: Parker Bibus <parkerbibus@microsoft.com>
Co-authored-by: Eirik Tsarpalis <eirik.tsarpalis@gmail.com>
Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com>
@ghost ghost locked as resolved and limited conversation to collaborators Nov 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants