Interlocked.Read from 32-bit apps are much slower on Intel Sapphire Rapids CPUs #93624

TrevorFellman-MSFT · 2023-10-17T19:58:28Z

Description

Interlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace UnmanagedMemoryStreamPerfTest
{
    class Program
    {
        static bool _Is32BitProcess = true;
        static Stream _Stream = null;
        static void Main(string[] args)
        {
            _Is32BitProcess = GetIs32BitProcess();
            _Stream = GetStream();

            Console.WriteLine("Is 32 bit process: " + _Is32BitProcess);
            Stopwatch sw = new Stopwatch();
            sw.Start();

            MainAsync();

            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
            Console.ReadLine();

        }

        static async void MainAsync()
        {
            //100 threads to make it eaven more obvious
            List<Task> tasks = new List<Task>();
            for (int i = 0; i < 100; i++)
                tasks.Add(DoStuff(GetStream()));
            await Task.WhenAll(tasks);
        }


        static unsafe async Task DoStuff(Stream stream)
        {
            //contrived, but this is just so we can see it.
            //for every XAML file, WPF will read baml streams from the assembly.
            //for apps with dense UIs, that can thousands of streams to open a screen

             //the problem, 32-bit will be about 3-20 times slower.  200 times slower on high end 2023 AMD and Intel chips due to a throttling feature for the type of lock UnmangedMemoryStream takes.
            int count = 1000000;
            for (int i = 0; i < count; i++)
            {
                DoStuffStream(stream);
                //DoStuffInterlocked()
                //DoStuffInterlockedTreF()
            }

        }

        static unsafe long DoStuffStream(Stream stream)
        {
            return stream.Length;
        }

        static unsafe long DoStuffInterlocked()
        {
            long len = 0;
            Interlocked.Read(ref len);
            return len;
        }

        static unsafe long DoStuffInterlockedTreF()
        {
            // inspriation from Parallel.For() fix  //https://devdiv.visualstudio.com/DevDiv/_search?text=969699&type=workitem&pageSize=25&filters=Projects%7BDevDiv%7D
            //Read uses Add under the hood, so this should be ok

            //gotcha is that 32-bit apps could then only use 2GB streams, limited by 32-bit int
            //maybe another Flag _BigStream

            long len = 0;
            if (_Is32BitProcess  /*&& !_BigStream*/)
            {
                long* indexPtr = &len;
                {
                    Interlocked.Add(ref *(int*)indexPtr, 0);
                }
            }
            else
            {
                Interlocked.Read(ref len);
            }
            return len;
        }

        static unsafe Stream GetStream()
        {
            //https://learn.microsoft.com/en-us/dotnet/api/system.io.unmanagedmemorystream?view=net-7.0
            //this is over kill, but its copy/paste MS sample code...

            byte[] message = UnicodeEncoding.Unicode.GetBytes(GetMessage());

            // Allocate a block of unmanaged memory and return an IntPtr object.	
            IntPtr memIntPtr = Marshal.AllocHGlobal(message.Length);

            // Get a byte pointer from the IntPtr object.
            byte* memBytePtr = (byte*)memIntPtr.ToPointer();

            // Create an UnmanagedMemoryStream object using a pointer to unmanaged memory.
            UnmanagedMemoryStream writeStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Write);

            // Write the data.
            writeStream.Write(message, 0, message.Length);

            // Close the stream.
            writeStream.Close();
            writeStream.Dispose();

            // Create another UnmanagedMemoryStream object using a pointer to unmanaged memory.
            UnmanagedMemoryStream readStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Read);

            // Create a byte array to hold data from unmanaged memory.
            byte[] outMessage = new byte[message.Length];

            // Read from unmanaged memory to the byte array.
            readStream.Read(outMessage, 0, message.Length);

            // Close the stream.
            //readStream.Close();
            //readStream.Dispose();

            return readStream;
        }

        static string GetMessage()
        {
            return "Here is some data.";
        }


        static bool GetIs32BitProcess()
        {
            return IntPtr.Size == 4;
        }
    }
}

Configuration

.NET 4.8 and 6.0

Regression?

No, new CPU behavior.

Data

WPR of real world apps are available.

Analysis

This was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors.
Interlocked.Read is used by a lot of APIs and apps. If we can improve perf here, a lot of apps will magically run faster on modern hardware. Otherwise, this issue will have to be addressed on an individual basis.

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

TrevorFellman-MSFT · 2023-10-17T20:52:44Z

Intel provided the following links.

https://lwn.net/Articles/790464
https://www.kernel.org/doc/html/v5.18/x86/buslock.html

Bit 29 in MSR 0x33 can be used to disable exception generation that is used by kernel/hypervisor to rate limit the bus locks
Your default value for MSR 0x33 may look like 0x20000000. You need to write 0 to bit 29 to disable this rate limit feature

Do we really need to take a lock to read a 64-bit int on 64-bit processors?

ghost · 2023-10-17T23:07:15Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Interlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace UnmanagedMemoryStreamPerfTest
{
class Program
{
static bool _Is32BitProcess = true;
static Stream _Stream = null;
static void Main(string[] args)
{
_Is32BitProcess = GetIs32BitProcess();
_Stream = GetStream();

        Console.WriteLine("Is 32 bit process: " + _Is32BitProcess);
        Stopwatch sw = new Stopwatch();
        sw.Start();

        MainAsync();

        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
        Console.ReadLine();

    }

    static async void MainAsync()
    {
        //100 threads to make it eaven more obvious
        List<Task> tasks = new List<Task>();
        for (int i = 0; i < 100; i++)
            tasks.Add(DoStuff(GetStream()));
        await Task.WhenAll(tasks);
    }


    static unsafe async Task DoStuff(Stream stream)
    {
        //contrived, but this is just so we can see it.
        //for every XAML file, WPF will read baml streams from the assembly.
        //for apps with dense UIs, that can thousands of streams to open a screen

         //the problem, 32-bit will be about 3-20 times slower.  200 times slower on high end 2023 AMD and Intel chips due to a throttling feature for the type of lock UnmangedMemoryStream takes.
        int count = 1000000;
        for (int i = 0; i < count; i++)
        {
            DoStuffStream(stream);
            //DoStuffInterlocked()
            //DoStuffInterlockedTreF()
        }

    }

    static unsafe long DoStuffStream(Stream stream)
    {
        return stream.Length;
    }

    static unsafe long DoStuffInterlocked()
    {
        long len = 0;
        Interlocked.Read(ref len);
        return len;
    }

    static unsafe long DoStuffInterlockedTreF()
    {
        // inspriation from Parallel.For() fix  //https://devdiv.visualstudio.com/DevDiv/_search?text=969699&type=workitem&pageSize=25&filters=Projects%7BDevDiv%7D
        //Read uses Add under the hood, so this should be ok

        //gotcha is that 32-bit apps could then only use 2GB streams, limited by 32-bit int
        //maybe another Flag _BigStream

        long len = 0;
        if (_Is32BitProcess  /*&& !_BigStream*/)
        {
            long* indexPtr = &len;
            {
                Interlocked.Add(ref *(int*)indexPtr, 0);
            }
        }
        else
        {
            Interlocked.Read(ref len);
        }
        return len;
    }

















    static unsafe Stream GetStream()
    {
        //https://learn.microsoft.com/en-us/dotnet/api/system.io.unmanagedmemorystream?view=net-7.0
        //this is over kill, but its copy/paste MS sample code...

        byte[] message = UnicodeEncoding.Unicode.GetBytes(GetMessage());

        // Allocate a block of unmanaged memory and return an IntPtr object.	
        IntPtr memIntPtr = Marshal.AllocHGlobal(message.Length);

        // Get a byte pointer from the IntPtr object.
        byte* memBytePtr = (byte*)memIntPtr.ToPointer();

        // Create an UnmanagedMemoryStream object using a pointer to unmanaged memory.
        UnmanagedMemoryStream writeStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Write);

        // Write the data.
        writeStream.Write(message, 0, message.Length);

        // Close the stream.
        writeStream.Close();
        writeStream.Dispose();

        // Create another UnmanagedMemoryStream object using a pointer to unmanaged memory.
        UnmanagedMemoryStream readStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Read);

        // Create a byte array to hold data from unmanaged memory.
        byte[] outMessage = new byte[message.Length];

        // Read from unmanaged memory to the byte array.
        readStream.Read(outMessage, 0, message.Length);

        // Close the stream.
        //readStream.Close();
        //readStream.Dispose();

        return readStream;
    }

    static string GetMessage()
    {
        return "Here is some data.";
    }


    static bool GetIs32BitProcess()
    {
        return IntPtr.Size == 4;
    }
}

}

Configuration

.NET 4.8 and 6.0

Regression?

No, new CPU behavior.

Data

WPR of real world apps are available.

Analysis

This was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors.
Interlocked.Read is used by a lot of APIs and apps. If we can improve perf here, a lot of apps will magically run faster on modern hardware. Otherwise, this issue will have to be addressed on an individual basis.

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

Author:	TrevorFellman-MSFT
Assignees:	-
Labels:	`area-System.Threading`, `tenet-performance`, `untriaged`, `needs-area-label`
Milestone:	-

ghost · 2023-10-18T00:22:41Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Interlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace UnmanagedMemoryStreamPerfTest
{
class Program
{
static bool _Is32BitProcess = true;
static Stream _Stream = null;
static void Main(string[] args)
{
_Is32BitProcess = GetIs32BitProcess();
_Stream = GetStream();

        Console.WriteLine("Is 32 bit process: " + _Is32BitProcess);
        Stopwatch sw = new Stopwatch();
        sw.Start();

        MainAsync();

        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
        Console.ReadLine();

    }

    static async void MainAsync()
    {
        //100 threads to make it eaven more obvious
        List<Task> tasks = new List<Task>();
        for (int i = 0; i < 100; i++)
            tasks.Add(DoStuff(GetStream()));
        await Task.WhenAll(tasks);
    }


    static unsafe async Task DoStuff(Stream stream)
    {
        //contrived, but this is just so we can see it.
        //for every XAML file, WPF will read baml streams from the assembly.
        //for apps with dense UIs, that can thousands of streams to open a screen

         //the problem, 32-bit will be about 3-20 times slower.  200 times slower on high end 2023 AMD and Intel chips due to a throttling feature for the type of lock UnmangedMemoryStream takes.
        int count = 1000000;
        for (int i = 0; i < count; i++)
        {
            DoStuffStream(stream);
            //DoStuffInterlocked()
            //DoStuffInterlockedTreF()
        }

    }

    static unsafe long DoStuffStream(Stream stream)
    {
        return stream.Length;
    }

    static unsafe long DoStuffInterlocked()
    {
        long len = 0;
        Interlocked.Read(ref len);
        return len;
    }

    static unsafe long DoStuffInterlockedTreF()
    {
        // inspriation from Parallel.For() fix  //https://devdiv.visualstudio.com/DevDiv/_search?text=969699&type=workitem&pageSize=25&filters=Projects%7BDevDiv%7D
        //Read uses Add under the hood, so this should be ok

        //gotcha is that 32-bit apps could then only use 2GB streams, limited by 32-bit int
        //maybe another Flag _BigStream

        long len = 0;
        if (_Is32BitProcess  /*&& !_BigStream*/)
        {
            long* indexPtr = &len;
            {
                Interlocked.Add(ref *(int*)indexPtr, 0);
            }
        }
        else
        {
            Interlocked.Read(ref len);
        }
        return len;
    }

















    static unsafe Stream GetStream()
    {
        //https://learn.microsoft.com/en-us/dotnet/api/system.io.unmanagedmemorystream?view=net-7.0
        //this is over kill, but its copy/paste MS sample code...

        byte[] message = UnicodeEncoding.Unicode.GetBytes(GetMessage());

        // Allocate a block of unmanaged memory and return an IntPtr object.	
        IntPtr memIntPtr = Marshal.AllocHGlobal(message.Length);

        // Get a byte pointer from the IntPtr object.
        byte* memBytePtr = (byte*)memIntPtr.ToPointer();

        // Create an UnmanagedMemoryStream object using a pointer to unmanaged memory.
        UnmanagedMemoryStream writeStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Write);

        // Write the data.
        writeStream.Write(message, 0, message.Length);

        // Close the stream.
        writeStream.Close();
        writeStream.Dispose();

        // Create another UnmanagedMemoryStream object using a pointer to unmanaged memory.
        UnmanagedMemoryStream readStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Read);

        // Create a byte array to hold data from unmanaged memory.
        byte[] outMessage = new byte[message.Length];

        // Read from unmanaged memory to the byte array.
        readStream.Read(outMessage, 0, message.Length);

        // Close the stream.
        //readStream.Close();
        //readStream.Dispose();

        return readStream;
    }

    static string GetMessage()
    {
        return "Here is some data.";
    }


    static bool GetIs32BitProcess()
    {
        return IntPtr.Size == 4;
    }
}

}

Configuration

.NET 4.8 and 6.0

Regression?

No, new CPU behavior.

Data

WPR of real world apps are available.

Analysis

This was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors.
Interlocked.Read is used by a lot of APIs and apps. If we can improve perf here, a lot of apps will magically run faster on modern hardware. Otherwise, this issue will have to be addressed on an individual basis.

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

Author:	TrevorFellman-MSFT
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

Clockwork-Muse · 2023-10-18T00:36:54Z

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

This might be true, but it always bugs me.
What I wish they would do is set up a private package repo....

EgorBo · 2023-10-18T12:11:28Z

@DeepakRajendrakumaran is that the issue we dicussed?

Long story short - the perf bottle-neck is in

runtime/src/libraries/System.Private.CoreLib/src/System/IO/UnmanagedMemoryStream.cs

Line 262 in 45ed0ab

return Interlocked.Read(ref _length);

the long field is not guaranteed to be aligned to 8-bytes on 32bit and that causes the perf regression. In theory, we could offer 8-byte alignment for such types (all types with long fields?) just like we do it for some cases with doubles (aligning to 8-bytes via fake objects) but that probably a bit of work, as, presumably, VM will have to align fields to 8-bytes then + potential heap size regression on 32bit.

Anyway we don't think it's a codegen issue in any way.

DeepakRajendrakumaran · 2023-10-18T16:22:34Z

@DeepakRajendrakumaran is that the issue we dicussed?

Long story short - the perf bottle-neck is in

runtime/src/libraries/System.Private.CoreLib/src/System/IO/UnmanagedMemoryStream.cs

Line 262 in 45ed0ab

return Interlocked.Read(ref _length);

the long field is not guaranteed to be aligned to 8-bytes on 32bit and that causes the perf regression. In theory, we could offer 8-byte alignment for such types (all types with long fields?) just like we do it for some cases with doubles (aligning to 8-bytes via fake objects) but that probably a bit of work, as, presumably, VM will have to align fields to 8-bytes then + potential heap size regression on 32bit.

Anyway we don't think it's a codegen issue in any way.

Yes. This is the same issue

@EgorBo : Is using a 4 byte data type instead of long an option for position and length?

EgorBo · 2023-10-18T16:37:31Z

@EgorBo : Is using a 4 byte data type instead of long an option for position and length?

Not for the UnmanagedMemoryStream I guess, it's a public API, changing types for it will be a breaking change. As a workaround, users can make their own UnmanagedMemoryStream2 with int length..

TrevorFellman-MSFT · 2023-10-18T18:40:46Z

is there a better implementation for Interlocked.Read that is better for modern hardware?
Why does UnmanagedMemoryStream need to use Interlocked.Read? Is it even necessary here?

MemoryStream doesn't

runtime/src/libraries/System.Private.CoreLib/src/System/IO/MemoryStream.cs

Line 292 in d65a39d

public override long Length

FileStream doesn't either

runtime/src/libraries/System.Private.CoreLib/src/System/IO/FileStream.cs

Line 451 in d65a39d

public override long Length

EgorBo · 2023-10-18T18:51:53Z

is there a better implementation for Interlocked.Read that is better for modern hardware?

I am not aware of simpler ways to perform a volatile/interlocked 8-byte load on x86 (32bit) other than https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange64

Why does UnmanagedMemoryStream need to use Interlocked.Read?

Presumably, to make it thread-safe so when some other thread is changing length, the value won't be torn if read by other thread - I don't know why it's done for UnmanagedMemoryStream and not for MemoryStream

PS: ah, MemoryStream's length is actually int so it doesn't need any Interlocked.

Daniel-Svensson · 2023-10-18T19:30:12Z

Just curious, wouldn't it work if the field was changed to a double making it aligned allowing a 8 byte load to vector register?

However it would lead to increased complexity and worse code on x64 (which might be somewhat negated by different code paths based on size of intptr) so it is not be a generic solution

EgorBo · 2023-10-18T21:07:52Z

Just curious, wouldn't it work if the field was changed to a double making it aligned allowing a 8 byte load to vector register?

double fields won't force GC to use 8-byte alignment, I think we use 8-byte alignment on 32 bit only in two cases:

arrays
boxed structs with doubles

Regarding SIMD, I don't think a simd load can guarantee us atomicity?

DeepakRajendrakumaran · 2023-10-18T21:57:13Z

There are other places we do something similar(End up doing a CompareExchange on long) -

For eg.

runtime/src/libraries/System.Private.Uri/src/System/Uri.cs

Line 211 in fa3babd

Interlocked.Or(ref Unsafe.As<Flags, ulong>(ref _flags), (ulong)flags);

->

runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Interlocked.cs

Line 347 in fa3babd

long oldValue = CompareExchange(ref location1, newValue, current);

EgorBo · 2023-10-18T21:59:42Z

Interlocked.Or is expected to be accelerated on x64, but it's currently not implemented in JIT, it's a relatively new API (accelerated on ARM64 though).

Clockwork-Muse · 2023-10-19T02:29:07Z

PS: ah, MemoryStream's length is actually int so it doesn't need any Interlocked.

...on 32bit+ systems.
A good chunk of the embedded space is (and possibly always will be) less than that.

gfoidl · 2023-10-19T09:15:47Z

Regarding SIMD, I don't think a simd load can guarantee us atomicity?

When data is aligned then it's atomic. Otherwise no guarantee, and for cacheline split it's not atomic.

EgorBo · 2023-10-19T12:53:00Z

Regarding SIMD, I don't think a simd load can guarantee us atomicity?

When data is aligned then it's atomic. Otherwise no guarantee, and for cacheline split it's not atomic.

Only on AVX+ hardware, if aligned to 16b and special aligned loads/stores are used. It makes it useless for us for atomicity

DeepakRajendrakumaran · 2023-10-19T16:53:19Z

FYI - I locally switched the _position and _length in UnmanagedMemoryStream(We use the lock on these) to int and ran the app. The problem does go away

p.s. This was just an expt to verify and confirm that this is indeed the issue.

UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes dotnet#93624

ghost · 2023-10-20T21:54:34Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Interlocked.Read from 32-bit apps are much slower (100x) on Intel Sapphire Rapids CPUs.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace UnmanagedMemoryStreamPerfTest
{
    class Program
    {
        static bool _Is32BitProcess = true;
        static Stream _Stream = null;
        static void Main(string[] args)
        {
            _Is32BitProcess = GetIs32BitProcess();
            _Stream = GetStream();

            Console.WriteLine("Is 32 bit process: " + _Is32BitProcess);
            Stopwatch sw = new Stopwatch();
            sw.Start();

            MainAsync();

            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
            Console.ReadLine();

        }

        static async void MainAsync()
        {
            //100 threads to make it eaven more obvious
            List<Task> tasks = new List<Task>();
            for (int i = 0; i < 100; i++)
                tasks.Add(DoStuff(GetStream()));
            await Task.WhenAll(tasks);
        }


        static unsafe async Task DoStuff(Stream stream)
        {
            //contrived, but this is just so we can see it.
            //for every XAML file, WPF will read baml streams from the assembly.
            //for apps with dense UIs, that can thousands of streams to open a screen

             //the problem, 32-bit will be about 3-20 times slower.  200 times slower on high end 2023 AMD and Intel chips due to a throttling feature for the type of lock UnmangedMemoryStream takes.
            int count = 1000000;
            for (int i = 0; i < count; i++)
            {
                DoStuffStream(stream);
                //DoStuffInterlocked()
                //DoStuffInterlockedTreF()
            }

        }

        static unsafe long DoStuffStream(Stream stream)
        {
            return stream.Length;
        }

        static unsafe long DoStuffInterlocked()
        {
            long len = 0;
            Interlocked.Read(ref len);
            return len;
        }

        static unsafe long DoStuffInterlockedTreF()
        {
            // inspriation from Parallel.For() fix  //https://devdiv.visualstudio.com/DevDiv/_search?text=969699&type=workitem&pageSize=25&filters=Projects%7BDevDiv%7D
            //Read uses Add under the hood, so this should be ok

            //gotcha is that 32-bit apps could then only use 2GB streams, limited by 32-bit int
            //maybe another Flag _BigStream

            long len = 0;
            if (_Is32BitProcess  /*&& !_BigStream*/)
            {
                long* indexPtr = &len;
                {
                    Interlocked.Add(ref *(int*)indexPtr, 0);
                }
            }
            else
            {
                Interlocked.Read(ref len);
            }
            return len;
        }

        static unsafe Stream GetStream()
        {
            //https://learn.microsoft.com/en-us/dotnet/api/system.io.unmanagedmemorystream?view=net-7.0
            //this is over kill, but its copy/paste MS sample code...

            byte[] message = UnicodeEncoding.Unicode.GetBytes(GetMessage());

            // Allocate a block of unmanaged memory and return an IntPtr object.	
            IntPtr memIntPtr = Marshal.AllocHGlobal(message.Length);

            // Get a byte pointer from the IntPtr object.
            byte* memBytePtr = (byte*)memIntPtr.ToPointer();

            // Create an UnmanagedMemoryStream object using a pointer to unmanaged memory.
            UnmanagedMemoryStream writeStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Write);

            // Write the data.
            writeStream.Write(message, 0, message.Length);

            // Close the stream.
            writeStream.Close();
            writeStream.Dispose();

            // Create another UnmanagedMemoryStream object using a pointer to unmanaged memory.
            UnmanagedMemoryStream readStream = new UnmanagedMemoryStream(memBytePtr, message.Length, message.Length, FileAccess.Read);

            // Create a byte array to hold data from unmanaged memory.
            byte[] outMessage = new byte[message.Length];

            // Read from unmanaged memory to the byte array.
            readStream.Read(outMessage, 0, message.Length);

            // Close the stream.
            //readStream.Close();
            //readStream.Dispose();

            return readStream;
        }

        static string GetMessage()
        {
            return "Here is some data.";
        }


        static bool GetIs32BitProcess()
        {
            return IntPtr.Size == 4;
        }
    }
}

Configuration

.NET 4.8 and 6.0

Regression?

No, new CPU behavior.

Data

WPR of real world apps are available.

Analysis

This was found tuning WPF applications on VDI and seeing a noticeable performance drop opening new screens on these new processors.
Interlocked.Read is used by a lot of APIs and apps. If we can improve perf here, a lot of apps will magically run faster on modern hardware. Otherwise, this issue will have to be addressed on an individual basis.

Many large enterprise apps have 32-bit 3rd-party dependencies, so cannot be rebuilt as 64-bit processes to avoid this perf issue.

Author:	TrevorFellman-MSFT
Assignees:	-
Labels:	`area-System.IO`, `tenet-performance`, `area-CodeGen-coreclr`, `untriaged`, `in-pr`
Milestone:	-

* Improve performance of UnmanagedMemoryStream UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes #93624 * Add comment

UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes #93624

* Improve performance of UnmanagedMemoryStream UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes #93624 * Add comment --------- Co-authored-by: Jan Kotas <jkotas@microsoft.com>

* [release/8.0] Stable branding for .NET 8 GA * Handle an empty bandPreleaseVersion * [release/8.0] Update dependencies from dotnet/emsdk (#93801) * Update dependencies from https://github.com/dotnet/emsdk build 20231020.2 Microsoft.NET.Workload.Emscripten.Current.Manifest-8.0.100.Transport From Version 8.0.0-rtm.23511.3 -> To Version 8.0.0-rtm.23520.2 * switch to the stable version now * Update dependencies * Also fix the trigger * pin the wbt sdk to avoid the latest analizer nonsense * Use source generation for the serializer * Try to make sourcebuild happy * Try again to make sourcebuild happy * Use reflection and suppress trim analysis instead This reverts commit 768b65b. * Fix reverting too much --------- Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: Larry Ewing <lewing@microsoft.com> * [release/8.0] Update APICompat settings under source build (#93865) * Update APICompat settings under source build * Update resolveContract.targets --------- Co-authored-by: Viktor Hofer <viktor.hofer@microsoft.com> * Override InformationalVersion for NativeAOT corelib too * [release/8.0] Improve performance of UnmanagedMemoryStream (#93812) * Improve performance of UnmanagedMemoryStream UnmanagedMemoryStream used Interlocked operations to update its state to prevent tearing of 64-bit values on 32-bit platforms. This pattern is expensive in general and it was found to be prohibitively expensive on recent hardware.. This change removes the expensive Interlocked operations and addresses the tearing issues in alternative way: - The _length field is converted to nuint that is guaranteed to be updated atomically. - Writes to _length field are volatile to guaranteed the unininitialized memory cannot be read. - The _position field remains long and it has a risk of tearing. It is not a problem since tearing of this field cannot lead to buffer overruns. Fixes #93624 * Add comment --------- Co-authored-by: Jan Kotas <jkotas@microsoft.com> * Update dependencies from https://github.com/dotnet/emsdk build 20231023.2 (#93881) Microsoft.SourceBuild.Intermediate.emsdk , Microsoft.NET.Workload.Emscripten.Current.Manifest-8.0.100 From Version 8.0.0-rtm.23520.2 -> To Version 8.0.0-rtm.23523.2 Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> * [release/8.0] Update dependencies from dnceng/internal/dotnet-optimization (#93827) * Update dependencies from https://dev.azure.com/dnceng/internal/_git/dotnet-optimization build 20231021.3 optimization.linux-arm64.MIBC.Runtime , optimization.linux-x64.MIBC.Runtime , optimization.windows_nt-arm64.MIBC.Runtime , optimization.windows_nt-x64.MIBC.Runtime , optimization.windows_nt-x86.MIBC.Runtime , optimization.PGO.CoreCLR From Version 1.0.0-prerelease.23519.5 -> To Version 1.0.0-prerelease.23521.3 * Update dependencies from https://dev.azure.com/dnceng/internal/_git/dotnet-optimization build 20231021.3 optimization.linux-arm64.MIBC.Runtime , optimization.linux-x64.MIBC.Runtime , optimization.windows_nt-arm64.MIBC.Runtime , optimization.windows_nt-x64.MIBC.Runtime , optimization.windows_nt-x86.MIBC.Runtime , optimization.PGO.CoreCLR From Version 1.0.0-prerelease.23519.5 -> To Version 1.0.0-prerelease.23521.3 --------- Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> * [release/8.0][wasm] Fix perf pipeline runs (#93888) * Remove --experimental-wasm-eh argument from the wasm_args used for wasm performance runs. (#93357) (cherry picked from commit a770017) * performance-setup.sh: Use `release/8.0` as the default channel * performance-setup.ps1: use release/8.0 as the default channel * Fix passing wasmArgs for bdn --------- Co-authored-by: Parker Bibus <parkerbibus@microsoft.com> * [release/8.0] Honor JsonSerializerOptions.PropertyNameCaseInsensitive in property name conflict resolution. (#93935) * Honor JsonSerializerOptions.PropertyNameCaseInsensitive in property name conflict resolution. * Update src/libraries/System.Text.Json/tests/Common/PropertyNameTests.cs Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com> * Address feedback --------- Co-authored-by: Eirik Tsarpalis <eirik.tsarpalis@gmail.com> Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com> * [8.0] Update MsQuic (#93979) * Try pinning the installer version to a 8.01xx sdk * Target net8.0 in SatelliteAssemblyFromProjectRef --------- Co-authored-by: Carlos Sánchez López <1175054+carlossanlop@users.noreply.github.com> Co-authored-by: Larry Ewing <lewing@microsoft.com> Co-authored-by: dotnet-maestro[bot] <42748379+dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Viktor Hofer <viktor.hofer@microsoft.com> Co-authored-by: Alexander Köplinger <alex.koeplinger@outlook.com> Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Ankit Jain <radical@gmail.com> Co-authored-by: Parker Bibus <parkerbibus@microsoft.com> Co-authored-by: Eirik Tsarpalis <eirik.tsarpalis@gmail.com> Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com>

TrevorFellman-MSFT added the tenet-performance Performance related issue label Oct 17, 2023

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 17, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Oct 17, 2023

jkotas added area-System.Threading and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 17, 2023

danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Threading labels Oct 18, 2023

jkotas mentioned this issue Oct 20, 2023

Improve performance of UnmanagedMemoryStream #93766

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Oct 20, 2023

JulieLeeMSFT added the area-System.IO label Oct 20, 2023

JulieLeeMSFT removed the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 20, 2023

Jozkee closed this as completed in #93766 Oct 20, 2023

ghost removed in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner labels Oct 20, 2023

Jozkee added this to the 9.0.0 milestone Oct 20, 2023

ghost locked as resolved and limited conversation to collaborators Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interlocked.Read from 32-bit apps are much slower on Intel Sapphire Rapids CPUs #93624

Interlocked.Read from 32-bit apps are much slower on Intel Sapphire Rapids CPUs #93624

TrevorFellman-MSFT commented Oct 17, 2023 •

edited by EgorBo

TrevorFellman-MSFT commented Oct 17, 2023

ghost commented Oct 17, 2023

Description

Configuration

Regression?

Data

Analysis

ghost commented Oct 18, 2023

Description

Configuration

Regression?

Data

Analysis

Clockwork-Muse commented Oct 18, 2023

EgorBo commented Oct 18, 2023 •

edited

DeepakRajendrakumaran commented Oct 18, 2023 •

edited

EgorBo commented Oct 18, 2023

TrevorFellman-MSFT commented Oct 18, 2023

EgorBo commented Oct 18, 2023 •

edited

Daniel-Svensson commented Oct 18, 2023

EgorBo commented Oct 18, 2023

DeepakRajendrakumaran commented Oct 18, 2023

EgorBo commented Oct 18, 2023

Clockwork-Muse commented Oct 19, 2023

gfoidl commented Oct 19, 2023

EgorBo commented Oct 19, 2023 •

edited

DeepakRajendrakumaran commented Oct 19, 2023 •

edited

ghost commented Oct 20, 2023

Description

Configuration

Regression?

Data

Analysis

Interlocked.Read from 32-bit apps are much slower on Intel Sapphire Rapids CPUs #93624

Interlocked.Read from 32-bit apps are much slower on Intel Sapphire Rapids CPUs #93624

Comments

TrevorFellman-MSFT commented Oct 17, 2023 • edited by EgorBo

Description

Configuration

Regression?

Data

Analysis

TrevorFellman-MSFT commented Oct 17, 2023

ghost commented Oct 17, 2023

Description

Configuration

Regression?

Data

Analysis

ghost commented Oct 18, 2023

Description

Configuration

Regression?

Data

Analysis

Clockwork-Muse commented Oct 18, 2023

EgorBo commented Oct 18, 2023 • edited

DeepakRajendrakumaran commented Oct 18, 2023 • edited

EgorBo commented Oct 18, 2023

TrevorFellman-MSFT commented Oct 18, 2023

EgorBo commented Oct 18, 2023 • edited

Daniel-Svensson commented Oct 18, 2023

EgorBo commented Oct 18, 2023

DeepakRajendrakumaran commented Oct 18, 2023

EgorBo commented Oct 18, 2023

Clockwork-Muse commented Oct 19, 2023

gfoidl commented Oct 19, 2023

EgorBo commented Oct 19, 2023 • edited

DeepakRajendrakumaran commented Oct 19, 2023 • edited

ghost commented Oct 20, 2023

Description

Configuration

Regression?

Data

Analysis

TrevorFellman-MSFT commented Oct 17, 2023 •

edited by EgorBo

EgorBo commented Oct 18, 2023 •

edited

DeepakRajendrakumaran commented Oct 18, 2023 •

edited

EgorBo commented Oct 18, 2023 •

edited

EgorBo commented Oct 19, 2023 •

edited

DeepakRajendrakumaran commented Oct 19, 2023 •

edited