Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span<T> wrapped in a struct isn't performing as fast as it could be #68797

Closed
essoperagma opened this issue May 3, 2022 · 10 comments · Fixed by #88090
Closed

Span<T> wrapped in a struct isn't performing as fast as it could be #68797

essoperagma opened this issue May 3, 2022 · 10 comments · Fixed by #88090
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release tenet-performance Performance related issue
Milestone

Comments

@essoperagma
Copy link

essoperagma commented May 3, 2022

Brief intro:

I'm developing a binary serialization library with two major requirements:

  1. offering high performance
  2. minimizing user errors by utilizing the type system as much as possible.

For requirement #1, I'm heavily using Span. I'm quite happy with the performance and how clean the internal serializer implementation is.

For requirement #2, I'm trying to prevent users from accidentally passing the wrong Span to a method.
As a solution, I'm wrapping Span in my own struct: WrappedSpan. All my methods are requesting a WrappedSpan and they won't accept a Span.
This helps prevent mistakes via compile-time errors in the case that the user passes a Span instead of a WrappedSpan to any of the serialization methods.

The problem:

I noticed that a regular Span<byte> is performing 100% better than a Span wrapped in an otherwise-empty struct in some scenarios (many, small, chained method calls).

The method using Span<byte>:

public MethodHost_Span WriteInt32_Span(ref Span<byte> span, int value)
{
    MemoryMarshal.Cast<byte, int>(span)[0] = value;
    span = span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_Span);
}

The method using WrappedSpan:

public MethodHost_WrappedSpan WriteInt32_WrappedSpan(ref WrappedSpan wrapper, int value)
{
    MemoryMarshal.Cast<byte, int>(wrapper.Span)[0] = value;
    wrapper.Span = wrapper.Span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_WrappedSpan);
}

public ref struct WrappedSpan
{
    public Span<byte> Span;
}

Actual benchmark method is here.

More details:

  • All this is happening on .NET 7.0.100-preview.3.22179.4, but I got similar results on .NET 6 and 5.
  • Setting DOTNET_TieredPGO doesn't impact the performance.
  • The full benchmark project can be found here.
  • Outputs from Disasmo can be found here in the same repo.
  • Local runtime build used by Disasmo is at commit: 3535e0769f202ae4cd820bea24afd20cee313966
  • I'm on Windows 10, Version 10.0.19044 Build 19044
  • Building for amd64.

BenchmarkDotNet results with different data types:

Method Mean Error StdDev
WriteMany_Int32_Span 1.163 us 0.0122 us 0.0114 us
WriteMany_Int32_WrappedSpan 2.169 us 0.0194 us 0.0172 us
WriteMany_Single_Span 1.085 us 0.0119 us 0.0106 us
WriteMany_Single_WrappedSpan 2.183 us 0.0206 us 0.0161 us
WriteMany_Double_Span 1.096 us 0.0215 us 0.0221 us
WriteMany_Double_WrappedSpan 2.192 us 0.0181 us 0.0161 us
WriteMany_Mixed_Span 1.134 us 0.0127 us 0.0119 us
WriteMany_Mixed_WrappedSpan 2.183 us 0.0242 us 0.0226 us

Expected

I would expect the wrapping struct to have no impact on the generated byte code. In other words, WrappedSpan performs just as fast as a regular Span.

My questions

  1. Considering JIT internals, is this an expected result? If yes, could you share the decision-making process of JIT that results in such perf difference?
  2. Are there any options/tricks that I can use to get better results with the WrappedSpan?
  3. Would you consider improving JIT to generate better performing code for such scenarios?

category:cq
theme:structs
skill-level:expert
cost:large
impact:medium

@essoperagma essoperagma added the tenet-performance Performance related issue label May 3, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Memory untriaged New issue has not been triaged by the area owner labels May 3, 2022
@ghost
Copy link

ghost commented May 3, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Brief intro:

I'm developing a binary serialization library with two major requirements:

  1. offering high performance
  2. minimizing user errors by utilizing the type system as much as possible.

For requirement #1, I'm heavily using Span. I'm quite happy with the performance and how clean the internal serializer implementation is.

For requirement #2, I'm trying to prevent users from accidentally passing the wrong Span to a method.
As a solution, I'm wrapping Span in my own struct: WrappedSpan. All my methods are requesting a WrappedSpan and they won't accept a Span.
This helps prevent mistakes via compile-time errors in the case that the user passes a Span instead of a WrappedSpan to any of the serialization methods.

The problem:

I noticed that a regular Span<byte> is performing 100% better than a Span wrapped in an otherwise-empty struct in some scenarios (many, small, chained method calls).

The method using Span<byte>:

public MethodHost_Span WriteInt32_Span(ref Span<byte> span, int value)
{
    MemoryMarshal.Cast<byte, int>(span)[0] = value;
    span = span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_Span);
}

The method using WrappedSpan:

public MethodHost_WrappedSpan WriteInt32_WrappedSpan(ref WrappedSpan wrapper, int value)
{
    MemoryMarshal.Cast<byte, int>(wrapper.Span)[0] = value;
    wrapper.Span = wrapper.Span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_WrappedSpan);
}

public ref struct WrappedSpan
{
    public Span<byte> Span;
}

Actual benchmark method is here.

More details:

  • All this is happening on .NET 7.0.100-preview.3.22179.4, but I got similar results on .NET 6 and 5.
  • Setting DOTNET_TieredPGO doesn't impact the performance.
  • The full benchmark project can be found here.
  • Outputs from Disasmo can be found here in the same repo.
  • Local runtime build used by Disasmo is at commit: 3535e0769f202ae4cd820bea24afd20cee313966
  • I'm on Windows 10, Version 10.0.19044 Build 19044
  • Building for amd64.

BenchmarkDotNet results with different data types:

Method Mean Error StdDev
WriteMany_Int32_Span 1.163 us 0.0122 us 0.0114 us
WriteMany_Int32_WrappedSpan 2.169 us 0.0194 us 0.0172 us
WriteMany_Single_Span 1.085 us 0.0119 us 0.0106 us
WriteMany_Single_WrappedSpan 2.183 us 0.0206 us 0.0161 us
WriteMany_Double_Span 1.096 us 0.0215 us 0.0221 us
WriteMany_Double_WrappedSpan 2.192 us 0.0181 us 0.0161 us
WriteMany_Mixed_Span 1.134 us 0.0127 us 0.0119 us
WriteMany_Mixed_WrappedSpan 2.183 us 0.0242 us 0.0226 us

Expected

I would expect the wrapping struct to have no impact on the generated byte code. In other words, WrappedSpan performs just as fast as a regular Span.

My questions

  1. Considering JIT internals, is this an expected result? If yes, could you share the decision-making process of JIT that results in such perf difference?
  2. Are there any options/tricks that I can use to get better results with the WrappedSpan?
  3. Would you consider improving JIT to generate better performing code for such scenarios?
Author: essoperagma
Assignees: -
Labels:

area-System.Memory, tenet-performance, untriaged

Milestone: -

@gfoidl
Copy link
Member

gfoidl commented May 3, 2022

In the assembly you linked there are a lot of System.Diagnostics.Debug:Fail-calls. Did you get these from Release or a Debug build (looks like the latter)?

In sharplab almost the same code is generated.


FWIW 1: Instead of MemoryMarshal.Cast<byte, int>(span)[0] = value; I'd use MemoryMarshal.TryWrite(span, ref value);. Is more natural way of expressing what should happen, and yields better code (by avoiding the checks for the cast).

FWIW 2: Instead of having ref Span-argument, the "usual" way is to have the method like this

-public MethodHost_Span WriteInt32_Span(ref Span<byte> span, int value);
+public MethodHost_Span WriteInt32_Span(Span<byte> span, int value, out int bytesWritten);

So the consumer of that method has to slice the span.
W/o the ref Span arguments it's easier for the JIT to optimize. See sharplab, code for type Foo. With the ref Span the method isn't inlined, while the other two get inlined.

@essoperagma
Copy link
Author

Hey @gfoidl,
Thanks for the comment.

I am not sure why Debug.Fails are there. But I'm sure that:

  • the benchmarks are running in Release mode (benchmark.net complains in debug mode anyway),
  • Disasmo is building my project in Release mode (as I can see it in the UI)
  • The local runtime build I have for Disasmo was built in release mode ( I used .\build.cmd Clr+Libs -c Release -rc Checked)

I didn't know about MemoryMarshal.TryWrite, but it seems like a better alternative than Cast. I will go with Write instead of TryWrite though as I neither want to handle the error nor swallow it.

As for your second point, having an out int bytesWritten will greatly complicate the call site for the users. See how I'm chaining calls to serialize a Book object:

Span<byte> w = buffer.Span;
BookDetailsMessage bookDetails =  BookDetailsMessage.Build()
    .WithIdentifier(ref w)
    .WithTitle(ref w, book.Title)
    .WithISBN(ref w, book.ISBN)
    .WithPageCount(ref w, book.PageCount)
    .WithYear(ref w, book.Year);

The next method in the chain is still expecting a Span. So I would need to call span.Slice(bytesWritten), ending up just shifting the work around.

Finally, I don't think comparing WriteInt32_Span byte code to WriteInt32_WrappedSpan byte code is very relevant for this case. These methods will always be chained and as far as I have been able to test, they will almost a 100% be inlined. I'm more interested in the comparison of the call sites, like this one from the benchmark.

@gfoidl
Copy link
Member

gfoidl commented May 3, 2022

Note: comment put inside a details to not distract too much from the main-concern you have.

Re: As for your second point

As for your second point, having an out int bytesWritten will greatly complicate the call site for the users. See how I'm chaining calls to serialize a Book object:

I don't know if this the concrete usage, or just a demo that you showed. But by using a more standard builder-pattern, this can be accomplished quite nicely (and very efficient):

using System.Buffers;
using System.Runtime.InteropServices;
using System.Text;

using BookDetailsMessageBuilder bookDetailsMessageBuilder = new(stackalloc byte[256]);

BookDetailsMessage bookDetailsMessage = bookDetailsMessageBuilder
    .WithTitle(".NET is amazing Pt. 7")
    .WithPageCount(42)
    .Build();

public class BookDetailsMessage
{
    private readonly byte[] _binaryData;

    // Or hower it should be used.
    public BookDetailsMessage(byte[] binaryData) => _binaryData = binaryData;
}

public ref struct BookDetailsMessageBuilder
{
    private int _curPos;
    private Span<byte> _buffer;
    private byte[]? _bufferFromPool;

    public BookDetailsMessageBuilder()
    {
        _curPos = 0;
        _buffer = _bufferFromPool = ArrayPool<byte>.Shared.Rent(1024);
    }

    public BookDetailsMessageBuilder(Span<byte> initialBuffer)
    {
        _curPos = 0;
        _buffer = initialBuffer;
        _bufferFromPool = null;
    }

    public void Dispose()
    {
        if (_bufferFromPool is not null)
        {
            ArrayPool<byte>.Shared.Return(_bufferFromPool);
            _bufferFromPool = null;
        }
    }

    private Span<byte> CurrentBuffer => _buffer.Slice(_curPos);

    public BookDetailsMessageBuilder WithTitle(string title)
    {
        // Validate that into the CurrentBuffer can be written, otherwise grow the buffer.
        // E.g. by renting a larger buffer from the ArrayPool, and copying over the 
        // current contents. Omitted here for brevity.

        int written = Encoding.UTF8.GetBytes(title, CurrentBuffer);
        _curPos += written;

        return this;
    }

    public BookDetailsMessageBuilder WithPageCount(int pageCount)
    {
        // Validate that into the CurrentBuffer can be written, otherwise grow the buffer.
        // E.g. by renting a larger buffer from the ArrayPool, and copying over the 
        // current contents. Omitted here for brevity.

        MemoryMarshal.Write(CurrentBuffer, ref pageCount);
        _curPos += sizeof(int);

        return this;
    }

    public BookDetailsMessage Build()
    {
        return new BookDetailsMessage(_buffer[.._curPos].ToArray());
    }
}

For the "grow" you can take inspiration at the code of List<T> for instance.

@tannergooding
Copy link
Member

I am not sure why Debug.Fails are there. But I'm sure that:
The local runtime build I have for Disasmo was built in release mode ( I used .\build.cmd Clr+Libs -c Release -rc Checked)

It's worth noting that -rc Checked means the runtime will be built as Checked. This means optimizations are enabled, but debug validation is still preserved. -rc includes System.Private.Corelib which will impact disassembly and potentially perf (https://github.com/dotnet/runtime/blob/main/src/coreclr/System.Private.CoreLib/System.Private.CoreLib.csproj#L87-L94)

While I haven't done the investigation to look at why there are differences, there are several known cases where having wrapper structs may be less efficient. This ranges from ABI differences between T value and struct Wrapper { T value; } to a few cases where the JIT is simply known to pessimize the codegen (often due to ABI requirements, such as around wrappers of single floating-point values). Improving these areas, where possible, is covered by:

Some of this will only be possible to fix if we deviate from the native ABI and native code (C, C++, Rust compilers, etc) would likewise pessimize the result when inlining cannot happen.

@jakobbotsch jakobbotsch added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Memory labels May 4, 2022
@ghost
Copy link

ghost commented May 4, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Brief intro:

I'm developing a binary serialization library with two major requirements:

  1. offering high performance
  2. minimizing user errors by utilizing the type system as much as possible.

For requirement #1, I'm heavily using Span. I'm quite happy with the performance and how clean the internal serializer implementation is.

For requirement #2, I'm trying to prevent users from accidentally passing the wrong Span to a method.
As a solution, I'm wrapping Span in my own struct: WrappedSpan. All my methods are requesting a WrappedSpan and they won't accept a Span.
This helps prevent mistakes via compile-time errors in the case that the user passes a Span instead of a WrappedSpan to any of the serialization methods.

The problem:

I noticed that a regular Span<byte> is performing 100% better than a Span wrapped in an otherwise-empty struct in some scenarios (many, small, chained method calls).

The method using Span<byte>:

public MethodHost_Span WriteInt32_Span(ref Span<byte> span, int value)
{
    MemoryMarshal.Cast<byte, int>(span)[0] = value;
    span = span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_Span);
}

The method using WrappedSpan:

public MethodHost_WrappedSpan WriteInt32_WrappedSpan(ref WrappedSpan wrapper, int value)
{
    MemoryMarshal.Cast<byte, int>(wrapper.Span)[0] = value;
    wrapper.Span = wrapper.Span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_WrappedSpan);
}

public ref struct WrappedSpan
{
    public Span<byte> Span;
}

Actual benchmark method is here.

More details:

  • All this is happening on .NET 7.0.100-preview.3.22179.4, but I got similar results on .NET 6 and 5.
  • Setting DOTNET_TieredPGO doesn't impact the performance.
  • The full benchmark project can be found here.
  • Outputs from Disasmo can be found here in the same repo.
  • Local runtime build used by Disasmo is at commit: 3535e0769f202ae4cd820bea24afd20cee313966
  • I'm on Windows 10, Version 10.0.19044 Build 19044
  • Building for amd64.

BenchmarkDotNet results with different data types:

Method Mean Error StdDev
WriteMany_Int32_Span 1.163 us 0.0122 us 0.0114 us
WriteMany_Int32_WrappedSpan 2.169 us 0.0194 us 0.0172 us
WriteMany_Single_Span 1.085 us 0.0119 us 0.0106 us
WriteMany_Single_WrappedSpan 2.183 us 0.0206 us 0.0161 us
WriteMany_Double_Span 1.096 us 0.0215 us 0.0221 us
WriteMany_Double_WrappedSpan 2.192 us 0.0181 us 0.0161 us
WriteMany_Mixed_Span 1.134 us 0.0127 us 0.0119 us
WriteMany_Mixed_WrappedSpan 2.183 us 0.0242 us 0.0226 us

Expected

I would expect the wrapping struct to have no impact on the generated byte code. In other words, WrappedSpan performs just as fast as a regular Span.

My questions

  1. Considering JIT internals, is this an expected result? If yes, could you share the decision-making process of JIT that results in such perf difference?
  2. Are there any options/tricks that I can use to get better results with the WrappedSpan?
  3. Would you consider improving JIT to generate better performing code for such scenarios?
Author: essoperagma
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@AndyAyersMS
Copy link
Member

As Tanner notes, wrapping structs around types may not give the same performance as passing the underlying type. This is especially true when the data being wrapped is itself a multi-field struct (like Span<T>).

Structs wrapping single-field structs are optimized in many but not all cases; see eg dotnet/coreclr#22867.

There is a general concept of "recursive struct promotion" where optimizations like struct promotion can be applied to arbitrary structs with arbitrary nesting which we hope to address someday. There's no specific issue open for this and no timeline for when we might get around to working on it.

@jakobbotsch
Copy link
Member

With the same modifications as in #32415 (comment) physical promotion gets us:

Method Job EnvironmentVariables Mean Error StdDev
WriteMany_Int32_Span Job-AZLIHY Empty 610.1 ns 2.96 ns 2.62 ns
WriteMany_Int32_WrappedSpan Job-AZLIHY Empty 775.0 ns 2.07 ns 1.93 ns
WriteMany_Int32_WrappedSpan Job-CPRVWC DOTNET_JitEnablePhysicalPromotion=1 611.1 ns 3.03 ns 2.83 ns
WriteMany_Single_Span Job-AZLIHY Empty 579.7 ns 1.64 ns 1.37 ns
WriteMany_Single_WrappedSpan Job-AZLIHY Empty 775.8 ns 1.64 ns 1.37 ns
WriteMany_Single_WrappedSpan Job-CPRVWC DOTNET_JitEnablePhysicalPromotion=1 614.6 ns 6.86 ns 6.42 ns
WriteMany_Double_Span Job-AZLIHY Empty 584.5 ns 3.23 ns 3.02 ns
WriteMany_Double_WrappedSpan Job-AZLIHY Empty 779.1 ns 2.81 ns 2.35 ns
WriteMany_Double_WrappedSpan Job-CPRVWC DOTNET_JitEnablePhysicalPromotion=1 605.8 ns 1.04 ns 0.97 ns
WriteMany_Mixed_Span Job-AZLIHY Empty 629.0 ns 2.13 ns 1.78 ns
WriteMany_Mixed_WrappedSpan Job-AZLIHY Empty 825.9 ns 7.44 ns 6.96 ns
WriteMany_Mixed_WrappedSpan Job-CPRVWC DOTNET_JitEnablePhysicalPromotion=1 591.7 ns 1.31 ns 1.16 ns

Double checked a few of the cases and the codegen was equal between the WrappedSpan and Span versions with physical promotion enabled.

@jakobbotsch jakobbotsch self-assigned this Jun 16, 2023
@jakobbotsch jakobbotsch modified the milestones: Future, 8.0.0 Jun 16, 2023
@essoperagma
Copy link
Author

Great to see so much progress on this. Thanks @jakobbotsch!

Are the changes in #32415 (comment) already merged to main? I'm hoping to get a daily build to give this a try.

Even without DOTNET_JitEnablePhysicalPromotion=1 though, WrappedSpan seems to be performing much better in .NET 8 compared to .NET 7. Was there some other struct-promotion work in already completed in .NET 8?

@jakobbotsch
Copy link
Member

@essoperagma Not yet. I can let you know once I submit the PR to fix the remaining problem here, but it may still be a bit before that happens.

Even without DOTNET_JitEnablePhysicalPromotion=1 though, WrappedSpan seems to be performing much better in .NET 8 compared to .NET 7. Was there some other struct-promotion work in already completed in .NET 8?

Nothing I'm aware of that would affect this benchmark, but it's likely various other improvements in the JIT affected it. Sadly it's a bit hard to check what exactly is the cause of the improvement.

@jakobbotsch jakobbotsch added the Priority:2 Work that is important, but not critical for the release label Jun 20, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 27, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 29, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants