Span<T> wrapped in a struct isn't performing as fast as it could be #68797

essoperagma · 2022-05-03T01:42:32Z

Brief intro:

I'm developing a binary serialization library with two major requirements:

offering high performance
minimizing user errors by utilizing the type system as much as possible.

For requirement #1, I'm heavily using Span. I'm quite happy with the performance and how clean the internal serializer implementation is.

For requirement #2, I'm trying to prevent users from accidentally passing the wrong Span to a method.
As a solution, I'm wrapping Span in my own struct: WrappedSpan. All my methods are requesting a WrappedSpan and they won't accept a Span.
This helps prevent mistakes via compile-time errors in the case that the user passes a Span instead of a WrappedSpan to any of the serialization methods.

The problem:

I noticed that a regular Span<byte> is performing 100% better than a Span wrapped in an otherwise-empty struct in some scenarios (many, small, chained method calls).

The method using Span<byte>:

public MethodHost_Span WriteInt32_Span(ref Span<byte> span, int value)
{
    MemoryMarshal.Cast<byte, int>(span)[0] = value;
    span = span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_Span);
}

The method using WrappedSpan:

public MethodHost_WrappedSpan WriteInt32_WrappedSpan(ref WrappedSpan wrapper, int value)
{
    MemoryMarshal.Cast<byte, int>(wrapper.Span)[0] = value;
    wrapper.Span = wrapper.Span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_WrappedSpan);
}

public ref struct WrappedSpan
{
    public Span<byte> Span;
}

Actual benchmark method is here.

More details:

All this is happening on .NET 7.0.100-preview.3.22179.4, but I got similar results on .NET 6 and 5.
Setting DOTNET_TieredPGO doesn't impact the performance.
The full benchmark project can be found here.
Outputs from Disasmo can be found here in the same repo.
Local runtime build used by Disasmo is at commit: 3535e0769f202ae4cd820bea24afd20cee313966
I'm on Windows 10, Version 10.0.19044 Build 19044
Building for amd64.

BenchmarkDotNet results with different data types:

Method	Mean	Error	StdDev
WriteMany_Int32_Span	1.163 us	0.0122 us	0.0114 us
WriteMany_Int32_WrappedSpan	2.169 us	0.0194 us	0.0172 us
WriteMany_Single_Span	1.085 us	0.0119 us	0.0106 us
WriteMany_Single_WrappedSpan	2.183 us	0.0206 us	0.0161 us
WriteMany_Double_Span	1.096 us	0.0215 us	0.0221 us
WriteMany_Double_WrappedSpan	2.192 us	0.0181 us	0.0161 us
WriteMany_Mixed_Span	1.134 us	0.0127 us	0.0119 us
WriteMany_Mixed_WrappedSpan	2.183 us	0.0242 us	0.0226 us

Expected

I would expect the wrapping struct to have no impact on the generated byte code. In other words, WrappedSpan performs just as fast as a regular Span.

My questions

Considering JIT internals, is this an expected result? If yes, could you share the decision-making process of JIT that results in such perf difference?
Are there any options/tricks that I can use to get better results with the WrappedSpan?
Would you consider improving JIT to generate better performing code for such scenarios?

category:cq
theme:structs
skill-level:expert
cost:large
impact:medium

The text was updated successfully, but these errors were encountered:

ghost · 2022-05-03T01:42:36Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Brief intro:

I'm developing a binary serialization library with two major requirements:

offering high performance
minimizing user errors by utilizing the type system as much as possible.

For requirement #1, I'm heavily using Span. I'm quite happy with the performance and how clean the internal serializer implementation is.

For requirement #2, I'm trying to prevent users from accidentally passing the wrong Span to a method.
As a solution, I'm wrapping Span in my own struct: WrappedSpan. All my methods are requesting a WrappedSpan and they won't accept a Span.
This helps prevent mistakes via compile-time errors in the case that the user passes a Span instead of a WrappedSpan to any of the serialization methods.

The problem:

I noticed that a regular Span<byte> is performing 100% better than a Span wrapped in an otherwise-empty struct in some scenarios (many, small, chained method calls).

The method using Span<byte>:

public MethodHost_Span WriteInt32_Span(ref Span<byte> span, int value)
{
    MemoryMarshal.Cast<byte, int>(span)[0] = value;
    span = span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_Span);
}

The method using WrappedSpan:

public MethodHost_WrappedSpan WriteInt32_WrappedSpan(ref WrappedSpan wrapper, int value)
{
    MemoryMarshal.Cast<byte, int>(wrapper.Span)[0] = value;
    wrapper.Span = wrapper.Span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_WrappedSpan);
}

public ref struct WrappedSpan
{
    public Span<byte> Span;
}

Actual benchmark method is here.

More details:

All this is happening on .NET 7.0.100-preview.3.22179.4, but I got similar results on .NET 6 and 5.
Setting DOTNET_TieredPGO doesn't impact the performance.
The full benchmark project can be found here.
Outputs from Disasmo can be found here in the same repo.
Local runtime build used by Disasmo is at commit: 3535e0769f202ae4cd820bea24afd20cee313966
I'm on Windows 10, Version 10.0.19044 Build 19044
Building for amd64.

BenchmarkDotNet results with different data types:

Method	Mean	Error	StdDev
WriteMany_Int32_Span	1.163 us	0.0122 us	0.0114 us
WriteMany_Int32_WrappedSpan	2.169 us	0.0194 us	0.0172 us
WriteMany_Single_Span	1.085 us	0.0119 us	0.0106 us
WriteMany_Single_WrappedSpan	2.183 us	0.0206 us	0.0161 us
WriteMany_Double_Span	1.096 us	0.0215 us	0.0221 us
WriteMany_Double_WrappedSpan	2.192 us	0.0181 us	0.0161 us
WriteMany_Mixed_Span	1.134 us	0.0127 us	0.0119 us
WriteMany_Mixed_WrappedSpan	2.183 us	0.0242 us	0.0226 us

Expected

I would expect the wrapping struct to have no impact on the generated byte code. In other words, WrappedSpan performs just as fast as a regular Span.

My questions

Considering JIT internals, is this an expected result? If yes, could you share the decision-making process of JIT that results in such perf difference?
Are there any options/tricks that I can use to get better results with the WrappedSpan?
Would you consider improving JIT to generate better performing code for such scenarios?

Author:	essoperagma
Assignees:	-
Labels:	`area-System.Memory`, `tenet-performance`, `untriaged`
Milestone:	-

gfoidl · 2022-05-03T08:36:21Z

In the assembly you linked there are a lot of System.Diagnostics.Debug:Fail-calls. Did you get these from Release or a Debug build (looks like the latter)?

In sharplab almost the same code is generated.

FWIW 1: Instead of MemoryMarshal.Cast<byte, int>(span)[0] = value; I'd use MemoryMarshal.TryWrite(span, ref value);. Is more natural way of expressing what should happen, and yields better code (by avoiding the checks for the cast).

FWIW 2: Instead of having ref Span-argument, the "usual" way is to have the method like this

-public MethodHost_Span WriteInt32_Span(ref Span<byte> span, int value);
+public MethodHost_Span WriteInt32_Span(Span<byte> span, int value, out int bytesWritten);

So the consumer of that method has to slice the span.
W/o the ref Span arguments it's easier for the JIT to optimize. See sharplab, code for type Foo. With the ref Span the method isn't inlined, while the other two get inlined.

essoperagma · 2022-05-03T11:58:29Z

Hey @gfoidl,
Thanks for the comment.

I am not sure why Debug.Fails are there. But I'm sure that:

the benchmarks are running in Release mode (benchmark.net complains in debug mode anyway),
Disasmo is building my project in Release mode (as I can see it in the UI)
The local runtime build I have for Disasmo was built in release mode ( I used .\build.cmd Clr+Libs -c Release -rc Checked)

I didn't know about MemoryMarshal.TryWrite, but it seems like a better alternative than Cast. I will go with Write instead of TryWrite though as I neither want to handle the error nor swallow it.

As for your second point, having an out int bytesWritten will greatly complicate the call site for the users. See how I'm chaining calls to serialize a Book object:

Span<byte> w = buffer.Span;
BookDetailsMessage bookDetails =  BookDetailsMessage.Build()
    .WithIdentifier(ref w)
    .WithTitle(ref w, book.Title)
    .WithISBN(ref w, book.ISBN)
    .WithPageCount(ref w, book.PageCount)
    .WithYear(ref w, book.Year);

The next method in the chain is still expecting a Span. So I would need to call span.Slice(bytesWritten), ending up just shifting the work around.

Finally, I don't think comparing WriteInt32_Span byte code to WriteInt32_WrappedSpan byte code is very relevant for this case. These methods will always be chained and as far as I have been able to test, they will almost a 100% be inlined. I'm more interested in the comparison of the call sites, like this one from the benchmark.

gfoidl · 2022-05-03T12:42:30Z

Note: comment put inside a details to not distract too much from the main-concern you have.

Re: As for your second point

As for your second point, having an out int bytesWritten will greatly complicate the call site for the users. See how I'm chaining calls to serialize a Book object:

I don't know if this the concrete usage, or just a demo that you showed. But by using a more standard builder-pattern, this can be accomplished quite nicely (and very efficient):

using System.Buffers;
using System.Runtime.InteropServices;
using System.Text;

using BookDetailsMessageBuilder bookDetailsMessageBuilder = new(stackalloc byte[256]);

BookDetailsMessage bookDetailsMessage = bookDetailsMessageBuilder
    .WithTitle(".NET is amazing Pt. 7")
    .WithPageCount(42)
    .Build();

public class BookDetailsMessage
{
    private readonly byte[] _binaryData;

    // Or hower it should be used.
    public BookDetailsMessage(byte[] binaryData) => _binaryData = binaryData;
}

public ref struct BookDetailsMessageBuilder
{
    private int _curPos;
    private Span<byte> _buffer;
    private byte[]? _bufferFromPool;

    public BookDetailsMessageBuilder()
    {
        _curPos = 0;
        _buffer = _bufferFromPool = ArrayPool<byte>.Shared.Rent(1024);
    }

    public BookDetailsMessageBuilder(Span<byte> initialBuffer)
    {
        _curPos = 0;
        _buffer = initialBuffer;
        _bufferFromPool = null;
    }

    public void Dispose()
    {
        if (_bufferFromPool is not null)
        {
            ArrayPool<byte>.Shared.Return(_bufferFromPool);
            _bufferFromPool = null;
        }
    }

    private Span<byte> CurrentBuffer => _buffer.Slice(_curPos);

    public BookDetailsMessageBuilder WithTitle(string title)
    {
        // Validate that into the CurrentBuffer can be written, otherwise grow the buffer.
        // E.g. by renting a larger buffer from the ArrayPool, and copying over the 
        // current contents. Omitted here for brevity.

        int written = Encoding.UTF8.GetBytes(title, CurrentBuffer);
        _curPos += written;

        return this;
    }

    public BookDetailsMessageBuilder WithPageCount(int pageCount)
    {
        // Validate that into the CurrentBuffer can be written, otherwise grow the buffer.
        // E.g. by renting a larger buffer from the ArrayPool, and copying over the 
        // current contents. Omitted here for brevity.

        MemoryMarshal.Write(CurrentBuffer, ref pageCount);
        _curPos += sizeof(int);

        return this;
    }

    public BookDetailsMessage Build()
    {
        return new BookDetailsMessage(_buffer[.._curPos].ToArray());
    }
}

For the "grow" you can take inspiration at the code of List<T> for instance.

tannergooding · 2022-05-03T12:49:33Z

I am not sure why Debug.Fails are there. But I'm sure that:
The local runtime build I have for Disasmo was built in release mode ( I used .\build.cmd Clr+Libs -c Release -rc Checked)

It's worth noting that -rc Checked means the runtime will be built as Checked. This means optimizations are enabled, but debug validation is still preserved. -rc includes System.Private.Corelib which will impact disassembly and potentially perf (https://github.com/dotnet/runtime/blob/main/src/coreclr/System.Private.CoreLib/System.Private.CoreLib.csproj#L87-L94)

While I haven't done the investigation to look at why there are differences, there are several known cases where having wrapper structs may be less efficient. This ranges from ABI differences between T value and struct Wrapper { T value; } to a few cases where the JIT is simply known to pessimize the codegen (often due to ABI requirements, such as around wrappers of single floating-point values). Improving these areas, where possible, is covered by:

Some of this will only be possible to fix if we deviate from the native ABI and native code (C, C++, Rust compilers, etc) would likewise pessimize the result when inlining cannot happen.

ghost · 2022-05-04T08:51:29Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Brief intro:

I'm developing a binary serialization library with two major requirements:

offering high performance
minimizing user errors by utilizing the type system as much as possible.

For requirement #1, I'm heavily using Span. I'm quite happy with the performance and how clean the internal serializer implementation is.

For requirement #2, I'm trying to prevent users from accidentally passing the wrong Span to a method.
As a solution, I'm wrapping Span in my own struct: WrappedSpan. All my methods are requesting a WrappedSpan and they won't accept a Span.
This helps prevent mistakes via compile-time errors in the case that the user passes a Span instead of a WrappedSpan to any of the serialization methods.

The problem:

I noticed that a regular Span<byte> is performing 100% better than a Span wrapped in an otherwise-empty struct in some scenarios (many, small, chained method calls).

The method using Span<byte>:

public MethodHost_Span WriteInt32_Span(ref Span<byte> span, int value)
{
    MemoryMarshal.Cast<byte, int>(span)[0] = value;
    span = span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_Span);
}

The method using WrappedSpan:

public MethodHost_WrappedSpan WriteInt32_WrappedSpan(ref WrappedSpan wrapper, int value)
{
    MemoryMarshal.Cast<byte, int>(wrapper.Span)[0] = value;
    wrapper.Span = wrapper.Span.Slice(sizeof(int));
    // Return of zero-size struct is needed for method chaining.
    return default(MethodHost_WrappedSpan);
}

public ref struct WrappedSpan
{
    public Span<byte> Span;
}

Actual benchmark method is here.

More details:

All this is happening on .NET 7.0.100-preview.3.22179.4, but I got similar results on .NET 6 and 5.
Setting DOTNET_TieredPGO doesn't impact the performance.
The full benchmark project can be found here.
Outputs from Disasmo can be found here in the same repo.
Local runtime build used by Disasmo is at commit: 3535e0769f202ae4cd820bea24afd20cee313966
I'm on Windows 10, Version 10.0.19044 Build 19044
Building for amd64.

BenchmarkDotNet results with different data types:

Method	Mean	Error	StdDev
WriteMany_Int32_Span	1.163 us	0.0122 us	0.0114 us
WriteMany_Int32_WrappedSpan	2.169 us	0.0194 us	0.0172 us
WriteMany_Single_Span	1.085 us	0.0119 us	0.0106 us
WriteMany_Single_WrappedSpan	2.183 us	0.0206 us	0.0161 us
WriteMany_Double_Span	1.096 us	0.0215 us	0.0221 us
WriteMany_Double_WrappedSpan	2.192 us	0.0181 us	0.0161 us
WriteMany_Mixed_Span	1.134 us	0.0127 us	0.0119 us
WriteMany_Mixed_WrappedSpan	2.183 us	0.0242 us	0.0226 us

Expected

I would expect the wrapping struct to have no impact on the generated byte code. In other words, WrappedSpan performs just as fast as a regular Span.

My questions

Considering JIT internals, is this an expected result? If yes, could you share the decision-making process of JIT that results in such perf difference?
Are there any options/tricks that I can use to get better results with the WrappedSpan?
Would you consider improving JIT to generate better performing code for such scenarios?

Author:	essoperagma
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

AndyAyersMS · 2022-05-04T18:48:21Z

As Tanner notes, wrapping structs around types may not give the same performance as passing the underlying type. This is especially true when the data being wrapped is itself a multi-field struct (like Span<T>).

Structs wrapping single-field structs are optimized in many but not all cases; see eg dotnet/coreclr#22867.

There is a general concept of "recursive struct promotion" where optimizations like struct promotion can be applied to arbitrary structs with arbitrary nesting which we hope to address someday. There's no specific issue open for this and no timeline for when we might get around to working on it.

jakobbotsch · 2023-06-16T21:40:56Z

With the same modifications as in #32415 (comment) physical promotion gets us:

Method	Job	EnvironmentVariables	Mean	Error	StdDev
WriteMany_Int32_Span	Job-AZLIHY	Empty	610.1 ns	2.96 ns	2.62 ns
WriteMany_Int32_WrappedSpan	Job-AZLIHY	Empty	775.0 ns	2.07 ns	1.93 ns
WriteMany_Int32_WrappedSpan	Job-CPRVWC	DOTNET_JitEnablePhysicalPromotion=1	611.1 ns	3.03 ns	2.83 ns


WriteMany_Single_Span	Job-AZLIHY	Empty	579.7 ns	1.64 ns	1.37 ns
WriteMany_Single_WrappedSpan	Job-AZLIHY	Empty	775.8 ns	1.64 ns	1.37 ns
WriteMany_Single_WrappedSpan	Job-CPRVWC	DOTNET_JitEnablePhysicalPromotion=1	614.6 ns	6.86 ns	6.42 ns


WriteMany_Double_Span	Job-AZLIHY	Empty	584.5 ns	3.23 ns	3.02 ns
WriteMany_Double_WrappedSpan	Job-AZLIHY	Empty	779.1 ns	2.81 ns	2.35 ns
WriteMany_Double_WrappedSpan	Job-CPRVWC	DOTNET_JitEnablePhysicalPromotion=1	605.8 ns	1.04 ns	0.97 ns


WriteMany_Mixed_Span	Job-AZLIHY	Empty	629.0 ns	2.13 ns	1.78 ns
WriteMany_Mixed_WrappedSpan	Job-AZLIHY	Empty	825.9 ns	7.44 ns	6.96 ns
WriteMany_Mixed_WrappedSpan	Job-CPRVWC	DOTNET_JitEnablePhysicalPromotion=1	591.7 ns	1.31 ns	1.16 ns

Double checked a few of the cases and the codegen was equal between the WrappedSpan and Span versions with physical promotion enabled.

essoperagma · 2023-06-18T23:50:23Z

Great to see so much progress on this. Thanks @jakobbotsch!

Are the changes in #32415 (comment) already merged to main? I'm hoping to get a daily build to give this a try.

Even without DOTNET_JitEnablePhysicalPromotion=1 though, WrappedSpan seems to be performing much better in .NET 8 compared to .NET 7. Was there some other struct-promotion work in already completed in .NET 8?

jakobbotsch · 2023-06-19T07:49:47Z

@essoperagma Not yet. I can let you know once I submit the PR to fix the remaining problem here, but it may still be a bit before that happens.

Even without DOTNET_JitEnablePhysicalPromotion=1 though, WrappedSpan seems to be performing much better in .NET 8 compared to .NET 7. Was there some other struct-promotion work in already completed in .NET 8?

Nothing I'm aware of that would affect this benchmark, but it's likely various other improvements in the JIT affected it. Sadly it's a bit hard to check what exactly is the cause of the improvement.

essoperagma added the tenet-performance Performance related issue label May 3, 2022

dotnet-issue-labeler bot added area-System.Memory untriaged New issue has not been triaged by the area owner labels May 3, 2022

jakobbotsch added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Memory labels May 4, 2022

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label May 4, 2022

AndyAyersMS added this to the Future milestone May 4, 2022

Chicken-Bones mentioned this issue Jun 30, 2022

Poor performance/code-gen with struct enumerators, especially when containing ReadOnlySpan #71510

Closed

This was referenced Oct 12, 2022

JIT: Generalized struct promotion #76928

Closed

JIT struct work planned for .NET 8 #76931

Closed

jakobbotsch self-assigned this Jun 16, 2023

jakobbotsch modified the milestones: Future, 8.0.0 Jun 16, 2023

jakobbotsch added the Priority:2 Work that is important, but not critical for the release label Jun 20, 2023

jakobbotsch mentioned this issue Jun 27, 2023

JIT: Enable physical promotion by default #88090

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 27, 2023

jakobbotsch closed this as completed in 9dcc7b1 Jun 29, 2023

jakobbotsch closed this as completed in #88090 Jun 29, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 29, 2023

ghost locked as resolved and limited conversation to collaborators Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Span<T> wrapped in a struct isn't performing as fast as it could be #68797

Span<T> wrapped in a struct isn't performing as fast as it could be #68797

essoperagma commented May 3, 2022 •

edited by BruceForstall

Loading

ghost commented May 3, 2022

Brief intro:

The problem:

BenchmarkDotNet results with different data types:

Expected

My questions

gfoidl commented May 3, 2022 •

edited

Loading

essoperagma commented May 3, 2022

gfoidl commented May 3, 2022

tannergooding commented May 3, 2022

ghost commented May 4, 2022

Brief intro:

The problem:

BenchmarkDotNet results with different data types:

Expected

My questions

AndyAyersMS commented May 4, 2022

jakobbotsch commented Jun 16, 2023

essoperagma commented Jun 18, 2023

jakobbotsch commented Jun 19, 2023

Span<T> wrapped in a struct isn't performing as fast as it could be #68797

Span<T> wrapped in a struct isn't performing as fast as it could be #68797

Comments

essoperagma commented May 3, 2022 • edited by BruceForstall Loading

Brief intro:

The problem:

BenchmarkDotNet results with different data types:

Expected

My questions

ghost commented May 3, 2022

Brief intro:

The problem:

BenchmarkDotNet results with different data types:

Expected

My questions

gfoidl commented May 3, 2022 • edited Loading

essoperagma commented May 3, 2022

gfoidl commented May 3, 2022

tannergooding commented May 3, 2022

ghost commented May 4, 2022

Brief intro:

The problem:

BenchmarkDotNet results with different data types:

Expected

My questions

AndyAyersMS commented May 4, 2022

jakobbotsch commented Jun 16, 2023

essoperagma commented Jun 18, 2023

jakobbotsch commented Jun 19, 2023

essoperagma commented May 3, 2022 •

edited by BruceForstall

Loading

gfoidl commented May 3, 2022 •

edited

Loading