ReadOnlySpan<char>.Trim for small inputs is few times slower on Linux #13669

adamsitnik · 2019-10-28T15:48:58Z

ReadOnlySpan<char>.Trim for small inputs is few times slower on Linux

Slower	Lin/Win	Win Median (ns)	Lin Median (ns)
System.Memory.ReadOnlySpan.Trim(input: "")	6.45	3.36	21.68
System.Memory.ReadOnlySpan.Trim(input: "abcdefg")	5.67	4.82	27.33
System.Memory.ReadOnlySpan.Trim(input: " abcdefg ")	3.32	7.78	25.83

Benchmark:

https://github.com/dotnet/performance/blob/8b23cabe793b4ff73a9b28c7dd092b11dc17b197/src/benchmarks/micro/corefx/System.Memory/ReadOnlySpan.cs#L77-L79

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp5.0 --filter System.Memory.ReadOnlySpan.Trim

I've created a very small repro app:

class Program
{
    static int Main(string[] args)
    {
        int result = 0;

        ReadOnlySpan<char> span = string.Empty.AsSpan();
        for (int i = 0; i < 1_000_000_000; i++)
        {
            result ^= TrimSourceCopied(span).Length;
        }

        return result;
    }
    
    private static ReadOnlySpan<char> TrimSourceCopied(ReadOnlySpan<char> span)
    {
        int start = 0;
        for (; start < span.Length; start++)
        {
            if (!char.IsWhiteSpace(span[start]))
            {
                break;
            }
        }

        int end = span.Length - 1;
        for (; end > start; end--)
        {
            if (!char.IsWhiteSpace(span[end]))
            {
                break;
            }
        }

        return span.Slice(start, end - start + 1);
    }
}

And using VTune I was able to narrow down the problem to struct copying:

The body of the Main method on Ubuntu 18.04:

Please mind the body of the loop:

When I change TrimSourceCopied to accept the Span as readonly ref parameter:

private static ReadOnlySpan<char> TrimSourceCopied(in ReadOnlySpan<char> span)

System.Memory.ReadOnlySpan.Trim is only one example of a benchmark that uses Span a lot and is slower on Linux compared to Windows. This pattern|problem is quite common.

/cc @AndyAyersMS

category:cq
theme:structs
skill-level:expert
cost:medium

The text was updated successfully, but these errors were encountered:

adamsitnik · 2019-10-28T15:51:48Z

Windows codegen (different machine, but also x64):

GrabYourPitchforks · 2019-10-28T16:29:36Z

As it so happens I already have a fix for this in a local branch. Assigning to self.

Edit: My fix involved improving Trim specifically by optimizing the method implementation. It does not solve the more general issue of structs not being enregistered optimally across method boundaries.

AndyAyersMS · 2019-10-28T19:06:51Z

The general fix is to better take advantage of the SysV calling conventions for structs. This is especially notable for benchmarks that pass small-length spans (see eg #12901) as the method called tends not to do much work, so the calling overhead stands out.

cc @CarolEidt

GrabYourPitchforks · 2019-10-28T21:32:06Z

Is it worthwhile to improve the performance of the ROS<char>.Trim() extension method more generally, even separate from efficiencies from changing calling conventions? I have a proof of concept that knocks around 30% off the runtime (at least on Windows), but the method is already so efficient that we're talking nanoseconds. I'm having trouble justifying complicating the code for such a gain.

adamsitnik · 2019-10-29T17:29:30Z

@AndyAyersMS @CarolEidt It looks like JIT knows how pass the structs by readonly reference on Linux when we tell it explictilty to do so. Is it "just" a matter of tuning the heuristic which decides when to pass the struct by reference?

I've done some more profiling today and I've observed this problem in more places. I think we should consider fixing it for 5.0.

adamsitnik · 2019-10-29T17:34:47Z

Is it worthwhile to improve the performance of the ROS<char>.Trim() extension method more generally, even separate from efficiencies from changing calling conventions?

@kevingosse @mjsabby have you ever seen Trim being hot in your profiles?

adamsitnik · 2019-10-29T17:34:54Z

have a proof of concept that knocks around 30%

@GrabYourPitchforks I wonder if it would be possible to have only one Unsafe.Add in the second loop of your improved implementation and get some % extra

jkotas · 2019-10-29T17:45:28Z

have you ever seen Trim being hot in your profiles?

It has zero hits in the Azure telemetry.

kevingosse · 2019-10-29T21:52:04Z

@kevingosse @mjsabby have you ever seen Trim being hot in your profiles?

Never noticed it, but we don't make a heavy usage of Span. In our applications it's only used by corefx and aspnetcore.
I'll check tomorrow to see if it appears at all.

ezsilmar · 2019-10-30T12:18:46Z

I'll check tomorrow to see if it appears at all.

Trim indeed has no hits in our CPU samples, as @kevingosse expected.

The only thing showing up around span is System.Buffers.BuffersExtensions::WriteMultiSegment, but it's just a couple of them, less than 0.1%.

adamsitnik · 2019-11-04T09:02:08Z

@ezsilmar thank you very much!

CarolEidt · 2020-10-26T22:45:42Z

The struct values passed to TrimSourceCopied are now kept in registers, but more work is needed to ensure that they remain in registers in the called method.

CarolEidt · 2020-12-02T00:25:05Z

Fixed by #43870

adamsitnik · 2020-12-02T07:42:29Z

@CarolEidt awesome! can't wait to see how many benchmarks show improvement on Linux

FWIW these 3 charts should show it for this particular method in the next 24 hours: 1, 2, 3

GrabYourPitchforks self-assigned this Oct 28, 2019

GrabYourPitchforks removed their assignment Oct 28, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

CarolEidt mentioned this issue Oct 27, 2020

Keep structs in registers #43867

Closed

10 tasks

CarolEidt modified the milestones: Future, 6.0.0 Oct 27, 2020

CarolEidt closed this as completed Dec 2, 2020

ghost locked as resolved and limited conversation to collaborators Jan 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadOnlySpan<char>.Trim for small inputs is few times slower on Linux #13669

ReadOnlySpan<char>.Trim for small inputs is few times slower on Linux #13669

adamsitnik commented Oct 28, 2019

adamsitnik commented Oct 28, 2019

GrabYourPitchforks commented Oct 28, 2019 •

edited

Loading

AndyAyersMS commented Oct 28, 2019

GrabYourPitchforks commented Oct 28, 2019

adamsitnik commented Oct 29, 2019

adamsitnik commented Oct 29, 2019

adamsitnik commented Oct 29, 2019

jkotas commented Oct 29, 2019

kevingosse commented Oct 29, 2019

ezsilmar commented Oct 30, 2019

adamsitnik commented Nov 4, 2019

CarolEidt commented Oct 26, 2020

CarolEidt commented Dec 2, 2020

adamsitnik commented Dec 2, 2020

ReadOnlySpan<char>.Trim for small inputs is few times slower on Linux #13669

ReadOnlySpan<char>.Trim for small inputs is few times slower on Linux #13669

Comments

adamsitnik commented Oct 28, 2019

adamsitnik commented Oct 28, 2019

GrabYourPitchforks commented Oct 28, 2019 • edited Loading

AndyAyersMS commented Oct 28, 2019

GrabYourPitchforks commented Oct 28, 2019

adamsitnik commented Oct 29, 2019

adamsitnik commented Oct 29, 2019

adamsitnik commented Oct 29, 2019

jkotas commented Oct 29, 2019

kevingosse commented Oct 29, 2019

ezsilmar commented Oct 30, 2019

adamsitnik commented Nov 4, 2019

CarolEidt commented Oct 26, 2020

CarolEidt commented Dec 2, 2020

adamsitnik commented Dec 2, 2020

GrabYourPitchforks commented Oct 28, 2019 •

edited

Loading