Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadOnlySpan<char>.Trim for small inputs is few times slower on Linux #13669

Closed
adamsitnik opened this issue Oct 28, 2019 · 14 comments
Closed

ReadOnlySpan<char>.Trim for small inputs is few times slower on Linux #13669

adamsitnik opened this issue Oct 28, 2019 · 14 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Milestone

Comments

@adamsitnik
Copy link
Member

ReadOnlySpan<char>.Trim for small inputs is few times slower on Linux

Slower Lin/Win Win Median (ns) Lin Median (ns) Modality
System.Memory.ReadOnlySpan.Trim(input: "") 6.45 3.36 21.68
System.Memory.ReadOnlySpan.Trim(input: "abcdefg") 5.67 4.82 27.33
System.Memory.ReadOnlySpan.Trim(input: " abcdefg ") 3.32 7.78 25.83

Benchmark:

https://github.com/dotnet/performance/blob/8b23cabe793b4ff73a9b28c7dd092b11dc17b197/src/benchmarks/micro/corefx/System.Memory/ReadOnlySpan.cs#L77-L79

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp5.0 --filter System.Memory.ReadOnlySpan.Trim

I've created a very small repro app:

class Program
{
    static int Main(string[] args)
    {
        int result = 0;

        ReadOnlySpan<char> span = string.Empty.AsSpan();
        for (int i = 0; i < 1_000_000_000; i++)
        {
            result ^= TrimSourceCopied(span).Length;
        }

        return result;
    }
    
    private static ReadOnlySpan<char> TrimSourceCopied(ReadOnlySpan<char> span)
    {
        int start = 0;
        for (; start < span.Length; start++)
        {
            if (!char.IsWhiteSpace(span[start]))
            {
                break;
            }
        }

        int end = span.Length - 1;
        for (; end > start; end--)
        {
            if (!char.IsWhiteSpace(span[end]))
            {
                break;
            }
        }

        return span.Slice(start, end - start + 1);
    }
}

And using VTune I was able to narrow down the problem to struct copying:

The body of the Main method on Ubuntu 18.04:

image

Please mind the body of the loop:

image

When I change TrimSourceCopied to accept the Span as readonly ref parameter:

private static ReadOnlySpan<char> TrimSourceCopied(in ReadOnlySpan<char> span)

image

System.Memory.ReadOnlySpan.Trim is only one example of a benchmark that uses Span a lot and is slower on Linux compared to Windows. This pattern|problem is quite common.

/cc @AndyAyersMS

category:cq
theme:structs
skill-level:expert
cost:medium

@adamsitnik
Copy link
Member Author

Windows codegen (different machine, but also x64):

image

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Oct 28, 2019

As it so happens I already have a fix for this in a local branch. Assigning to self.

Edit: My fix involved improving Trim specifically by optimizing the method implementation. It does not solve the more general issue of structs not being enregistered optimally across method boundaries.

@GrabYourPitchforks GrabYourPitchforks self-assigned this Oct 28, 2019
@AndyAyersMS
Copy link
Member

The general fix is to better take advantage of the SysV calling conventions for structs. This is especially notable for benchmarks that pass small-length spans (see eg #12901) as the method called tends not to do much work, so the calling overhead stands out.

cc @CarolEidt

@GrabYourPitchforks GrabYourPitchforks removed their assignment Oct 28, 2019
@GrabYourPitchforks
Copy link
Member

Is it worthwhile to improve the performance of the ROS<char>.Trim() extension method more generally, even separate from efficiencies from changing calling conventions? I have a proof of concept that knocks around 30% off the runtime (at least on Windows), but the method is already so efficient that we're talking nanoseconds. I'm having trouble justifying complicating the code for such a gain.

@adamsitnik
Copy link
Member Author

@AndyAyersMS @CarolEidt It looks like JIT knows how pass the structs by readonly reference on Linux when we tell it explictilty to do so. Is it "just" a matter of tuning the heuristic which decides when to pass the struct by reference?

I've done some more profiling today and I've observed this problem in more places. I think we should consider fixing it for 5.0.

@adamsitnik
Copy link
Member Author

Is it worthwhile to improve the performance of the ROS<char>.Trim() extension method more generally, even separate from efficiencies from changing calling conventions?

@kevingosse @mjsabby have you ever seen Trim being hot in your profiles?

@adamsitnik
Copy link
Member Author

have a proof of concept that knocks around 30%

@GrabYourPitchforks I wonder if it would be possible to have only one Unsafe.Add in the second loop of your improved implementation and get some % extra

@jkotas
Copy link
Member

jkotas commented Oct 29, 2019

have you ever seen Trim being hot in your profiles?

It has zero hits in the Azure telemetry.

@kevingosse
Copy link
Contributor

@kevingosse @mjsabby have you ever seen Trim being hot in your profiles?

Never noticed it, but we don't make a heavy usage of Span. In our applications it's only used by corefx and aspnetcore.
I'll check tomorrow to see if it appears at all.

@ezsilmar
Copy link
Contributor

I'll check tomorrow to see if it appears at all.

Trim indeed has no hits in our CPU samples, as @kevingosse expected.

The only thing showing up around span is System.Buffers.BuffersExtensions::WriteMultiSegment, but it's just a couple of them, less than 0.1%.

@adamsitnik
Copy link
Member Author

@ezsilmar thank you very much!

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@CarolEidt
Copy link
Contributor

The struct values passed to TrimSourceCopied are now kept in registers, but more work is needed to ensure that they remain in registers in the called method.

@CarolEidt CarolEidt modified the milestones: Future, 6.0.0 Oct 27, 2020
@CarolEidt
Copy link
Contributor

Fixed by #43870

@adamsitnik
Copy link
Member Author

@CarolEidt awesome! can't wait to see how many benchmarks show improvement on Linux

FWIW these 3 charts should show it for this particular method in the next 24 hours: 1, 2, 3

@ghost ghost locked as resolved and limited conversation to collaborators Jan 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

8 participants