Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf -26%] System.Tests.Perf_String (4) #37816

Closed
DrewScoggins opened this issue Jun 12, 2020 · 19 comments
Closed

[Perf -26%] System.Tests.Perf_String (4) #37816

DrewScoggins opened this issue Jun 12, 2020 · 19 comments
Labels
Milestone

Comments

@DrewScoggins
Copy link
Member

DrewScoggins commented Jun 12, 2020

Run Information

Architecture x64
OS Windows 10.0.18362
Changes diff

Regressions in System.Tests.Perf_String

Benchmark Baseline Test Test/Base Modality Baseline Outlier
ToLower 19.36 ns 22.83 ns 1.18 False
IndexerCheckBoundCheckHoist 32.35 ns 47.41 ns 1.47 False
ToUpper 19.69 ns 22.59 ns 1.15 False
Replace_Char 12.59 ns 15.80 ns 1.25 False

graph
graph
graph
graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Tests.Perf_String*';

Histogram

System.Tests.Perf_String.ToLower(s: "TEST")

[17.383 ; 18.296) | @@
[18.296 ; 19.756) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[19.756 ; 21.029) | @@@@@
[21.029 ; 22.525) | 
[22.525 ; 23.985) | @@@@@@@@@@@@@@@@@
[23.985 ; 25.194) | @
[25.194 ; 25.850) | 
[25.850 ; 27.310) | @

System.Tests.Perf_String.IndexerCheckBoundCheckHoist

[30.905 ; 35.815) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[35.815 ; 40.085) | 
[40.085 ; 45.827) | @@@@
[45.827 ; 50.738) | @@@@@@@@@@@@@@@@@@@@@@@@@

System.Tests.Perf_String.ToUpper(s: "TeSt")

[17.170 ; 18.265) | @@@
[18.265 ; 19.772) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[19.772 ; 21.112) | @@@@
[21.112 ; 22.508) | 
[22.508 ; 24.015) | @@@@@@@@@@@@@@@@@
[24.015 ; 25.273) | @
[25.273 ; 26.343) | 
[26.343 ; 27.850) | @

System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'l', newChar: '!')

[11.532 ; 13.074) | @@@@@@@@@@@@@@@@
[13.074 ; 14.615) | 
[14.615 ; 16.631) | @@@@@@@@@@@@@@@@@
[16.631 ; 18.420) | @@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added the tenet-performance-benchmarks Issue from performance benchmark label Jun 12, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Runtime untriaged New issue has not been triaged by the area owner labels Jun 12, 2020
@DrewScoggins
Copy link
Member Author

@adamsitnik

@danmoseley
Copy link
Member

@DrewScoggins is there some way to close the gap in the graph? Are we missing all perf data from Dec-April or were just these tests broken?

Ideally we would see which build it regressed in.

@DrewScoggins
Copy link
Member Author

The tests are not broken. We are also not missing any data. In general, for comparisons that span many months we try not to put every point in the graph. I will regenerate the graphs that do not show the actual move and update the issues accordingly.

@danmoseley
Copy link
Member

Thanks @DrewScoggins. Looks like you did this. Can you clarify the Legend on the graph? The purple and orange lines have build numbers, but I think I would expect the build number to be changing every day. My mental picture is there's a single line, with a datapoint every day, is that wrong?

@danmoseley
Copy link
Member

In the case of scenarios like this that are likely affected by the ICU change, a superimposed Linux measurement would likely be interesting - for example, we might see that Windows now matches Linux. (Whether we're OK with that is a separate issue)

@DrewScoggins
Copy link
Member Author

Yeah, I should probably change the legend as it doesn't make sense here outside the context of the full report. You are correct in thinking that these are about one build per day. Also yeah, I can take a look at building out graphs that have the Windows vs Linux comparison so we can see how we look compared to that as well. I will post those in the relevant issues.

@danmoseley
Copy link
Member

One other feedback, it would be good in these issue to clarify what the diff link represents. In the graphs above, it seems clear that there are 3 distinct regressions across these scenarios -- Dec, early Jan, and early March. The single diff link provided above covers 7 months , and it's out of the dotnet/installer repo. It's so huge that I get a Unicorn trying to view it. What might be ideal is 3 diff links, each of ~1 week or less out of the dotnet/runtime repo.

cc @adamsitnik @tannergooding since I know you guys are collectively tuning these issues.

@DrewScoggins DrewScoggins added tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark and removed tenet-performance-benchmarks Issue from performance benchmark labels Jul 7, 2020
@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label Jul 7, 2020
@joperezr joperezr added this to the 5.0.0 milestone Jul 7, 2020
@danmoseley
Copy link
Member

@tarekgh @safern are you tracking this as part of ICU perf remediation work? It would be good to know how we stand with respect to NLS-ICU perf gap. I know @tarekgh you mentioned an effort to cache casing table.

@tarekgh
Copy link
Member

tarekgh commented Aug 13, 2020

@danmosemsft I am focusing on the ordinal ignore casing (the high priority issue). This issue is not ordinal. I cannot cache all possible casing for all possible cultures. The best we can do here is to look if there is a way to optimize when calling ICU.

@adamsitnik
Copy link
Member

FWIW The Replace_Char benchmark regressed for most of the configs:

System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'l', newChar: '!')

Conclusion Base Diff Base/Diff Modality Operating System Bit Processor Name Base Runtime Diff Runtime
Slower 26.14 28.24 0.93 Windows 10.0.18363.1016 Arm Microsoft SQ1 3.0 GHz .NET Core 3.1.6 5.0.100-rc.1.20413.9
Slower 12.01 14.30 0.84 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20404.3
Slower 12.34 15.04 0.82 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20418.3
Slower 14.66 17.56 0.83 Windows 10.0.19041.450 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 5.0.100-rc.1.20413.9
Same 13.48 14.02 0.96 Windows 10.0.19041.450 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) .NET Core 3.1.6 5.0.100-rc.1.20419.9
Slower 11.52 13.17 0.87 Windows 10.0.19042 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 3.1.6 5.0.100-rc.1.20418.3
Same 14.34 15.29 0.94 bimodal Windows 10.0.19041.450 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) .NET Core 3.1.6 5.0.100-rc.1.20419.14
Slower 16.14 17.13 0.94 Windows 10.0.18363.959 X86 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20420.14
Slower 18.58 21.28 0.87 Windows 10.0.19041.450 X86 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 5.0.100-rc.1.20419.5
Slower 35.09 53.71 0.65 several? ubuntu 18.04 Arm64 Unknown processor .NET Core 3.1.6 6.0.100-alpha.1.20421.6
Slower 14.77 17.57 0.84 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20403.23
Slower 16.83 21.49 0.78 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 5.0.100-rc.1.20404.2
Slower 11.55 21.53 0.54 alpine 3.11 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 3.1.6 6.0.100-alpha.1.20421.6
Slower 16.97 18.69 0.91 ubuntu 18.04 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 3.1.6 5.0.100-rc.1.20418.3

@jeffhandley
Copy link
Member

@tarekgh Do you want to leave this open for 5.0.0 or should we move to 6.0.0/Future?

@tarekgh
Copy link
Member

tarekgh commented Sep 2, 2020

@jeffhandley please move them to 6.0 for now and we can bring it back if we'll address it soon.

@Gnbrkm41
Copy link
Contributor

Gnbrkm41 commented Sep 4, 2020

I suspect that #279 impacted the performance for this particular case - the minimum size of the string for entering the vectorised path is 8 chars (SSE2/NEON) or 16 chars (if AVX2 is available), and the additional checks / etc may have introduced additional operations that regresses the path for short strings.

Do we believe this change is worth it? While it definitely does benefit long strings, it might be useful to collect informations on the length of the string that this method usually gets called with, and see if the majority of the strings are short enough to not enter the vectorised path.

Another thing to consider might be whether we think the difference in actual nanoseconds are significant enough, since while those added initial set-up overheads are likely to overwhelm the difference in percentages it might be not that much of the time lost. I'm seeing regressions of between 0.6ns best case and 18ns worst case (assuming those are in nanoseconds).

I think this is something that's going to show up fairly frequently with vectorisations of variable-length data in general. For example, we had the same issue with BitArray: #37813

@tannergooding
Copy link
Member

I wonder if there is something we could do for vectorized code paths that would allow the initial size check to be optimized away.

That is, much of the code is in a pattern such as:

if (Avx2.IsSupported && (data.length >= 32))
{
    VectorizedPath();
}
else
{
    SoftwareFallback();
}

If data.length is a known constant, then inlining the method would allow it to be a direct call to SoftwareFallback or to VectorizedPath which would lessen the impact that vectorization has on short inputs.
However, a non-constant length may make it undesirable to inline due to increased bloat....

C++ allows you to specialize for these cases using templates, but we don't have an equivalent in .NET today and so you can't really say "Do this if the input is 4 bytes and this if the input is greater than 128 bytes"

@tarekgh
Copy link
Member

tarekgh commented Sep 4, 2020

@Gnbrkm41 it may better we confirm first #279 is the culprit and then we can try to think how we fix it. Is this something you can help confirming first?

@Gnbrkm41
Copy link
Contributor

Gnbrkm41 commented Sep 4, 2020

@tarekgh, Looks like I didn't forget to include a benchmark in the original PR fortunately so yes: dotnet/coreclr#27798

Specifically, dotnet/coreclr#27798 (comment) is the relevant part. Quoting the comment from below - you can see the exact same case regressing on my machine which has i7-8700. Not exactly sure which version of Windows I used at the time but it was one of the 10 Insider builds.

C:\Users\gotos\source\repos\performance\src\tools\ResultsComparer>dotnet run --base C:\Users\gotos\desktop\replace-before --diff C:\users\gotos\desktop\replace-after --threshold 5%
summary:
better: 4, geomean: 2.626
worse: 1, geomean: 1.204
total diff: 5
Slower diff/base Base Median (ns) Diff Median (ns) Modality
"Hello".Replace('l', '!') 1.20 10.49 12.63
Faster base/diff Base Median (ns) Diff Median (ns) Modality
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut id tempor metus.".Replace('z', 'y') 5.19 46.50 8.97
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut id tempor metus.".Replace('e', 'E') 2.79 66.68 23.93
"This is a very nice sentence".Replace('z', 'y') 2.21 14.65 6.62
"This is a very nice sentence".Replace('i', 'I') 1.49 28.97 19.47 several?

(Also ran "Hello".Replace('a', 'b'), but there was no difference at 5% threshold)

@tarekgh
Copy link
Member

tarekgh commented Sep 4, 2020

Thanks @Gnbrkm41 for the details.

Do we have some idea about the min string length that we start see the perf improvement with?

@tannergooding looks data.Length looks is not constant I guess which means inlining the method wouldn't help according to your comment. is there any other idea can be tried?

CC @GrabYourPitchforks @jkotas

@jkotas
Copy link
Member

jkotas commented Sep 4, 2020

This method allocates new strings. It makes it unlikely to be used in the most performance critical scenarios. It is nice that it is vectorized and significantly faster on long strings now, but I do not think it is worth it to spend time on this or making the code more complex to match the earlier performance for small strings.

@tarekgh
Copy link
Member

tarekgh commented Sep 4, 2020

sounds good. I am closing this one and we can revisit if the issue raised in any customer scenario.

@tarekgh tarekgh closed this as completed Sep 4, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

10 participants