.NET 5.0 Microbenchmarks Performance Study Report #41871

adamsitnik · 2020-09-04T14:25:33Z

Goals

The main goal of my study was to ensure that we ship .NET 5.0 without any performance regressions and validate whether in the near future we can fully rely on the regressions auto-filing bot written by @DrewScoggins.
My other goal was to get .NET Library Team members involved and keep on growing the performance culture.

#tl;dr The bot is doing a great job in detecting regressions. Most serious regressions have been already fixed, however a few investigations are still in progress.

Methodology (and how it evolved)

In 2018 I had the pleasure to review @AndreyAkinshin "Pro .NET Benchmarking" book. The "Statistics for Performance Engineers" and "Performance Analysis and Performance Testing" chapters inspired me to implement a small tool called Results Comparer. The tool uses the Mann-Whitney U statistical test to detect performance regressions in results exported by BenchmarkDotNet. It's being used (or at least it should) as part of our benchmarking workflow to prevent introducing regressions to .NET.

In 2019 I was asked by @danmosemsft to verify .NET Core 3.0 performance. Initially, I’ve run all the microbenchmarks from dotnet/performance repository using a single machine with dual boot for Windows 10 and Ubuntu 18.04 x64 and used the Results Comparer to find regressions. It very quickly turned out that such a sample was way too small to make sure that we don’t have any regressions. Some benchmarks were simply unstable, some architectures like ARM and ARM64 were simply not covered. Other Linux distros and CPU families were also not covered.

Then I’ve run the benchmarks on all the PCs, laptops, and VMs that I could access. But I was still missing AMD and ARM results, so I've asked @tannergooding and @BruceForstall for help. @tannergooding has run the benchmarks on all his AMD machines. @BruceForstall has provided me access to a document that explains how to use ARM machines owned by the JIT Team. This turned out to be an invaluable help as I've used these machines many, many times. Including this year during the 5.0 investigation.

After having enough samples to cover our matrix of supported OSes and architectures, I’ve built a simple console app on top of ResultsComparer (source code available here). The tool uses the very same statistical test to detect regressions, aggregates the results from all different configurations, and sorts them from the biggest regression to the biggest improvement.

Such approach allows for very quick identification of regressions of all kinds:

affecting every configuration

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: IOrderedEnumerable)

Result	Base	Diff	Ratio	Operating System	Bit
Slower	570.88	3069.76	0.19	Windows 10.0.19041.388	X64
Slower	610.20	3674.19	0.17	Windows 10.0.18363.959	X64
Slower	598.37	3519.26	0.17	Windows 10.0.18363.959	X64
Slower	700.86	4238.85	0.17	Windows 10.0.19041.450	X64
Slower	583.19	3538.60	0.16	Windows 10.0.19041.450	X64
Slower	546.58	3015.23	0.18	Windows 10.0.19042	X64
Slower	665.53	3776.10	0.18	Windows 10.0.19041.450	X64
Slower	515.15	3162.05	0.16	Windows 10.0.19041.450	X64
Slower	626.94	3928.55	0.16	ubuntu 18.04	X64
Slower	630.90	4196.01	0.15	manjaro	X64
Slower	813.80	4605.57	0.18	pop 20.04	X64
Slower	608.59	3587.44	0.17	alpine 3.11	X64
Slower	615.67	3390.01	0.18	ubuntu 18.04	X64
Slower	2148.33	10335.71	0.21	ubuntu 16.04	Arm64
Slower	2183.77	10620.53	0.21	ubuntu 16.04	Arm64
Slower	2163.67	10815.16	0.20	ubuntu 16.04	Arm64
Slower	1176.33	11641.04	0.10	ubuntu 18.04	Arm64
Slower	1550.48	5183.74	0.30	ubuntu 20.04	Arm64
Slower	568.67	3637.59	0.16	Windows 10.0.18363.959	X86
Slower	664.86	4576.24	0.15	Windows 10.0.19041.450	X86
Slower	972.74	8054.46	0.12	Windows 10.0.18363.1016	Arm
Slower	790.15	5171.92	0.15	macOS Catalina 10.15.6	X64
Slower	668.62	4153.54	0.16	macOS Catalina 10.15.6	X64
Slower	743.69	4727.58	0.16	macOS Mojave 10.14.5	X64

affecting specific OS families (Windows, Unix)

System.Globalization.Tests.StringSearch.IsPrefix_DifferentFirstChar(Options: (en-US, IgnoreSymbols, False))

Result	Base	Diff	Ratio	Operating System	Bit
Slower	53.24	26589.31	0.00	Windows 10.0.19041.388	X64
Slower	65.47	28371.93	0.00	Windows 10.0.18363.959	X64
Slower	63.89	27952.39	0.00	Windows 10.0.18363.959	X64
Slower	75.24	35910.74	0.00	Windows 10.0.19041.450	X64
Slower	67.29	55198.94	0.00	Windows 10.0.19041.450	X64
Slower	58.36	31008.73	0.00	Windows 10.0.19042	X64
Slower	70.38	34632.87	0.00	Windows 10.0.19041.450	X64
Slower	58.92	27533.16	0.00	Windows 10.0.19041.450	X64
Same	24197.26	24316.40	1.00	ubuntu 18.04	X64
Same	23317.93	23585.42	0.99	manjaro	X64
Same	30855.66	30176.99	1.02	pop 20.04	X64
Same	29081.88	28590.29	1.02	alpine 3.11	X64
Same	23929.07	23728.33	1.01	ubuntu 18.04	X64
Same	51918.86	51256.87	1.01	ubuntu 16.04	Arm64
Same	51674.77	51693.86	1.00	ubuntu 16.04	Arm64
Same	51690.93	52015.88	0.99	ubuntu 16.04	Arm64
Same	61071.92	43711.17	1.40	ubuntu 18.04	Arm64
Faster	43870.66	26020.13	1.69	ubuntu 20.04	Arm64
Slower	78.42	36208.27	0.00	Windows 10.0.18363.959	X86
Slower	88.01	42312.37	0.00	Windows 10.0.19041.450	X86
Slower	104.29	57622.86	0.00	Windows 10.0.18363.1016	Arm
Same	38089.02	40079.68	0.95	macOS Catalina 10.15.6	X64
Same	32208.09	32537.00	0.99	macOS Catalina 10.15.6	X64
Same	32575.17	32782.69	0.99	macOS Mojave 10.14.5	X64

affecting specific Linux distros

System.Threading.Tests.Perf_CancellationToken.Cancel

Result	Base	Diff	Ratio	Operating System	Bit
Same	116.42	120.28	0.97	Windows 10.0.19041.388	X64
Same	148.25	146.53	1.01	Windows 10.0.18363.959	X64
Same	144.37	144.09	1.00	Windows 10.0.18363.959	X64
Same	154.82	151.57	1.02	Windows 10.0.19041.450	X64
Same	134.57	133.40	1.01	Windows 10.0.19041.450	X64
Same	122.52	119.39	1.03	Windows 10.0.19042	X64
Same	154.48	150.92	1.02	Windows 10.0.19041.450	X64
Same	128.87	122.90	1.05	Windows 10.0.19041.450	X64
Same	169.50	168.46	1.01	ubuntu 18.04	X64
Faster	171.67	155.11	1.11	manjaro	X64
Same	179.54	175.17	1.02	pop 20.04	X64
Slower	146.39	203.94	0.72	alpine 3.11	X64
Same	179.39	180.75	0.99	ubuntu 18.04	X64
Same	1068.08	1029.35	1.04	ubuntu 16.04	Arm64
Same	1066.73	1056.79	1.01	ubuntu 16.04	Arm64
Same	1111.72	1037.54	1.07	ubuntu 16.04	Arm64
Same	751.74	622.83	1.21	ubuntu 18.04	Arm64
Faster	675.51	318.18	2.12	ubuntu 20.04	Arm64
Same	258.80	257.15	1.01	Windows 10.0.18363.959	X86
Same	194.61	192.96	1.01	Windows 10.0.19041.450	X86
Same	486.93	508.05	0.96	Windows 10.0.18363.1016	Arm
Same	200.25	203.78	0.98	macOS Catalina 10.15.6	X64
Same	168.62	163.47	1.03	macOS Catalina 10.15.6	X64
Same	174.95	177.88	0.98	macOS Mojave 10.14.5	X64

affecting specific CPU families

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)

Result	Base	Diff	Ratio	Operating System	Bit	Processor Name
Same	125616750.00	125476550.00	1.00	Windows 10.0.19041.388	X64	AMD Ryzen 9 3900X
Same	161388400.00	156493500.00	1.03	Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Same	154933500.00	154730800.00	1.00	Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Same	180481800.00	180129900.00	1.00	Windows 10.0.19041.450	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Slower	161742300.00	211160300.00	0.77	Windows 10.0.19041.450	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)
Same	152928600.00	150232700.00	1.02	Windows 10.0.19042	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)
Same	206708750.00	206860050.00	1.00	Windows 10.0.19041.450	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower	140924300.00	185228400.00	0.76	Windows 10.0.19041.450	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same	154948321.00	154788579.50	1.00	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz
Same	175860282.50	163007313.50	1.08	manjaro	X64	Intel Core i7-4771 CPU 3.50GHz (Haswell)
Slower	199713880.00	255270486.50	0.78	pop 20.04	X64	Intel Core i7-6600U CPU 2.60GHz (Skylake)
Same	151256100.00	168661900.00	0.90	alpine 3.11	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)
Same	171229200.00	165843050.00	1.03	ubuntu 18.04	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)
Same	503785101.00	505992400.50	1.00	ubuntu 16.04	Arm64	Unknown processor
Same	503901205.00	506190175.00	1.00	ubuntu 16.04	Arm64	Unknown processor
Same	504131772.50	506220395.00	1.00	ubuntu 16.04	Arm64	Unknown processor
Same	473629200.00	541631800.00	0.87	ubuntu 18.04	Arm64	Unknown processor
Same	331381500.00	333779500.00	0.99	ubuntu 20.04	Arm64	Unknown processor
Same	246876150.00	247010200.00	1.00	Windows 10.0.18363.959	X86	Intel Xeon CPU E5-1650 v4 3.60GHz
Same	290036150.00	289409500.00	1.00	Windows 10.0.19041.450	X86	Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Same	418007450.00	415404450.00	1.01	Windows 10.0.18363.1016	Arm	Microsoft SQ1 3.0 GHz
Same	204196936.50	204410652.50	1.00	macOS Catalina 10.15.6	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)
Same	176763730.00	175647563.50	1.01	macOS Catalina 10.15.6	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)
Same	180812724.00	184849205.00	0.98	macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)

Using the tool had one major flaw: it was not automated and hence we were finding out about the regressions only when we searched for them.

This has been recognized and a new project has been started. In 2020 @DrewScoggins started implementing a GitHub bot that would be using the data gathered from performance lab (a set of machines owned by .NET Performance Team) microbenchmark runs to detect and auto-file the regressions. So far the bot was reporting new issues in a dedicated repository and once a week the workgroup led by @DrewScoggins that consisted of @AndyAyersMS, @kunalspathak, @tannergooding any myself was going through the list and triaging the issues. Issues that were seemed as actual regressions were labeled as Needs Transfer and were later moved by @DrewScoggins to the runtime repo.

A few weeks ago we were getting close to "code freeze" for .NET 5 and I have asked myself a question: are we sure that the bot has reported all possible regressions for all the supported OS versions?

The bot is using different statistical methods to detect regressions and so far it has been enabled only for Windows 10 x64, Ubuntu 18.04 x64, and Windows 10 x86. So I've decided to spend some time and use the old tool that I wrote to verify it. To increase the sample size and get other .NET Libraries Team members involved, I've simply asked the Team to run the benchmarks and share the results with me.

Running the performance repo microbenchmarks against the latest .NET Core SDK is super easy thanks to a python script implemented by @jorive. The script downloads the right SDK and starts benchmarking with cleared environment variables.

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --filter '*'

Data

The data I've received from the .NET Libraries Team members allowed me a big part of the entire matrix of the supported configurations:

Operating System	Arch	Processor Name	Provided by
Windows 10.0.19041.388	X64	AMD Ryzen 9 3900X	@tannergooding
Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	@adamsitnik
Windows 10.0.19041.450	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	@adamsitnik
Windows 10.0.19041.450	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)	@GrabYourPitchforks
Windows 10.0.19042	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	@danmosemsft
Windows 10.0.19041.450	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)	@jeffhandley
Windows 10.0.19041.450	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)	@jeffhandley
ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	@adamsitnik
manjaro	X64	Intel Core i7-4771 CPU 3.50GHz (Haswell)	@ManickaP
pop 20.04	X64	Intel Core i7-6600U CPU 2.60GHz (Skylake)	@carlossanlop
alpine 3.11 (WSL2)	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	@danmosemsft
ubuntu 18.04 (WSL2)	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	@danmosemsft
ubuntu 16.04	Arm64	Qualcomm Centriq	@adamsitnik
ubuntu 18.04 (WSL2)	Arm64	Microsoft SQ1 3.0 GHz (Surface Pro X)	@carlossanlop
ubuntu 20.04 (WSL2)	Arm64	Microsoft SQ1 3.0 GHz (Surface Pro X)	@pgovind
Windows 10.0.18363.959	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	@adamsitnik
Windows 10.0.19041.450	X86	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	@adamsitnik
Windows 10.0.18363.1016	Arm	Microsoft SQ1 3.0 GHz (Surface Pro X)	@adamsitnik
macOS Catalina 10.15.6	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)	@jeffhandley
macOS Catalina 10.15.6	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)	@carlossanlop
macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	@adamsitnik

Everyone interested can download the data from here. The full report generated by the tool is available here.

Moreover, the full historical data turned out to be extremely useful. I've used it every time I was not sure whether something was a regression or just unstable|multimodal benchmark:

Regressions

Already fixed

System.Collections.Contains*, System.Memory.SequenceReader.TryReadTo, System.Text.Json.Tests.Perf_Segment.ReadSingleSegmentSequenceByN
- was a 32 bit issue only (both x86 and ARM)
- detected by the bot, reported in [Perf -196%] System.Collections.ContainsTrue<Int32> (6) DrewScoggins/performance-2#910 (comment)
- confirmed: [Perf -196%] System.Collections.ContainsTrue<Int32> (6) DrewScoggins/performance-2#910 (comment)
- transffered to runtime repo: [Perf -196%] System.Collections.ContainsTrue<Int32> (6) #41167
- fixed in Fix perf regression in IntPtr operators on 32-bit platforms #41198
- backported to 5.0 in [release/5.0] Fix perf regression in IntPtr operators on 32-bit platforms #41254
System.Collections.CtorGivenSize<Int32>.Array(Size: 512)
- specific to Apline only
- created an issue Performance regression: 6x slower array allocation on Alpine #41398
- confirmed by @jkotas to be not WSL specific, but a much bigger Alpine perf problem
- it has shown that an increased number of Gen 0 collections is a valuable metric to detect regressions
- fixed in Fix reading cpu cache size for Alpine(musl) #41532
- backported to 5.0 [release/5.0] Fix reading cpu cache size for Alpine(musl) #41547
- created Create a unit test for PAL_GetLogicalProcessorCacheSizeFromOS #41708 to add unit tests that ensure that this problem is not coming back
System.Numerics.Tests.Perf_Quaternion.Conjugate and System.Numerics.Tests.Perf_Quaternion.Negat*
- not reported by the bot because it's a brand new benchmark and we did not have historical data at the time of my investigation
- issue created: Performance regressions in Quaternion.Conjugate and Quaternion.Negate #41738
- fixed in Marking Matrix3x2, Matrix4x4, Plane, and Quaternion as Intrinsic #41829
- backported to 5.0-rc2 in [release/5.0-rc2] Marking Matrix3x2, Matrix4x4, Plane, and Quaternion as Intrinsic #41885
Directory.EnumerateFiles
- not reported by the bot, most probably because it was a very fresh regression
- issue created: [Unix] Potential performance regression in Directory.EnumerateFiles #41739
- fixed in [Unix] Potential performance regression in Directory.EnumerateFiles #41739
- backported to 5.0-rc2 in [release/5.0-rc2] Revert #40641 #41820
ByteMark.BenchIDEAEncryption
- not reported by the bot, most probably because it was a very fresh regression
- issue created: Performance regression in ByteMark.BenchIDEAEncryption #41677
- fixed in Alternative fix for folding of *(typ*)&lclVar for small types #40607 #40871
- backported to 5.0-rc2 in [release/5.0-rc2] Fix for folding of *(typ*)&lclVar for small types #41838
System.Text.Perf_Utf8Encoding
- not detected by the bot because it was not enabled for ARM yet
- issue created: [ARM64] Performance regression: Utf8Encoding #41699
- fixed in Temporarily disable arm64 intrinsics in UTF-16 validation code paths #42052
- backported to 5.0-rc2 in [release/5.0-rc2] Disable arm64 intrinsics in UTF-16 validation code paths #42064

Investigation in progress

System.Memory.Slice
- not detected by the bot because it was not enabled for ARM yet
- seems to be ARM64-specific, created an issue [ARM64] Possible perf regression: slicing #41704
- investigation is in progress
PerfLabTests.CastingPerf2.CastingPerf.IntObj
- not detected by the bot because it was not enabled for ARM yet
- seems to be ARM64-specific, created an issue [ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj #41706
- investigation is in progress

By design or Acceptable

ICU-related regressions
- System.Globalization.Tests.StringSearch: detected by the bot, reported in [Perf -1,796%] System.Globalization.Tests.StringSearch (33) #37819
- System.Memory.ReadOnlySpan.IndexOfString: detected by the bot, reported in [Perf -14%] System.Memory.ReadOnlySpan.IndexOfString (2) #39724
- System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja): detected by the bot, reported in [Perf - 10-20x regression] System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse in ja #37807
- System.Globalization.Tests.StringEquality: detected by the bot, reported in [Perf -97%] System.Globalization.Tests.StringEquality (8) #39038
- I've created one uber issue to track all of them in one place: List of performance regressions caused by switching to ICU #40942
- OrdinalIgnoreCase has been optimized in Port Ordinal Ignore Case Optimization changes #40962
- TODO: doc update still required
System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: IOrderedEnumerable)
- detected by the bot, reported in [Perf -492%] System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches #39032
- closed, by design: removed the O(N log N) cost of the OrderBy [Perf -492%] System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches #39032 (comment)
System.Collections.Tests.Perf_BitArray.*(Size: 4)
- detected by the bot, reported in [Perf -118%] System.Collections.Tests.Perf_BitArray for small inputs (3) #37813
- closed, by design: introduction of vectorization has increased the cost of operations for small inputs: [Perf -118%] System.Collections.Tests.Perf_BitArray for small inputs (3) #37813 (comment)
System.Threading.Tests.Perf_Thread.GetCurrentProcessorId
- detected by the bot, reported in [Perf -35%] System.Threading.Tests.Perf_Thread.GetCurrentProcessorId #37804
- closed, by design: precision was improved at a cost of acceptable minor perf regression: [Perf -35%] System.Threading.Tests.Perf_Thread.GetCurrentProcessorId #37804 (comment)
PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceNo, PerfLabTests.CastingPerf.CheckObjIsInterfaceNo
- detected by the bot, reported in [Perf -29%] PerfLabTests.CastingPerf (2) #37803
- closed, by design: known tradeoff: [Perf -29%] PerfLabTests.CastingPerf (2) #37803 (comment)
System.Net.NetworkInformation.Tests.PhysicalAddressTests.PAShort
- detected by the bot, reported in [Perf -19%] System.Net.NetworkInformation.Tests.PhysicalAddressTests.PAShort #39720
- closed, acceptable for improved code reuse [Perf -19%] System.Net.NetworkInformation.Tests.PhysicalAddressTests.PAShort #39720 (comment)
- benchmark for 1 byte removed, added 6 bytes in remove PAShort benchmark that uses 1 byte long input and add a "Medim" that consists of 6 bytes performance#1490
System.Numerics.Tests.Perf_Vector*.GetHashCodeBenchmark
- detected by the bot, reported in [Perf -98%] System.Numerics.Tests.Perf_Vector2.GetHashCodeBenchmark #39035 and [Perf -53%] System.Numerics.Tests.Perf_Vector4.GetHashCodeBenchmark #39029
- closed, "it should not be used" [Perf -53%] System.Numerics.Tests.Perf_Vector4.GetHashCodeBenchmark #39029 (comment)
System.Net.Primitives.Tests.CredentialCacheTests.ForEach(uriCount: 0, hostPortCount: 0)
- detected by the bot, reported in [Perf -17%] System.Net.Primitives.Tests.CredentialCacheTests (2) DrewScoggins/performance-2#510
- confirmed: [Perf -17%] System.Net.Primitives.Tests.CredentialCacheTests (2) DrewScoggins/performance-2#510 (comment)
- awaiting the transfer to runtime repo. Most probably a by-design regression.

Moved to 6.0

System.Tests.Perf_Char.GetUnicodeCategory(c: '?')
- detected and reported by the bot in [Perf -11%] System.Tests.Perf_Char.GetUnicodeCategory DrewScoggins/performance-2#574, I've created Minor regression in System.Tests.Perf_Char.GetUnicodeCategory for non-ascii characters #41107
- minor regression for non-ascii characters, moved to 6.0
PerfLabTests.StackWalk.Walk
- detected by the bot and reported in [Perf -55%] PerfLabTests.StackWalk.Walk #39115
- confirmed in [Perf -55%] PerfLabTests.StackWalk.Walk #39115 (comment)
- specific to everything that is not Windows x64, rather not critical -> moved to 6.0: [Perf -55%] PerfLabTests.StackWalk.Walk #39115 (comment)
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'l', newChar: '!')
- reported in [Perf -26%] System.Tests.Perf_String (4) #37816
- confirmed in [Perf -26%] System.Tests.Perf_String (4) #37816 (comment)
- moved to 6.0
System.Text.Perf_Utf8String.IsAscii(Input: EnglishAllAscii)
- not reported by the bot because it was a brand new benchmark and we did not have historical data at the time of my investigation
- issue created: Performance regression: Utf8String.IsAscii (x86 only) #41388
- moved to 6.0 as Utf8String is still only experimental
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8
- not reported by the bot because it was a brand new benchmark and we did not have historical data at the time of my investigation
- issue created: Performance regression: System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8 #41104
- moved to 6.0

Unstable or multimodal benchmarks

There was of course more of them, here are the ones that I've noted to use as Contract Tests in the near future (to reduce the noise produced by the bot):

System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer
- detected by the bot, reported in [Perf -138%] System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer #39031
- asked for historical data to verify if it's multimodal or not [Perf -138%] System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer #39031 (comment)
- thanks to historical data provided it was possible to tell that it's unstable for x64 and bimodal for x86: [Perf -138%] System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer #39031 (comment)
System.Memory.ReadOnlySequence.Slice_Repeat_StartPosition_And_EndPosition(Segment: Multiple)
- quite unstable benchmark, I've verified that 5.0 codegen is better
PerfLabTests.BlockCopyPerf.CallBlockCopy
- detected by the bot, reported in [Perf -47%] PerfLabTests.BlockCopyPerf.CallBlockCopy #37808
- copying 0 elements does not add value: [Perf -47%] PerfLabTests.BlockCopyPerf.CallBlockCopy #37808 (comment)
- test case for copying 0 elements removed in measuring the performance of copying of 0 elements does not add value performance#1465
- closed as unstable based on full historical data: [Perf -47%] PerfLabTests.BlockCopyPerf.CallBlockCopy #37808 (comment)
System.Tests.Perf_String.Trim_CharArr(s: "Test", c: [' ', ' '])
- multimodal benchmark, needs a rewrite as stated long time ago: Performance regression: string.Trim #13135
System.Threading.Tests.Perf_Interlocked.CompareExchange_long
- the benchmark typically reports 10ns, but sometimes x100 that. Only for x86. I need logs to verify whether it's a BDN bug or not.
- issue created CompareExchange_long benchmark sometimes reports very long execution time on x86 performance#1497
System.Memory.Span<Int32>.IndexOfValue(Size: 512)
- reported in [Perf -35%] System.Memory.Span<Char>.IndexOfValue #39722
- confirmed that it was due to code alignment change in [Perf -35%] System.Memory.Span<Char>.IndexOfValue #39722 (comment)
Benchstone.BenchI.Fib.Test
- perfectly multimodal, great example for a contract test

Summary

The bot has reported all major performance issues for the configurations that it was enabled for (Windows x64, x86, and Ubuntu x64). Great work @DrewScoggins!
The full historical data turned out to be extremely useful to exclude all false positives for multimodal and unstable benchmarks.
We have missed one important x86 bug during triaging (human error), but it got discovered during the study ([Perf -196%] System.Collections.ContainsTrue<Int32> (6) #41167 (comment)). To avoid such problems in the future and to enable the bot in the runtime repo, the noise of the bot needs to be reduced. Currently, it's quite high, mostly due to the multimodal nature of the benchmarks.
The study has detected relatively many new ARM64 perf problems at a late stage of the release. The sooner we enable the bot for ARM64, the better. Moreover, we should be more frequently asking for ARM64 results when reviewing big changes that affect the performance of frequently used features (like sorting the arrays).
The study has shown that measuring the performance of GNU libc based Linux distros like Ubuntu is not enough to detect musl libc specific regressions. We should consider adding Alpine runs to the perf lab.
This time no important issues specific to macOS and different CPU families were discovered. It has proven that the perf lab has good hardware coverage.
The Alpine regression has shown that an increased number of Gen 0 collections can be a very valuable metric to detect regressions. We should consider extending the bot to use it.

Big thanks to everyone involved!

The text was updated successfully, but these errors were encountered:

Dotnet-GitSync-Bot · 2020-09-04T14:25:42Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

danmoseley · 2020-09-04T14:37:15Z

Great work @adamsitnik . I am pleased that our systems have improved since last cycle, and that you and @DrewScoggins will be using this exercise to improve them such that more issues are found earlier and not at the end of the cycle.

Cc @Lxiamail @jkotas

no important issues specific to macOS and different CPU families were discovered. It has proven that the perf lab has good hardware coverage.

I thought the perf lab did not cover Mac and only had one type of CPU. How can we catch such regression before the end of the cycle?

adamsitnik · 2020-09-04T14:42:29Z

thought the perf lab did not cover Mac and only had one type of CPU.

You are right.

no important issues specific to macOS and different CPU families were discovered. It has proven that the perf lab has good hardware coverage.

I meant that in my study I have gathered macOS and different CPU families results but I did not find any important regressions (specific to macOS etc), so adding macOS or more types of CPUs to perf lab is not needed

danmoseley · 2020-09-04T14:44:19Z

Could an effort like this (script run over manual collected submissions) be made cheap enough that we could do it occasionally during the cycle?

Lxiamail · 2020-09-04T15:57:35Z

@adamsitnik Great job! Based on the data out of this exercise, we will add Alpine to .net perf lab.

ladeak · 2020-09-04T20:38:45Z

Is the collected data internal only or available publicly as well?

jeffhandley · 2020-09-04T23:29:51Z

Could an effort like this (script run over manual collected submissions) be made cheap enough that we could do it occasionally during the cycle?

We talked about this offline, but sharing the comment here too... We'll be putting effort into that between now and the first 6.0 previews with the goal of completing targeting manual runs for each of the 6.0 Preview/RC releases.

DrewScoggins · 2020-09-09T22:15:36Z

thought the perf lab did not cover Mac and only had one type of CPU.

You are right.

no important issues specific to macOS and different CPU families were discovered. It has proven that the perf lab has good hardware coverage.

I meant that in my study I have gathered macOS and different CPU families results but I did not find any important regressions (specific to macOS etc), so adding macOS or more types of CPUs to perf lab is not needed

Just for clarity here. We will have a brand new batch of AMD hardware this calendar year. So we will not be covering MacOS in the lab, but we will have both Intel and AMD hardware for coverage.

tannergooding · 2020-09-11T15:21:56Z

@DrewScoggins, do we know what the hardware specs are?

DrewScoggins · 2020-09-11T17:54:21Z

I do not, @billwert did the work of speccing out the machines.

richlander · 2021-03-15T18:09:29Z

@adamsitnik -- that sharepoint link didn't work for me. Can you just publish the data to GH?

richlander · 2021-03-16T15:59:58Z

My mistake. I was using the wrong browser profile.

adamsitnik added tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark discussion Triaged labels Sep 4, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Sep 4, 2020

adamsitnik removed the untriaged New issue has not been triaged by the area owner label Sep 4, 2020

danmoseley added the area-Meta label Sep 4, 2020

joperezr added the tracking This issue is tracking the completion of other related issues. label Sep 8, 2020

iSazonov mentioned this issue Sep 9, 2020

Cleanup/Optimization: Remove remaining LINQ usage in the compiler PowerShell/PowerShell#13543

Closed

14 tasks

jeffhandley added this to the 5.0.0 milestone Sep 16, 2020

jeffhandley closed this as completed Sep 16, 2020

adamsitnik mentioned this issue Oct 23, 2020

Generate Exception: Access to the path '/su' is denied on Gentoo dotnet/BenchmarkDotNet#882

Closed

adamsitnik mentioned this issue Nov 23, 2020

Improve throughput of Environment.GetEnvironmentVariables() #45057

Merged

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.NET 5.0 Microbenchmarks Performance Study Report #41871

.NET 5.0 Microbenchmarks Performance Study Report #41871

adamsitnik commented Sep 4, 2020 •

edited

Loading

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: IOrderedEnumerable)

System.Globalization.Tests.StringSearch.IsPrefix_DifferentFirstChar(Options: (en-US, IgnoreSymbols, False))

System.Threading.Tests.Perf_CancellationToken.Cancel

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)

Dotnet-GitSync-Bot commented Sep 4, 2020

danmoseley commented Sep 4, 2020

adamsitnik commented Sep 4, 2020

danmoseley commented Sep 4, 2020

Lxiamail commented Sep 4, 2020

ladeak commented Sep 4, 2020

jeffhandley commented Sep 4, 2020

DrewScoggins commented Sep 9, 2020

tannergooding commented Sep 11, 2020

DrewScoggins commented Sep 11, 2020

richlander commented Mar 15, 2021

richlander commented Mar 16, 2021

.NET 5.0 Microbenchmarks Performance Study Report #41871

.NET 5.0 Microbenchmarks Performance Study Report #41871

Comments

adamsitnik commented Sep 4, 2020 • edited Loading

Goals

Methodology (and how it evolved)

System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches(input: IOrderedEnumerable)

System.Globalization.Tests.StringSearch.IsPrefix_DifferentFirstChar(Options: (en-US, IgnoreSymbols, False))

System.Threading.Tests.Perf_CancellationToken.Cancel

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)

Data

Regressions

Already fixed

Investigation in progress

By design or Acceptable

Moved to 6.0

Unstable or multimodal benchmarks

Summary

Dotnet-GitSync-Bot commented Sep 4, 2020

danmoseley commented Sep 4, 2020

adamsitnik commented Sep 4, 2020

danmoseley commented Sep 4, 2020

Lxiamail commented Sep 4, 2020

ladeak commented Sep 4, 2020

jeffhandley commented Sep 4, 2020

DrewScoggins commented Sep 9, 2020

tannergooding commented Sep 11, 2020

DrewScoggins commented Sep 11, 2020

richlander commented Mar 15, 2021

richlander commented Mar 16, 2021

adamsitnik commented Sep 4, 2020 •

edited

Loading