Dynamic PGO Microbenchmark Regressions #87194

AndyAyersMS · 2023-06-06T20:32:45Z

This issue tracks investigation into microbenchmarks that have reported regressions with Dynamic PGO enabled. It is a continuation of #84264 which tracked regressions from PGO before it was enabled.

The report below is collated from the following autofiling reports.

[Perf] Windows/arm64: 67 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#17979
[Perf] Linux/arm64: 62 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#17982
[Perf] Windows/x64: 100 Regressions on 5/19/2023 3:32:16 PM perf-autofiling-issues#17994
[Perf] Windows/arm64: 61 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#18096
[Perf] Windows/arm64: 7 Regressions on 5/19/2023 9:46:56 PM perf-autofiling-issues#18097
[Perf] Linux/arm64: 81 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#18103
[Perf] Windows/arm64: 4 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#18109
[Perf] Windows/arm64: 9 Regressions on 5/20/2023 3:57:43 PM perf-autofiling-issues#18111
[Perf] Linux/x64: 100 Regressions on 5/19/2023 3:32:16 PM perf-autofiling-issues#18139
[Perf] Windows/x64: 122 Regressions on 5/19/2023 3:32:16 PM perf-autofiling-issues#18151
[Perf] Windows/x64: 55 Regressions on 5/19/2023 3:32:16 PM perf-autofiling-issues#18582

The table is auto generated by a tool written by @EgorBo but may be edited by hand as regression analysis produces results. The "Score" is the geomean regression across all architectures; benchmarks that did not regress (or get reported) on some architectures are assumed to have produced the same results with and without PGO. "Recent Score" is the current performance (as of 2023-0606) versus the non-PGO result; "Orig Score" is based on the results of auto filing. They will differ if benchmark performance has improved or regressed since the auto filing ran (see for example the results for System.Text.Json.Tests.Perf_Get.GetByte, which has improved already).

Only the 36 entries with recent scores >= 1.3 are included; this leaves off approximately 220 more rows with scores between 1.3 or lower. Our plan is to prioritize investigation of these benchmarks initially, as they have the largest aggregate regressions. If time permits, we will regenerate this chart to pick up the impact of any fixes and see how much of the remainder we can tackle.

Each arch/os result is a hyperlink to the performance data graph for that benchmark. ~Note we currently have no autofiling data for win-x64-intel. If/when that shows up we will regenerate the table.~~

[edit: had to regenerate the table once already, as the scoring logic was off]
[edit: have x64 win intel data now, new table. Not current results have shifted so table is somewhat different...]

cc @dotnet/jit-contrib

Notes	Recent Score	Orig Score	arm64-lin-ampere	arm64-win-surface	arm64-win-ampere	x64-lin-intel	x64-win-intel	x64-win-amd	Benchmark
noise	3.38	1.37						3.37 1.36	System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: None)
noise	3.36	1.37						3.36 1.37	System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: NonBacktracking)
notes	2.71	3.39		2.71 3.39					System.Memory.Span(Int32).EndsWith(Size: 4)
likely same as above	2.62	3.03	2.55 2.27	2.59 3.04					System.Memory.Span(Int32).SequenceEqual(Size: 4)
likely same as above	1.87	1.76		1.87 1.76					System.Memory.Span(Int32).SequenceCompareToDifferent(Size: 512)
(lack of) if conversion	1.82	1.80				1.67 1.63	1.93 1.92	1.86 1.85	System.Tests.Perf_Random.NextSingle
budget	1.75	1.88	1.33 1.47	1.35 1.49		1.90 1.99	2.29 2.43	2.10 2.19	System.Text.Json.Tests.Perf_Get.GetInt16
BDN	1.73	2.81	3.55 3.54	1.89 2.00		1.28 4.73	1.32 2.01	1.39 2.68	System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)
notes	1.64	1.63	1.84 1.82	1.65 1.64					System.Tests.Perf_UInt32.TryParseHex(value: "0")
budget	1.61	1.70	1.27 1.44	1.28 1.46	1.24 1.18	2.09 2.17	2.25 2.33	1.86 1.94	System.Text.Json.Tests.Perf_Get.GetSByte
bimodal	1.61	1.59						1.60 1.58	System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock Holmes", Options: Compiled)
cast expansion	1.60	1.64				1.82 1.87	1.41 1.43		System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstSingleSegment
cast expansion	1.58	1.62				1.58 1.62			System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSpanTenSegments
cast expansion	1.52	1.65				1.48 1.81	1.56 1.50		System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSingleSegment
cast expansion	1.50	1.73				1.88 2.13	1.20 1.41		System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstTenSegments
likely same as span cases above	1.48	1.28			1.48 1.28				System.Memory.Span(Int32).Reverse(Size: 4)
cast expansion	1.47	1.44				1.47 1.44			System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSpanSingleSegment
notes	1.47	1.42						1.46 1.42	Benchstone.BenchF.InvMt.Test
unclear	1.46	1.15					1.46 1.15		MicroBenchmarks.Serializers.Json_FromStream(MyEventsListerViewModel).DataContractJsonSerializer_
fixed itself	1.45	1.09		1.45 1.09					System.Tests.Perf_Uri.EscapeDataString(input: "{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{
unclear	1.44	1.44				1.44 1.44			Burgers.Test1
unclear	1.43	1.27				1.43 1.27			System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfNumbers)
unclear, linux arm64 only	1.41	1.58	1.41 1.58						System.Text.Tests.Perf_StringBuilder.Append_Char_Capacity(length: 100000)
unclear, linux arm64 only	1.39	1.62	1.39 1.62						BenchmarksGame.RegexRedux_5.RunBench(options: Compiled)
bimodal	1.39	1.39	1.39 1.39						System.MathBenchmarks.Single.Min
bimodal	1.39	1.39	1.39 1.39						System.MathBenchmarks.Single.Max
unclear, linux arm64 only	1.39	1.32	1.39 1.32						System.IO.Pipes.Tests.Perf_NamedPipeStream.ReadWriteAsync(size: 1000000, Options: Asynchronous)
noise	1.38	1.29					1.38 1.29		System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000)
bimodal	1.37	1.37						1.37 1.37	System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: Compiled)
notes	1.37	1.36	1.26 1.29	1.42 1.43	1.24 1.28	1.60 1.48			System.Collections.Sort(IntStruct).Array(Size: 512)
budget	1.36	1.93	1.15 1.56	1.15 1.58	1.27 1.66	1.42 2.14	1.80 2.67	1.49 2.24	System.Text.Json.Tests.Perf_Get.GetByte
noise	1.35	1.31						1.36 1.33	System.Memory.Span(Char).IndexOfAnyTwoValues(Size: 512)
arm64 only; ldar vs dmb	1.35	1.36	1.35 1.34	1.38 1.40					System.Collections.CtorFromCollection(Int32).ConcurrentBag(Size: 512)
fixed by physical promotion	1.35	1.36		1.35 1.36					Devirtualization.EqualityComparer.ValueTupleCompareWrapped
budget	1.34	1.42	1.42 1.26	1.28 1.38	1.35 1.44	1.35 1.42	1.35 1.55	1.31 1.46	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToStream(Mode: SourceGen)
notes	1.34	1.45	1.18 1.29	1.40 1.44	1.13 1.41	1.71 1.71			System.Collections.Sort(IntStruct).List(Size: 512)
notes	1.33	1.33						1.33 1.33	System.Tests.Perf_HashCode.Combine_1
inlining different; exposed local	1.33	1.32					1.34 1.33	1.32 1.32	System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple)
notes	1.33	1.18				1.33 1.18			System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfStrings)
budget	1.32	1.37	1.24 1.39	1.20 1.28	1.37 1.15	1.39 1.46	1.45 1.57	1.27 1.39	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToWriter(Mode: SourceGen)
budget	1.32	1.39	1.37 1.28	1.22 1.42	1.34 1.31	1.32 1.38	1.30 1.50	1.34 1.38	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToUtf8Bytes(Mode: SourceGen)
budget	1.31	1.88	1.15 1.59	1.18 1.62	1.03 1.37	1.49 2.22	1.66 2.49	1.49 2.24	System.Text.Json.Tests.Perf_Get.GetUInt16
budget	1.31	1.33	1.38 1.25	1.20 1.23	1.23 1.26	1.35 1.46	1.40 1.40	1.41 1.43	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToString(Mode: SourceGen)
jcc errata	1.31	1.39					1.31 1.39		Span.Sorting.QuickSortSpan(Size: 512)
lack of cold inline exposes local	1.31	1.29					1.31 1.31	1.31 1.27	System.Memory.ReadOnlySequence.Slice_Start_And_Length(Segment: Multiple)
budget	1.31	1.39	1.32 1.19	1.20 1.50		1.31 1.37	1.40 1.50	1.31 1.34	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeObjectProperty(Mode: SourceGen)
lack of `ldapr`	1.30	1.30	1.29 1.30	1.30 1.30					System.Collections.CtorFromCollection(String).ConcurrentBag(Size: 512)

The text was updated successfully, but these errors were encountered:

ghost · 2023-06-06T20:33:11Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This issue tracks investigation into microbenchmarks that have reported regressions with Dynamic PGO enabled. It is a continuation of #84264 which tracked regressions from PGO before it was enabled.

The report below is collated from the following autofiling reports.

[Perf] Windows/arm64: 67 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#17979
[Perf] Linux/arm64: 62 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#17982
[Perf] Windows/x64: 100 Regressions on 5/19/2023 3:32:16 PM perf-autofiling-issues#17994
[Perf] Windows/arm64: 61 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#18096
[Perf] Windows/arm64: 7 Regressions on 5/19/2023 9:46:56 PM perf-autofiling-issues#18097
[Perf] Linux/arm64: 81 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#18103
[Perf] Windows/arm64: 4 Regressions on 5/19/2023 1:23:34 PM perf-autofiling-issues#18109
[Perf] Windows/arm64: 9 Regressions on 5/20/2023 3:57:43 PM perf-autofiling-issues#18111
[Perf] Linux/x64: 100 Regressions on 5/19/2023 3:32:16 PM perf-autofiling-issues#18139
[Perf] Windows/x64: 122 Regressions on 5/19/2023 3:32:16 PM perf-autofiling-issues#18151

The table is auto generated by a tool written by @EgorBo but may be edited by hand as regression analysis produces results. The "Score" is the geomean regression across all architectures; benchmarks that did not regress (or get reported) on some architectures are assumed to have produced the same results with and without PGO. "Recent Score" is the current performance (as of 2023-0606) versus the non-PGO result; "Orig Score" is based on the results of auto filing. They will differ if benchmark performance has improved or regressed since the auto filing ran (see for example row 4's results for System.Text.Json.Tests.Perf_Get.GetUInt16, which has improved already).

Only the 54 entries with recent scores > 1.4 are included; this leaves off approximately 220 more rows with scores between 1.4 and 1.0. Our plan is to prioritize investigation of these benchmarks initially, as they have the largest aggregate regressions. If time permits we will regenerate this chart to pick up the impact of any fixes and see how much of the remainder we can tackle.

Each arch/os result is a hyperlink to the performance data graph for that benchmark. Note we currently have no autofiling data for win-x64-intel. If/when that shows up we will regenerate the table.

cc @dotnet/jit-contrib

Recent Score	Orig Score	win-x64-amd	lin-x64-intel	win-arm64-surface	win-arm64-ampere	lin-arm64-ampere	Benchmark
3.66	3.66	2.68 2.68	4.60 4.73	2.00 2.00		3.54 3.54	System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)
2.80	2.97	1.86 1.94	2.09 2.17	1.28 1.46	1.22 1.18	1.27 1.44	System.Text.Json.Tests.Perf_Get.GetSByte
2.79	2.83	1.90 1.85	3.84 3.83	1.30 1.27		2.45 2.70	System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)
2.65	3.41	1.48 2.24	1.49 2.22	1.17 1.62	1.19 1.37	1.17 1.59	System.Text.Json.Tests.Perf_Get.GetUInt16
2.53	2.60	1.31 1.38	1.37 1.38	1.21 1.42	1.30 1.31	1.35 1.28	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToUtf8Bytes(Mode: SourceGen)
2.52	3.32	1.49 2.24	1.42 2.14	1.14 1.58	1.39 1.66	1.14 1.56	System.Text.Json.Tests.Perf_Get.GetByte
2.50	2.56	1.33 1.39	1.37 1.46	1.21 1.28	1.16 1.15	1.36 1.39	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToWriter(Mode: SourceGen)
2.50	2.64	1.36 1.46	1.38 1.42	1.16 1.38	1.15 1.44	1.39 1.26	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToStream(Mode: SourceGen)
2.49	2.53	1.43 1.43	1.30 1.46	1.22 1.23	1.17 1.26	1.38 1.25	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToString(Mode: SourceGen)
2.35	2.47	2.10 2.19	1.92 1.99	1.35 1.49		1.33 1.47	System.Text.Json.Tests.Perf_Get.GetInt16
2.22	2.22	1.23 1.23		1.45 1.45	1.27 1.33	1.53 1.44	System.Tests.Perf_UInt64.TryParseHex(value: "0")
2.10	2.10		1.48 1.48	1.44 1.43	1.35 1.28	1.25 1.29	System.Collections.Sort(IntStruct).Array(Size: 512)
2.04	2.10	1.31 1.34	1.34 1.37	1.22 1.50		1.35 1.19	System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeObjectProperty(Mode: SourceGen)
2.04	2.18		1.63 1.71	1.42 1.44	1.03 1.41	1.19 1.29	System.Collections.Sort(IntStruct).List(Size: 512)
1.97	1.99		1.33 1.39	1.26 1.22	1.12 1.19	1.23 1.24	System.Memory.Span(Int32).IndexOfValue(Size: 512)
1.95	2.60	1.15 3.31	1.05 1.29	1.08 1.56		1.08 1.08	System.Text.Json.Tests.Perf_Get.GetInt32
1.92	1.88			3.36 3.04		2.94 2.27	System.Memory.Span(Int32).SequenceEqual(Size: 4)
1.92	2.04	1.19 1.43	1.15 1.23	1.20 1.30		1.17 1.27	System.Text.Json.Serialization.Tests.WriteJson(ImmutableSortedDictionary(String, String)).SerializeToWriter(Mode: SourceGen)
1.87	1.97	0.99 1.11	1.14 1.50	1.31 1.22		1.09 1.10	System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,&lorem ipsum=dolor sit amet,16)
1.87	1.86	1.47 1.46	1.16 1.13	1.18 1.18		1.20 1.19	System.Collections.ContainsTrue(Int32).List(Size: 512)
1.83	1.84	1.27 1.31	1.21 1.21	1.17 1.17		1.17 1.17	System.Collections.ContainsTrue(Int32).ICollection(Size: 512)
1.80	1.89	1.05 1.38	1.30 1.30	1.20 1.21		1.19 1.19	System.Collections.ContainsTrue(Int32).Array(Size: 512)
1.73	1.82	0.89 1.16		1.13 1.14	1.34 1.45	1.16 1.14	System.Memory.Span(Int32).LastIndexOfValue(Size: 512)
1.72	2.66	0.84 3.31	0.93 1.48	0.84 1.56		0.85 1.07	System.Text.Json.Tests.Perf_Get.GetUInt32
1.72	1.73	1.07 1.07	1.28 1.30	1.07 1.06		1.07 1.08	System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf16(arguments: Url,&lorem ipsum=dolor sit amet,16)
1.71	1.76	1.08 1.18	1.11 1.15	1.10 1.14		1.10 1.10	System.Collections.IterateForEach(Int32).ImmutableHashSet(Size: 512)
1.71	1.83	0.94 1.13	0.95 1.05	0.99 1.11		1.01 1.10	System.Text.Json.Tests.Perf_Get.GetInt64
1.67	1.66		1.29 1.23	1.10 1.10		1.12 1.11	System.Memory.Span(Byte).IndexOfValue(Size: 512)
1.66	1.66		1.34 1.33	1.08 1.06		1.08 1.08	System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Mariomkas.Count(Pattern: "(?:(?:25[0-5]
1.64	1.65	1.15 1.19		1.08 1.10		1.11 1.13	System.Memory.Span(Byte).LastIndexOfValue(Size: 512)
1.63	1.64		1.10 1.12	1.28 1.29		1.41 1.41	System.Tests.Perf_UInt64.TryParseHex(value: "3039")
1.60	1.59		1.40 1.34	1.16 1.17		1.15 1.16	Benchstone.BenchI.BubbleSort2.Test
1.60	1.59		1.33 1.30	1.16 1.16		1.18 1.16	System.Collections.ContainsTrue(Int32).ImmutableArray(Size: 512)
1.58	1.58	1.85 1.85	1.62 1.63				System.Tests.Perf_Random.NextSingle
1.57	1.59		1.25 1.35	1.15 1.14		1.13 1.13	System.Collections.ContainsTrue(Int32).Queue(Size: 512)
1.56	1.67	0.99 1.19		0.98 1.15		1.00 1.13	System.Tests.Perf_Uri.EscapeDataString(input: "a{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{ü
1.56	1.63	1.17 1.31		1.48 1.46		1.28 1.40	System.Memory.Span(Char).LastIndexOfValue(Size: 512)
1.55	1.55	1.47 1.45		1.21 1.21		1.21 1.22	System.Collections.ContainsTrue(Int32).Span(Size: 512)
1.54	1.54	1.20 1.20		1.11 1.09		1.08 1.08	System.Memory.Span(Int32).IndexOfAnyFourValues(Size: 33)
1.54	1.52	1.16 1.17		1.37 1.28		1.34 1.32	System.Tests.Perf_UInt32.TryParseHex(value: "3039")
1.52	1.51	1.15 1.20		1.49 1.39			System.Memory.Span(Byte).SequenceCompareToDifferent(Size: 512)
1.51	1.55	1.08 1.24		1.13 1.13		1.05 1.10	System.Text.Perf_Utf8Encoding.GetBytes(Input: Chinese)
1.50	1.61			1.26 1.47		1.22 1.72	System.Tests.Perf_Guid.GuidToString
1.48	1.49			1.20 1.19		1.18 1.24	System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 16,16 bits)
1.47	1.49		1.30 1.35	1.03 1.08			System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: JavaScript,&Hello+(World)!,16)
1.46	1.62	0.78 1.09		0.88 1.08		0.87 1.07	System.Text.Json.Tests.Perf_Get.GetString
1.45	1.46			1.54 1.64		1.88 1.82	System.Tests.Perf_UInt32.TryParseHex(value: "0")
1.45	1.48	1.27 1.41		1.06 1.06		1.09 1.09	System.Memory.Span(Int32).IndexOfAnyFiveValues(Size: 33)
1.45	1.45		1.14 1.11			1.06 1.10	BenchmarksGame.BinaryTrees_5.RunBench
1.45	1.45			3.41 3.39			System.Memory.Span(Int32).EndsWith(Size: 4)
1.43	1.43			1.05 1.05		1.06 1.06	System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock
1.42	1.45			0.99 1.08		1.06 1.10	System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig.Count(Pattern: "[a-z]shing", Options: Compiled)
1.40	1.41	1.22 1.22	1.24 1.32				LinqBenchmarks.Count00ForX
1.40	1.40		1.41 1.38	1.09 1.13			System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,&lorem ipsum=dolor sit amet,512)

Author:	AndyAyersMS
Assignees:	AndyAyersMS
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2023-06-06T20:54:33Z

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)

----

These benchmarks were analyzed before PGO was enabled: #84264 (comment)

BDN's strategy doesn't run the benchmark enough, because each iteration is long running, and so (since the key benchmark methods are R2R'd) the test ends up measuring tier1-instr code.

AndyAyersMS · 2023-06-06T21:03:44Z

System.Text.Json.Tests.Perf_Get.GetInt16

System.Text.Json.Tests.Perf_Get.GetSByte

System.Text.Json.Tests.Perf_Get.GetByte

Likely the analysis from #84264 (comment) is still relevant and explains the related tests regressions as well: we run out of inlining budget, in part because the benchmark method is small and there are quite a few large aggressive inline methods, and so we're unable to do some key inlines.

Tracking issue for this is #85531.

Some of these improved with #86551.

AndyAyersMS · 2023-06-06T22:17:40Z

System.Tests.Perf_UInt32.TryParseHex(value: "0")

This regresses across the board, so not sure why we don't have more autofiling for it.

I can't repro on win-x64

Method	Job	Toolchain	value	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
TryParseHex	Job-EEXSZV	\base\corerun.exe	0	7.860 ns	0.1366 ns	0.1573 ns	7.814 ns	7.640 ns	8.281 ns	1.00	0.00	-	NA
TryParseHex	Job-EZIDSX	\diff\corerun.exe	0	7.834 ns	0.1461 ns	0.1683 ns	7.783 ns	7.634 ns	8.344 ns	1.00	0.02	-	NA

But I can (perhaps) on win-arm64 (volterra)

Method	Job	Toolchain	value	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
TryParseHex	Job-DFURZJ	\base\corerun.exe	0	3.265 ns	0.0267 ns	0.0237 ns	3.257 ns	3.235 ns	3.316 ns	1.00	0.00	-	NA
TryParseHex	Job-URZCKJ	\diff\corerun.exe	0	3.876 ns	0.0899 ns	0.0751 ns	3.849 ns	3.794 ns	4.034 ns	1.19	0.02	-	NA

base

00.81%   4.1E+05     ?        Unknown
48.82%   2.458E+07   Tier-1   [491fed18-8a14-46f6-b30c-6320dc770919]Runnable_0.WorkloadActionUnroll(int64)
23.52%   1.184E+07   Tier-1   [System.Private.CoreLib]Number.TryParseBinaryIntegerHexOrBinaryNumberStyle(value class System.ReadOnlySpan`1<!!0>,value class System.Globalization.NumberStyles,!!1&)
12.69%   6.39E+06    Tier-1   [MicroBenchmarks]Perf_UInt32.TryParseHex(class System.String)
08.92%   4.49E+06    Tier-1   [System.Private.CoreLib]NumberFormatInfo.<GetInstance>g__GetProviderNonNull|58_0(class System.IFormatProvider)
04.37%   2.2E+06     Tier-1   [System.Private.CoreLib]CastHelpers.IsInstanceOfClass(void*,class System.Object)

diff

50.62%   2.666E+07   Tier-1   [d735f3a6-719a-4885-9631-16a7aeff132c]Runnable_0.WorkloadActionUnroll(int64)
23.94%   1.261E+07   Tier-1   [System.Private.CoreLib]Number.TryParseBinaryIntegerHexOrBinaryNumberStyle(value class System.ReadOnlySpan`1<!!0>,value class System.Globalization.NumberStyles,!!1&)
10.73%   5.65E+06    Tier-1   [MicroBenchmarks]Perf_UInt32.TryParseHex(class System.String)
08.62%   4.54E+06    Tier-1   [System.Private.CoreLib]NumberFormatInfo.<GetInstance>g__GetProviderNonNull|58_0(class System.IFormatProvider)
04.08%   2.15E+06    Tier-1   [System.Private.CoreLib]CastHelpers.IsInstanceOfClass(void*,class System.Object)

Note the very high overhead. WorkoadActionUnroll should have the AggressiveOptimization attribute so its codegen does not vary.

PGO codegen for TryParseBinaryIntegerHexOrBinaryNumberStyle looks good, all the hot code is adjacent and the method is straight line code. So not clear why it is ~7% or so slower.

Path lengths are similar but PGO does one extra STP/LDP and a few more register arg moves. So perhaps that's the explanation?

AndyAyersMS · 2023-06-06T22:29:26Z

System.Memory.Span<Int32>.EndsWith(Size: 4)

System.Memory.Span.SequenceEqual(Size: 4)

These two appear to be arm64 specific.

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
EndsWith	Job-INARIQ	\base-rel\corerun.exe	4	2.712 ns	0.0209 ns	0.0163 ns	2.712 ns	2.680 ns	2.734 ns	1.00	0.00	-	NA
EndsWith	Job-YXKXJS	\diff-rel\corerun.exe	4	3.357 ns	0.2569 ns	0.2959 ns	3.175 ns	3.079 ns	3.928 ns	1.23	0.10	-	NA

base
01.52%   5.7E+05     ?        Unknown
35.26%   1.319E+07   Tier-1   [18850c97-dca6-4aad-8393-0715e0b95a4a]Runnable_0.WorkloadActionUnroll(int64)
34.51%   1.291E+07   Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Int32].EndsWith()
27.72%   1.037E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceEqual(unsigned int8&,unsigned int8&,unsigned int)

diff
01.29%   5.1E+05     ?        Unknown
35.20%   1.392E+07   Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Int32].EndsWith()
31.40%   1.242E+07   Tier-1   [72707307-0596-4e84-8515-1a5c07b80d3a]Runnable_0.WorkloadActionUnroll(int64)
31.10%   1.23E+07    Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceEqual(unsigned int8&,unsigned int8&,unsigned int)

So issue appears to be in SequenceEqual ?

Seems like PGO and non-PGO codegen is the same. Method is R2R'd and with PGO we do a tier1 instr, but do not add any probes. Explanation: this method is an intrinsic and not on the whitelist.

Fixing that (hack) gives:

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
EndsWith	Job-FBJVVQ	\base-rel\corerun.exe	4	2.732 ns	0.0330 ns	0.0275 ns	2.729 ns	2.692 ns	2.793 ns	1.00	-	NA
EndsWith	Job-UKVWAD	\diff-rel\corerun.exe	4	3.146 ns	0.0472 ns	0.0419 ns	3.153 ns	3.084 ns	3.208 ns	1.15	-	NA
EndsWith	Job-BJUBBV	\hack-rel\corerun.exe	4	2.660 ns	0.0364 ns	0.0322 ns	2.659 ns	2.610 ns	2.731 ns	0.97	-	NA

and more broady

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
EndsWith	Job-SDVJLL	\base-rel\corerun.exe	4	2.685 ns	0.0479 ns	0.0448 ns	2.679 ns	2.627 ns	2.766 ns	1.00	0.00	-	NA
EndsWith	Job-FTSDAY	\diff-rel\corerun.exe	4	3.151 ns	0.0178 ns	0.0158 ns	3.151 ns	3.126 ns	3.181 ns	1.17	0.02	-	NA
EndsWith	Job-EGIDWC	\hack-rel\corerun.exe	4	3.125 ns	0.0394 ns	0.0368 ns	3.125 ns	3.080 ns	3.180 ns	1.16	0.02	-	NA

EndsWith	Job-SDVJLL	\base-rel\corerun.exe	33	4.739 ns	0.0558 ns	0.0495 ns	4.728 ns	4.686 ns	4.840 ns	1.00	0.00	-	NA
EndsWith	Job-FTSDAY	\diff-rel\corerun.exe	33	5.322 ns	0.0576 ns	0.0539 ns	5.299 ns	5.257 ns	5.446 ns	1.12	0.01	-	NA
EndsWith	Job-EGIDWC	\hack-rel\corerun.exe	33	5.369 ns	0.0265 ns	0.0235 ns	5.369 ns	5.333 ns	5.419 ns	1.13	0.01	-	NA

EndsWith	Job-SDVJLL	\base-rel\corerun.exe	512	32.020 ns	0.0209 ns	0.0175 ns	32.023 ns	31.999 ns	32.059 ns	1.00	0.00	-	NA
EndsWith	Job-FTSDAY	\diff-rel\corerun.exe	512	32.422 ns	0.0351 ns	0.0328 ns	32.417 ns	32.383 ns	32.495 ns	1.01	0.00	-	NA
EndsWith	Job-EGIDWC	\hack-rel\corerun.exe	512	31.819 ns	0.0370 ns	0.0346 ns	31.815 ns	31.776 ns	31.893 ns	0.99	0.00	-	NA

But does not explain why there is a PGO regression. And as you can see the Size=4 results are not very stable.

Similar diffs for EndsWith (better layout, slightly higher prolog/epilog costs).

AndyAyersMS · 2023-06-06T22:37:04Z

System.Memory.Span<Int32>.SequenceCompareToDifferent(Size: 512)

Also seems to be arm64 specific.

AndyAyersMS · 2023-07-05T18:05:28Z

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: None)

These spiked up but then recovered and match their longer-term behavior

AndyAyersMS · 2023-07-05T18:11:23Z

Devirtualization.EqualityComparer.ValueTupleCompareWrapped

Fixed by physical promotion

AndyAyersMS · 2023-07-17T20:04:12Z

Benchstone.BenchF.InvMt.Test

This one is more substantially regressed on amd64 HW...

This doesn't repro on my local Zen3 box

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22000.2176/21H2/SunValley)
AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-EGKMFB : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-KOAKXG : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	Gen0	Allocated	Alloc Ratio
InvMt	Job-EGKMFB	\base-rel\corerun.exe	1.540 ms	0.0146 ms	0.0137 ms	1.540 ms	1.521 ms	1.568 ms	1.00	12.5000	105.07 KB	1.00
InvMt	Job-KOAKXG	\diff-rel\corerun.exe	1.513 ms	0.0077 ms	0.0072 ms	1.512 ms	1.504 ms	1.527 ms	0.98	12.5000	105.07 KB	1.00

Perf lab is running Ryzen 7 3700 PRO.

AndyAyersMS · 2023-07-17T21:41:24Z

System.Tests.Perf_Random.NextSingle

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
NextSingle	Job-IRALDH	\base-rel\corerun.exe	4.697 ns	0.0866 ns	0.0810 ns	4.696 ns	4.528 ns	4.796 ns	1.00	0.00	-	NA
NextSingle	Job-VNWIBR	\diff-rel\corerun.exe	8.382 ns	0.1170 ns	0.1094 ns	8.394 ns	8.229 ns	8.574 ns	1.79	0.04	-	NA

Profiling

base

00.95%   4.8E+05     ?        Unknown
36.70%   1.859E+07   Tier-1   [System.Private.CoreLib]Random+CompatPrng.InternalSample()
34.94%   1.77E+07    Tier-1   [System.Private.CoreLib]Random+Net5CompatSeedImpl.NextSingle()
12.38%   6.27E+06    Tier-1   [System.Private.CoreLib]Random.NextSingle()
07.54%   3.82E+06    Tier-1   [cb5bf117-1cbd-438a-bbce-a71239c42b3d]Runnable_0.WorkloadActionUnroll(int64)
06.85%   3.47E+06    Tier-1   [MicroBenchmarks]Perf_Random.NextSingle()
00.26%   1.3E+05     native   clrjit.dll
00.12%   6E+04       native   coreclr.dll
00.12%   6E+04       native   ntoskrnl.exe
00.10%   5E+04       native   ntdll.dll

diff

02.33%   2.07E+06    ?        Unknown
79.57%   7.061E+07   Tier-1   [System.Private.CoreLib]Random+Net5CompatSeedImpl.NextSingle()
14.97%   1.328E+07   Tier-1   [MicroBenchmarks]Perf_Random.NextSingle()
02.56%   2.27E+06    Tier-1   [c48f5bab-b8a5-4a49-b7a3-55a58e477b40]Runnable_0.WorkloadActionUnroll(int64)
00.21%   1.9E+05     native   clrjit.dll
00.17%   1.5E+05     native   coreclr.dll
00.14%   1.2E+05     native   ntoskrnl.exe

Looks like with PGO we inline InternalSample, and this hurts perf. Why?

Root cause is lack of if conversion in InternalSample, once PGO has inlined it into Random+Net5CompatSeedImpl.NextSingle.

(with DOTNET_JitDoIfConversion=0)

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
NextSingle	Job-UYNTQT	\base\corerun.exe	10.451 ns	0.1370 ns	0.1578 ns	10.424 ns	10.252 ns	10.94 ns	1.00	0.00	-	NA
NextSingle	Job-FNVAZN	\diff\corerun.exe	9.709 ns	0.3340 ns	0.3846 ns	9.751 ns	8.154 ns	10.10 ns	0.93	0.04	-	NA

@jakobbotsch here's an example where not if converting in loops is painful. PGO undoes the improvements that came in with #81267.

ARM64 does not seem to be affected for some reason.

#79101 may show a similar problem

AndyAyersMS · 2023-07-17T23:40:24Z

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSingleSegment

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
FirstSingleSegment	Job-MECUZL	\base-rel\corerun.exe	3.166 ns	0.0275 ns	0.0317 ns	3.156 ns	3.125 ns	3.248 ns	1.00	0.00	-	NA
FirstSingleSegment	Job-IYGZNT	\diff-rel\corerun.exe	5.107 ns	0.0487 ns	0.0560 ns	5.089 ns	5.040 ns	5.237 ns	1.61	0.03	-	NA

base
00.57%   4.51E+06    ?        Unknown
64.12%   5.094E+08   Tier-1   [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].First(value class System.Buffers.ReadOnlySequence`1<!0>)
16.92%   1.344E+08   Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
09.23%   7.33E+07    Tier-1   [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].FirstSingleSegment()
05.37%   4.269E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Char]..ctor(class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32,class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32)
02.28%   1.81E+07    Tier-1   [3e900f5f-7929-4719-a376-9aa860256700]Runnable_0.WorkloadActionUnroll(int64)
01.44%   1.142E+07   native   coreclr.dll

diff
00.45%   5.24E+06    ?        Unknown
57.87%   6.67E+08    Tier-1   [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].First(value class System.Buffers.ReadOnlySequence`1<!0>)
29.19%   3.364E+08   Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
06.48%   7.467E+07   Tier-1   [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].FirstSingleSegment()
03.18%   3.665E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Char]..ctor(class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32,class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32)
01.84%   2.116E+07   Tier-1   [60fe30c0-c2f0-4348-9bd8-9b632dabcde3]Runnable_0.WorkloadActionUnroll(int64)
00.91%   1.049E+07   native   coreclr.dll

Same set of inlines, similar optimizations.

Main delta is code layout, in both First and in ChkCastClassSpecial. Not clear why this causes such a big perf diff as the profile data should be accurate.

The issue is that by default we don't profile casts, and without profile data, we assume casts will fall back to the helper 25% of the time. This leads the jit to move all the cast calls to the end of the method, and so each call site must jump to the call and then jump back into the regular flow:

;; base
						;; size=41 bbWeight=0.50 PerfScore 7.12
G_M11072_IG110:  ;; offset=0C55H
       mov      rdx, r8
       call     [CORINFO_HELP_CHKCASTCLASS_SPECIAL]
						;; size=9 bbWeight=0.12 PerfScore 0.41
G_M11072_IG111:  ;; offset=0C5EH
       mov      rdx, gword ptr [rax+18H]

;; diff

       jne      SHORT G_M11072_IG52
						;; size=45 bbWeight=0.50 PerfScore 7.12
G_M11072_IG49:  ;; offset=0863H
       mov      rdx, gword ptr [rax+18H]
       ...

G_M11072_IG52:  ;; offset=08B8H
       mov      rdx, r8
       call     [CORINFO_HELP_CHKCASTCLASS_SPECIAL]
       jmp      SHORT G_M11072_IG49

BBWeight of IG52 in comes from QMARK expansion, we assume 25% chance?

With DOTNET_JitProfileCasts=1:

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
FirstSingleSegment	Job-BCGWCL	\base\corerun.exe	3.217 ns	0.0506 ns	0.0583 ns	3.245 ns	3.130 ns	3.276 ns	1.00	-	NA
FirstSingleSegment	Job-BXIQEJ	\diff\corerun.exe	2.705 ns	0.0115 ns	0.0133 ns	2.704 ns	2.669 ns	2.736 ns	0.84	-	NA

@EgorBo where did we end up in the evaluation of cast profiling?

AndyAyersMS · 2023-07-18T02:24:48Z

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock Holmes", Options: Compiled)

Looks like this one is bimodal, especially on amd64

AndyAyersMS · 2023-07-18T02:33:39Z

Burgers.Test1

This one is pretty stable except on linux x64:

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.100-preview.4.23260.5
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-AJUYHP : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-IYXLDU : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
Burgers_1	Job-PTZPWJ	/base-rel/corerun	181.0 ms	1.61 ms	1.85 ms	180.9 ms	178.4 ms	187.2 ms	1.00	0.00	157.03 KB	1.00
Burgers_1	Job-ZGFEKO	/diff-rel/corerun	185.4 ms	2.27 ms	2.62 ms	184.8 ms	181.5 ms	191.4 ms	1.02	0.02	157.03 KB	1.00

Can't repro this one locally. I think I have the same HW as the lab (i7-8700) but not sure.

Also does not repro on my old Sandy Bridge

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 20.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.6.23330.14
[Host] : .NET 7.0.9 (7.0.923.32018), X64 RyuJIT AVX
Job-CWOHEC : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-ERFHVR : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
Burgers_1	Job-CWOHEC	/base-rel/corerun	703.3 ms	1.05 ms	0.93 ms	703.4 ms	701.8 ms	704.9 ms	1.00	157.03 KB	1.00
Burgers_1	Job-ERFHVR	/diff-rel/corerun	704.8 ms	2.09 ms	1.74 ms	704.2 ms	703.1 ms	709.1 ms	1.00	157.03 KB	1.00

EgorBo · 2023-07-18T13:50:06Z

@EgorBo where did we end up in the evaluation of cast profiling?

I think it should be good to enable it for complex types of casts, but not in .NET 8.0 (afair, we don't profile all types of casts + never checked actual codegen size overhead from it)

AndyAyersMS · 2023-07-18T17:55:55Z

System.MathBenchmarks.Single.Min

Seems like it is bimodal and just happened to flip around the time we enabled Dynamic PGO.

Similar for Max

Recent regressions are #88482

AndyAyersMS · 2023-07-21T17:19:20Z

MicroBenchmarks.Serializers.Json_FromStream<MyEventsListerViewModel>>.DataContractJsonSerializer_

This is not a pure regression but rather an increase in variance (ala #87324), with lower lows and higher highs -- but note this is only the case for windows x64 intel (and arm64); Linux is relatively stable and faster with PGO, and windows on amd64 seems ok as well.

AndyAyersMS · 2023-07-21T17:31:05Z

System.Tests.Perf_Uri.EscapeDataString(input: "{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{

Generally looks to have recovered.

AndyAyersMS · 2023-07-21T17:37:16Z

System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfNumbers)

Regression is linux-x64 only; win-x64 benefits from PGO, arm64 is more or less unchanged.

I can repro the regression on WSL2, however there aren't great profiling tools available there.

Method	Job	Toolchain	TestCase	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
EnumerateUsingIndexer	Job-OAGCJY	/base-rel/corerun	ArrayOfNumbers	984.7 ns	7.21 ns	6.02 ns	985.8 ns	976.5 ns	992.7 ns	1.00	-	NA
EnumerateUsingIndexer	Job-AJUODA	/diff-rel/corerun	ArrayOfNumbers	1,123.7 ns	7.15 ns	6.69 ns	1,121.5 ns	1,111.4 ns	1,134.2 ns	1.14	-	NA

Nominal profile (from winx64 is)

74.36%   3.889E+07   Tier-1   [System.Text.Json]JsonDocument.GetArrayIndexElement(int32,int32)
18.53%   9.69E+06    Tier-1   [MicroBenchmarks]Perf_EnumerateArray.EnumerateUsingIndexer()
06.08%   3.18E+06    native   coreclr.dll
00.34%   1.8E+05     native   clrjit.dll
00.19%   1E+05       Tier-1   [System.Text.Json]JsonDocument.GetArrayLength(int32)

Using BDN's -p EP it appears similar for the run on WSL2 ad suggests the regression is in GetArrayIndexElement. Codegen for this method differs (mainly streamlined layout w/PGO). The PGO code looks better to me. Without more details it is hard to figure out where else to look.

EnumerateUsingIndexer inlines GetArrayLength with PGO, but the loop invoking GetArrayIndexElement is identical and looks like it iterates ~300? times.

Actually, looking at the profile data, it seems wrong. A big of digging turned out to be a bug in the 32 bit counter helper on linux, see #89340.

Fixing that doesn't alter the codegen, it just makes the loop seem a bit less hot (but still very hot).

AndyAyersMS · 2023-07-24T23:07:06Z

System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000)

No regression here.

AndyAyersMS · 2023-07-24T23:19:40Z

System.Collections.Sort<IntStruct>.List(Size: 512)

(see also #84264 (comment))

Did not regress on windows-x64, but did everywhere else.

System.Collections.Sort<IntStruct>.Array(Size: 512)

Ditto.

These two are quite likely the same issue. Last I looked into sorting, it was layout related; let's look again.

AndyAyersMS · 2023-07-27T02:20:47Z

System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 512)

Looks like noise

AndyAyersMS · 2023-07-27T02:34:51Z

System.Collections.CtorFromCollection<Int32>.ConcurrentBag(Size: 512)

Improved on x64. regressed on arm64.

Ah, likely the same issue with barriers as noted below: #87194 (comment)

AndyAyersMS · 2023-07-27T02:39:19Z

System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToStream(Mode: SourceGen)

And related...

see previous analysis

AndyAyersMS · 2023-07-28T15:29:52Z

System.Tests.Perf_HashCode.Combine_1

amd64 only

Does not repro on my local Zen3 machine:

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-XLXHLH : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-UGGZHH : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
Combine_1	Job-XLXHLH	\base-rel\corerun.exe	62.43 us	0.298 us	0.249 us	62.44 us	62.01 us	63.00 us	1.00	-	NA
Combine_1	Job-UGGZHH	\diff-rel\corerun.exe	61.18 us	0.137 us	0.128 us	61.21 us	60.93 us	61.35 us	0.98	-	NA

In fact the whole suite looks pretty good, save perhaps _4:

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-WPDJSJ : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-YNXXXX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
Combine_1	Job-WPDJSJ	\base-rel\corerun.exe	62.04 us	0.139 us	0.108 us	62.04 us	61.80 us	62.23 us	1.00	-	NA
Combine_1	Job-YNXXXX	\diff-rel\corerun.exe	60.92 us	0.096 us	0.090 us	60.86 us	60.85 us	61.10 us	0.98	-	NA

Combine_2	Job-WPDJSJ	\base-rel\corerun.exe	79.40 us	0.182 us	0.162 us	79.40 us	79.16 us	79.74 us	1.00	-	NA
Combine_2	Job-YNXXXX	\diff-rel\corerun.exe	78.70 us	0.091 us	0.085 us	78.67 us	78.62 us	78.85 us	0.99	-	NA

Combine_3	Job-WPDJSJ	\base-rel\corerun.exe	69.93 us	0.093 us	0.087 us	69.90 us	69.85 us	70.11 us	1.00	-	NA
Combine_3	Job-YNXXXX	\diff-rel\corerun.exe	69.88 us	0.040 us	0.034 us	69.87 us	69.86 us	69.97 us	1.00	-	NA

Combine_4	Job-WPDJSJ	\base-rel\corerun.exe	83.42 us	0.035 us	0.030 us	83.41 us	83.39 us	83.50 us	1.00	-	NA
Combine_4	Job-YNXXXX	\diff-rel\corerun.exe	95.03 us	0.013 us	0.010 us	95.03 us	95.01 us	95.05 us	1.14	-	NA

Combine_5	Job-WPDJSJ	\base-rel\corerun.exe	72.14 us	0.012 us	0.011 us	72.14 us	72.13 us	72.17 us	1.00	-	NA
Combine_5	Job-YNXXXX	\diff-rel\corerun.exe	72.15 us	0.059 us	0.049 us	72.13 us	72.12 us	72.29 us	1.00	-	NA

Combine_6	Job-WPDJSJ	\base-rel\corerun.exe	83.46 us	0.104 us	0.092 us	83.40 us	83.38 us	83.65 us	1.00	-	NA
Combine_6	Job-YNXXXX	\diff-rel\corerun.exe	83.43 us	0.064 us	0.057 us	83.41 us	83.39 us	83.55 us	1.00	-	NA

Combine_7	Job-WPDJSJ	\base-rel\corerun.exe	94.72 us	0.112 us	0.093 us	94.68 us	94.66 us	94.93 us	1.00	-	NA
Combine_7	Job-YNXXXX	\diff-rel\corerun.exe	94.86 us	0.177 us	0.166 us	94.81 us	94.69 us	95.26 us	1.00	-	NA

Combine_8	Job-WPDJSJ	\base-rel\corerun.exe	116.49 us	0.078 us	0.065 us	116.46 us	116.43 us	116.66 us	1.00	-	NA
Combine_8	Job-YNXXXX	\diff-rel\corerun.exe	117.22 us	0.260 us	0.243 us	117.18 us	116.88 us	117.69 us	1.01	-	NA

AndyAyersMS · 2023-07-28T16:02:20Z

System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple)

Win-x64 only

This one repros

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-RBZAAU : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-AJWTXX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	Toolchain	Segment	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
Slice_Repeat	Job-RBZAAU	\base-rel\corerun.exe	Multiple	32.03 ns	0.020 ns	0.017 ns	32.03 ns	32.01 ns	32.06 ns	1.00	-	NA
Slice_Repeat	Job-AJWTXX	\diff-rel\corerun.exe	Multiple	43.23 ns	0.274 ns	0.243 ns	43.16 ns	42.93 ns	43.75 ns	1.35	-	NA

Issue here seems to be that with PGO we mark a call site that takes V05 (struct local) as rare and don't inline it, and so V05 ends up getting address exposed and has more expensive copy semantics.

@EgorBo example where not doing an inline in a cold block impacts codegen in a hot block.

base
01.75%   8.8E+05     ?        Unknown
56.39%   2.828E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].Slice(int64,int64)
13.86%   6.95E+06    Tier-1   [MicroBenchmarks]ReadOnlySequence.Slice_Repeat()
10.91%   5.47E+06    Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
09.71%   4.87E+06    native   coreclr.dll
06.28%   3.15E+06    Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].SeekMultiSegment(class System.Buffers.ReadOnlySequenceSegment`1<!0>,class System.Object,int32,int64,value class System.ExceptionArgument)
00.62%   3.1E+05     Tier-1   [7c853c35-6121-4c85-8327-3f1f8585f3b1]Runnable_0.WorkloadActionUnroll(int64)
00.28%   1.4E+05     native   clrjit.dll
00.12%   6E+04       native   ntoskrnl.exe
00.06%   3E+04       native   ntdll.dll

diff

00.69%   3.5E+05     ?        Unknown
74.57%   3.797E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].Slice(int64,int64)
11.82%   6.02E+06    Tier-1   [MicroBenchmarks]ReadOnlySequence.Slice_Repeat()
05.32%   2.71E+06    native   coreclr.dll
03.69%   1.88E+06    Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].SeekMultiSegment(class System.Buffers.ReadOnlySequenceSegment`1<!0>,class System.Object,int32,int64,value class System.ExceptionArgument)
03.18%   1.62E+06    Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
00.37%   1.9E+05     native   ntoskrnl.exe
00.26%   1.3E+05     native   clrjit.dll

AndyAyersMS · 2023-07-28T18:49:17Z

System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfStrings)

x64 linux only. Likely the same as #87194 (comment)

AndyAyersMS · 2023-07-28T18:56:33Z

Span.Sorting.QuickSortSpan(Size: 512)

Windows intel x64 only.

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
QuickSortSpan	Job-FBLSKS	\base-rel\corerun.exe	512	12.76 us	2.346 us	2.304 us	13.86 us	8.584 us	15.80 us	1.00	0.00	-	NA
QuickSortSpan	Job-UVWYZR	\diff-rel\corerun.exe	512	14.04 us	3.065 us	3.530 us	13.52 us	10.049 us	19.78 us	1.18	0.42	-	NA

At first look, it seems like BDN is not iterating this enough... we are measuring Tier0 code.

base
55.66%   2.31E+06    Tier-0   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
18.80%   7.8E+05     native   clrjit.dll
15.18%   6.3E+05     Tier-1   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
04.82%   2E+05       native   coreclr.dll

diff
37.39%   1.23E+06    Tier-1   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
31.91%   1.05E+06    native   coreclr.dll
26.44%   8.7E+05     Tier-0   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
03.04%   1E+05       native   clrjit.dll

The per-iteration times tell a similar story:

base
000 1021.818 -- 1052.150 : 30.332
001 1053.657 -- 1083.828 : 30.171
002 1085.532 -- 1115.977 : 30.445
003 1117.422 -- 1147.984 : 30.562
004 1149.518 -- 1180.699 : 31.181
005 1182.397 -- 1213.166 : 30.769
006 1214.624 -- 1245.324 : 30.700
007 1247.707 -- 1273.592 : 25.886
008 1275.115 -- 1283.721 : 8.607
009 1285.118 -- 1293.938 : 8.820
010 1295.437 -- 1304.091 : 8.655
011 1305.529 -- 1314.128 : 8.599
012 1315.535 -- 1324.085 : 8.550
013 1325.503 -- 1334.060 : 8.557
014 1335.444 -- 1343.915 : 8.471

diff
000 816.876 -- 872.133 : 55.258
001 873.602 -- 940.129 : 66.526
002 942.247 -- 997.177 : 54.931
003 998.693 -- 1020.673 : 21.980
004 1022.154 -- 1032.488 : 10.334
005 1033.906 -- 1044.327 : 10.421
006 1045.780 -- 1056.235 : 10.455
007 1059.274 -- 1069.671 : 10.398
008 1071.102 -- 1081.520 : 10.418
009 1082.967 -- 1093.448 : 10.481
010 1094.849 -- 1105.169 : 10.320
011 1106.637 -- 1117.139 : 10.502
012 1118.663 -- 1129.091 : 10.427
013 1130.485 -- 1140.868 : 10.384
014 1142.225 -- 1152.484 : 10.260

Where for diff (pgo) we will eagerly instrument so the tier0 code will be slower. But even so, the optimized code is slower...

@adamsitnik seems like we should up the iterations per invocation for these tests to something like 10_000 (at 1000, each benchmark interval is only 10ms).

Think this might be caused by JCC errata. Main optimizations are identical, but code layout differs.

;;
       cmp      dword ptr [rbx+4*r8], edx
       jge      SHORT G_M24415_IG04
						;; size=17 bbWeight=5.09 PerfScore 27.98
G_M24415_IG07:  ;; offset=0041H

and note how in diff that jge straddles a 32 byte boundary. Profiling shows there is a prominent peak at offset 0x2A that is not there in the baseline version.

AndyAyersMS · 2023-07-28T20:09:33Z

System.Memory.ReadOnlySequence.Slice_Start_And_Length(Segment: Multiple)

also win-x64 only

Same underlying issue as #87194 (comment)

AndyAyersMS · 2023-07-28T20:13:46Z

System.Collections.CtorFromCollection<String>.ConcurrentBag(Size: 512)

arm64 only

Repros on my volterra:

BenchmarkDotNet v0.13.7-nightly.20230724.45, Windows 11 (10.0.22621.2070/22H2/2022Update/SunValley2)
Snapdragon 8cx Gen 3 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 8.0.0 (8.0.23.32907), Arm64 RyuJIT AdvSIMD
Job-EGKPHU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-FFUPKC : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
ConcurrentBag	Job-EGKPHU	\base-rel\corerun.exe	512	15.78 us	0.107 us	0.100 us	15.73 us	15.68 us	16.00 us	1.00	2.6309	2.5667	0.0642	16.16 KB	1.00
ConcurrentBag	Job-FFUPKC	\diff-rel\corerun.exe	512	22.78 us	0.119 us	0.112 us	22.76 us	22.63 us	22.99 us	1.44	2.5491	2.4547	-	16.16 KB	1.00

base
12.70%   4.89E+06    ?        Unknown
43.43%   1.672E+07   Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1+WorkStealingQueue[System.__Canon].LocalPush(!0,int64&)
32.36%   1.246E+07   native   coreclr.dll
03.12%   1.2E+06     native   ntoskrnl.exe
02.68%   1.03E+06    Tier-1   [System.Private.CoreLib]System.SZGenericArrayEnumerator`1[System.__Canon].get_Current()
02.05%   7.9E+05     Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1[System.__Canon]..ctor(class System.Collections.Generic.IEnumerable`1<!0>)
01.51%   5.8E+05     Tier-1   [System.Private.CoreLib]SZGenericArrayEnumeratorBase.MoveNext()

diff

06.89%   2.71E+06    ?        Unknown
60.37%   2.376E+07   Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1+WorkStealingQueue[System.__Canon].LocalPush(!0,int64&)
23.04%   9.07E+06    native   coreclr.dll
02.69%   1.06E+06    Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1[System.__Canon]..ctor(class System.Collections.Generic.IEnumerable`1<!0>)
02.59%   1.02E+06    Tier-1   [System.Private.CoreLib]System.SZGenericArrayEnumerator`1[System.__Canon].get_Current()
02.57%   1.01E+06    native   ntoskrnl.exe

So issue is evidently in LocalPush.

In the base (no-pgo) jit we form a CSE and this lets us use ldar, in the diff (pgo) jit we don't do the cse and end up emitting a separate barrier. That may be the culprit.

BASE

Generating: N507 (  1,  1) [000436] -----------                  t436 =    LCL_VAR   byref  V16 cse0          x28 REG x28 $1c8
                                                                        /--*  t436   byref  
Generating: N509 (  3,  2) [000174] V---GO-----                  t174 = *  IND       ref    REG x0 $156
IN0062:             ldapr   x0, [x28]

DIFF

Generating: N347 (  1,  1) [000172] -----------                  t172 =    LCL_VAR   ref    V00 this         u:1 x22 REG x22 $80
                                                                        /--*  t172   ref    
Generating: N349 (  3,  4) [000356] -c---------                  t356 = *  LEA(b+8)  byref  REG NA
                                                                        /--*  t356   byref  
Generating: N351 (  6,  6) [000174] V---GO-----                  t174 = *  IND       ref    REG x0 $156
IN00ac:             ldr     x0, [x22, #0x08]
IN00ad:             dmb     ishld

In base there are just two sampling hot spots, both tied to ldapr:

0x0044 : 1037
0x00DC : 162

IN0005: 000040      swpal   w1, w1, [x21]
IN0006: 000044      add     x22, x0, #28
IN0007: 000048      ldapr   w23, [x22]

IN002b: 0000D8      ldapr   w25, [x24]
IN002c: 0000DC      ldrb    w1, [x0, #0x34]

In diff there are more hot spots, and all hottest samples are near the dmbs.

0x003C : 655
0x0340 : 367
0x00C4 : 239
0x00AC : 201
0x0064 : 208

IN0006: 00003C      ldr     w1, [x0, #0x1C]
IN0007: 000040      dmb     ishld

IN00c6: 00033C      dmb     ish
IN00c7: 000340      str     wzr, [x0, #0x2C]

IN0027: 0000C0      dmb     ish
IN0028: 0000C4      str     w1, [x0, #0x1C]

IN0021: 0000A8      dmb     ishld
IN0022: 0000AC      ldr     x2, [fp, #0x28]	// [V01 arg1]

IN000f: 000060      dmb     ishld
IN0010: 000064      ldrb    w1, [x0, #0x34]

@EgorBo example we were chatting about.

adamsitnik · 2023-07-31T12:20:57Z

@adamsitnik seems like we should up the iterations per invocation for these tests to something like 10_000 (at 1000, each benchmark interval is only 10ms).

I took a look at the benchmark source code and it's safe to do it (the benchmark ID won't change, as the InvocationsPerIteration const is not an argument or a parameter for this benchmark)

AndyAyersMS · 2023-07-31T23:33:51Z

While there are still a few benchmarks where the analysis is unclear, they are isolated to specific OS/ABI combinations. So I'm going to close this out.

DrewScoggins · 2023-08-24T21:12:59Z

Did BubbleSort2 ever get looked at?

DrewScoggins · 2023-08-24T21:42:52Z

adamsitnik · 2023-08-25T07:22:35Z

Please keep in mind that both bubble sort and IndexOf are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.

DrewScoggins · 2023-08-25T20:52:13Z

Please keep in mind that both bubble sort and IndexOf are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.

Makes sense. I am just adding tests here that were not previously included, but regressed over the same commit range as this check-in.

DrewScoggins · 2023-08-25T20:52:17Z

DrewScoggins · 2023-08-25T22:19:32Z

Major instability starting with this commit.

AndyAyersMS · 2023-08-25T22:22:24Z

Please keep in mind that both bubble sort and IndexOf are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.

Makes sense. I am just adding tests here that were not previously included, but regressed over the same commit range as this check-in.

Also note that bubble sort runs for a very long time, and so likely BDN + lab customization is not reliably measuring the tier1 codegen , but instead some mixture of Tier0, Tier0 + instrumentation, OSR, and or R2R code.

AndyAyersMS · 2023-08-25T22:26:06Z

Major instability starting with this commit.

Feel free to add this kind of thing to #87324

DrewScoggins · 2023-08-25T22:27:01Z

I will do that going forward, didn't know about that issue :)

DrewScoggins · 2023-08-28T06:46:41Z

This is specifically on Windows x86.

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 6, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Jun 6, 2023

AndyAyersMS self-assigned this Jun 6, 2023

AndyAyersMS added this to the 8.0.0 milestone Jun 6, 2023

AndyAyersMS mentioned this issue Jun 6, 2023

Investigate microbenchmarks that regress with PGO enabled #84264

Closed

EgorBo mentioned this issue Jun 8, 2023

[Perf] Linux/arm64: 1 Regression on 5/19/2023 4:22:12 AM dotnet/perf-autofiling-issues#18102

Closed

AndyAyersMS added the Priority:2 Work that is important, but not critical for the release label Jun 9, 2023

This was referenced Jul 22, 2023

JIT_CountProfile32 incorrect native codegen on linux #89340

Closed

Block inlining of IntroSort #89310

Merged

AndyAyersMS mentioned this issue Jul 29, 2023

JIT: Don't use addressing modes for volatile loads for gc types #70794

Merged

AndyAyersMS mentioned this issue Jul 31, 2023

Increase Sorting invocations per iteration dotnet/performance#3202

Merged

AndyAyersMS closed this as completed Jul 31, 2023

AndyAyersMS mentioned this issue Aug 4, 2023

Regression in IfStatements.IfStatements #79101

Closed

DrewScoggins mentioned this issue Sep 5, 2023

[Perf] 7.0 -> 8.0-RC1 Microbenchmark Comparison #91626

Open

DrewScoggins mentioned this issue Sep 15, 2023

[Perf] Triage backlog #92164

Open

ghost locked as resolved and limited conversation to collaborators Sep 27, 2023

Dynamic PGO Microbenchmark Regressions #87194

Dynamic PGO Microbenchmark Regressions #87194

Comments

AndyAyersMS commented Jun 6, 2023 • edited

ghost commented Jun 6, 2023

AndyAyersMS commented Jun 6, 2023 • edited

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)

AndyAyersMS commented Jun 6, 2023 • edited

System.Text.Json.Tests.Perf_Get.GetInt16

System.Text.Json.Tests.Perf_Get.GetSByte

System.Text.Json.Tests.Perf_Get.GetByte

AndyAyersMS commented Jun 6, 2023 • edited

System.Tests.Perf_UInt32.TryParseHex(value: "0")

AndyAyersMS commented Jun 6, 2023 • edited

System.Memory.Span<Int32>.EndsWith(Size: 4)

System.Memory.Span.SequenceEqual(Size: 4)

AndyAyersMS commented Jun 6, 2023

System.Memory.Span<Int32>.SequenceCompareToDifferent(Size: 512)

AndyAyersMS commented Jul 5, 2023

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: None)

AndyAyersMS commented Jul 5, 2023

Devirtualization.EqualityComparer.ValueTupleCompareWrapped

AndyAyersMS commented Jul 17, 2023 • edited

Benchstone.BenchF.InvMt.Test

AndyAyersMS commented Jul 17, 2023 • edited

System.Tests.Perf_Random.NextSingle

AndyAyersMS commented Jul 17, 2023 • edited

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSingleSegment

AndyAyersMS commented Jul 18, 2023

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock Holmes", Options: Compiled)

AndyAyersMS commented Jul 18, 2023 • edited

Burgers.Test1

EgorBo commented Jul 18, 2023

AndyAyersMS commented Jul 18, 2023 • edited

System.MathBenchmarks.Single.Min

AndyAyersMS commented Jul 21, 2023

MicroBenchmarks.Serializers.Json_FromStream<MyEventsListerViewModel>>.DataContractJsonSerializer_

AndyAyersMS commented Jul 21, 2023

System.Tests.Perf_Uri.EscapeDataString(input: "{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{

AndyAyersMS commented Jul 21, 2023 • edited

System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfNumbers)

AndyAyersMS commented Jul 24, 2023

System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000)

AndyAyersMS commented Jul 24, 2023 • edited

System.Collections.Sort<IntStruct>.List(Size: 512)

System.Collections.Sort<IntStruct>.Array(Size: 512)

AndyAyersMS commented Jul 27, 2023

System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 512)

AndyAyersMS commented Jul 27, 2023 • edited

System.Collections.CtorFromCollection<Int32>.ConcurrentBag(Size: 512)

AndyAyersMS commented Jul 27, 2023

System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToStream(Mode: SourceGen)

AndyAyersMS commented Jul 28, 2023 • edited

System.Tests.Perf_HashCode.Combine_1

AndyAyersMS commented Jul 28, 2023 • edited

System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple)

AndyAyersMS commented Jul 28, 2023

System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfStrings)

AndyAyersMS commented Jul 28, 2023 • edited

Span.Sorting.QuickSortSpan(Size: 512)

AndyAyersMS commented Jul 28, 2023

System.Memory.ReadOnlySequence.Slice_Start_And_Length(Segment: Multiple)

AndyAyersMS commented Jul 28, 2023 • edited

System.Collections.CtorFromCollection<String>.ConcurrentBag(Size: 512)

adamsitnik commented Jul 31, 2023

AndyAyersMS commented Jul 31, 2023

DrewScoggins commented Aug 24, 2023

DrewScoggins commented Aug 24, 2023

adamsitnik commented Aug 25, 2023

DrewScoggins commented Aug 25, 2023

DrewScoggins commented Aug 25, 2023

DrewScoggins commented Aug 25, 2023

AndyAyersMS commented Aug 25, 2023

AndyAyersMS commented Aug 25, 2023

DrewScoggins commented Aug 25, 2023

DrewScoggins commented Aug 28, 2023

AndyAyersMS commented Jun 6, 2023 •

edited

AndyAyersMS commented Jun 6, 2023 •

edited

AndyAyersMS commented Jun 6, 2023 •

edited

AndyAyersMS commented Jun 6, 2023 •

edited

AndyAyersMS commented Jun 6, 2023 •

edited

AndyAyersMS commented Jul 17, 2023 •

edited

AndyAyersMS commented Jul 17, 2023 •

edited

AndyAyersMS commented Jul 17, 2023 •

edited

AndyAyersMS commented Jul 18, 2023 •

edited

AndyAyersMS commented Jul 18, 2023 •

edited

AndyAyersMS commented Jul 21, 2023 •

edited

AndyAyersMS commented Jul 24, 2023 •

edited

AndyAyersMS commented Jul 27, 2023 •

edited

AndyAyersMS commented Jul 28, 2023 •

edited

AndyAyersMS commented Jul 28, 2023 •

edited

AndyAyersMS commented Jul 28, 2023 •

edited

AndyAyersMS commented Jul 28, 2023 •

edited