Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic PGO Microbenchmark Regressions #87194

Closed
AndyAyersMS opened this issue Jun 6, 2023 · 45 comments
Closed

Dynamic PGO Microbenchmark Regressions #87194

AndyAyersMS opened this issue Jun 6, 2023 · 45 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 6, 2023

This issue tracks investigation into microbenchmarks that have reported regressions with Dynamic PGO enabled. It is a continuation of #84264 which tracked regressions from PGO before it was enabled.

The report below is collated from the following autofiling reports.

The table is auto generated by a tool written by @EgorBo but may be edited by hand as regression analysis produces results. The "Score" is the geomean regression across all architectures; benchmarks that did not regress (or get reported) on some architectures are assumed to have produced the same results with and without PGO. "Recent Score" is the current performance (as of 2023-0606) versus the non-PGO result; "Orig Score" is based on the results of auto filing. They will differ if benchmark performance has improved or regressed since the auto filing ran (see for example the results for System.Text.Json.Tests.Perf_Get.GetByte, which has improved already).

Only the 36 entries with recent scores >= 1.3 are included; this leaves off approximately 220 more rows with scores between 1.3 or lower. Our plan is to prioritize investigation of these benchmarks initially, as they have the largest aggregate regressions. If time permits, we will regenerate this chart to pick up the impact of any fixes and see how much of the remainder we can tackle.

Each arch/os result is a hyperlink to the performance data graph for that benchmark. ~Note we currently have no autofiling data for win-x64-intel. If/when that shows up we will regenerate the table.~~

[edit: had to regenerate the table once already, as the scoring logic was off]
[edit: have x64 win intel data now, new table. Not current results have shifted so table is somewhat different...]

cc @dotnet/jit-contrib

Notes Recent Score Orig Score arm64-lin-ampere arm64-win-surface arm64-win-ampere x64-lin-intel x64-win-intel x64-win-amd Benchmark
noise 3.38 1.37 3.37
1.36
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: None)
noise 3.36 1.37 3.36
1.37
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: NonBacktracking)
notes 2.71 3.39 2.71
3.39
System.Memory.Span(Int32).EndsWith(Size: 4)
likely same as above 2.62 3.03 2.55
2.27
2.59
3.04
System.Memory.Span(Int32).SequenceEqual(Size: 4)
likely same as above 1.87 1.76 1.87
1.76
System.Memory.Span(Int32).SequenceCompareToDifferent(Size: 512)
(lack of) if conversion 1.82 1.80 1.67
1.63
1.93
1.92
1.86
1.85
System.Tests.Perf_Random.NextSingle
budget 1.75 1.88 1.33
1.47
1.35
1.49
1.90
1.99
2.29
2.43
2.10
2.19
System.Text.Json.Tests.Perf_Get.GetInt16
BDN 1.73 2.81 3.55
3.54
1.89
2.00
1.28
4.73
1.32
2.01
1.39
2.68
System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)
notes 1.64 1.63 1.84
1.82
1.65
1.64
System.Tests.Perf_UInt32.TryParseHex(value: "0")
budget 1.61 1.70 1.27
1.44
1.28
1.46
1.24
1.18
2.09
2.17
2.25
2.33
1.86
1.94
System.Text.Json.Tests.Perf_Get.GetSByte
bimodal 1.61 1.59 1.60
1.58
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock Holmes", Options: Compiled)
cast expansion 1.60 1.64 1.82
1.87
1.41
1.43
System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstSingleSegment
cast expansion 1.58 1.62 1.58
1.62
System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSpanTenSegments
cast expansion 1.52 1.65 1.48
1.81
1.56
1.50
System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSingleSegment
cast expansion 1.50 1.73 1.88
2.13
1.20
1.41
System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstTenSegments
likely same as span cases above 1.48 1.28 1.48
1.28
System.Memory.Span(Int32).Reverse(Size: 4)
cast expansion 1.47 1.44 1.47
1.44
System.Buffers.Tests.ReadOnlySequenceTests(Byte).FirstSpanSingleSegment
notes 1.47 1.42 1.46
1.42
Benchstone.BenchF.InvMt.Test
unclear 1.46 1.15 1.46
1.15
MicroBenchmarks.Serializers.Json_FromStream(MyEventsListerViewModel).DataContractJsonSerializer_
fixed itself 1.45 1.09 1.45
1.09
System.Tests.Perf_Uri.EscapeDataString(input: "{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{
unclear 1.44 1.44 1.44
1.44
Burgers.Test1
unclear 1.43 1.27 1.43
1.27
System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfNumbers)
unclear, linux arm64 only 1.41 1.58 1.41
1.58
System.Text.Tests.Perf_StringBuilder.Append_Char_Capacity(length: 100000)
unclear, linux arm64 only 1.39 1.62 1.39
1.62
BenchmarksGame.RegexRedux_5.RunBench(options: Compiled)
bimodal 1.39 1.39 1.39
1.39
System.MathBenchmarks.Single.Min
bimodal 1.39 1.39 1.39
1.39
System.MathBenchmarks.Single.Max
unclear, linux arm64 only 1.39 1.32 1.39
1.32
System.IO.Pipes.Tests.Perf_NamedPipeStream.ReadWriteAsync(size: 1000000, Options: Asynchronous)
noise 1.38 1.29 1.38
1.29
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000)
bimodal 1.37 1.37 1.37
1.37
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: Compiled)
notes 1.37 1.36 1.26
1.29
1.42
1.43
1.24
1.28
1.60
1.48
System.Collections.Sort(IntStruct).Array(Size: 512)
budget 1.36 1.93 1.15
1.56
1.15
1.58
1.27
1.66
1.42
2.14
1.80
2.67
1.49
2.24
System.Text.Json.Tests.Perf_Get.GetByte
noise 1.35 1.31 1.36
1.33
System.Memory.Span(Char).IndexOfAnyTwoValues(Size: 512)
arm64 only; ldar vs dmb 1.35 1.36 1.35
1.34
1.38
1.40
System.Collections.CtorFromCollection(Int32).ConcurrentBag(Size: 512)
fixed by physical promotion 1.35 1.36 1.35
1.36
Devirtualization.EqualityComparer.ValueTupleCompareWrapped
budget 1.34 1.42 1.42
1.26
1.28
1.38
1.35
1.44
1.35
1.42
1.35
1.55
1.31
1.46
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToStream(Mode: SourceGen)
notes 1.34 1.45 1.18
1.29
1.40
1.44
1.13
1.41
1.71
1.71
System.Collections.Sort(IntStruct).List(Size: 512)
notes 1.33 1.33 1.33
1.33
System.Tests.Perf_HashCode.Combine_1
inlining different; exposed local 1.33 1.32 1.34
1.33
1.32
1.32
System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple)
notes 1.33 1.18 1.33
1.18
System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfStrings)
budget 1.32 1.37 1.24
1.39
1.20
1.28
1.37
1.15
1.39
1.46
1.45
1.57
1.27
1.39
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToWriter(Mode: SourceGen)
budget 1.32 1.39 1.37
1.28
1.22
1.42
1.34
1.31
1.32
1.38
1.30
1.50
1.34
1.38
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToUtf8Bytes(Mode: SourceGen)
budget 1.31 1.88 1.15
1.59
1.18
1.62
1.03
1.37
1.49
2.22
1.66
2.49
1.49
2.24
System.Text.Json.Tests.Perf_Get.GetUInt16
budget 1.31 1.33 1.38
1.25
1.20
1.23
1.23
1.26
1.35
1.46
1.40
1.40
1.41
1.43
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToString(Mode: SourceGen)
jcc errata 1.31 1.39 1.31
1.39
Span.Sorting.QuickSortSpan(Size: 512)
lack of cold inline exposes local 1.31 1.29 1.31
1.31
1.31
1.27
System.Memory.ReadOnlySequence.Slice_Start_And_Length(Segment: Multiple)
budget 1.31 1.39 1.32
1.19
1.20
1.50
1.31
1.37
1.40
1.50
1.31
1.34
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeObjectProperty(Mode: SourceGen)
lack of ldapr 1.30 1.30 1.29
1.30
1.30
1.30
System.Collections.CtorFromCollection(String).ConcurrentBag(Size: 512)
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 6, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 6, 2023
@AndyAyersMS AndyAyersMS self-assigned this Jun 6, 2023
@AndyAyersMS AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 6, 2023
@AndyAyersMS AndyAyersMS added this to the 8.0.0 milestone Jun 6, 2023
@ghost
Copy link

ghost commented Jun 6, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This issue tracks investigation into microbenchmarks that have reported regressions with Dynamic PGO enabled. It is a continuation of #84264 which tracked regressions from PGO before it was enabled.

The report below is collated from the following autofiling reports.

The table is auto generated by a tool written by @EgorBo but may be edited by hand as regression analysis produces results. The "Score" is the geomean regression across all architectures; benchmarks that did not regress (or get reported) on some architectures are assumed to have produced the same results with and without PGO. "Recent Score" is the current performance (as of 2023-0606) versus the non-PGO result; "Orig Score" is based on the results of auto filing. They will differ if benchmark performance has improved or regressed since the auto filing ran (see for example row 4's results for System.Text.Json.Tests.Perf_Get.GetUInt16, which has improved already).

Only the 54 entries with recent scores > 1.4 are included; this leaves off approximately 220 more rows with scores between 1.4 and 1.0. Our plan is to prioritize investigation of these benchmarks initially, as they have the largest aggregate regressions. If time permits we will regenerate this chart to pick up the impact of any fixes and see how much of the remainder we can tackle.

Each arch/os result is a hyperlink to the performance data graph for that benchmark. Note we currently have no autofiling data for win-x64-intel. If/when that shows up we will regenerate the table.

cc @dotnet/jit-contrib

Notes Recent Score Orig Score win-x64-intel win-x64-amd lin-x64-intel win-arm64-surface win-arm64-ampere lin-arm64-ampere Benchmark
3.66 3.66 2.68
2.68
4.60
4.73
2.00
2.00
3.54
3.54
System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)
2.80 2.97 1.86
1.94
2.09
2.17
1.28
1.46
1.22
1.18
1.27
1.44
System.Text.Json.Tests.Perf_Get.GetSByte
2.79 2.83 1.90
1.85
3.84
3.83
1.30
1.27
2.45
2.70
System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)
2.65 3.41 1.48
2.24
1.49
2.22
1.17
1.62
1.19
1.37
1.17
1.59
System.Text.Json.Tests.Perf_Get.GetUInt16
2.53 2.60 1.31
1.38
1.37
1.38
1.21
1.42
1.30
1.31
1.35
1.28
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToUtf8Bytes(Mode: SourceGen)
2.52 3.32 1.49
2.24
1.42
2.14
1.14
1.58
1.39
1.66
1.14
1.56
System.Text.Json.Tests.Perf_Get.GetByte
2.50 2.56 1.33
1.39
1.37
1.46
1.21
1.28
1.16
1.15
1.36
1.39
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToWriter(Mode: SourceGen)
2.50 2.64 1.36
1.46
1.38
1.42
1.16
1.38
1.15
1.44
1.39
1.26
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToStream(Mode: SourceGen)
2.49 2.53 1.43
1.43
1.30
1.46
1.22
1.23
1.17
1.26
1.38
1.25
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeToString(Mode: SourceGen)
2.35 2.47 2.10
2.19
1.92
1.99
1.35
1.49
1.33
1.47
System.Text.Json.Tests.Perf_Get.GetInt16
2.22 2.22 1.23
1.23
1.45
1.45
1.27
1.33
1.53
1.44
System.Tests.Perf_UInt64.TryParseHex(value: "0")
2.10 2.10 1.48
1.48
1.44
1.43
1.35
1.28
1.25
1.29
System.Collections.Sort(IntStruct).Array(Size: 512)
2.04 2.10 1.31
1.34
1.34
1.37
1.22
1.50
1.35
1.19
System.Text.Json.Serialization.Tests.WriteJson(ImmutableDictionary(String, String)).SerializeObjectProperty(Mode: SourceGen)
2.04 2.18 1.63
1.71
1.42
1.44
1.03
1.41
1.19
1.29
System.Collections.Sort(IntStruct).List(Size: 512)
1.97 1.99 1.33
1.39
1.26
1.22
1.12
1.19
1.23
1.24
System.Memory.Span(Int32).IndexOfValue(Size: 512)
1.95 2.60 1.15
3.31
1.05
1.29
1.08
1.56
1.08
1.08
System.Text.Json.Tests.Perf_Get.GetInt32
1.92 1.88 3.36
3.04
2.94
2.27
System.Memory.Span(Int32).SequenceEqual(Size: 4)
1.92 2.04 1.19
1.43
1.15
1.23
1.20
1.30
1.17
1.27
System.Text.Json.Serialization.Tests.WriteJson(ImmutableSortedDictionary(String, String)).SerializeToWriter(Mode: SourceGen)
1.87 1.97 0.99
1.11
1.14
1.50
1.31
1.22
1.09
1.10
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,&lorem ipsum=dolor sit amet,16)
1.87 1.86 1.47
1.46
1.16
1.13
1.18
1.18
1.20
1.19
System.Collections.ContainsTrue(Int32).List(Size: 512)
1.83 1.84 1.27
1.31
1.21
1.21
1.17
1.17
1.17
1.17
System.Collections.ContainsTrue(Int32).ICollection(Size: 512)
1.80 1.89 1.05
1.38
1.30
1.30
1.20
1.21
1.19
1.19
System.Collections.ContainsTrue(Int32).Array(Size: 512)
1.73 1.82 0.89
1.16
1.13
1.14
1.34
1.45
1.16
1.14
System.Memory.Span(Int32).LastIndexOfValue(Size: 512)
1.72 2.66 0.84
3.31
0.93
1.48
0.84
1.56
0.85
1.07
System.Text.Json.Tests.Perf_Get.GetUInt32
1.72 1.73 1.07
1.07
1.28
1.30
1.07
1.06
1.07
1.08
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf16(arguments: Url,&lorem ipsum=dolor sit amet,16)
1.71 1.76 1.08
1.18
1.11
1.15
1.10
1.14
1.10
1.10
System.Collections.IterateForEach(Int32).ImmutableHashSet(Size: 512)
1.71 1.83 0.94
1.13
0.95
1.05
0.99
1.11
1.01
1.10
System.Text.Json.Tests.Perf_Get.GetInt64
1.67 1.66 1.29
1.23
1.10
1.10
1.12
1.11
System.Memory.Span(Byte).IndexOfValue(Size: 512)
1.66 1.66 1.34
1.33
1.08
1.06
1.08
1.08
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Mariomkas.Count(Pattern: "(?:(?:25[0-5]
1.64 1.65 1.15
1.19
1.08
1.10
1.11
1.13
System.Memory.Span(Byte).LastIndexOfValue(Size: 512)
1.63 1.64 1.10
1.12
1.28
1.29
1.41
1.41
System.Tests.Perf_UInt64.TryParseHex(value: "3039")
1.60 1.59 1.40
1.34
1.16
1.17
1.15
1.16
Benchstone.BenchI.BubbleSort2.Test
1.60 1.59 1.33
1.30
1.16
1.16
1.18
1.16
System.Collections.ContainsTrue(Int32).ImmutableArray(Size: 512)
1.58 1.58 1.85
1.85
1.62
1.63
System.Tests.Perf_Random.NextSingle
1.57 1.59 1.25
1.35
1.15
1.14
1.13
1.13
System.Collections.ContainsTrue(Int32).Queue(Size: 512)
1.56 1.67 0.99
1.19
0.98
1.15
1.00
1.13
System.Tests.Perf_Uri.EscapeDataString(input: "a{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{üa{ü
1.56 1.63 1.17
1.31
1.48
1.46
1.28
1.40
System.Memory.Span(Char).LastIndexOfValue(Size: 512)
1.55 1.55 1.47
1.45
1.21
1.21
1.21
1.22
System.Collections.ContainsTrue(Int32).Span(Size: 512)
1.54 1.54 1.20
1.20
1.11
1.09
1.08
1.08
System.Memory.Span(Int32).IndexOfAnyFourValues(Size: 33)
1.54 1.52 1.16
1.17
1.37
1.28
1.34
1.32
System.Tests.Perf_UInt32.TryParseHex(value: "3039")
1.52 1.51 1.15
1.20
1.49
1.39
System.Memory.Span(Byte).SequenceCompareToDifferent(Size: 512)
1.51 1.55 1.08
1.24
1.13
1.13
1.05
1.10
System.Text.Perf_Utf8Encoding.GetBytes(Input: Chinese)
1.50 1.61 1.26
1.47
1.22
1.72
System.Tests.Perf_Guid.GuidToString
1.48 1.49 1.20
1.19
1.18
1.24
System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 16,16 bits)
1.47 1.49 1.30
1.35
1.03
1.08
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: JavaScript,&Hello+(World)!,16)
1.46 1.62 0.78
1.09
0.88
1.08
0.87
1.07
System.Text.Json.Tests.Perf_Get.GetString
1.45 1.46 1.54
1.64
1.88
1.82
System.Tests.Perf_UInt32.TryParseHex(value: "0")
1.45 1.48 1.27
1.41
1.06
1.06
1.09
1.09
System.Memory.Span(Int32).IndexOfAnyFiveValues(Size: 33)
1.45 1.45 1.14
1.11
1.06
1.10
BenchmarksGame.BinaryTrees_5.RunBench
1.45 1.45 3.41
3.39
System.Memory.Span(Int32).EndsWith(Size: 4)
1.43 1.43 1.05
1.05
1.06
1.06
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock
1.42 1.45 0.99
1.08
1.06
1.10
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig.Count(Pattern: "[a-z]shing", Options: Compiled)
1.40 1.41 1.22
1.22
1.24
1.32
LinqBenchmarks.Count00ForX
1.40 1.40 1.41
1.38
1.09
1.13
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,&lorem ipsum=dolor sit amet,512)
Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jun 6, 2023

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64EncodeInPlace(NumberOfBytes: 200000000)

image

System.Buffers.Text.Tests.Base64EncodeDecodeInPlaceTests.Base64DecodeInPlace(NumberOfBytes: 200000000)

image ----

These benchmarks were analyzed before PGO was enabled: #84264 (comment)

BDN's strategy doesn't run the benchmark enough, because each iteration is long running, and so (since the key benchmark methods are R2R'd) the test ends up measuring tier1-instr code.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jun 6, 2023

System.Text.Json.Tests.Perf_Get.GetInt16

image

System.Text.Json.Tests.Perf_Get.GetSByte

image

System.Text.Json.Tests.Perf_Get.GetByte

image

Likely the analysis from #84264 (comment) is still relevant and explains the related tests regressions as well: we run out of inlining budget, in part because the benchmark method is small and there are quite a few large aggressive inline methods, and so we're unable to do some key inlines.

Tracking issue for this is #85531.

Some of these improved with #86551.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jun 6, 2023

System.Tests.Perf_UInt32.TryParseHex(value: "0")

image

This regresses across the board, so not sure why we don't have more autofiling for it.

I can't repro on win-x64

Method Job Toolchain value Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
TryParseHex Job-EEXSZV \base\corerun.exe 0 7.860 ns 0.1366 ns 0.1573 ns 7.814 ns 7.640 ns 8.281 ns 1.00 0.00 - NA
TryParseHex Job-EZIDSX \diff\corerun.exe 0 7.834 ns 0.1461 ns 0.1683 ns 7.783 ns 7.634 ns 8.344 ns 1.00 0.02 - NA

But I can (perhaps) on win-arm64 (volterra)

Method Job Toolchain value Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
TryParseHex Job-DFURZJ \base\corerun.exe 0 3.265 ns 0.0267 ns 0.0237 ns 3.257 ns 3.235 ns 3.316 ns 1.00 0.00 - NA
TryParseHex Job-URZCKJ \diff\corerun.exe 0 3.876 ns 0.0899 ns 0.0751 ns 3.849 ns 3.794 ns 4.034 ns 1.19 0.02 - NA
base

00.81%   4.1E+05     ?        Unknown
48.82%   2.458E+07   Tier-1   [491fed18-8a14-46f6-b30c-6320dc770919]Runnable_0.WorkloadActionUnroll(int64)
23.52%   1.184E+07   Tier-1   [System.Private.CoreLib]Number.TryParseBinaryIntegerHexOrBinaryNumberStyle(value class System.ReadOnlySpan`1<!!0>,value class System.Globalization.NumberStyles,!!1&)
12.69%   6.39E+06    Tier-1   [MicroBenchmarks]Perf_UInt32.TryParseHex(class System.String)
08.92%   4.49E+06    Tier-1   [System.Private.CoreLib]NumberFormatInfo.<GetInstance>g__GetProviderNonNull|58_0(class System.IFormatProvider)
04.37%   2.2E+06     Tier-1   [System.Private.CoreLib]CastHelpers.IsInstanceOfClass(void*,class System.Object)

diff

50.62%   2.666E+07   Tier-1   [d735f3a6-719a-4885-9631-16a7aeff132c]Runnable_0.WorkloadActionUnroll(int64)
23.94%   1.261E+07   Tier-1   [System.Private.CoreLib]Number.TryParseBinaryIntegerHexOrBinaryNumberStyle(value class System.ReadOnlySpan`1<!!0>,value class System.Globalization.NumberStyles,!!1&)
10.73%   5.65E+06    Tier-1   [MicroBenchmarks]Perf_UInt32.TryParseHex(class System.String)
08.62%   4.54E+06    Tier-1   [System.Private.CoreLib]NumberFormatInfo.<GetInstance>g__GetProviderNonNull|58_0(class System.IFormatProvider)
04.08%   2.15E+06    Tier-1   [System.Private.CoreLib]CastHelpers.IsInstanceOfClass(void*,class System.Object)

Note the very high overhead. WorkoadActionUnroll should have the AggressiveOptimization attribute so its codegen does not vary.

PGO codegen for TryParseBinaryIntegerHexOrBinaryNumberStyle looks good, all the hot code is adjacent and the method is straight line code. So not clear why it is ~7% or so slower.

Path lengths are similar but PGO does one extra STP/LDP and a few more register arg moves. So perhaps that's the explanation?

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jun 6, 2023

System.Memory.Span<Int32>.EndsWith(Size: 4)

image

System.Memory.Span.SequenceEqual(Size: 4)

image

These two appear to be arm64 specific.


Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
EndsWith Job-INARIQ \base-rel\corerun.exe 4 2.712 ns 0.0209 ns 0.0163 ns 2.712 ns 2.680 ns 2.734 ns 1.00 0.00 - NA
EndsWith Job-YXKXJS \diff-rel\corerun.exe 4 3.357 ns 0.2569 ns 0.2959 ns 3.175 ns 3.079 ns 3.928 ns 1.23 0.10 - NA
base
01.52%   5.7E+05     ?        Unknown
35.26%   1.319E+07   Tier-1   [18850c97-dca6-4aad-8393-0715e0b95a4a]Runnable_0.WorkloadActionUnroll(int64)
34.51%   1.291E+07   Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Int32].EndsWith()
27.72%   1.037E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceEqual(unsigned int8&,unsigned int8&,unsigned int)

diff
01.29%   5.1E+05     ?        Unknown
35.20%   1.392E+07   Tier-1   [MicroBenchmarks]System.Memory.Span`1[System.Int32].EndsWith()
31.40%   1.242E+07   Tier-1   [72707307-0596-4e84-8515-1a5c07b80d3a]Runnable_0.WorkloadActionUnroll(int64)
31.10%   1.23E+07    Tier-1   [System.Private.CoreLib]SpanHelpers.SequenceEqual(unsigned int8&,unsigned int8&,unsigned int)

So issue appears to be in SequenceEqual ?

Seems like PGO and non-PGO codegen is the same. Method is R2R'd and with PGO we do a tier1 instr, but do not add any probes. Explanation: this method is an intrinsic and not on the whitelist.

Fixing that (hack) gives:

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
EndsWith Job-FBJVVQ \base-rel\corerun.exe 4 2.732 ns 0.0330 ns 0.0275 ns 2.729 ns 2.692 ns 2.793 ns 1.00 - NA
EndsWith Job-UKVWAD \diff-rel\corerun.exe 4 3.146 ns 0.0472 ns 0.0419 ns 3.153 ns 3.084 ns 3.208 ns 1.15 - NA
EndsWith Job-BJUBBV \hack-rel\corerun.exe 4 2.660 ns 0.0364 ns 0.0322 ns 2.659 ns 2.610 ns 2.731 ns 0.97 - NA

and more broady

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
EndsWith Job-SDVJLL \base-rel\corerun.exe 4 2.685 ns 0.0479 ns 0.0448 ns 2.679 ns 2.627 ns 2.766 ns 1.00 0.00 - NA
EndsWith Job-FTSDAY \diff-rel\corerun.exe 4 3.151 ns 0.0178 ns 0.0158 ns 3.151 ns 3.126 ns 3.181 ns 1.17 0.02 - NA
EndsWith Job-EGIDWC \hack-rel\corerun.exe 4 3.125 ns 0.0394 ns 0.0368 ns 3.125 ns 3.080 ns 3.180 ns 1.16 0.02 - NA
EndsWith Job-SDVJLL \base-rel\corerun.exe 33 4.739 ns 0.0558 ns 0.0495 ns 4.728 ns 4.686 ns 4.840 ns 1.00 0.00 - NA
EndsWith Job-FTSDAY \diff-rel\corerun.exe 33 5.322 ns 0.0576 ns 0.0539 ns 5.299 ns 5.257 ns 5.446 ns 1.12 0.01 - NA
EndsWith Job-EGIDWC \hack-rel\corerun.exe 33 5.369 ns 0.0265 ns 0.0235 ns 5.369 ns 5.333 ns 5.419 ns 1.13 0.01 - NA
EndsWith Job-SDVJLL \base-rel\corerun.exe 512 32.020 ns 0.0209 ns 0.0175 ns 32.023 ns 31.999 ns 32.059 ns 1.00 0.00 - NA
EndsWith Job-FTSDAY \diff-rel\corerun.exe 512 32.422 ns 0.0351 ns 0.0328 ns 32.417 ns 32.383 ns 32.495 ns 1.01 0.00 - NA
EndsWith Job-EGIDWC \hack-rel\corerun.exe 512 31.819 ns 0.0370 ns 0.0346 ns 31.815 ns 31.776 ns 31.893 ns 0.99 0.00 - NA

But does not explain why there is a PGO regression. And as you can see the Size=4 results are not very stable.

Similar diffs for EndsWith (better layout, slightly higher prolog/epilog costs).

@AndyAyersMS
Copy link
Member Author

System.Memory.Span<Int32>.SequenceCompareToDifferent(Size: 512)

image

Also seems to be arm64 specific.

@AndyAyersMS AndyAyersMS added the Priority:2 Work that is important, but not critical for the release label Jun 9, 2023
@AndyAyersMS
Copy link
Member Author

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: None)

These spiked up but then recovered and match their longer-term behavior

newplot - 2023-07-05T110411 317
newplot - 2023-07-05T110407 909

@AndyAyersMS
Copy link
Member Author

Devirtualization.EqualityComparer.ValueTupleCompareWrapped

Fixed by physical promotion

newplot - 2023-07-05T111009 591

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 17, 2023

Benchstone.BenchF.InvMt.Test

This one is more substantially regressed on amd64 HW...

image

This doesn't repro on my local Zen3 box

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22000.2176/21H2/SunValley)
AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-EGKMFB : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-KOAKXG : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job Toolchain Mean Error StdDev Median Min Max Ratio Gen0 Allocated Alloc Ratio
InvMt Job-EGKMFB \base-rel\corerun.exe 1.540 ms 0.0146 ms 0.0137 ms 1.540 ms 1.521 ms 1.568 ms 1.00 12.5000 105.07 KB 1.00
InvMt Job-KOAKXG \diff-rel\corerun.exe 1.513 ms 0.0077 ms 0.0072 ms 1.512 ms 1.504 ms 1.527 ms 0.98 12.5000 105.07 KB 1.00

Perf lab is running Ryzen 7 3700 PRO.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 17, 2023

System.Tests.Perf_Random.NextSingle

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
NextSingle Job-IRALDH \base-rel\corerun.exe 4.697 ns 0.0866 ns 0.0810 ns 4.696 ns 4.528 ns 4.796 ns 1.00 0.00 - NA
NextSingle Job-VNWIBR \diff-rel\corerun.exe 8.382 ns 0.1170 ns 0.1094 ns 8.394 ns 8.229 ns 8.574 ns 1.79 0.04 - NA

Profiling

base

00.95%   4.8E+05     ?        Unknown
36.70%   1.859E+07   Tier-1   [System.Private.CoreLib]Random+CompatPrng.InternalSample()
34.94%   1.77E+07    Tier-1   [System.Private.CoreLib]Random+Net5CompatSeedImpl.NextSingle()
12.38%   6.27E+06    Tier-1   [System.Private.CoreLib]Random.NextSingle()
07.54%   3.82E+06    Tier-1   [cb5bf117-1cbd-438a-bbce-a71239c42b3d]Runnable_0.WorkloadActionUnroll(int64)
06.85%   3.47E+06    Tier-1   [MicroBenchmarks]Perf_Random.NextSingle()
00.26%   1.3E+05     native   clrjit.dll
00.12%   6E+04       native   coreclr.dll
00.12%   6E+04       native   ntoskrnl.exe
00.10%   5E+04       native   ntdll.dll

diff

02.33%   2.07E+06    ?        Unknown
79.57%   7.061E+07   Tier-1   [System.Private.CoreLib]Random+Net5CompatSeedImpl.NextSingle()
14.97%   1.328E+07   Tier-1   [MicroBenchmarks]Perf_Random.NextSingle()
02.56%   2.27E+06    Tier-1   [c48f5bab-b8a5-4a49-b7a3-55a58e477b40]Runnable_0.WorkloadActionUnroll(int64)
00.21%   1.9E+05     native   clrjit.dll
00.17%   1.5E+05     native   coreclr.dll
00.14%   1.2E+05     native   ntoskrnl.exe

Looks like with PGO we inline InternalSample, and this hurts perf. Why?

Root cause is lack of if conversion in InternalSample, once PGO has inlined it into Random+Net5CompatSeedImpl.NextSingle.

(with DOTNET_JitDoIfConversion=0)

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
NextSingle Job-UYNTQT \base\corerun.exe 10.451 ns 0.1370 ns 0.1578 ns 10.424 ns 10.252 ns 10.94 ns 1.00 0.00 - NA
NextSingle Job-FNVAZN \diff\corerun.exe 9.709 ns 0.3340 ns 0.3846 ns 9.751 ns 8.154 ns 10.10 ns 0.93 0.04 - NA

@jakobbotsch here's an example where not if converting in loops is painful. PGO undoes the improvements that came in with #81267.

ARM64 does not seem to be affected for some reason.

image

#79101 may show a similar problem

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 17, 2023

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSingleSegment

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
FirstSingleSegment Job-MECUZL \base-rel\corerun.exe 3.166 ns 0.0275 ns 0.0317 ns 3.156 ns 3.125 ns 3.248 ns 1.00 0.00 - NA
FirstSingleSegment Job-IYGZNT \diff-rel\corerun.exe 5.107 ns 0.0487 ns 0.0560 ns 5.089 ns 5.040 ns 5.237 ns 1.61 0.03 - NA
base
00.57%   4.51E+06    ?        Unknown
64.12%   5.094E+08   Tier-1   [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].First(value class System.Buffers.ReadOnlySequence`1<!0>)
16.92%   1.344E+08   Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
09.23%   7.33E+07    Tier-1   [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].FirstSingleSegment()
05.37%   4.269E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Char]..ctor(class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32,class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32)
02.28%   1.81E+07    Tier-1   [3e900f5f-7929-4719-a376-9aa860256700]Runnable_0.WorkloadActionUnroll(int64)
01.44%   1.142E+07   native   coreclr.dll

diff
00.45%   5.24E+06    ?        Unknown
57.87%   6.67E+08    Tier-1   [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].First(value class System.Buffers.ReadOnlySequence`1<!0>)
29.19%   3.364E+08   Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
06.48%   7.467E+07   Tier-1   [MicroBenchmarks]System.Buffers.Tests.ReadOnlySequenceTests`1[System.Char].FirstSingleSegment()
03.18%   3.665E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Char]..ctor(class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32,class System.Buffers.ReadOnlySequenceSegment`1<!0>,int32)
01.84%   2.116E+07   Tier-1   [60fe30c0-c2f0-4348-9bd8-9b632dabcde3]Runnable_0.WorkloadActionUnroll(int64)
00.91%   1.049E+07   native   coreclr.dll

Same set of inlines, similar optimizations.

Main delta is code layout, in both First and in ChkCastClassSpecial. Not clear why this causes such a big perf diff as the profile data should be accurate.


The issue is that by default we don't profile casts, and without profile data, we assume casts will fall back to the helper 25% of the time. This leads the jit to move all the cast calls to the end of the method, and so each call site must jump to the call and then jump back into the regular flow:

;; base
						;; size=41 bbWeight=0.50 PerfScore 7.12
G_M11072_IG110:  ;; offset=0C55H
       mov      rdx, r8
       call     [CORINFO_HELP_CHKCASTCLASS_SPECIAL]
						;; size=9 bbWeight=0.12 PerfScore 0.41
G_M11072_IG111:  ;; offset=0C5EH
       mov      rdx, gword ptr [rax+18H]

;; diff

       jne      SHORT G_M11072_IG52
						;; size=45 bbWeight=0.50 PerfScore 7.12
G_M11072_IG49:  ;; offset=0863H
       mov      rdx, gword ptr [rax+18H]
       ...

G_M11072_IG52:  ;; offset=08B8H
       mov      rdx, r8
       call     [CORINFO_HELP_CHKCASTCLASS_SPECIAL]
       jmp      SHORT G_M11072_IG49

BBWeight of IG52 in comes from QMARK expansion, we assume 25% chance?

With DOTNET_JitProfileCasts=1:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
FirstSingleSegment Job-BCGWCL \base\corerun.exe 3.217 ns 0.0506 ns 0.0583 ns 3.245 ns 3.130 ns 3.276 ns 1.00 - NA
FirstSingleSegment Job-BXIQEJ \diff\corerun.exe 2.705 ns 0.0115 ns 0.0133 ns 2.704 ns 2.669 ns 2.736 ns 0.84 - NA

@EgorBo where did we end up in the evaluation of cast profiling?

@AndyAyersMS
Copy link
Member Author

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock Holmes", Options: Compiled)

Looks like this one is bimodal, especially on amd64

image

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 18, 2023

Burgers.Test1

This one is pretty stable except on linux x64:

image

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.100-preview.4.23260.5
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-AJUYHP : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-IYXLDU : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Burgers_1 Job-PTZPWJ /base-rel/corerun 181.0 ms 1.61 ms 1.85 ms 180.9 ms 178.4 ms 187.2 ms 1.00 0.00 157.03 KB 1.00
Burgers_1 Job-ZGFEKO /diff-rel/corerun 185.4 ms 2.27 ms 2.62 ms 184.8 ms 181.5 ms 191.4 ms 1.02 0.02 157.03 KB 1.00

Can't repro this one locally. I think I have the same HW as the lab (i7-8700) but not sure.

Also does not repro on my old Sandy Bridge

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 20.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.100-preview.6.23330.14
[Host] : .NET 7.0.9 (7.0.923.32018), X64 RyuJIT AVX
Job-CWOHEC : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-ERFHVR : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
Burgers_1 Job-CWOHEC /base-rel/corerun 703.3 ms 1.05 ms 0.93 ms 703.4 ms 701.8 ms 704.9 ms 1.00 157.03 KB 1.00
Burgers_1 Job-ERFHVR /diff-rel/corerun 704.8 ms 2.09 ms 1.74 ms 704.2 ms 703.1 ms 709.1 ms 1.00 157.03 KB 1.00

@EgorBo
Copy link
Member

EgorBo commented Jul 18, 2023

@EgorBo where did we end up in the evaluation of cast profiling?

I think it should be good to enable it for complex types of casts, but not in .NET 8.0 (afair, we don't profile all types of casts + never checked actual codegen size overhead from it)

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 18, 2023

System.MathBenchmarks.Single.Min

image

Seems like it is bimodal and just happened to flip around the time we enabled Dynamic PGO.


Similar for Max

image

Recent regressions are #88482

@AndyAyersMS
Copy link
Member Author

MicroBenchmarks.Serializers.Json_FromStream<MyEventsListerViewModel>>.DataContractJsonSerializer_

image

This is not a pure regression but rather an increase in variance (ala #87324), with lower lows and higher highs -- but note this is only the case for windows x64 intel (and arm64); Linux is relatively stable and faster with PGO, and windows on amd64 seems ok as well.

@AndyAyersMS
Copy link
Member Author

System.Tests.Perf_Uri.EscapeDataString(input: "{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{

image

Generally looks to have recovered.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 21, 2023

System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfNumbers)

image

Regression is linux-x64 only; win-x64 benefits from PGO, arm64 is more or less unchanged.


I can repro the regression on WSL2, however there aren't great profiling tools available there.

Method Job Toolchain TestCase Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
EnumerateUsingIndexer Job-OAGCJY /base-rel/corerun ArrayOfNumbers 984.7 ns 7.21 ns 6.02 ns 985.8 ns 976.5 ns 992.7 ns 1.00 - NA
EnumerateUsingIndexer Job-AJUODA /diff-rel/corerun ArrayOfNumbers 1,123.7 ns 7.15 ns 6.69 ns 1,121.5 ns 1,111.4 ns 1,134.2 ns 1.14 - NA

Nominal profile (from winx64 is)

74.36%   3.889E+07   Tier-1   [System.Text.Json]JsonDocument.GetArrayIndexElement(int32,int32)
18.53%   9.69E+06    Tier-1   [MicroBenchmarks]Perf_EnumerateArray.EnumerateUsingIndexer()
06.08%   3.18E+06    native   coreclr.dll
00.34%   1.8E+05     native   clrjit.dll
00.19%   1E+05       Tier-1   [System.Text.Json]JsonDocument.GetArrayLength(int32)

Using BDN's -p EP it appears similar for the run on WSL2 ad suggests the regression is in GetArrayIndexElement. Codegen for this method differs (mainly streamlined layout w/PGO). The PGO code looks better to me. Without more details it is hard to figure out where else to look.

EnumerateUsingIndexer inlines GetArrayLength with PGO, but the loop invoking GetArrayIndexElement is identical and looks like it iterates ~300? times.

Actually, looking at the profile data, it seems wrong. A big of digging turned out to be a bug in the 32 bit counter helper on linux, see #89340.

Fixing that doesn't alter the codegen, it just makes the loop seem a bit less hot (but still very hot).

@AndyAyersMS
Copy link
Member Author

System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000)

No regression here.

image

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 24, 2023

System.Collections.Sort<IntStruct>.List(Size: 512)

(see also #84264 (comment))

image

Did not regress on windows-x64, but did everywhere else.

System.Collections.Sort<IntStruct>.Array(Size: 512)

image

Ditto.


These two are quite likely the same issue. Last I looked into sorting, it was layout related; let's look again.

@AndyAyersMS
Copy link
Member Author

System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 512)

Looks like noise

image

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 27, 2023

System.Collections.CtorFromCollection<Int32>.ConcurrentBag(Size: 512)

Improved on x64. regressed on arm64.

image

Ah, likely the same issue with barriers as noted below: #87194 (comment)

@AndyAyersMS
Copy link
Member Author

System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToStream(Mode: SourceGen)

And related...

image

see previous analysis

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 28, 2023

System.Tests.Perf_HashCode.Combine_1

amd64 only

image

Does not repro on my local Zen3 machine:

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-XLXHLH : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-UGGZHH : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
Combine_1 Job-XLXHLH \base-rel\corerun.exe 62.43 us 0.298 us 0.249 us 62.44 us 62.01 us 63.00 us 1.00 - NA
Combine_1 Job-UGGZHH \diff-rel\corerun.exe 61.18 us 0.137 us 0.128 us 61.21 us 60.93 us 61.35 us 0.98 - NA

In fact the whole suite looks pretty good, save perhaps _4:

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-WPDJSJ : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-YNXXXX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
Combine_1 Job-WPDJSJ \base-rel\corerun.exe 62.04 us 0.139 us 0.108 us 62.04 us 61.80 us 62.23 us 1.00 - NA
Combine_1 Job-YNXXXX \diff-rel\corerun.exe 60.92 us 0.096 us 0.090 us 60.86 us 60.85 us 61.10 us 0.98 - NA
Combine_2 Job-WPDJSJ \base-rel\corerun.exe 79.40 us 0.182 us 0.162 us 79.40 us 79.16 us 79.74 us 1.00 - NA
Combine_2 Job-YNXXXX \diff-rel\corerun.exe 78.70 us 0.091 us 0.085 us 78.67 us 78.62 us 78.85 us 0.99 - NA
Combine_3 Job-WPDJSJ \base-rel\corerun.exe 69.93 us 0.093 us 0.087 us 69.90 us 69.85 us 70.11 us 1.00 - NA
Combine_3 Job-YNXXXX \diff-rel\corerun.exe 69.88 us 0.040 us 0.034 us 69.87 us 69.86 us 69.97 us 1.00 - NA
Combine_4 Job-WPDJSJ \base-rel\corerun.exe 83.42 us 0.035 us 0.030 us 83.41 us 83.39 us 83.50 us 1.00 - NA
Combine_4 Job-YNXXXX \diff-rel\corerun.exe 95.03 us 0.013 us 0.010 us 95.03 us 95.01 us 95.05 us 1.14 - NA
Combine_5 Job-WPDJSJ \base-rel\corerun.exe 72.14 us 0.012 us 0.011 us 72.14 us 72.13 us 72.17 us 1.00 - NA
Combine_5 Job-YNXXXX \diff-rel\corerun.exe 72.15 us 0.059 us 0.049 us 72.13 us 72.12 us 72.29 us 1.00 - NA
Combine_6 Job-WPDJSJ \base-rel\corerun.exe 83.46 us 0.104 us 0.092 us 83.40 us 83.38 us 83.65 us 1.00 - NA
Combine_6 Job-YNXXXX \diff-rel\corerun.exe 83.43 us 0.064 us 0.057 us 83.41 us 83.39 us 83.55 us 1.00 - NA
Combine_7 Job-WPDJSJ \base-rel\corerun.exe 94.72 us 0.112 us 0.093 us 94.68 us 94.66 us 94.93 us 1.00 - NA
Combine_7 Job-YNXXXX \diff-rel\corerun.exe 94.86 us 0.177 us 0.166 us 94.81 us 94.69 us 95.26 us 1.00 - NA
Combine_8 Job-WPDJSJ \base-rel\corerun.exe 116.49 us 0.078 us 0.065 us 116.46 us 116.43 us 116.66 us 1.00 - NA
Combine_8 Job-YNXXXX \diff-rel\corerun.exe 117.22 us 0.260 us 0.243 us 117.18 us 116.88 us 117.69 us 1.01 - NA

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 28, 2023

System.Memory.ReadOnlySequence.Slice_Repeat(Segment: Multiple)

Win-x64 only

image

This one repros

BenchmarkDotNet v0.13.7-nightly.20230717.35, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 6.0.20 (6.0.2023.32017), X64 RyuJIT AVX2
Job-RBZAAU : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-AJWTXX : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job Toolchain Segment Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
Slice_Repeat Job-RBZAAU \base-rel\corerun.exe Multiple 32.03 ns 0.020 ns 0.017 ns 32.03 ns 32.01 ns 32.06 ns 1.00 - NA
Slice_Repeat Job-AJWTXX \diff-rel\corerun.exe Multiple 43.23 ns 0.274 ns 0.243 ns 43.16 ns 42.93 ns 43.75 ns 1.35 - NA

Issue here seems to be that with PGO we mark a call site that takes V05 (struct local) as rare and don't inline it, and so V05 ends up getting address exposed and has more expensive copy semantics.

@EgorBo example where not doing an inline in a cold block impacts codegen in a hot block.

base
01.75%   8.8E+05     ?        Unknown
56.39%   2.828E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].Slice(int64,int64)
13.86%   6.95E+06    Tier-1   [MicroBenchmarks]ReadOnlySequence.Slice_Repeat()
10.91%   5.47E+06    Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
09.71%   4.87E+06    native   coreclr.dll
06.28%   3.15E+06    Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].SeekMultiSegment(class System.Buffers.ReadOnlySequenceSegment`1<!0>,class System.Object,int32,int64,value class System.ExceptionArgument)
00.62%   3.1E+05     Tier-1   [7c853c35-6121-4c85-8327-3f1f8585f3b1]Runnable_0.WorkloadActionUnroll(int64)
00.28%   1.4E+05     native   clrjit.dll
00.12%   6E+04       native   ntoskrnl.exe
00.06%   3E+04       native   ntdll.dll

diff

00.69%   3.5E+05     ?        Unknown
74.57%   3.797E+07   Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].Slice(int64,int64)
11.82%   6.02E+06    Tier-1   [MicroBenchmarks]ReadOnlySequence.Slice_Repeat()
05.32%   2.71E+06    native   coreclr.dll
03.69%   1.88E+06    Tier-1   [System.Memory]System.Buffers.ReadOnlySequence`1[System.Byte].SeekMultiSegment(class System.Buffers.ReadOnlySequenceSegment`1<!0>,class System.Object,int32,int64,value class System.ExceptionArgument)
03.18%   1.62E+06    Tier-1   [System.Private.CoreLib]CastHelpers.ChkCastClassSpecial(void*,class System.Object)
00.37%   1.9E+05     native   ntoskrnl.exe
00.26%   1.3E+05     native   clrjit.dll

@AndyAyersMS
Copy link
Member Author

System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateUsingIndexer(TestCase: ArrayOfStrings)

image

x64 linux only. Likely the same as #87194 (comment)

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 28, 2023

Span.Sorting.QuickSortSpan(Size: 512)

Windows intel x64 only.

image
Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
QuickSortSpan Job-FBLSKS \base-rel\corerun.exe 512 12.76 us 2.346 us 2.304 us 13.86 us 8.584 us 15.80 us 1.00 0.00 - NA
QuickSortSpan Job-UVWYZR \diff-rel\corerun.exe 512 14.04 us 3.065 us 3.530 us 13.52 us 10.049 us 19.78 us 1.18 0.42 - NA

At first look, it seems like BDN is not iterating this enough... we are measuring Tier0 code.

base
55.66%   2.31E+06    Tier-0   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
18.80%   7.8E+05     native   clrjit.dll
15.18%   6.3E+05     Tier-1   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
04.82%   2E+05       native   coreclr.dll

diff
37.39%   1.23E+06    Tier-1   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
31.91%   1.05E+06    native   coreclr.dll
26.44%   8.7E+05     Tier-0   [MicroBenchmarks]Sorting.TestQuickSortSpan(value class System.Span`1<int32>)
03.04%   1E+05       native   clrjit.dll

The per-iteration times tell a similar story:

base
000 1021.818 -- 1052.150 : 30.332
001 1053.657 -- 1083.828 : 30.171
002 1085.532 -- 1115.977 : 30.445
003 1117.422 -- 1147.984 : 30.562
004 1149.518 -- 1180.699 : 31.181
005 1182.397 -- 1213.166 : 30.769
006 1214.624 -- 1245.324 : 30.700
007 1247.707 -- 1273.592 : 25.886
008 1275.115 -- 1283.721 : 8.607
009 1285.118 -- 1293.938 : 8.820
010 1295.437 -- 1304.091 : 8.655
011 1305.529 -- 1314.128 : 8.599
012 1315.535 -- 1324.085 : 8.550
013 1325.503 -- 1334.060 : 8.557
014 1335.444 -- 1343.915 : 8.471

diff
000 816.876 -- 872.133 : 55.258
001 873.602 -- 940.129 : 66.526
002 942.247 -- 997.177 : 54.931
003 998.693 -- 1020.673 : 21.980
004 1022.154 -- 1032.488 : 10.334
005 1033.906 -- 1044.327 : 10.421
006 1045.780 -- 1056.235 : 10.455
007 1059.274 -- 1069.671 : 10.398
008 1071.102 -- 1081.520 : 10.418
009 1082.967 -- 1093.448 : 10.481
010 1094.849 -- 1105.169 : 10.320
011 1106.637 -- 1117.139 : 10.502
012 1118.663 -- 1129.091 : 10.427
013 1130.485 -- 1140.868 : 10.384
014 1142.225 -- 1152.484 : 10.260

Where for diff (pgo) we will eagerly instrument so the tier0 code will be slower. But even so, the optimized code is slower...

@adamsitnik seems like we should up the iterations per invocation for these tests to something like 10_000 (at 1000, each benchmark interval is only 10ms).


Think this might be caused by JCC errata. Main optimizations are identical, but code layout differs.

;;
       cmp      dword ptr [rbx+4*r8], edx
       jge      SHORT G_M24415_IG04
						;; size=17 bbWeight=5.09 PerfScore 27.98
G_M24415_IG07:  ;; offset=0041H

and note how in diff that jge straddles a 32 byte boundary. Profiling shows there is a prominent peak at offset 0x2A that is not there in the baseline version.

@AndyAyersMS
Copy link
Member Author

System.Memory.ReadOnlySequence.Slice_Start_And_Length(Segment: Multiple)

also win-x64 only

image

Same underlying issue as #87194 (comment)

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 28, 2023

System.Collections.CtorFromCollection<String>.ConcurrentBag(Size: 512)

arm64 only

image

Repros on my volterra:

BenchmarkDotNet v0.13.7-nightly.20230724.45, Windows 11 (10.0.22621.2070/22H2/2022Update/SunValley2)
Snapdragon 8cx Gen 3 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.100-preview.6.23330.14
[Host] : .NET 8.0.0 (8.0.23.32907), Arm64 RyuJIT AdvSIMD
Job-EGKPHU : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-FFUPKC : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio Gen0 Gen1 Gen2 Allocated Alloc Ratio
ConcurrentBag Job-EGKPHU \base-rel\corerun.exe 512 15.78 us 0.107 us 0.100 us 15.73 us 15.68 us 16.00 us 1.00 2.6309 2.5667 0.0642 16.16 KB 1.00
ConcurrentBag Job-FFUPKC \diff-rel\corerun.exe 512 22.78 us 0.119 us 0.112 us 22.76 us 22.63 us 22.99 us 1.44 2.5491 2.4547 - 16.16 KB 1.00
base
12.70%   4.89E+06    ?        Unknown
43.43%   1.672E+07   Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1+WorkStealingQueue[System.__Canon].LocalPush(!0,int64&)
32.36%   1.246E+07   native   coreclr.dll
03.12%   1.2E+06     native   ntoskrnl.exe
02.68%   1.03E+06    Tier-1   [System.Private.CoreLib]System.SZGenericArrayEnumerator`1[System.__Canon].get_Current()
02.05%   7.9E+05     Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1[System.__Canon]..ctor(class System.Collections.Generic.IEnumerable`1<!0>)
01.51%   5.8E+05     Tier-1   [System.Private.CoreLib]SZGenericArrayEnumeratorBase.MoveNext()

diff

06.89%   2.71E+06    ?        Unknown
60.37%   2.376E+07   Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1+WorkStealingQueue[System.__Canon].LocalPush(!0,int64&)
23.04%   9.07E+06    native   coreclr.dll
02.69%   1.06E+06    Tier-1   [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentBag`1[System.__Canon]..ctor(class System.Collections.Generic.IEnumerable`1<!0>)
02.59%   1.02E+06    Tier-1   [System.Private.CoreLib]System.SZGenericArrayEnumerator`1[System.__Canon].get_Current()
02.57%   1.01E+06    native   ntoskrnl.exe

So issue is evidently in LocalPush.

In the base (no-pgo) jit we form a CSE and this lets us use ldar, in the diff (pgo) jit we don't do the cse and end up emitting a separate barrier. That may be the culprit.

BASE

Generating: N507 (  1,  1) [000436] -----------                  t436 =    LCL_VAR   byref  V16 cse0          x28 REG x28 $1c8
                                                                        /--*  t436   byref  
Generating: N509 (  3,  2) [000174] V---GO-----                  t174 = *  IND       ref    REG x0 $156
IN0062:             ldapr   x0, [x28]

DIFF

Generating: N347 (  1,  1) [000172] -----------                  t172 =    LCL_VAR   ref    V00 this         u:1 x22 REG x22 $80
                                                                        /--*  t172   ref    
Generating: N349 (  3,  4) [000356] -c---------                  t356 = *  LEA(b+8)  byref  REG NA
                                                                        /--*  t356   byref  
Generating: N351 (  6,  6) [000174] V---GO-----                  t174 = *  IND       ref    REG x0 $156
IN00ac:             ldr     x0, [x22, #0x08]
IN00ad:             dmb     ishld

In base there are just two sampling hot spots, both tied to ldapr:

0x0044 : 1037
0x00DC : 162

IN0005: 000040      swpal   w1, w1, [x21]
IN0006: 000044      add     x22, x0, #28
IN0007: 000048      ldapr   w23, [x22]

IN002b: 0000D8      ldapr   w25, [x24]
IN002c: 0000DC      ldrb    w1, [x0, #0x34]

In diff there are more hot spots, and all hottest samples are near the dmbs.

0x003C : 655
0x0340 : 367
0x00C4 : 239
0x00AC : 201
0x0064 : 208

IN0006: 00003C      ldr     w1, [x0, #0x1C]
IN0007: 000040      dmb     ishld

IN00c6: 00033C      dmb     ish
IN00c7: 000340      str     wzr, [x0, #0x2C]

IN0027: 0000C0      dmb     ish
IN0028: 0000C4      str     w1, [x0, #0x1C]

IN0021: 0000A8      dmb     ishld
IN0022: 0000AC      ldr     x2, [fp, #0x28]	// [V01 arg1]

IN000f: 000060      dmb     ishld
IN0010: 000064      ldrb    w1, [x0, #0x34]

@EgorBo example we were chatting about.

@adamsitnik
Copy link
Member

@adamsitnik seems like we should up the iterations per invocation for these tests to something like 10_000 (at 1000, each benchmark interval is only 10ms).

I took a look at the benchmark source code and it's safe to do it (the benchmark ID won't change, as the InvocationsPerIteration const is not an argument or a parameter for this benchmark)

@AndyAyersMS
Copy link
Member Author

While there are still a few benchmarks where the analysis is unclear, they are isolated to specific OS/ABI combinations. So I'm going to close this out.

@DrewScoggins
Copy link
Member

Did BubbleSort2 ever get looked at?

image

@DrewScoggins
Copy link
Member

image

@adamsitnik
Copy link
Member

Please keep in mind that both bubble sort and IndexOf are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.

@DrewScoggins
Copy link
Member

Please keep in mind that both bubble sort and IndexOf are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.

Makes sense. I am just adding tests here that were not previously included, but regressed over the same commit range as this check-in.

@DrewScoggins
Copy link
Member

image

@DrewScoggins
Copy link
Member

Major instability starting with this commit.

image

@AndyAyersMS
Copy link
Member Author

Please keep in mind that both bubble sort and IndexOf are heavily dependent on memory alignment, so it can be unrelated to PGO. The easiest way to verify is to allocate an aligned memory using NativeMemory.AlignedAlloc, create a span out of it and try to repro.

Makes sense. I am just adding tests here that were not previously included, but regressed over the same commit range as this check-in.

Also note that bubble sort runs for a very long time, and so likely BDN + lab customization is not reliably measuring the tier1 codegen , but instead some mixture of Tier0, Tier0 + instrumentation, OSR, and or R2R code.

@AndyAyersMS
Copy link
Member Author

Major instability starting with this commit.

image

Feel free to add this kind of thing to #87324

@DrewScoggins
Copy link
Member

I will do that going forward, didn't know about that issue :)

@DrewScoggins
Copy link
Member

This is specifically on Windows x86.

image

@ghost ghost locked as resolved and limited conversation to collaborators Sep 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Projects
None yet
Development

No branches or pull requests

4 participants