Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuickJitForLoops causing regressions #80210

Open
Tracked by #33658
alexcovington opened this issue Jan 4, 2023 · 4 comments
Open
Tracked by #33658

QuickJitForLoops causing regressions #80210

alexcovington opened this issue Jan 4, 2023 · 4 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@alexcovington
Copy link
Contributor

alexcovington commented Jan 4, 2023

Description

I am noticing some of the microbenchmarks perform worse or take longer before TieredCompilation can fully optimize. Specifically, this looks to be due to the DOTNET_TC_QuickJitForLoops configuration setting.

For example, if I run some of the microbenchmarks using .NET 7.0 with default settings, I get worse performance than if I run with DOTNET_TC_QuickJitForLoops=0.

Using .\Microbenchmarks.exe --filter 'FractalPerf.Launch.Test':

// * Summary *

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.25267.1000)
.NET SDK=7.0.101
  [Host]     : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Job-BGBUKP : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

|      Method |     Mean |   Error |  StdDev |   Median |      Min |      Max | Allocated |
|------------ |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
| FractalPerf | 253.7 ms | 0.24 ms | 0.21 ms | 253.7 ms | 253.3 ms | 254.1 ms |     744 B |

Using .\Microbenchmarks.exe --filter 'FractalPerf.Launch.Test' --envVars DOTNET_TC_QuickJitForLoops:0:

// * Summary *

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.25267.1000)
.NET SDK=7.0.101
  [Host]     : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Job-ZQLSHR : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_TC_QuickJitForLoops=0  PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

|      Method |     Mean |   Error |  StdDev |   Median |      Min |      Max | Allocated |
|------------ |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
| FractalPerf | 129.8 ms | 0.10 ms | 0.09 ms | 129.8 ms | 129.6 ms | 129.9 ms |     444 B |

Configuration

All benchmarks were run on various x64 systems (AMD Ryzen and Intel).

Baseline .NET version used is .NET 6.0.12.

Comparison .NET version used is .NET 7.0.1.

I ran two comparisons:

  • First comparison was between .NET 6.0 vs .NET 7.0 (DOTNET_TC_QuickJitForLoops=1)
  • Second comparison was between .NET 6.0 vs .NET 7.0 (DOTNET_TC_QuickJitForLoops=0)

Regression?

Yes, this is a regression going from .NET 6.0.12 -> .NET 7.0.1.

Data

I've noticed these microbenchmarks are affected by this:

  • Benchstone.BenchI.Array2.Test
  • Benchstone.BenchI.NDhrystone.Test
  • FractalPerf.Launch.Test
  • System.Collections.IndexerSetReverse.IList(Size: 512)
6.0 vs 7.0 (Base, QuickJitForLoops=1)

## Benchstone.BenchI.Array2.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower |  0.56 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower |  0.59 |        -240 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## Benchstone.BenchI.NDhrystone.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower |  0.87 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower |  0.85 |       -2232 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## FractalPerf.Launch.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower |  0.72 |        -784 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower |  0.52 |        -464 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## System.Collections.IndexerSetReverse<Int32>.IList(Size: 512)

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower |  0.88 |          +0 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower |  0.88 |          +0 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |

6.0 vs 7.0 (Diff, QuickJitForLoops=0)

## Benchstone.BenchI.Array2.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   |  0.99 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   |  1.01 |        -240 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## Benchstone.BenchI.NDhrystone.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   |  1.00 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   |  0.93 |       -2232 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## FractalPerf.Launch.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   |  0.99 |        -784 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   |  1.02 |        -464 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## System.Collections.IndexerSetReverse<Int32>.IList(Size: 512)

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   |  1.00 |          +0 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   |  1.00 |          +0 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |
@alexcovington alexcovington added the tenet-performance Performance related issue label Jan 4, 2023
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 4, 2023
@jakobbotsch jakobbotsch added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 5, 2023
@ghost
Copy link

ghost commented Jan 5, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I am noticing some of the microbenchmarks perform worse or take longer before TieredCompilation can fully optimize. Specifically, this looks to be due to the DOTNET_TC_QuickJitForLoops configuration setting.

For example, if I run some of the microbenchmarks using .NET 7.0 with default settings, I get worse performance than if I run with DOTNET_TC_QuickJitForLoops=0.

Using .\Microbenchmarks.exe --filter 'FractalPerf.Launch.Test':

// * Summary *

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.25267.1000)
AMD Eng Sample: 100-000000589-50_Y, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.101
  [Host]     : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Job-BGBUKP : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

|      Method |     Mean |   Error |  StdDev |   Median |      Min |      Max | Allocated |
|------------ |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
| FractalPerf | 253.7 ms | 0.24 ms | 0.21 ms | 253.7 ms | 253.3 ms | 254.1 ms |     744 B |

Using .\Microbenchmarks.exe --filter 'FractalPerf.Launch.Test' --envVars DOTNET_TC_QuickJitForLoops:0:

// * Summary *

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.25267.1000)
AMD Eng Sample: 100-000000589-50_Y, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.101
  [Host]     : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Job-ZQLSHR : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_TC_QuickJitForLoops=0  PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

|      Method |     Mean |   Error |  StdDev |   Median |      Min |      Max | Allocated |
|------------ |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
| FractalPerf | 129.8 ms | 0.10 ms | 0.09 ms | 129.8 ms | 129.6 ms | 129.9 ms |     444 B |

Configuration

All benchmarks were run on various x64 systems (AMD Zen 3, AMD Zen 4, and Intel Rocket Lake).

Baseline .NET version used is .NET 6.0.12.

Comparison .NET version used is .NET 7.0.1.

I ran two comparisons:

  • First comparison was between .NET 6.0 vs .NET 7.0 (DOTNET_TC_QuickJitForLoops=1)
  • Second comparison was between .NET 6.0 vs .NET 7.0 (DOTNET_TC_QuickJitForLoops=0)

Regression?

Yes, this is a regression going from .NET 6.0.12 -> .NET 7.0.1.

Data

I've noticed these microbenchmarks are affected by this:

  • Benchstone.BenchI.Array2.Test
  • Benchstone.BenchI.NDhrystone.Test
  • FractalPerf.Launch.Test
  • System.Collections.IndexerSetReverse.IList(Size: 512)
6.0 vs 7.0 (Base, QuickJitForLoops=1)

## Benchstone.BenchI.Array2.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower | 378068100.00 | 679266400.00 |  0.56 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 382520250.00 | 643846900.00 |  0.59 |        -144 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Slower | 356147150.00 | 600143300.00 |  0.59 |        -240 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## Benchstone.BenchI.NDhrystone.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower | 287615500.00 | 329207650.00 |  0.87 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 295240800.00 | 361841300.00 |  0.82 |       -1936 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Slower | 266774400.00 | 314499600.00 |  0.85 |       -2232 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## FractalPerf.Launch.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower | 177198500.00 | 245312800.00 |  0.72 |        -784 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 135577150.00 | 254808900.00 |  0.53 |        -416 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Slower | 130623500.00 | 252697300.00 |  0.52 |        -464 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## System.Collections.IndexerSetReverse<Int32>.IList(Size: 512)

| Result |    Base |    Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -------:| -------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower | 1039.31 | 1180.33 |  0.88 |          +0 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 1033.24 | 1180.34 |  0.88 |          +0 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Slower | 1034.39 | 1179.53 |  0.88 |          +0 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |
6.0 vs 7.0 (Diff, QuickJitForLoops=0)

## Benchstone.BenchI.Array2.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   | 378068100.00 | 380040700.00 |  0.99 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   | 382520250.00 | 378924350.00 |  1.01 |        -144 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Same   | 356147150.00 | 353812900.00 |  1.01 |        -240 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## Benchstone.BenchI.NDhrystone.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   | 287615500.00 | 287273800.00 |  1.00 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 295240800.00 | 354430000.00 |  0.83 |       -1936 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Same   | 266774400.00 | 285916500.00 |  0.93 |       -2232 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## FractalPerf.Launch.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   | 177198500.00 | 178737900.00 |  0.99 |        -784 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   | 135577150.00 | 135551375.00 |  1.00 |        -416 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Same   | 130623500.00 | 128415250.00 |  1.02 |        -464 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## System.Collections.IndexerSetReverse<Int32>.IList(Size: 512)

| Result |    Base |    Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -------:| -------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   | 1039.31 | 1035.28 |  1.00 |          +0 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   | 1033.24 | 1033.49 |  1.00 |          +0 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Same   | 1034.39 | 1033.07 |  1.00 |          +0 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |
Author: alexcovington
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@jakobbotsch
Copy link
Member

jakobbotsch commented Jan 5, 2023

In general it was expected that certain microbenchmarks would be regressions due to various interactions between on-stack replacement and how Benchmark.NET works. @AndyAyersMS analyzed most of them in #33658 and #67594.

@AndyAyersMS
Copy link
Member

In general it was expected that certain microbenchmarks would be regressions

Right, mostly benchmarks whose invocation times are on the order of 100ms or more that spend most of their time in a single method. When this happens BDN does not run many invocations per iteration, and so code being tested does not have a chance to tier up, and inevitably also contains loops, hence BDN ends up measuring the perf of the OSR version of the method.

There are a variety of reasons why the OSR code may be less efficient than the tiered up method code. I have ideas on how to mitigate some these (see the linked issues above) but nothing committed or scheduled.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 9, 2023
@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants