Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choosing the right defaults for Tiered Compilation #24064

Open
richlander opened this issue Apr 17, 2019 · 8 comments

Comments

Projects
None yet
7 participants
@richlander
Copy link
Member

commented Apr 17, 2019

Choosing the right defaults for tiered compilation

Tiered compilation (TC) is a runtime feature that is able to control the compilation speed and quality of the JIT to achieve various performance outcomes. It is enabled by default in .NET Core 3.0 builds. We are considering what the default TC configuration should be for the final 3.0 release. We have been investigating the performance impact (positive and/or negative) for a variety of application scenarios, with the goal of selecting a default that is good for all scenarios, and providing configuration switches to enable developers to opt apps into other configurations.

We would like your feedback on this exercise and want to share how we are thinking about TC currently.

TC Feature Explained (briefly)

TC is based on the underlying re-jit capability in the runtime, which enables methods to be compiled more than once (typically with different code). The re-jit capability was initially built to support instrumenting profilers.

The fundamental benefit and capability of TC is to enable (re-)jitting methods with lower but faster to produce or higher quality but slower to produce code in order to increase performance of an application as it goes through various stages of execution, from startup through stead-state. This contrasts with the non-TC approach, where every method is compiled a single way (the same as the high-quality tier), which biases to steady-state over startup performance.

TC isn't solely about jitted code. TC is able to re-jit R2R code to higher-quality jitted code. Ahead-of-time compiled ready-to-run (R2R) images are biased towards startup performance, and are worse for stead-state performance than high-quality jitted code. This capability of TC can significantly improve steady state performance for compute-intensive applications like web servers.

Only methods that are called multiple times are re-jitted, after calls to that methods satisfy a threshold, currently defined at 30 calls. Many methods are called only a few times, and don't warrant optimization.

We call code that is either already available (specifcally R2R code) or can be inexpensively produced at startup "tier 0". We call optimized code that is generated after startup "tier 1". Tier 1 code is the code that is generated after a method has been called multiple times, as described above.

At startup, tier 0 code can be one of the following:

  • Ahead-of-time compiled R2R code.
  • Tier 0 jitted code, produced by "Quick JIT". Quick JIT applies fewer optimizations (similar to "minopts") to compile code faster.

Context

We first introduced TC with .NET Core 2.1. We intended at that time to enable TC by default. We found regressions with some ASP.NET benchmarks, so opted to leave the feature off by default. We have heard that some users (including Microsoft products) have enabled TC based on observed benefits. That's great, and is part of the information we are collecting to make the decision on how to configure TC for 3.0.

As part of the .NET Core 3.0 release, we have invested significant effort into improving and optimizing TC, again with the goal of enabling TC by default. At this point, we are focussed less on further improvements to TC and more on the final ship configuration.

Recently, we saw a report of concerning performance with TC and AWS Lambda. We are working with both Zac Charles and Norm Johanson to better understand the results and try the same testing with more real-world Lambda applications. Zac and Norm have been excellent to work with. Major kudos to Zac for all the leg-work he's done helping us! Note that the results in the blog post were based on a Lambda application that just calls ToUpper() on a string. It doesn't make sense to base our analysis solely on an application that small.

Updated: Benchmarking .NET Core 3.0-preview4 on Lambda

We have a conversation started with the Azure Functions team to see if similar benchmarks produce similar results in that environment. The Functions team told us that they tried TC with .NET Core 2.1 and opted not to enable it because they didn't see a benefit with their testing, however, they are about to start testing .NET Core 3.0. We will work with the Functions team to specifically look at the impact of TC on their performance benchmarks.

We're not making .NET Core product decisions exclusively for the serverless application type, however, the post that Zac wrote and other community feedback (example) made us ask a few questions:

  • Is TC a good feature to have enabled by default? Is it generally beneficial or does it only show benefits with certain types of applications?
  • Is TC bad for people benchmarking with .NET Core? Will they need to read documentation to benchmark .NET Core correctly, specifically to accomodate for TC?
  • Almost all of the TC investigations have been on web apps. What about WPF and Windows Forms client applications, which are new in 3.0? What about more sophisticated console apps like PowerShell? What about constrained devices like Raspberry Pi or Docker containers with <= 1 cpu allocated?

The rest of this doc details our plan for answering these questions, and to using performance data we generate to define a final configuration for TC for .NET Core 3.0.

Desired Outcomes

First, we'll start with the characteristics we would want to see in order to make TC default.

  • No or limited regressions (<5% due to TC; 3.0 w/TC disabled is baseline); regressions could be: startup time, steady-state throughput, allocation rates, memory usage, ...
  • Significant improvements for some scenarios with a bias to steady state execution (for example, as measured by RPS for web apps)
  • Developers benchmarking .NET Core do not need to read documentation to get accurate results

Define Performance Baselines

  • 2.2 Customer default -- R2R enabled, TC disabled
  • 2.2 TechEmpower configuration -- R2R disabled, TC disabled

Measurement Modes

  • TC enabled (same as Preview 3 default)
  • TC enabled, QuickJit disabled (same as Preview 4 default)
  • TC disabled (Same as 2.2 default)
  • R2R disabled, TC disabled (as as 2.2 TechEmpower configuration)

Action Plan

We intend to make a decision on the .NET Core default mode for TC in May or June. We will use the following action plan.

Measure cold startup, warm startup, throughput and working set, in the defined measurement modes, for a broad set of applications:

  • UI client apps: WPF, Windows Forms, UWP
  • Console apps: Roslyn compiler (compiling roslyn), and PowerShell (pure startup and long-running script)
  • ASP.NET: TechEmpower and Music Store
  • Serverless: Azure Functions and AWS Lambda
  • PAAS: Azure websites

Note: some performance metrics may not be critical/relevant for all application types.

Execution plan:

  • Collect and publish performance data
  • Investigate anomalies
  • Consider experiments to improve results, and rinse and repeat, some of which will need to be postponed until a later release
  • Make changes in a preview and watch for feedback
  • Document the final decision with recommendations, as appropriate

Desired community engagement:

  • Provide general feedback, perferably with data justifying viewpoints
  • Run performance tests and report results and any associated analysis (file issues on dotnet/coreclr repo)

Theories and Thoughts

We have developed to a few theories. They are not guiding the investigation, but are ideas that we want to prove or disprove.

  • The AWS Lambda throughput benchmarks are negatively impacted by Quick JIT. The Lamdba environment is very constrained, resulting in poor throughput for an extended time, until tier 1 code can be generated. For some applications running in such an environment, they may never hit optimal execution because they may not run long enough.
  • System.Private.Corelib.dll was moved from being compiled with fragile NGEN to ready to run format in .NET Core 3.0. We believe that some startup performance regressions are due to this change.

Key Resources

@MeikTranel

This comment has been minimized.

Copy link

commented Apr 17, 2019

Has there been any testing in regards to runtime compilation - common tasks like Regex-Compile/XMLSerializer but also more complex examples like Cake-Build?
I assume tiered-compilation would have at least some influence with code compiled, loaded and executed in the same runtime cycle?

In regards to the action plan:
I also think it makes sense to have a tighter feedback loop with the communities that depend on these features (such as Powershell) and then have their feedback compared to regular customers to find common ground - while simultaneously gathering experiences with the various knobs users can tweak.

For regular customers i feel like there needs to be a more beginner friendly documentation on how to tweak TC - right now i don't see any documentation on how to enable different configuration unless you want us to go to the code and look for CLR config keys.

@richlander

This comment has been minimized.

Copy link
Member Author

commented Apr 17, 2019

@richlander

This comment has been minimized.

Copy link
Member Author

commented Apr 17, 2019

@MeikTranel -- I think you are asking about dynamic code scenarios, like Reflection.Emit. Tiering is not currently enabled for those scenarios. It's something that we'll consider post 3.0.

This issue is intended to create that feedback loop. I'll also reference this issue in our next release blog post. PowerShell currently has tiering enabled with their current in-market version, on top of .NET Core 2.1. The PowerShell team saw significant benefit from tiering, so they enabled it for all their users.

Yes, TC docs need to improve. That's on my plate to fix, after we figure out what the defaults will be. To be fair, you can enable TC on 2.1/2.2 via an msbuild property, as documented @ https://devblogs.microsoft.com/dotnet/tiered-compilation-preview-in-net-core-2-1/.

@Maoni0

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

@janvorli do you believe we will still get to this in 3.0?

@janvorli

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

@Maoni0 I guess you wanted to ask someone else or put your question to a wrong issue.

@Maoni0

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

yes..I did; during my frantic bug triage I apparently put the comment in the wrong issue. please ignore...sorry! (I can delete my last comment and this one after you've read it :P)

@normj

This comment has been minimized.

Copy link

commented Apr 21, 2019

Adding clarification on AWS Lambda.

AWS Lambda's memory and compute is as constrained as a user request it to be similar to a docker container. Generally memory and compute are set a pretty low levels. Our AWS tooling defaults to 256 MB of memory for .NET Core Lambda functions. Compute is a sliding scale based on the memory specified.

What is unique about Lambda is the process lifecycle which is either processing an event, which is usually a short duration like a single web request, or the process is frozen. There is not idle time in the compute environment. If there are no incoming events for some duration the compute environment is reclaimed and will be reconstructed including restarting the .NET process when a new event comes in.

@iSazonov iSazonov referenced this issue Apr 23, 2019

Closed

.NET Core 3.0 port #9425

9 of 9 tasks complete
@GSPP

This comment has been minimized.

Copy link

commented Jun 10, 2019

A few thoughts:

  1. It might be worth taking a look at Java. Java has TC for a very long time. It is default enabled. It is well known in the Java space that benchmarking needs to take TC into account. We can learn from their experience how the community handles the existence of TC.

  2. Benchmarking is complicated already (need Release compile, no debugger attached, warmup, sufficient duration, etc.). This is just one more item. Frameworks such as Benchmark.Net can automatically handle this issue.

  3. Disabling TC by default is a heavy solution. Maybe we can just tune the default heuristics so that they are less prone to regressions?

  4. For most applications I have worked with (many) steady state throughput is the only relevant concern. I'm personally most interested in TC because it can deliver higher throughput when the highest tier kicks in. This is an important use case.

  5. Java supports on stack replacement (OSR). This means that running functions can be up- and downgraded. This solves many issues. Even if a benchmark consists of a single function call executing a long-running loop, TC will very quickly upgrade that code to the highest tier. If the benchmark duration is chosen long enough the measurement will be meaningful even in the presence of TC.

With OSR it becomes much easier to pick heuristics that work for a broad range of scenarios. Are there plans to implement OSR? It seems very desirable to me.

OSR enables further very powerful optimizations as well. The runtime can track which types have ever been instantiated. If an interface only has one concrete type ever instantiated the JIT can pretend that values of the interface type are actually the single concrete type. If another type is ever instantiated OSR can undo this optimization. This can lead to very far reaching devirtualization and other specializations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.