Tiered Compilation step 1 #10478

noahfalk · 2017-03-25T06:00:13Z

Tiered compilation is a new feature we are experimenting with that aims to improve startup times. Initially we jit methods non-optimized, then switch to an optimized version once the method has been called a number of times. More details about the current feature operation are in the comments of TieredCompilation.cpp.

This is only the first step in a longer process building the feature. The primary goal for now is to avoid regressing any runtime behavior in the shipping configuration in which the complus variable is OFF, while putting enough code in place that we can measure performance in the daily builds and make incremental progress visible to collaborators and reviewers. The design of the TieredCompilationManager is likely to change substantively, and the call counter may also change.

noahfalk · 2017-03-25T06:07:56Z

@jkotas @davidwrighton - are you guys the best reviewers for this stuff or is there someone else I should be asking? Thanks!

jkotas · 2017-03-26T15:46:42Z

are you guys the best reviewers for this stuff or is there someone else I should be asking?

For the overall approach: @dotnet/jit-contrib

For the integration with the rest of the VM: @kouvel @janvorli @gkhanna79

jkotas · 2017-03-26T15:47:28Z

clr.coreclr.props

@@ -13,6 +13,7 @@
    <FeatureDbiOopDebugging_HostOneCorex86 Condition="'$(TargetArch)' == 'i386' or '$(TargetArch)' == 'arm'">true</FeatureDbiOopDebugging_HostOneCorex86>
    <FeatureDbiOopDebugging_HostOneCoreamd64 Condition="'$(TargetArch)' == 'amd64'">true</FeatureDbiOopDebugging_HostOneCoreamd64>
    <FeatureEventTrace>true</FeatureEventTrace>
+    <FeatureFitJit>true</FeatureFitJit>


Can we call it something more self-describing, like FEATURE_TIERED_JIT ?

Sure, that was a holdover from some internal naming

mattwarren · 2017-03-27T09:02:44Z

Initially we jit methods non-optimized, then switch to an optimized version once the method has been called a number of times.

Apologies is this is a stupid question, but why not interpreted first, then non-optimised, followed by optimised? There's already an Interpreter available, or is it not considered suitable for production code?

How different is the overhead between non-optimised and optimised JITting?

noahfalk · 2017-03-27T10:16:42Z

There's already an Interpreter available, or is it not considered suitable for production code?

Its a fine question, but you guessed correctly - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do). Given enough time and effort it is all solvable, it just isn't the easiest place to start.

How different is the overhead between non-optimised and optimised JITting?

On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements.

mattwarren · 2017-03-27T10:47:54Z

@noahfalk thanks for the response,, I'd not even considered profiling/debugging, that's useful to know.

On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements.

Interesting, so there's some decent saving to be made, that's cool

janvorli · 2017-03-27T10:32:16Z

src/vm/appdomain.hpp

+#if defined(FEATURE_FITJIT)
+
+public:
+    TieredCompilationManager & GetTieredCompilationManager()


In coreclr runtime, pointers are used instead of references in most places. I would prefer returning pointer here and from the GetCallCounter below. In AppDomain::Init(), which is the only caller of this method, you need a pointer anyways.

janvorli · 2017-03-27T10:37:13Z

src/vm/callcounter.cpp

+    }
+    CONTRACTL_END;
+
+    SpinLockHolder holder(&m_lock);


Is the spinlock really needed here? It looks like just making the m_pTieredCompilationManager VolatilePtr and using its Store / Load methods for accessing it would be sufficient.

janvorli · 2017-03-27T10:45:59Z

src/vm/method.hpp

+    // pointer need to be very careful about if and when they cache it
+    // if it is not stable.
+    //
+    // The stability of the native code pointer is seperate from the


A nit: seperate -> separate

janvorli · 2017-03-27T10:48:01Z

src/vm/methodtablebuilder.cpp

+#ifdef FEATURE_FITJIT
+    // Keep in-sync with MethodDesc::IsEligibleForTieredCompilation()
+	if (g_pConfig->TieredCompilation() &&
+        !GetModule()->HasNativeOrReadyToRunImage() &&


A nit - the formatting here is somehow strange, you have tabs here instead of spaces.

janvorli · 2017-03-27T11:31:38Z

src/vm/tieredcompilation.cpp

+    // and complicating the code to narrow an already rare error case isn't desirable.
+    {
+        SpinLockHolder holder(&m_lock);
+        SListElem<MethodDesc*>* pMethodListItem = new (nothrow) SListElem<MethodDesc*>(pMethodDesc);


It would be better to move the allocation out of the spinlock to minimalize the amount of work done inside of it.
Actually, it seems to me that you really need to use the spinlock just for the m_methodsToOptimize list access and you don't need to use it for the m_countOptimizationThreadsRunning, m_isAppDomainShuttingDown and m_domainId access.
You can use Volatile<...> for m_isAppDomainShuttingDown and m_domainId access and Interlocked operations for incrementing and decrementing the m_countOptimizationThreadsRunning. Please correct me if I am wrong, but it doesn't look like the check for m_isAppDomainShuttingDown and m_countOptimizationThreadsRunning increment / decrement needs to be a single atomic operation.

Agreed on moving the allocation.

You are correct, there is no requirement of atomicity between the various field updates. However I'm not sure that changing to lockless volatile access for the other fields would be an improvement? Unless this lock proved to be a performance hotspot I think we are better off optimizing the code for simplicity.

Ok, let's leave the spinlock usage as it is. I guess the hottest path is the OnMethodCalled function and it needs the spinlock anyways for syncing access to the m_methodsToOptimize list.
If we see that the lock is a perf issue here in the future, it seems we could even get rid of the lock completely by using a simple lockfree list (push one / pop all style that is trivial to make lockfree).

noahfalk · 2017-03-29T10:24:10Z

Thanks @janvorli ! If I don't hear anything further I'll squash and commit tomorrow (technically later today now)

AndyAyersMS · 2017-03-29T15:44:14Z

I think minopts (as it currently exists) is a plausible starting place for the initial method jit, but is something we will want to change fairly soon.

What we want in the initial jit attempt is to have the jit to generate code as fast as possible, not to generate code with minimal optimizations. Those are not the same thing: some optimizations will actually make jitting faster. We haven't really explored this space very well and I don't have anything concrete to recommend here yet. It should be the case that some optimization more than pays for itself.

Second, minopts does no inlining whatsoever, and this will both cause larger than normal counter overhead as well as kicking off jitting for methods that arguably never need to be jitted on their own (eg methods marked with aggressive inlining).

I have some data that shows inlining is one of the optimizations that may make jitting faster, at least for very simple inlinees. It is not an open and shut case because those measurements were made with the rest of the jit running its normal optimization passes and the measurements do not fully capture possible additional costs from class loading (which are tricky to account for since it's somewhat unfair to pin them on any particular inlining decision). Here's a plot of the data for the jit time impact of individual inlines as a function of IL size. Vertical units are microsesconds, lower/less than zero means the jit is faster if we inline than if we don't.

This data shows jitting is faster when the jit inlines methods with IL sizes 0-4, and is a decent bet to be faster or as fast even up to methods as large as 10 IL bytes.

The current inlining policy is to always inline methods that are 16 bytes of IL or less. There is an alternative policy (the "size policy") that might be a good alternative for initial jitting, as it tries to minimize overall method size (it also honors aggressive inlines). For the jit, jit time is typically proportional to the size of the generated code.

All of this impacts policy and tradeoff -- enabling some optimization initially can make the initial jitting faster and make the initially jitted code run faster. So it might buy us more time to use that initially jitted code until we decide to rejit, at which point we can possibly be somewhat more aggressive.

So it would be nice to even now to generalize the notion of "please jit fast" by passing in a new flag instead of reusing an old one. Initially the jit can map this to minopts but in the future we can experiment with alternatives.

BruceForstall · 2017-03-29T16:02:33Z

So it would be nice to even now to generalize the notion of "please jit fast" by passing in a new flag instead of reusing an old one.

One reason for minopts is to do as little as possible in case there is a bug in non-minopts, e.g. if we hit a noway_assert, or tell customers to try using minopts to avoid hitting a bug in the field. So we really don't want it doing inlining, e.g.

AndyAyersMS · 2017-03-29T16:17:07Z

I'm not saying we should get rid of minopts or change what it does.

I'm saying that the initial jit attempt should not be minopts, but something new that we don't have a flag for today, eg fastopts. As an initial cut fastopts can be mapped by the jit onto to minopts.

Over time fastopts should diverge from minopts and enable some optimization. And if fastops hits an issue the jit or user can always fall back to minopts.

JosephTremoulet · 2017-03-29T16:56:06Z

@AndyAyersMS / @BruceForstall, I think you're touching on a larger question of what's the right set of optimization levels/flags, which is something we've been meaning to address; I've just opened #10560 for discussion about that.

cmckinsey · 2017-03-29T21:06:14Z

@AndyAyersMS / @JosephTremoulet There is certainly some exploration required in order to arrive at the right opt/speed trade-offs. I agree we shouldn't hard code to MinOpts to imply Tier 0 in the JIT and this does overlap with your opt levels Joe, however I don't think it's clear even now how many tiers we might need. We said 3 might be the right thing to shoot for out of the gate. Probably best to start with some notion of an actual level counter and then virtualize it behind the JIT interface to imply the set of on/off and limits per optimization.

discostu105 · 2017-03-29T21:08:13Z

Have profiling scenarios been considered for this change? Specifically, I mean a profiler, which uses JitCompilationStarted callack to exchange IL-code for instrumentation. We use this feature heavily in our product.

If IL-code is interpreted at first, and jitted only later on, then code already runs before JitCompilationStarted is called. So, an IL-code modification is only possible "eventually".

noahfalk · 2017-03-29T23:04:24Z

@AndyAyersMS @cmckinsey @JosephTremoulet @BruceForstall - I think we are all in agreement about the desirability of a jit mode which obtains the best set of perf tradeoffs for tier 0. My above mention of min-opt jit was only to the extent that it is the best pre-existing approximation. Thanks for raising the clarification.

How about this as a proposal:

I will make a small follow-on change very shortly that adds a new flag and changes the code here to use it. I will alias the flag to minopts because that is the closest configuration that currently exists.
At some point that it is convenient a new optimization policy can be developed for this flag, and it can be unaliased from min-opt.
As we gain experience working on tiered compilation in general we can continue to collaborate on what additional configuration knobs are appropriate, be it a level number, tracing info, block counts, type test results, etc.

@cmckinsey - I hesitate to add a level counter 'right out of the gate' because we don't yet have machinery to track the progression of a method through multiple levels. Adding a counter at this point would be a place holder only. The JIT would only be called with two of the levels.

noahfalk · 2017-03-29T23:44:50Z

Have profiling scenarios been considered for this change?

@discostu105 - Yep! As much as possible we want diagnostic tools to continue to work with the tiered jitting support we are building. I'm looking to do it in a way that keeps those tools working as-is, or with relatively minor updates, but given the low level interactions profilers and debuggers have with the runtime its hard to keep significant runtime changes 100% abstracted. For instance I think we'll need to reveal that there are additional method jittings which didn't occur before, but we can preserve semantics that if you update IL when you get the first JitCompilationStarted notification then that modification will correctly apply to every form of the code that gets eventually run. We should continue to evaluate the impact as some additional work comes online that aims to make this change work more smoothly with the profiler. If there are further opportunities to mitigate compat issues by making runtime changes I'm glad to discuss it.

If IL-code is interpreted at first, and jitted only later on, then code already runs before JitCompilationStarted is called. So, an IL-code modification is only possible "eventually".
I don't think we have any near term plan for introducing an interpreter, in part because of the additional work it would involve integrating it with the current set of profiling and diagnostic tools.

There is no short term plan to add such an interpreter, and one of considerations in that was an expectation that it would cause trouble for diagnostic tools.

JosephTremoulet · 2017-03-30T00:39:12Z

How about this as a proposal...

works for me.

Tiered compilation is a new feature we are experimenting with that aims to improve startup times. Initially we jit methods non-optimized, then switch to an optimized version once the method has been called a number of times. More details about the current feature operation are in the comments of TieredCompilation.cpp. This is only the first step in a longer process building the feature. The primary goal for now is to avoid regressing any runtime behavior in the shipping configuration in which the complus variable is OFF, while putting enough code in place that we can measure performance in the daily builds and make incremental progress visible to collaborators and reviewers. The design of the TieredCompilationManager is likely to change substantively, and the call counter may also change.

GSPP · 2017-05-06T16:49:23Z

This is fantastic. It's going to be a big leap in the long run for hot code performance and startup time.

On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes

This means that optimizations currently only slow compilation down by a factor of 100/65=1.5x. If we JIT only hot code than the time spent on optimization can be increased greatly. I don't see why 5x slower compilation would be a problem if that is done on the top 5% of methods only. This would increase the proportion of those 5% to only 25% which is covered still by the gains of compiling cold code faster.

mattwarren · 2017-05-08T09:04:18Z

@GSPP

If we JIT only hot code than the time spent on optimization can be increased greatly.

Note that this feature in only enabling slow or fast JIT, a 'no JIT' (interpreted) option isn't currently possible because the .NET interpreter isn't considered production ready, see #10478 (comment)

GSPP · 2017-05-08T09:25:13Z

@mattwarren thanks for letting me know. A fast JIT should be similar in consequences to an interpreter I think. So that seems very good still.

At the very least this should remove the (correct) reluctance of the team to add expensive optimizations.

dnfclas added the cla-already-signed label Mar 25, 2017

jkotas reviewed Mar 26, 2017

View reviewed changes

janvorli reviewed Mar 27, 2017

View reviewed changes

noahfalk mentioned this pull request Mar 28, 2017

Jitted Code Pitching Feature implemented #10496

Merged

noahfalk force-pushed the fitjit branch from 03447cd to c756426 Compare March 29, 2017 07:10

janvorli approved these changes Mar 29, 2017

View reviewed changes

noahfalk force-pushed the fitjit branch from 52ed5e6 to cea477c Compare March 30, 2017 02:08

noahfalk force-pushed the fitjit branch from cea477c to 850164e Compare March 30, 2017 02:09

noahfalk merged commit bf6a03a into dotnet:master Mar 30, 2017

noahfalk mentioned this pull request Mar 30, 2017

Add Tier0 jit flag #10580

Merged

karelz added this to the 2.0.0 milestone Aug 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiered Compilation step 1 #10478

Tiered Compilation step 1 #10478

noahfalk commented Mar 25, 2017

noahfalk commented Mar 25, 2017

jkotas commented Mar 26, 2017

jkotas Mar 26, 2017

noahfalk Mar 26, 2017

mattwarren commented Mar 27, 2017

noahfalk commented Mar 27, 2017

mattwarren commented Mar 27, 2017 •

edited

janvorli Mar 27, 2017

noahfalk Mar 29, 2017

janvorli Mar 27, 2017

janvorli Mar 27, 2017

janvorli Mar 27, 2017

janvorli Mar 27, 2017

noahfalk Mar 29, 2017

janvorli Mar 29, 2017

noahfalk commented Mar 29, 2017

AndyAyersMS commented Mar 29, 2017

BruceForstall commented Mar 29, 2017

AndyAyersMS commented Mar 29, 2017

JosephTremoulet commented Mar 29, 2017

cmckinsey commented Mar 29, 2017

discostu105 commented Mar 29, 2017

noahfalk commented Mar 29, 2017

noahfalk commented Mar 29, 2017

JosephTremoulet commented Mar 30, 2017

GSPP commented May 6, 2017

mattwarren commented May 8, 2017

GSPP commented May 8, 2017

Tiered Compilation step 1 #10478

Tiered Compilation step 1 #10478

Conversation

noahfalk commented Mar 25, 2017

noahfalk commented Mar 25, 2017

jkotas commented Mar 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattwarren commented Mar 27, 2017

noahfalk commented Mar 27, 2017

mattwarren commented Mar 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noahfalk commented Mar 29, 2017

AndyAyersMS commented Mar 29, 2017

BruceForstall commented Mar 29, 2017

AndyAyersMS commented Mar 29, 2017

JosephTremoulet commented Mar 29, 2017

cmckinsey commented Mar 29, 2017

discostu105 commented Mar 29, 2017

noahfalk commented Mar 29, 2017

noahfalk commented Mar 29, 2017

JosephTremoulet commented Mar 30, 2017

GSPP commented May 6, 2017

mattwarren commented May 8, 2017

GSPP commented May 8, 2017

mattwarren commented Mar 27, 2017 •

edited