New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a tiered JIT #4331

Closed
GSPP opened this Issue Apr 14, 2016 · 63 comments

Comments

Projects
None yet
@GSPP

GSPP commented Apr 14, 2016

Why is the .NET JIT not tiered?

The JIT has two primary design goals: Fast startup time and high steady-state throughput.

At first, these goals appear at odds. But with a two-tier JIT design they are both attainable:

  1. All code starts out interpreted. This results in extremely fast startup time (faster than RyuJIT). Example: The Main method is almost always cold and jitting it is a waste of time.
  2. Code that runs often is jitted using a very high quality code generator. Very few methods will be hot (1%?). Therefore, throughput of the high quality JIT does not matter much. It can spend as much time as a C compiler spends to generate very good code. Also, it can assume that the code is hot. It can inline like crazy and unroll loops. Code size is not a concern.

Reaching this architecture does not seem too costly:

  1. Writing an interpreter seems cheap compared to a JIT.
  2. A high quality code generator must be created. This could be VC, or the LLILC project.
  3. It must be possible to transition interpreted running code to compiled code. This is possible; the JVM does it. It's called on stack replacement (OSR).

Is this idea being pursued by the JIT team?

.NET runs on 100's of millions of servers. I feel like a lot of performance is left on the table and millions of servers are wasted for customers because of suboptimal code gen.

category:throughput
theme:big-bets
skill-level:expert
cost:extra-large

@RussKeldorph

This comment has been minimized.

Show comment
Hide comment
@RussKeldorph

RussKeldorph Apr 14, 2016

Member

@GSPP Tiering is a constant topic in planning conversations. My impression is that it's a matter of when, not if, if that provides any solace. As to why it's not already there, I think it's because, historically, the perceived potential gains didn't justify the additional development resources necessary to manage to the increased complexity and risk of multiple codegen modes. I should really let the experts speak to this, though, so I'll add them.

/cc @dotnet/jit-contrib @russellhadley

Member

RussKeldorph commented Apr 14, 2016

@GSPP Tiering is a constant topic in planning conversations. My impression is that it's a matter of when, not if, if that provides any solace. As to why it's not already there, I think it's because, historically, the perceived potential gains didn't justify the additional development resources necessary to manage to the increased complexity and risk of multiple codegen modes. I should really let the experts speak to this, though, so I'll add them.

/cc @dotnet/jit-contrib @russellhadley

@mikedn

This comment has been minimized.

Show comment
Hide comment
@mikedn

mikedn Apr 14, 2016

Contributor

Somehow I doubt that this is still relevant in a world of crossgen/ngen, Ready to Run and corert.

Contributor

mikedn commented Apr 14, 2016

Somehow I doubt that this is still relevant in a world of crossgen/ngen, Ready to Run and corert.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Apr 14, 2016

None of these deliver high steady state throughput right now which is what's important for most web apps. If they ever do, I'm happy with that since I personally don't care about startup time.

But so far all code generators for .NET have tried to make an impossible balancing act between the two goals, fulfilling neither very well. Let's get rid of that balancing act so that we can turn optimizations to 11.

GSPP commented Apr 14, 2016

None of these deliver high steady state throughput right now which is what's important for most web apps. If they ever do, I'm happy with that since I personally don't care about startup time.

But so far all code generators for .NET have tried to make an impossible balancing act between the two goals, fulfilling neither very well. Let's get rid of that balancing act so that we can turn optimizations to 11.

@mikedn

This comment has been minimized.

Show comment
Hide comment
@mikedn

mikedn Apr 14, 2016

Contributor

But so far all code generators for .NET have tried to make an impossible balancing act between the two goals, fulfilling neither very well. Let's get rid of that balancing act so that we can turn optimizations to 11.

I agree but fixing this doesn't require things like an interpreter. Just a good crossgen compiler, be it a better RyuJIT or LLILC.

Contributor

mikedn commented Apr 14, 2016

But so far all code generators for .NET have tried to make an impossible balancing act between the two goals, fulfilling neither very well. Let's get rid of that balancing act so that we can turn optimizations to 11.

I agree but fixing this doesn't require things like an interpreter. Just a good crossgen compiler, be it a better RyuJIT or LLILC.

@DemiMarie

This comment has been minimized.

Show comment
Hide comment
@DemiMarie

DemiMarie Apr 14, 2016

I think the biggest advantage is for applications that need to generate code at runtime. These include dynamic languages and server containers.

DemiMarie commented Apr 14, 2016

I think the biggest advantage is for applications that need to generate code at runtime. These include dynamic languages and server containers.

@CarolEidt

This comment has been minimized.

Show comment
Hide comment
@CarolEidt

CarolEidt Apr 14, 2016

Member

It's true that dynamically generated code is one motivation - but it is also true that a static compiler will never have access to all of the information available at runtime. Not only that, even when it speculates (e.g. based on profile information), it is much more difficult for a static compiler to do so in the presence of modal or external context-dependent behavior.

Member

CarolEidt commented Apr 14, 2016

It's true that dynamically generated code is one motivation - but it is also true that a static compiler will never have access to all of the information available at runtime. Not only that, even when it speculates (e.g. based on profile information), it is much more difficult for a static compiler to do so in the presence of modal or external context-dependent behavior.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Apr 14, 2016

Web apps should not need any ngen-style processing. It does not fit well into the deployment pipeline. It takes a lot of time to ngen big binary (even if almost all code is dynamically dead or cold).

Also, when debugging and testing a web app you can't rely on ngen to give you realistic performance.

Further, I 2nd Carol's point of using dynamic information. The interpretation tier can profile code (branches, loop trip counts, dynamic dispatch targets). It's a perfect match! First collect the profile, then optimize.

Tiering solves everything in every scenario forever. Approximately speaking :) This can actually get us to the promise of JITs: Achieve performance beyond what a C compiler can do.

GSPP commented Apr 14, 2016

Web apps should not need any ngen-style processing. It does not fit well into the deployment pipeline. It takes a lot of time to ngen big binary (even if almost all code is dynamically dead or cold).

Also, when debugging and testing a web app you can't rely on ngen to give you realistic performance.

Further, I 2nd Carol's point of using dynamic information. The interpretation tier can profile code (branches, loop trip counts, dynamic dispatch targets). It's a perfect match! First collect the profile, then optimize.

Tiering solves everything in every scenario forever. Approximately speaking :) This can actually get us to the promise of JITs: Achieve performance beyond what a C compiler can do.

@redknightlois

This comment has been minimized.

Show comment
Hide comment
@redknightlois

redknightlois Apr 15, 2016

Current implementation of RyuJIT as it is now is good enough for a Tier 1... The question is: Would it make sense to have a Tier 2 extreme optimization JIT for hot paths that can run after the fact? Essentially when we detect or have enough runtime information to know that something is hot or when asked to use that instead from the start.

redknightlois commented Apr 15, 2016

Current implementation of RyuJIT as it is now is good enough for a Tier 1... The question is: Would it make sense to have a Tier 2 extreme optimization JIT for hot paths that can run after the fact? Essentially when we detect or have enough runtime information to know that something is hot or when asked to use that instead from the start.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Apr 15, 2016

RyuJIT is by far good enough to be the tier 1. Problem with that is that an interpreter would have far faster startup time (in my estimation). Second problem is in order to advance to tier 2 the local state of executing tier 1 code must be transferable to the new tier 2 code (OSR). That requires RyuJIT changes. Adding an interpreter would be, I think, a cheaper path with better startup latency at the same time.

An even cheaper variant would be to not replace running code with tier 2 code. Instead, wait until the tier 1 code naturally returns. This can be a problem if the code enters into a long running hot loop. It will never arrive at tier 2 performance that way.

I think that would not be too bad and could be used as a v1 strategy. Mitigating ideas are available such as an attribute marking a method as hot (this should exist anyway even with the current JIT strategy).

GSPP commented Apr 15, 2016

RyuJIT is by far good enough to be the tier 1. Problem with that is that an interpreter would have far faster startup time (in my estimation). Second problem is in order to advance to tier 2 the local state of executing tier 1 code must be transferable to the new tier 2 code (OSR). That requires RyuJIT changes. Adding an interpreter would be, I think, a cheaper path with better startup latency at the same time.

An even cheaper variant would be to not replace running code with tier 2 code. Instead, wait until the tier 1 code naturally returns. This can be a problem if the code enters into a long running hot loop. It will never arrive at tier 2 performance that way.

I think that would not be too bad and could be used as a v1 strategy. Mitigating ideas are available such as an attribute marking a method as hot (this should exist anyway even with the current JIT strategy).

@redknightlois

This comment has been minimized.

Show comment
Hide comment
@redknightlois

redknightlois Apr 15, 2016

@GSPP That is true, but that doesnt mean you wouldnt know that on the next run. If Jitted code & instrumentation becomes persistent, then the second execution you will still get Tier 2 code (at the expense of some startup time) --- which for once I personally don't care as I write mostly server code.

redknightlois commented Apr 15, 2016

@GSPP That is true, but that doesnt mean you wouldnt know that on the next run. If Jitted code & instrumentation becomes persistent, then the second execution you will still get Tier 2 code (at the expense of some startup time) --- which for once I personally don't care as I write mostly server code.

@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Apr 15, 2016

Contributor

Writing an interpreter seems cheap compared to a JIT.

Instead of writing a brand new interpreter, could it make sense to run RyuJIT with optimizations disabled? Would that improve startup time enough?

A high quality code generator must be created. This could be VC

Are you talking about C2, the Visual C++ backend? That's not cross-platform and not open source. I doubt that fixing both would happen anytime soon.

Contributor

svick commented Apr 15, 2016

Writing an interpreter seems cheap compared to a JIT.

Instead of writing a brand new interpreter, could it make sense to run RyuJIT with optimizations disabled? Would that improve startup time enough?

A high quality code generator must be created. This could be VC

Are you talking about C2, the Visual C++ backend? That's not cross-platform and not open source. I doubt that fixing both would happen anytime soon.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Apr 15, 2016

Good idea with disabling optimizations. The OSR problem remains, though. Not sure how difficult it is to generate code that allows the runtime to derive the IL architectural state (locals and stack) at runtime at a safe point, copy that into tier 2 jitted code and resume tier 2 execution mid-function. The JVM does it but who knows how much time it took to implement that.

Yes, I was talking about C2. I think I remember that at least one of the Desktop JITs is based on C2 code. Probably does not work for CoreCLR but maybe for Desktop. I'm sure Microsoft is interested in having aligned code bases so that's probably out indeed. LLVM seems to be a great choice. I believe multiple languages are currently interested in making LLVM work with GCs and with managed runtimes in general.

GSPP commented Apr 15, 2016

Good idea with disabling optimizations. The OSR problem remains, though. Not sure how difficult it is to generate code that allows the runtime to derive the IL architectural state (locals and stack) at runtime at a safe point, copy that into tier 2 jitted code and resume tier 2 execution mid-function. The JVM does it but who knows how much time it took to implement that.

Yes, I was talking about C2. I think I remember that at least one of the Desktop JITs is based on C2 code. Probably does not work for CoreCLR but maybe for Desktop. I'm sure Microsoft is interested in having aligned code bases so that's probably out indeed. LLVM seems to be a great choice. I believe multiple languages are currently interested in making LLVM work with GCs and with managed runtimes in general.

@swgillespie

This comment has been minimized.

Show comment
Hide comment
@swgillespie

swgillespie Apr 18, 2016

Contributor

LLVM seems to be a great choice. I believe multiple languages are currently interested in making LLVM work with GCs and with managed runtimes in general.

An interesting article on this topic: Apple recently moved the final tier of their JavaScript JIT away from LLVM: https://webkit.org/blog/5852/introducing-the-b3-jit-compiler/ . We would likely encounter similar issues to what they encountered: slow compile times and LLVM's lack of knowledge of the source language.

Contributor

swgillespie commented Apr 18, 2016

LLVM seems to be a great choice. I believe multiple languages are currently interested in making LLVM work with GCs and with managed runtimes in general.

An interesting article on this topic: Apple recently moved the final tier of their JavaScript JIT away from LLVM: https://webkit.org/blog/5852/introducing-the-b3-jit-compiler/ . We would likely encounter similar issues to what they encountered: slow compile times and LLVM's lack of knowledge of the source language.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Apr 18, 2016

10x slower than RyuJIT would be totally acceptable for a 2nd tier.

I don't think that the lack of knowledge of the source language (which is a true concern) is inherent in LLVM's architecture. I believe multiple teams are busy moving LLVM into a state where source language knowledge can be utilized more easily. All non-C high-level languages have this problem when compiling on LLVM.

The WebKIT FTL/B3 project is in a harder position to succeed than .NET because they must excel when running code that in total consumes a few hundred milliseconds of time and then exits. This is the nature of JavaScript workloads driving web pages. .NET is not in that spot.

GSPP commented Apr 18, 2016

10x slower than RyuJIT would be totally acceptable for a 2nd tier.

I don't think that the lack of knowledge of the source language (which is a true concern) is inherent in LLVM's architecture. I believe multiple teams are busy moving LLVM into a state where source language knowledge can be utilized more easily. All non-C high-level languages have this problem when compiling on LLVM.

The WebKIT FTL/B3 project is in a harder position to succeed than .NET because they must excel when running code that in total consumes a few hundred milliseconds of time and then exits. This is the nature of JavaScript workloads driving web pages. .NET is not in that spot.

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Apr 18, 2016

Member

@GSPP I'm sure you probably know about LLILC. If not, take a look.

We have been working for a while on LLVM support for CLR concepts and have invested in both EH and GC improvements. Still quite a bit more to do on both. Beyond that some there's unknown amount of work get optimizations working properly in the presence of GC.

Member

AndyAyersMS commented Apr 18, 2016

@GSPP I'm sure you probably know about LLILC. If not, take a look.

We have been working for a while on LLVM support for CLR concepts and have invested in both EH and GC improvements. Still quite a bit more to do on both. Beyond that some there's unknown amount of work get optimizations working properly in the presence of GC.

@DemiMarie

This comment has been minimized.

Show comment
Hide comment
@DemiMarie

DemiMarie Apr 20, 2016

LLILC seems to be stalled. Is it?
On Apr 18, 2016 7:32 PM, "Andy Ayers" notifications@github.com wrote:

@GSPP https://github.com/GSPP I'm sure you probably know about LLILC
https://github.com/dotnet/llilc. If not, take a look.

We have been working for a while on LLVM support for CLR concepts and have
invested in both EH and GC improvements. Still quite a bit more to do on
both. Beyond that some there's unknown amount of work get optimizations
working properly in the presence of GC.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#4331 (comment)

DemiMarie commented Apr 20, 2016

LLILC seems to be stalled. Is it?
On Apr 18, 2016 7:32 PM, "Andy Ayers" notifications@github.com wrote:

@GSPP https://github.com/GSPP I'm sure you probably know about LLILC
https://github.com/dotnet/llilc. If not, take a look.

We have been working for a while on LLVM support for CLR concepts and have
invested in both EH and GC improvements. Still quite a bit more to do on
both. Beyond that some there's unknown amount of work get optimizations
working properly in the presence of GC.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#4331 (comment)

@russellhadley

This comment has been minimized.

Show comment
Hide comment
@russellhadley

russellhadley Apr 20, 2016

Contributor

@drbo - LLILC is on the back burner for the moment - the MS team has been focusing on getting more targets brought up in RyuJIT as well as fixing issues that come up as CoreCLR drives to release and that's taken pretty much all our time. It's on my TODO list (in my copious free time) to write up a lessons learned post based on how far we've (currently) gotten with LLILC, but I haven't gotten to it yet.
On the tiering, this topic has generated lots of discussion over the years. I think that given some of the new workloads, as well as the new addition of versionable ready to run images, we'll be taking a fresh look at how and where to tier.

Contributor

russellhadley commented Apr 20, 2016

@drbo - LLILC is on the back burner for the moment - the MS team has been focusing on getting more targets brought up in RyuJIT as well as fixing issues that come up as CoreCLR drives to release and that's taken pretty much all our time. It's on my TODO list (in my copious free time) to write up a lessons learned post based on how far we've (currently) gotten with LLILC, but I haven't gotten to it yet.
On the tiering, this topic has generated lots of discussion over the years. I think that given some of the new workloads, as well as the new addition of versionable ready to run images, we'll be taking a fresh look at how and where to tier.

@papaslavik

This comment has been minimized.

Show comment
Hide comment
@papaslavik

papaslavik Jul 19, 2016

Contributor

@russellhadley did you have the free time to write the post?

I hypothesize, there should be something about not promoted stack slots and gcroots breaking the optizations and slow jitting time... I should better have a look at the project's code.

Contributor

papaslavik commented Jul 19, 2016

@russellhadley did you have the free time to write the post?

I hypothesize, there should be something about not promoted stack slots and gcroots breaking the optizations and slow jitting time... I should better have a look at the project's code.

@papaslavik

This comment has been minimized.

Show comment
Hide comment
@papaslavik

papaslavik Jul 19, 2016

Contributor

I also wonder if it's possible and profitable to directly jump into SelectionDAG and perform part of LLVM backend. At least some peephole and copy propagation... if e.g. the gcroot promotion to the registers is supported in LLILC

Contributor

papaslavik commented Jul 19, 2016

I also wonder if it's possible and profitable to directly jump into SelectionDAG and perform part of LLVM backend. At least some peephole and copy propagation... if e.g. the gcroot promotion to the registers is supported in LLILC

@choikwa

This comment has been minimized.

Show comment
Hide comment
@choikwa

choikwa Oct 14, 2016

Contributor

I am curious on the status of LLILC including current bottlenecks and how it fares against RyuJIT. LLVM being full-fledged "industrial-strength" compiler should have a great wealth of optimizations available to OSS. There have been some talks on more efficient, faster serialization/deserialization of bitcode format on the mailing list; I am wondering if this is a useful thing for LLILC.

Contributor

choikwa commented Oct 14, 2016

I am curious on the status of LLILC including current bottlenecks and how it fares against RyuJIT. LLVM being full-fledged "industrial-strength" compiler should have a great wealth of optimizations available to OSS. There have been some talks on more efficient, faster serialization/deserialization of bitcode format on the mailing list; I am wondering if this is a useful thing for LLILC.

@DemiMarie

This comment has been minimized.

Show comment
Hide comment
@DemiMarie

DemiMarie Dec 1, 2016

Have there been any more thoughts on this? @russellhadley CoreCLR has been released and RyuJIT has been ported to (at least) x86 – what is next on the roadmap?

DemiMarie commented Dec 1, 2016

Have there been any more thoughts on this? @russellhadley CoreCLR has been released and RyuJIT has been ported to (at least) x86 – what is next on the roadmap?

@noahfalk

This comment has been minimized.

Show comment
Hide comment
@noahfalk

noahfalk Nov 19, 2017

Member

@benaadams - Yeah multicore JIT works. I don't recall which (if any) scenarios where it is enabled by default, but you can turn it via configuration: https://github.com/dotnet/coreclr/blob/master/src/inc/clrconfigvalues.h#L548

Member

noahfalk commented Nov 19, 2017

@benaadams - Yeah multicore JIT works. I don't recall which (if any) scenarios where it is enabled by default, but you can turn it via configuration: https://github.com/dotnet/coreclr/blob/master/src/inc/clrconfigvalues.h#L548

@ciplogic

This comment has been minimized.

Show comment
Hide comment
@ciplogic

ciplogic Apr 3, 2018

I wrote a half-toy compiler and I've noticed that most of the time the hard hitting optimizations can be done fairly ok on the same infrastructure and very few things can be done in the higher tier optimizer.

What I mean is this: if a function is hit many times, the parameters as:

  • increase inline instruction count
  • use a more "advanced" register allocator (LLVM-like backtracking colorator or full-colorizer)
  • do more passes of optimizations, maybe some specialized with the local knowledge. For example: allow replacing full object allocation into stack allocation if the object is declared in the method and is not assigned in the body of bigger inline function.
  • use PIC for most hit objects where CHA is not possible. Even StringBuilder for instance very likely is not overriden, the code could if is marked as every time as it was hit with a StringBuilder, all methods called inside can be safely devirtualized and a type-guard is set in front of the SB's access.

It would be also very nice, but maybe this is my dreaming awake, that CompilerServices to offer the "advanced compiler" to be exposed as to be able to be accessed via code or metadata, so places like games or trading platforms could benefit by starting compilation ahead of time which classes and methods to be "more deeply compiled". This is not NGen, but if a non-tiered compiler is not necessarily possible (desirable), at least to be possible to use the heavier optimized code for critical parts that need this extra performance. Of course, if a platform does not offer the heavy optimizations (let's say Mono), the API calls will be basically a NO-OP.

ciplogic commented Apr 3, 2018

I wrote a half-toy compiler and I've noticed that most of the time the hard hitting optimizations can be done fairly ok on the same infrastructure and very few things can be done in the higher tier optimizer.

What I mean is this: if a function is hit many times, the parameters as:

  • increase inline instruction count
  • use a more "advanced" register allocator (LLVM-like backtracking colorator or full-colorizer)
  • do more passes of optimizations, maybe some specialized with the local knowledge. For example: allow replacing full object allocation into stack allocation if the object is declared in the method and is not assigned in the body of bigger inline function.
  • use PIC for most hit objects where CHA is not possible. Even StringBuilder for instance very likely is not overriden, the code could if is marked as every time as it was hit with a StringBuilder, all methods called inside can be safely devirtualized and a type-guard is set in front of the SB's access.

It would be also very nice, but maybe this is my dreaming awake, that CompilerServices to offer the "advanced compiler" to be exposed as to be able to be accessed via code or metadata, so places like games or trading platforms could benefit by starting compilation ahead of time which classes and methods to be "more deeply compiled". This is not NGen, but if a non-tiered compiler is not necessarily possible (desirable), at least to be possible to use the heavier optimized code for critical parts that need this extra performance. Of course, if a platform does not offer the heavy optimizations (let's say Mono), the API calls will be basically a NO-OP.

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Apr 24, 2018

Member

We have a solid foundation for tiering in place now thanks to the hard work of @noahfalk, @kouvel and others.

I suggest that we close this issue and open a "how can we make tiered jitting better" issue. I encourage anyone interested in the topic to give the current tiering a try to get an idea where things are at right now. We would love to get feedback on the actual behavior, whether good or bad.

Member

AndyAyersMS commented Apr 24, 2018

We have a solid foundation for tiering in place now thanks to the hard work of @noahfalk, @kouvel and others.

I suggest that we close this issue and open a "how can we make tiered jitting better" issue. I encourage anyone interested in the topic to give the current tiering a try to get an idea where things are at right now. We would love to get feedback on the actual behavior, whether good or bad.

@ltrzesniewski

This comment has been minimized.

Show comment
Hide comment
@ltrzesniewski

ltrzesniewski Apr 24, 2018

Is the current behavior described somewhere? I only found this but it's more about the implementation details rather than the tiering specifically.

ltrzesniewski commented Apr 24, 2018

Is the current behavior described somewhere? I only found this but it's more about the implementation details rather than the tiering specifically.

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Apr 25, 2018

Member

I believe we're going to have some kind of summary writeup available soon, with some of the data we've gathered.

Tiering can be enabled in 2.1 by setting COMPlus_TieredCompilation=1. If you try it, please report back what you find....

Member

AndyAyersMS commented Apr 25, 2018

I believe we're going to have some kind of summary writeup available soon, with some of the data we've gathered.

Tiering can be enabled in 2.1 by setting COMPlus_TieredCompilation=1. If you try it, please report back what you find....

@noahfalk

This comment has been minimized.

Show comment
Hide comment
@noahfalk

noahfalk May 4, 2018

Member

With recent PRs (#17840, dotnet/sdk#2201) you also the have the ability to specify tiered compilation as a runtimeconfig.json property or an msbuild project property. Using this functionality will require you to be on very recent builds whereas the environment variable has been around for a while.

Member

noahfalk commented May 4, 2018

With recent PRs (#17840, dotnet/sdk#2201) you also the have the ability to specify tiered compilation as a runtimeconfig.json property or an msbuild project property. Using this functionality will require you to be on very recent builds whereas the environment variable has been around for a while.

@alpencolt

This comment has been minimized.

Show comment
Hide comment
@alpencolt

alpencolt Aug 17, 2018

Member

As we've discussed before with @jkotas Tiered JIT can improve startup time. Does it work when we use native images?
We've made measurements for several apps on Tizen phone and there's the results:

System DLLs App DLLs Tiered time, s
R2R R2R no 2.68
R2R R2R yes 2.61 (-3%)
R2R no no 4.40
R2R no yes 3.63 (-17%)

We'll check FNV mode as well, but it looks it works good when there is no images.

cc @gbalykov @nkaretnikov2

Member

alpencolt commented Aug 17, 2018

As we've discussed before with @jkotas Tiered JIT can improve startup time. Does it work when we use native images?
We've made measurements for several apps on Tizen phone and there's the results:

System DLLs App DLLs Tiered time, s
R2R R2R no 2.68
R2R R2R yes 2.61 (-3%)
R2R no no 4.40
R2R no yes 3.63 (-17%)

We'll check FNV mode as well, but it looks it works good when there is no images.

cc @gbalykov @nkaretnikov2

@BruceForstall

This comment has been minimized.

Show comment
Hide comment
@BruceForstall

BruceForstall Aug 17, 2018

Contributor

FYI, tiered compilation is now the default for .NET Core: #19525

Contributor

BruceForstall commented Aug 17, 2018

FYI, tiered compilation is now the default for .NET Core: #19525

@kouvel

This comment has been minimized.

Show comment
Hide comment
@kouvel

kouvel Aug 17, 2018

Member

@alpencolt, startup time improvements may be less when using AOT compilation such as R2R. The startup time improvement currently comes from jitting more quickly with fewer optimizations, and when using AOT compilation there would be less to JIT. Some methods are not pregenerated, such as some generics, IL stubs, and other dynamic methods. Some generics may benefit from tiering during startup even when using AOT compilation.

Member

kouvel commented Aug 17, 2018

@alpencolt, startup time improvements may be less when using AOT compilation such as R2R. The startup time improvement currently comes from jitting more quickly with fewer optimizations, and when using AOT compilation there would be less to JIT. Some methods are not pregenerated, such as some generics, IL stubs, and other dynamic methods. Some generics may benefit from tiering during startup even when using AOT compilation.

@noahfalk

This comment has been minimized.

Show comment
Hide comment
@noahfalk

noahfalk Aug 17, 2018

Member

I'm going to go ahead close this issue, since with @kouvel's commit I think have achieved the ask in the title : D People are welcome to continue discussion and/or open new issues on more specific topics such as requested improvements, questions, or particular investigations. If anyone thinks it is closed prematurely of course let us know.

Member

noahfalk commented Aug 17, 2018

I'm going to go ahead close this issue, since with @kouvel's commit I think have achieved the ask in the title : D People are welcome to continue discussion and/or open new issues on more specific topics such as requested improvements, questions, or particular investigations. If anyone thinks it is closed prematurely of course let us know.

@noahfalk noahfalk closed this Aug 17, 2018

@daxian-dbw

This comment has been minimized.

Show comment
Hide comment
@daxian-dbw

daxian-dbw Aug 27, 2018

@kouvel Sorry to comment on the closed issue. I wonder when using AOT compilation such as crossgen, will the application still benefit from the second tier compilation for the hot-spot code paths?

daxian-dbw commented Aug 27, 2018

@kouvel Sorry to comment on the closed issue. I wonder when using AOT compilation such as crossgen, will the application still benefit from the second tier compilation for the hot-spot code paths?

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Aug 27, 2018

Collaborator

@daxian-dbw yes very much so; at runtime the Jit can do cross-assembly inlining (between dlls); branch elimination based on runtime constants (readonly static); etc

Collaborator

benaadams commented Aug 27, 2018

@daxian-dbw yes very much so; at runtime the Jit can do cross-assembly inlining (between dlls); branch elimination based on runtime constants (readonly static); etc

@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Aug 27, 2018

@benaadams And a well-designed AOT compiler couldn't?

masonwheeler commented Aug 27, 2018

@benaadams And a well-designed AOT compiler couldn't?

@daxian-dbw

This comment has been minimized.

Show comment
Hide comment
@daxian-dbw

daxian-dbw Aug 27, 2018

I found some information about this at https://blogs.msdn.microsoft.com/dotnet/2018/08/02/tiered-compilation-preview-in-net-core-2-1/:

the pre-compiled images have versioning constraints and CPU instruction constraints that prohibit some types of optimization. For any methods in these images that are called frequently Tiered Compilation requests the JIT to create optimized code on a background thread that will replace the pre-compiled version.

daxian-dbw commented Aug 27, 2018

I found some information about this at https://blogs.msdn.microsoft.com/dotnet/2018/08/02/tiered-compilation-preview-in-net-core-2-1/:

the pre-compiled images have versioning constraints and CPU instruction constraints that prohibit some types of optimization. For any methods in these images that are called frequently Tiered Compilation requests the JIT to create optimized code on a background thread that will replace the pre-compiled version.

@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Aug 27, 2018

Yeah, that's an example of "not a well-designed AOT." 😛

masonwheeler commented Aug 27, 2018

Yeah, that's an example of "not a well-designed AOT." 😛

@fiigii

This comment has been minimized.

Show comment
Hide comment
@fiigii

fiigii Aug 27, 2018

Collaborator

the pre-compiled images have versioning constraints and CPU instruction constraints that prohibit some types of optimization.

One of the examples is the methods that use hardware intrinsic. AOT compiler (crossgen) just assumes SSE2 as the codegen target on x86/x64, so all the methods that use hardware intrinsic will be rejected by crossgen and compiled by JIT that knows the underlying hardware information.

And a well-designed AOT compiler couldn't?

AOT compiler needs link-time optimization (for cross-assembly inlining) and profile-guided optimization (for runtime constants). Meanwhile, AOT compiler needs a "bottom-line" hardware info (like -mavx2 in gcc/clang) at build-time for SIMD code.

Collaborator

fiigii commented Aug 27, 2018

the pre-compiled images have versioning constraints and CPU instruction constraints that prohibit some types of optimization.

One of the examples is the methods that use hardware intrinsic. AOT compiler (crossgen) just assumes SSE2 as the codegen target on x86/x64, so all the methods that use hardware intrinsic will be rejected by crossgen and compiled by JIT that knows the underlying hardware information.

And a well-designed AOT compiler couldn't?

AOT compiler needs link-time optimization (for cross-assembly inlining) and profile-guided optimization (for runtime constants). Meanwhile, AOT compiler needs a "bottom-line" hardware info (like -mavx2 in gcc/clang) at build-time for SIMD code.

@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Aug 27, 2018

One of the examples is the methods that use hardware intrinsic. AOT compiler (crossgen) just assumes SSE2 as the codegen target on x86/x64, so all the methods that use hardware intrinsic will be rejected by crossgen and compiled by JIT that knows the underlying hardware information.

Wait, what? I don't quite follow here. Why would the AOT compiler reject the intrinsics?

And a well-designed AOT compiler couldn't?

AOT compiler needs link-time optimization (for cross-assembly inlining) and profile-guided optimization (for runtime constants). Meanwhile, AOT compiler needs a "bottom-line" hardware info (like -mavx2 in gcc/clang) at build-time for SIMD code.

Yes, as I said, "a well-designed AOT compiler." 😁

masonwheeler commented Aug 27, 2018

One of the examples is the methods that use hardware intrinsic. AOT compiler (crossgen) just assumes SSE2 as the codegen target on x86/x64, so all the methods that use hardware intrinsic will be rejected by crossgen and compiled by JIT that knows the underlying hardware information.

Wait, what? I don't quite follow here. Why would the AOT compiler reject the intrinsics?

And a well-designed AOT compiler couldn't?

AOT compiler needs link-time optimization (for cross-assembly inlining) and profile-guided optimization (for runtime constants). Meanwhile, AOT compiler needs a "bottom-line" hardware info (like -mavx2 in gcc/clang) at build-time for SIMD code.

Yes, as I said, "a well-designed AOT compiler." 😁

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Aug 27, 2018

Collaborator

@masonwheeler different scenario; crossgen is AoT that works with the Jit and allows servicing/patching of dlls without requiring full application recompilation and redistribution. It offers better code gen than Tier0 with faster start-up than Tier1; but isn't platform neutral.

Tier0, crossgen and Tier1 all work together as a cohesive model in coreclr

To cross-assembly inline (non-Jit) it would require the compilation of a statically linked single file executable and require full recompilation and re-distribution of the application to patch any library it used as well as targeting the specific platform (what version of SSE, Avx etc to use; lowest common or produce version for all?).

corert will AoT this style of application.

However; do do certain types of branch elimination that the Jit can do would require a large amount of extra asm generation for the alternative paths; and the runtime patching of the correct tree

e.g. any code using a method like (where the Tier1 Jit will remove all the ifs)

readonly static _numProcs = Environment.ProcessorCount;

public void DoThing()
{
    if (_numProcs == 1) 
    {
       // Single proc path
    }
    else if (_numProcs == 2) 
    {
       // Two proc path
    }
    else
    {
       // Multi proc path
    }
}
Collaborator

benaadams commented Aug 27, 2018

@masonwheeler different scenario; crossgen is AoT that works with the Jit and allows servicing/patching of dlls without requiring full application recompilation and redistribution. It offers better code gen than Tier0 with faster start-up than Tier1; but isn't platform neutral.

Tier0, crossgen and Tier1 all work together as a cohesive model in coreclr

To cross-assembly inline (non-Jit) it would require the compilation of a statically linked single file executable and require full recompilation and re-distribution of the application to patch any library it used as well as targeting the specific platform (what version of SSE, Avx etc to use; lowest common or produce version for all?).

corert will AoT this style of application.

However; do do certain types of branch elimination that the Jit can do would require a large amount of extra asm generation for the alternative paths; and the runtime patching of the correct tree

e.g. any code using a method like (where the Tier1 Jit will remove all the ifs)

readonly static _numProcs = Environment.ProcessorCount;

public void DoThing()
{
    if (_numProcs == 1) 
    {
       // Single proc path
    }
    else if (_numProcs == 2) 
    {
       // Two proc path
    }
    else
    {
       // Multi proc path
    }
}
@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Aug 27, 2018

@benaadams

To cross-assembly inline (non-Jit) it would require the compilation of a statically linked single file executable and require full recompilation and re-distribution of the application to patch any library it used as well as targeting the specific platform (what version of SSE, Avx etc to use; lowest common or produce version for all?).

It shouldn't require a full re-distribution of the application. Look at Android's ART compilation system: You distribute the application as managed code (Java in their case, but the same principles apply) and the compiler, which lives on the local system, AOT compiles the managed code into a super-optimized native executable.

If you change some little library, all the managed code is still there and you wouldn't have to re-distribute everything, just the thing with the patch, and then the AOT can be re-run to produce a new executable. (Obviously this is where the Android analogy breaks down, due to Android's APK app distribution model, but that doesn't apply to desktop/server development.)

masonwheeler commented Aug 27, 2018

@benaadams

To cross-assembly inline (non-Jit) it would require the compilation of a statically linked single file executable and require full recompilation and re-distribution of the application to patch any library it used as well as targeting the specific platform (what version of SSE, Avx etc to use; lowest common or produce version for all?).

It shouldn't require a full re-distribution of the application. Look at Android's ART compilation system: You distribute the application as managed code (Java in their case, but the same principles apply) and the compiler, which lives on the local system, AOT compiles the managed code into a super-optimized native executable.

If you change some little library, all the managed code is still there and you wouldn't have to re-distribute everything, just the thing with the patch, and then the AOT can be re-run to produce a new executable. (Obviously this is where the Android analogy breaks down, due to Android's APK app distribution model, but that doesn't apply to desktop/server development.)

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Aug 27, 2018

Collaborator

and the compiler, which lives on the local system, AOT compiles the managed code...

That's the previous NGen model that full framework used; though don't think it created a single assembly inlining the framework code into the applications code either? Difference between two approaches was highlighted in the Bing.com runs on .NET Core 2.1! blog post

ReadyToRun Images

Managed applications often can have poor startup performance as methods first have to be JIT compiled to machine code. .NET Framework has a precompilation technology, NGEN. However, NGEN requires the precompilation step to occur on the machine on which the code will execute. For Bing, that would mean NGENing on thousands of machines. This coupled with an aggressive deployment cycle would result in significant serving capacity reduction as the application gets precompiled on the web-serving machines. Furthermore, running NGEN requires administrative privileges, which are often unavailable or heavily scrutinized in a datacenter setting. On .NET Core, the crossgen tool allows the code to be precompiled as a pre-deployment step, such as in the build lab, and the images deployed to production are Ready To Run!

Collaborator

benaadams commented Aug 27, 2018

and the compiler, which lives on the local system, AOT compiles the managed code...

That's the previous NGen model that full framework used; though don't think it created a single assembly inlining the framework code into the applications code either? Difference between two approaches was highlighted in the Bing.com runs on .NET Core 2.1! blog post

ReadyToRun Images

Managed applications often can have poor startup performance as methods first have to be JIT compiled to machine code. .NET Framework has a precompilation technology, NGEN. However, NGEN requires the precompilation step to occur on the machine on which the code will execute. For Bing, that would mean NGENing on thousands of machines. This coupled with an aggressive deployment cycle would result in significant serving capacity reduction as the application gets precompiled on the web-serving machines. Furthermore, running NGEN requires administrative privileges, which are often unavailable or heavily scrutinized in a datacenter setting. On .NET Core, the crossgen tool allows the code to be precompiled as a pre-deployment step, such as in the build lab, and the images deployed to production are Ready To Run!

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 27, 2018

Member

@masonwheeler AOT faces headwinds in full .Net because of the dynamic nature of a .Net process. For instance method bodies in .Net can be modified via a profiler at any time, classes can be loaded or created via reflection, and new code can be created by the runtime as needed for things like interop -- so interprocedural analysis information at best reflects a transient state a the running process. Any interprocedural analysis or optimization (including inlining) in .Net must be undoable at runtime.

AOT works best when the set of things that can change between AOT time and runtime is small and the impact of such changes is localized, so that the expansive scope available for AOT optimization largely reflects things that must always be true (or have perhaps a small number of alternatives).

If you can build in mechanisms for coping with or restricting the dynamic nature of .Net processes then pure AOT can do pretty well -- for instance .Net Native considers the impact of reflection and interop, and outlaws assembly loading, reflection emit, and (I presume) profile attach. But it is not simple.

There is some work underway to allow us to expand the scope of crossgen to multiple assemblies so we can AOT compile all of the core frameworks (or all the asp.net assemblies) as a bundle. But that's only viable because we have the jit as a fallback to redo codegen when things change.

Member

AndyAyersMS commented Aug 27, 2018

@masonwheeler AOT faces headwinds in full .Net because of the dynamic nature of a .Net process. For instance method bodies in .Net can be modified via a profiler at any time, classes can be loaded or created via reflection, and new code can be created by the runtime as needed for things like interop -- so interprocedural analysis information at best reflects a transient state a the running process. Any interprocedural analysis or optimization (including inlining) in .Net must be undoable at runtime.

AOT works best when the set of things that can change between AOT time and runtime is small and the impact of such changes is localized, so that the expansive scope available for AOT optimization largely reflects things that must always be true (or have perhaps a small number of alternatives).

If you can build in mechanisms for coping with or restricting the dynamic nature of .Net processes then pure AOT can do pretty well -- for instance .Net Native considers the impact of reflection and interop, and outlaws assembly loading, reflection emit, and (I presume) profile attach. But it is not simple.

There is some work underway to allow us to expand the scope of crossgen to multiple assemblies so we can AOT compile all of the core frameworks (or all the asp.net assemblies) as a bundle. But that's only viable because we have the jit as a fallback to redo codegen when things change.

@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Aug 27, 2018

@AndyAyersMS I've never believed that the .NET AOT solution should be a "pure AOT-only" solution, for exactly the reasons you're describing here. Having the JIT around to create new code as needed is very important. But the situations in which it's needed are very much in the minority, and therefore I think that Anders Hejlsberg's rule for type systems could be profitably applied here:

Static where possible, dynamic when necessary.

masonwheeler commented Aug 27, 2018

@AndyAyersMS I've never believed that the .NET AOT solution should be a "pure AOT-only" solution, for exactly the reasons you're describing here. Having the JIT around to create new code as needed is very important. But the situations in which it's needed are very much in the minority, and therefore I think that Anders Hejlsberg's rule for type systems could be profitably applied here:

Static where possible, dynamic when necessary.

@iSazonov

This comment has been minimized.

Show comment
Hide comment
@iSazonov

iSazonov Sep 6, 2018

From System.Linq.Expressions
public TDelegate Compile(bool preferInterpretation);

Is the tiered compilation continues to work if preferInterpretation is true?

iSazonov commented Sep 6, 2018

From System.Linq.Expressions
public TDelegate Compile(bool preferInterpretation);

Is the tiered compilation continues to work if preferInterpretation is true?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment