Enable R2R compilation/inlining of PInvoke stubs where no marshalling is required #22560

fadimounir · 2019-02-13T00:59:11Z

These changes enable the inlining of some PInvokes that do not require any marshalling. With inlined pinvokes, R2R performance should become slightly better, since we'll avoid jitting some of the pinvoke IL stubs that we jit today for S.P.CoreLib. Performance gains not yet measured.

Added JIT_PInvokeBegin/End helpers for all architectures. Linux stubs not yet implemented
Add INLINE_GETTHREAD for arm/arm64
Set CORJIT_FLAG_USE_PINVOKE_HELPERS jit flag for ReadyToRun compilations

src/vm/arm/PInvokeStubs.asm

src/vm/dllimport.cpp

src/inc/readytorun.h

src/jit/compiler.h

fadimounir · 2019-02-27T21:32:25Z

@jkotas PTAL. I'm still going to run the P0 tests with crossgen enabled, for verification, and will get some perf measurements.

src/jit/lower.cpp

src/vm/amd64/PInvokeStubs.asm

src/jit/compiler.hpp

src/vm/amd64/PInvokeStubs.asm

src/vm/jithelpers.cpp

fadimounir · 2019-03-11T16:47:37Z

@dotnet-bot test Windows_NT x64 Checked CoreFX Tests

fadimounir · 2019-03-11T19:07:47Z

@jkotas PTAL at the new changes I submitted

jkotas · 2019-03-11T20:51:08Z

The delta looks reasonable to me. Have you done any R2R specific testing on this?

fadimounir · 2019-03-11T21:00:00Z

I have run the P0 tests using the 'crossgen' command as described in this doc: https://github.com/dotnet/coreclr/blob/master/Documentation/building/windows-test-instructions.md
Results were clean.

jkotas · 2019-03-11T21:06:12Z

These helpers have tight interaction with the GC. I would also do some crossgen+GC stress testing (with tiered compilation disabled).

fadimounir · 2019-03-11T21:16:14Z

Sounds good. I'll look into it

fadimounir · 2019-03-13T19:34:54Z

@jkotas crossgen testing with and without gc stress was clean with regards to these changes (x64 only). For the other architectures, my targeted pinvoke test case had some form of GC stress enabled, and was passing.

fadimounir · 2019-03-13T19:35:14Z

I'm still waiting on the perf job to complete to see what the impact of the changes are.

fadimounir · 2019-03-13T22:58:47Z

/cc @sergiy-k

src/zap/zapinfo.cpp

src/vm/amd64/PInvokeStubs.asm

fadimounir · 2019-03-21T23:04:32Z

Hmm... Doesn't look like we're getting noticable startup perf wins I expected: http://benchview/compare?jobid=158586&comparejobids=[158569]&testid=61944&
There are some scenarios that actually seem slower now. I'll need to dig in further.. I do see some good wins under the "Inlining" category here, with 11% faster execution. CscBench is also 1% faster. There are just some tests mainly under BenchmarksGame and Benchstone that seem slightly slower. Could it be noise?

@AndyAyersMS, @jkotas what do you guys think?

/cc @brianrob

jkotas · 2019-03-28T04:07:54Z

This comment is closed, but I do not see a response to it. Just want to make sure you have seen it:

Another option is to move the popping of the frame on the slow path into the C helper. If you do that the need for this macro will disapper and you will have bit less of assembly code to maintain which is always goodness.

fadimounir · 2019-03-28T04:27:39Z

Where does Linq use PInvokes to explain this gain?

Without tiered compilation, the previous lab results were showing a 20% regression for some weird reason, even though Linq shouldn't really be impacted by pinvokes. I just wanted to dig deeper into that regression, and make sure it was bogus.

fadimounir · 2019-03-28T04:30:10Z

Another option is to move the popping of the frame on the slow path into the C helper

Can this be done in the same JIT_RareDisableHelper method or should I add a wrapper for it? I don't know what else uses this helper, and if popping the frame from the thread at that location would have other side effects.

fadimounir · 2019-03-28T04:33:35Z

How many of these methods are PInvoke stubs? It would be useful to get the list and see how many of them are easy to convert to blittable PInvokes as follow up.

After a second look, I just realized that the baseline measurement may have also been a partial R2R image, that's why it has more jitting. However, for a helloworld scenario, i can confirm by debugging that there are about 5 or 6 pinvokes getting inlined and invoked (JIT_PInvokeBegin/End called)

jkotas · 2019-03-28T04:42:51Z

Can this be done in the same JIT_RareDisableHelper method

It should be separate method. I would copy&paste the code for JIT_RareDisableHelper and added the extra piece to it.

src/inc/corinfo.h

src/vm/amd64/PInvokeStubs.asm

src/vm/jithelpers.cpp

…e any marshalling. With inlined pinvokes, R2R performance should become slightly better, since we'll avoid jitting some of the pinvoke IL stubs that we jit today for S.P.CoreLib. Performance gains not yet measured. Added JIT_PInvokeBegin/End helpers for all architectures. Linux stubs not yet implemented Add INLINE_GETTHREAD for arm/arm64 Set CORJIT_FLAG_USE_PINVOKE_HELPERS jit flag for ReadyToRun compilations

Increase size reserve for InlineCallFrame

Small adjustment to the arm/arm64 INLINE_GET_THREAD macros

AndyAyersMS · 2019-04-01T19:21:59Z

Hmm... Doesn't look like we're getting noticable startup perf wins I expected

Not too surprising; the jit-focused CoreCLR perf tests do not measure startup (or jit time, for the most part). Using ETW to look at jit time and jit requests (or using scenario startup metrics) is a better way to assess this.

Is there a follow-up plan to enable this for non-windows platforms?

fadimounir · 2019-04-01T19:44:05Z

Is there a follow-up plan to enable this for non-windows platforms?

Yes. I'm currently working on it and will create a separate PR

… is required (dotnet/coreclr#22560) * These changes enable the inlining of some PInvokes that do not require any marshalling. With inlined pinvokes, R2R performance should become slightly better, since we'll avoid jitting some of the pinvoke IL stubs that we jit today for S.P.CoreLib. Performance gains not yet measured. * Added JIT_PInvokeBegin/End helpers for all architectures. Linux stubs not yet implemented * Add INLINE_GETTHREAD for arm/arm64 * Set CORJIT_FLAG_USE_PINVOKE_HELPERS jit flag for ReadyToRun compilations * Updating R2RDump tool to handle pinvokes Commit migrated from dotnet/coreclr@bc9248c

fadimounir added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 13, 2019

jkotas reviewed Feb 13, 2019

View reviewed changes

src/vm/arm/PInvokeStubs.asm Outdated Show resolved Hide resolved

jkotas reviewed Feb 13, 2019

View reviewed changes

src/vm/dllimport.cpp Outdated Show resolved Hide resolved

jkotas reviewed Feb 13, 2019

View reviewed changes

src/vm/dllimport.cpp Outdated Show resolved Hide resolved

MichalStrehovsky reviewed Feb 13, 2019

View reviewed changes

src/inc/readytorun.h Show resolved Hide resolved

jkotas reviewed Feb 26, 2019

View reviewed changes

src/jit/compiler.h Outdated Show resolved Hide resolved

fadimounir force-pushed the enable_some_pinvokes branch from 8e27394 to 776cf97 Compare February 27, 2019 21:25

fadimounir removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 27, 2019

fadimounir changed the title ~~[WIP] Enable some pinvokes - Do not merge~~ Enable R2R compilation/inlining of PInvoke stubs where no marshalling is required Feb 27, 2019