New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: modify box/unbox/isinst/castclass expansions for fast jitting #13188

Merged
merged 2 commits into from Aug 10, 2017

Conversation

Projects
None yet
7 participants
@AndyAyersMS
Member

AndyAyersMS commented Aug 3, 2017

When the jit is generating code in debug/minopts/rare-block/Tier0 modes,
we'd prefer it to generate code more quickly and worry less about overall
generated code performance.

Generally speaking smaller intermediate and final code should correlate
well with faster jitting, as to first order jit time is proportional to the
amount of IR it has to carry around.

This is preliminary work to experiment with different box expansion strategies
to see if there's any notable impact on jit throughput.

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 3, 2017

Member

Still exploring options here, but feedback welcome.

@adiaaida what's the magic phrase to test minopts TP?
cc @dotnet/jit-contrib

Member

AndyAyersMS commented Aug 3, 2017

Still exploring options here, but feedback welcome.

@adiaaida what's the magic phrase to test minopts TP?
cc @dotnet/jit-contrib

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 3, 2017

Member

Regular jit-diffs shows some small improvement from the rare block logic:

Total bytes of diff: -2598 (-0.02 % of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
         526 : Microsoft.CodeAnalysis.CSharp.dasm (0.02 % of base)
          38 : System.Security.Cryptography.X509Certificates.dasm (0.04 % of base)
          20 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.00 % of base)
           4 : System.Collections.Concurrent.dasm (0.01 % of base)
           3 : System.Threading.dasm (0.02 % of base)
Top file improvements by size (bytes):
       -3121 : System.Private.CoreLib.dasm (-0.09 % of base)
         -41 : Microsoft.CodeAnalysis.dasm (-0.01 % of base)
         -20 : System.Collections.dasm (-0.02 % of base)
          -7 : System.Collections.Immutable.dasm (0.00 % of base)
9 total files with size differences (4 improved, 5 regressed), 70 unchanged.

This also helps clean up some of the craziness seen in #13187 as boxes of large structs on cold paths are now kept compact (while still doing way more work than needed to feed GetType).

Forcing on minopts and running jit-diffs shows larger wins, so hoping maybe these show up in the new minopts TP runs....

Total bytes of diff: -45002 (-0.27 % of base)
    diff is an improvement.

Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes

Top file regressions by size (bytes):
          24 : Microsoft.VisualBasic.dasm (0.01 % of base)

Top file improvements by size (bytes):
      -41354 : System.Private.CoreLib.dasm (-0.82 % of base)
        -993 : System.Linq.Parallel.dasm (-0.13 % of base)
        -633 : System.Collections.dasm (-0.39 % of base)
        -500 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.02 % of base)
        -453 : Microsoft.CodeAnalysis.dasm (-0.05 % of base)

23 total files with size differences (22 improved, 1 regressed), 56 unchanged.

Still some losses to look at. Small structs may benefit from inline expansion, and the best expansion depends both on how the box value is produced and on how the box is consumed.

Member

AndyAyersMS commented Aug 3, 2017

Regular jit-diffs shows some small improvement from the rare block logic:

Total bytes of diff: -2598 (-0.02 % of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
         526 : Microsoft.CodeAnalysis.CSharp.dasm (0.02 % of base)
          38 : System.Security.Cryptography.X509Certificates.dasm (0.04 % of base)
          20 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.00 % of base)
           4 : System.Collections.Concurrent.dasm (0.01 % of base)
           3 : System.Threading.dasm (0.02 % of base)
Top file improvements by size (bytes):
       -3121 : System.Private.CoreLib.dasm (-0.09 % of base)
         -41 : Microsoft.CodeAnalysis.dasm (-0.01 % of base)
         -20 : System.Collections.dasm (-0.02 % of base)
          -7 : System.Collections.Immutable.dasm (0.00 % of base)
9 total files with size differences (4 improved, 5 regressed), 70 unchanged.

This also helps clean up some of the craziness seen in #13187 as boxes of large structs on cold paths are now kept compact (while still doing way more work than needed to feed GetType).

Forcing on minopts and running jit-diffs shows larger wins, so hoping maybe these show up in the new minopts TP runs....

Total bytes of diff: -45002 (-0.27 % of base)
    diff is an improvement.

Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes

Top file regressions by size (bytes):
          24 : Microsoft.VisualBasic.dasm (0.01 % of base)

Top file improvements by size (bytes):
      -41354 : System.Private.CoreLib.dasm (-0.82 % of base)
        -993 : System.Linq.Parallel.dasm (-0.13 % of base)
        -633 : System.Collections.dasm (-0.39 % of base)
        -500 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.02 % of base)
        -453 : Microsoft.CodeAnalysis.dasm (-0.05 % of base)

23 total files with size differences (22 improved, 1 regressed), 56 unchanged.

Still some losses to look at. Small structs may benefit from inline expansion, and the best expansion depends both on how the box value is produced and on how the box is consumed.

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 3, 2017

Member

Trying to reverse engineer, looks like it might be

@dotnet-bot test Windows_NT x64 min_opt Throughput

(or maybe min_opts, seems like the PR side might be inconsistent?)

Member

AndyAyersMS commented Aug 3, 2017

Trying to reverse engineer, looks like it might be

@dotnet-bot test Windows_NT x64 min_opt Throughput

(or maybe min_opts, seems like the PR side might be inconsistent?)

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 3, 2017

Member

Trying @dotnet-bot test Windows_NT x64 min_opts Throughput

Member

AndyAyersMS commented Aug 3, 2017

Trying @dotnet-bot test Windows_NT x64 min_opts Throughput

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 3, 2017

Member

Trying @dotnet-bot test Windows_NT x64 Throughput

Member

AndyAyersMS commented Aug 3, 2017

Trying @dotnet-bot test Windows_NT x64 Throughput

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 4, 2017

Member

The TP job data was inconclusive; not surprising given the relatively small expected impact.

To get a more detailed look, I used a variant of the instructions retired explorer to mine ETL on windows. Looking just at corelib,

Jit Minopts Inst Retired Ratio to resp base Ratio to base min
Base Yes 1.096e10 -- --
Base No 1.786e10 -- 1.630
Diff Yes 1.092e10 0.996 0.996
Diff No 1.787e10 1.000 1.630

So there is no TP impact in regular opts, and a small win in minopts.

FWIW raw data looks like this:

DIFF MIN OPT
InstRetired for crossgen: 166592 events, 1.091777E+010 instrs
Jitting           : 00.00% (0 methods)
Jit-generated code: 00.00%
  Jitted code     : 00.00%

58.47%   6.38E+09   native  clrjit.dll
30.11%   3.29E+09   native  crossgen.exe
05.03%   5.49E+08   native  ntdll.dll
03.34%   3.64E+08   native  ucrtbase.dll
02.80%   3.06E+08   native  ntoskrnl.exe

With full opt, jit time almost doubles over minopts:

BASE FULL OPT
InstRetired for crossgen: 272597 events, 1.786492E+010 instrs
Jitting           : 00.00% (0 methods)
Jit-generated code: 00.00%
  Jitted code     : 00.00%

66.43%   1.19E+10   native  clrjit.dll
21.85%   3.9E+09    native  crossgen.exe
04.99%   8.92E+08   native  ntoskrnl.exe
04.20%   7.5E+08    native  ntdll.dll
02.34%   4.18E+08   native  ucrtbase.dll
Member

AndyAyersMS commented Aug 4, 2017

The TP job data was inconclusive; not surprising given the relatively small expected impact.

To get a more detailed look, I used a variant of the instructions retired explorer to mine ETL on windows. Looking just at corelib,

Jit Minopts Inst Retired Ratio to resp base Ratio to base min
Base Yes 1.096e10 -- --
Base No 1.786e10 -- 1.630
Diff Yes 1.092e10 0.996 0.996
Diff No 1.787e10 1.000 1.630

So there is no TP impact in regular opts, and a small win in minopts.

FWIW raw data looks like this:

DIFF MIN OPT
InstRetired for crossgen: 166592 events, 1.091777E+010 instrs
Jitting           : 00.00% (0 methods)
Jit-generated code: 00.00%
  Jitted code     : 00.00%

58.47%   6.38E+09   native  clrjit.dll
30.11%   3.29E+09   native  crossgen.exe
05.03%   5.49E+08   native  ntdll.dll
03.34%   3.64E+08   native  ucrtbase.dll
02.80%   3.06E+08   native  ntoskrnl.exe

With full opt, jit time almost doubles over minopts:

BASE FULL OPT
InstRetired for crossgen: 272597 events, 1.786492E+010 instrs
Jitting           : 00.00% (0 methods)
Jit-generated code: 00.00%
  Jitted code     : 00.00%

66.43%   1.19E+10   native  clrjit.dll
21.85%   3.9E+09    native  crossgen.exe
04.99%   8.92E+08   native  ntoskrnl.exe
04.20%   7.5E+08    native  ntdll.dll
02.34%   4.18E+08   native  ucrtbase.dll
@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 4, 2017

Member

Added similar checks to the isinst/castclass helpers. These kick in a fair amount. Updated stats:

Jit Minopts Inst Retired Ratio to resp base Ratio to base min
Base Yes 1.096e10 -- --
Base No 1.786e10 -- 1.630
Diff-Box Yes 1.092e10 0.996 0.996
Diff-Box No 1.787e10 1.000 1.630
Diff-Box+Cast Yes 1.085e10 0.990 0.990
Diff-Box+Cast No 1.781e10 1.000 1.630

So on coreclib, about a 1% TP improvement for minopts, with 1.9% smaller code.

Jit-diffs under-reports impact on other assemblies because R2R overrides all this logic already. Need to look at fragile prejitting for a broader take, or else assess on destktop.

We can extend this approach to UNBOX and possibly more. Still investigating.

Member

AndyAyersMS commented Aug 4, 2017

Added similar checks to the isinst/castclass helpers. These kick in a fair amount. Updated stats:

Jit Minopts Inst Retired Ratio to resp base Ratio to base min
Base Yes 1.096e10 -- --
Base No 1.786e10 -- 1.630
Diff-Box Yes 1.092e10 0.996 0.996
Diff-Box No 1.787e10 1.000 1.630
Diff-Box+Cast Yes 1.085e10 0.990 0.990
Diff-Box+Cast No 1.781e10 1.000 1.630

So on coreclib, about a 1% TP improvement for minopts, with 1.9% smaller code.

Jit-diffs under-reports impact on other assemblies because R2R overrides all this logic already. Need to look at fragile prejitting for a broader take, or else assess on destktop.

We can extend this approach to UNBOX and possibly more. Still investigating.

@CarolEidt

LGTM - and I really appreciate the restructuring and helpful comments and dumps.

Show outdated Hide outdated src/jit/importer.cpp
// case the call can construct its return value directly into the
// box payload, saving possibly some up-front zeroing.
//
// Currently primitive type boxes get inline expanded. We may

This comment has been minimized.

@CarolEidt

CarolEidt Aug 4, 2017

Member

Just for clarity, you might want to say "get inline expanded when not in a constrained mode."

@CarolEidt

CarolEidt Aug 4, 2017

Member

Just for clarity, you might want to say "get inline expanded when not in a constrained mode."

This comment has been minimized.

@AndyAyersMS

AndyAyersMS Aug 5, 2017

Member

Primitive type boxes are always expanded inline currently. Will update the comment.

Not sure if this is always the best way when size constrained -- some more experimenting may be needed here.

@AndyAyersMS

AndyAyersMS Aug 5, 2017

Member

Primitive type boxes are always expanded inline currently. Will update the comment.

Not sure if this is always the best way when size constrained -- some more experimenting may be needed here.

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 5, 2017

Member

Added similar logic to unbox. This kicks in more broadly in jit-diffs since there is no R2R special casing. Net effect of all 3 on jit-diffs (with MINOPTS) is now:

Total bytes of diff: -146204 (-0.87 % of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
     -122880 : System.Private.CoreLib.dasm (-2.45 % of base)
      -10799 : System.Linq.Expressions.dasm (-1.45 % of base)
       -3921 : Microsoft.VisualBasic.dasm (-1.83 % of base)
       -1826 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.07 % of base)
       -1284 : Microsoft.CodeAnalysis.dasm (-0.14 % of base)

and the updated corelib TP data (reordered a bit from above) is:

Jit Minopts Inst Retired Ratio to resp base Ratio to base minopts
Base Yes 1.096e10 -- --
Diff-Box Yes 1.092e10 0.996 0.996
Diff-Box+Cast Yes 1.085e10 0.990 0.990
Diff-Box+Cast+Unbox Yes 1.081e10 0.986 0.986
Base No 1.786e10 -- 1.630
Diff-Box No 1.787e10 1.000 1.630
Diff-Box+Cast No 1.781e10 0.997 1.625
Diff-Box+Cast+Unbox No 1.788e10 1.001 1.631

There is now a slight regression in non-MINOPTS tp, so it may be that simple unboxes (primitives and structs with 1 field, say) should always be expanded inline as we are doing for simple boxes.

Member

AndyAyersMS commented Aug 5, 2017

Added similar logic to unbox. This kicks in more broadly in jit-diffs since there is no R2R special casing. Net effect of all 3 on jit-diffs (with MINOPTS) is now:

Total bytes of diff: -146204 (-0.87 % of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
     -122880 : System.Private.CoreLib.dasm (-2.45 % of base)
      -10799 : System.Linq.Expressions.dasm (-1.45 % of base)
       -3921 : Microsoft.VisualBasic.dasm (-1.83 % of base)
       -1826 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.07 % of base)
       -1284 : Microsoft.CodeAnalysis.dasm (-0.14 % of base)

and the updated corelib TP data (reordered a bit from above) is:

Jit Minopts Inst Retired Ratio to resp base Ratio to base minopts
Base Yes 1.096e10 -- --
Diff-Box Yes 1.092e10 0.996 0.996
Diff-Box+Cast Yes 1.085e10 0.990 0.990
Diff-Box+Cast+Unbox Yes 1.081e10 0.986 0.986
Base No 1.786e10 -- 1.630
Diff-Box No 1.787e10 1.000 1.630
Diff-Box+Cast No 1.781e10 0.997 1.625
Diff-Box+Cast+Unbox No 1.788e10 1.001 1.631

There is now a slight regression in non-MINOPTS tp, so it may be that simple unboxes (primitives and structs with 1 field, say) should always be expanded inline as we are doing for simple boxes.

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 5, 2017

Member

TP impact might be measurable now, so will give the job another shot. Also retesting the arm legs that should now pass.

@dotnet-bot retest Tizen armel Cross Release Build
@dotnet-bot retest Ubuntu arm Cross Release Build
@dotnet-bot test Windows_NT x64 Throughput

Member

AndyAyersMS commented Aug 5, 2017

TP impact might be measurable now, so will give the job another shot. Also retesting the arm legs that should now pass.

@dotnet-bot retest Tizen armel Cross Release Build
@dotnet-bot retest Ubuntu arm Cross Release Build
@dotnet-bot test Windows_NT x64 Throughput

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 5, 2017

Member

TP job shows about a 1% net improvement for minopts.

Also a regression for non-minopts. The regression is a bit harder to understand as the only non-minopts impact should come from run rarely blocks. Looks like most of the regression cases are in small assemblies, so this could be in part an artifact of how the overall aggregate score is computed. Will pick one or two and drill in further.

There is a hypothesis of mine that (for minopts anyways, and even generally for maxopt) the time spent jitting is proportional to the size of the jit output. Let's see how well that holds up here.

Looking just at S.P.Corelib where we have detailed data, there is a net 2.45% decrease in native code size. Jit time is around 60% of crossgen time. So if jit time is directly proportional to native code size, we'd expect to see around a (2.45 * .6) = 1.5% improvement in overall throughput.

Inst retired data shows a 1.4% improvement. So the hypothesis that native code size ~ jit time seems to hold up pretty well here.

This can be used as a proxy for evaluating other similar approaches to decreasing minopts time -- or more generally trying to get the jit to generate code as fast as possible, eg for debuggable codegen or as a new Tier0 jitting mode.

Approaches include:

  • selectively choosing more compact forms during importation (like we do here)
  • selectively enabling simple optimizations that are likely to decrease code size
  • running simple optimizations early to try and eliminate code sooner (say, #10731)
  • introducing new helpers to capture common jit codegen sequences, or reusing existing helpers that have fallen into disuse (things like struct init and struct copy, for instance).
Member

AndyAyersMS commented Aug 5, 2017

TP job shows about a 1% net improvement for minopts.

Also a regression for non-minopts. The regression is a bit harder to understand as the only non-minopts impact should come from run rarely blocks. Looks like most of the regression cases are in small assemblies, so this could be in part an artifact of how the overall aggregate score is computed. Will pick one or two and drill in further.

There is a hypothesis of mine that (for minopts anyways, and even generally for maxopt) the time spent jitting is proportional to the size of the jit output. Let's see how well that holds up here.

Looking just at S.P.Corelib where we have detailed data, there is a net 2.45% decrease in native code size. Jit time is around 60% of crossgen time. So if jit time is directly proportional to native code size, we'd expect to see around a (2.45 * .6) = 1.5% improvement in overall throughput.

Inst retired data shows a 1.4% improvement. So the hypothesis that native code size ~ jit time seems to hold up pretty well here.

This can be used as a proxy for evaluating other similar approaches to decreasing minopts time -- or more generally trying to get the jit to generate code as fast as possible, eg for debuggable codegen or as a new Tier0 jitting mode.

Approaches include:

  • selectively choosing more compact forms during importation (like we do here)
  • selectively enabling simple optimizations that are likely to decrease code size
  • running simple optimizations early to try and eliminate code sooner (say, #10731)
  • introducing new helpers to capture common jit codegen sequences, or reusing existing helpers that have fallen into disuse (things like struct init and struct copy, for instance).
@pgavlin

This comment has been minimized.

Show comment
Hide comment
@pgavlin

pgavlin Aug 5, 2017

Contributor

There is a hypothesis of mine that (for minopts anyways, and even generally for maxopt) the time spent jitting is proportional to the size of the jit output. Let's see how well that holds up here.

It would be interesting to see what the relationship is between the size of the IR at various points in the phase order and the size of the JIT output. I would guess that there is a pretty direct relationship there as well.

Contributor

pgavlin commented Aug 5, 2017

There is a hypothesis of mine that (for minopts anyways, and even generally for maxopt) the time spent jitting is proportional to the size of the jit output. Let's see how well that holds up here.

It would be interesting to see what the relationship is between the size of the IR at various points in the phase order and the size of the JIT output. I would guess that there is a pretty direct relationship there as well.

@pgavlin

This comment has been minimized.

Show comment
Hide comment
@pgavlin

pgavlin Aug 5, 2017

Contributor

(note that we already have support in the CSV log for reporting the IR size at various phases if COMPlus_JitMeasureIR is set)

Contributor

pgavlin commented Aug 5, 2017

(note that we already have support in the CSV log for reporting the IR size at various phases if COMPlus_JitMeasureIR is set)

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 5, 2017

Member

From the TP job: chart of the full opt crossgen time differences. X axis (log scale) is millseconds to crossgen with the baseline jit. Not yet sure what this tells me...
image
Aggregate difference over these 95 assemblies is about 30ms or 0.1%. Geomean of individual ratios is more like 1.1%. So, losses are biased towards smaller assemblies.

Will post a similar one for minopts when I have time.

Member

AndyAyersMS commented Aug 5, 2017

From the TP job: chart of the full opt crossgen time differences. X axis (log scale) is millseconds to crossgen with the baseline jit. Not yet sure what this tells me...
image
Aggregate difference over these 95 assemblies is about 30ms or 0.1%. Geomean of individual ratios is more like 1.1%. So, losses are biased towards smaller assemblies.

Will post a similar one for minopts when I have time.

pgavlin added a commit to pgavlin/coreclr that referenced this pull request Aug 6, 2017

Use a smaller expansion of GT_INDEX in MinOpts.
We currently expand `GT_INDEX` nodes during morph into an explicit
bounds check followed by a load. For example, this tree:

```
[000059] ------------       /--*  LCL_VAR   int    V09 loc6
[000060] R--XG-------    /--*  INDEX     ref
[000058] ------------    |  \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG-------    *  ASG       ref
[000061] D------N----    \--*  LCL_VAR   ref    V10 loc7
```

is expanded into this tree:

```
[000060] R--XG+------       /--*  IND       ref
[000491] -----+------       |  |  /--*  CNS_INT   long   16 Fseq[#FirstElem]
[000492] -----+------       |  \--*  ADD       byref
[000488] -----+-N----       |     |     /--*  CNS_INT   long   3
[000489] -----+------       |     |  /--*  LSH       long
[000487] -----+------       |     |  |  \--*  CAST      long <- int
[000484] i----+------       |     |  |     \--*  LCL_VAR   int    V09 loc6
[000490] -----+------       |     \--*  ADD       byref
[000483] -----+------       |        \--*  LCL_VAR   ref    V00 arg0
[000493] ---XG+------    /--*  COMMA     ref
[000486] ---X-+------    |  \--*  ARR_BOUNDS_CHECK_Rng void
[000059] -----+------    |     +--*  LCL_VAR   int    V09 loc6
[000485] ---X-+------    |     \--*  ARR_LENGTH int
[000058] -----+------    |        \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

```

Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:

```
IN0018: 000092 mov      rdi, gword ptr [V00 rbp-10H]
IN0019: 000096 mov      edi, dword ptr [rdi+8]
IN001a: 000099 cmp      dword ptr [V09 rbp-48H], edi
IN001b: 00009C jae      G_M5106_IG38
IN001c: 0000A2 mov      rdi, gword ptr [V00 rbp-10H]
IN001d: 0000A6 mov      esi, dword ptr [V09 rbp-48H]
IN001e: 0000A9 movsxd   rsi, esi
IN001f: 0000AC mov      rdi, gword ptr [rdi+8*rsi+16]
IN0020: 0000B1 mov      gword ptr [V10 rbp-50H], rdi
``

Inspired by other recent experiments (e.g. #13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node, `GT_INDEX_ADDR`, represents the bounds check and address
computation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:

```
[000489] a--XG+------    /--*  IND       ref
[000059] -----+------    |  |  /--*  LCL_VAR   int    V09 loc6
[000060] R--XG+--R---    |  \--*  INDEX_ADDR byref
[000058] -----+------    |     \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7
```

This expansion requires only the addition of the `GT_IND` node that
represents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation.
@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 6, 2017

Member

From the minopts TP job, geomean diff/base time ratio is 0.985, net diff/base ratio is 0.988. No obvious size-related bias in the results, though it appears smaller runs are noisier (not surprising).

Each data point here is the average of 5 runs. But for this kind of thing averaging can be misleading since we don't have any expectation as to the underlying "noise" distribution.

For example, the worst data point below comes from crossgenning System.Memory, which is reported as being 1.6x slower. The per-iteration times are

Jit Run1 Run2 Run3 Run4 Run5
Diff 46.8424 45.77212 44.81503 45.83143 147.5008
Base 41.76613 41.76357 41.70683 41.68032 41.53919

So the diff Run5 trashes the diff results. If we'd used Median instead to summarize results, we'd have reported a 1.1x slowdown, which seems more representative.

image

Member

AndyAyersMS commented Aug 6, 2017

From the minopts TP job, geomean diff/base time ratio is 0.985, net diff/base ratio is 0.988. No obvious size-related bias in the results, though it appears smaller runs are noisier (not surprising).

Each data point here is the average of 5 runs. But for this kind of thing averaging can be misleading since we don't have any expectation as to the underlying "noise" distribution.

For example, the worst data point below comes from crossgenning System.Memory, which is reported as being 1.6x slower. The per-iteration times are

Jit Run1 Run2 Run3 Run4 Run5
Diff 46.8424 45.77212 44.81503 45.83143 147.5008
Base 41.76613 41.76357 41.70683 41.68032 41.53919

So the diff Run5 trashes the diff results. If we'd used Median instead to summarize results, we'd have reported a 1.1x slowdown, which seems more representative.

image

pgavlin added a commit to pgavlin/coreclr that referenced this pull request Aug 6, 2017

Use a smaller expansion of `GT_INDEX` in MinOpts.
We currently expand `GT_INDEX` nodes during morph into an explicit
bounds check followed by a load. For example, this tree:

```
[000059] ------------       /--*  LCL_VAR   int    V09 loc6
[000060] R--XG-------    /--*  INDEX     ref
[000058] ------------    |  \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG-------    *  ASG       ref
[000061] D------N----    \--*  LCL_VAR   ref    V10 loc7
```

is expanded into this tree:

```
[000060] R--XG+------       /--*  IND       ref
[000491] -----+------       |  |  /--*  CNS_INT   long   16 Fseq[#FirstElem]
[000492] -----+------       |  \--*  ADD       byref
[000488] -----+-N----       |     |     /--*  CNS_INT   long   3
[000489] -----+------       |     |  /--*  LSH       long
[000487] -----+------       |     |  |  \--*  CAST      long <- int
[000484] i----+------       |     |  |     \--*  LCL_VAR   int    V09 loc6
[000490] -----+------       |     \--*  ADD       byref
[000483] -----+------       |        \--*  LCL_VAR   ref    V00 arg0
[000493] ---XG+------    /--*  COMMA     ref
[000486] ---X-+------    |  \--*  ARR_BOUNDS_CHECK_Rng void
[000059] -----+------    |     +--*  LCL_VAR   int    V09 loc6
[000485] ---X-+------    |     \--*  ARR_LENGTH int
[000058] -----+------    |        \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

```

Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:

```
IN0018: 000092 mov      rdi, gword ptr [V00 rbp-10H]
IN0019: 000096 mov      edi, dword ptr [rdi+8]
IN001a: 000099 cmp      dword ptr [V09 rbp-48H], edi
IN001b: 00009C jae      G_M5106_IG38
IN001c: 0000A2 mov      rdi, gword ptr [V00 rbp-10H]
IN001d: 0000A6 mov      esi, dword ptr [V09 rbp-48H]
IN001e: 0000A9 movsxd   rsi, esi
IN001f: 0000AC mov      rdi, gword ptr [rdi+8*rsi+16]
IN0020: 0000B1 mov      gword ptr [V10 rbp-50H], rdi
```

Inspired by other recent experiments (e.g. #13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node, `GT_INDEX_ADDR`, represents the bounds check and address
computation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:

```
[000489] a--XG+------    /--*  IND       ref
[000059] -----+------    |  |  /--*  LCL_VAR   int    V09 loc6
[000060] R--XG+--R---    |  \--*  INDEX_ADDR byref
[000058] -----+------    |     \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7
```

This expansion requires only the addition of the `GT_IND` node that
represents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation. Furthermore, this expansion tends to generate smaller
code; for example, the tree given above is generated in 29 rather than
35 bytes:

```
IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi
```
@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 7, 2017

Member

Looked at the two slowest full-opts cases and the slowest min-opt case from the TP job. Could not repro the slowdowns locally.

Doing some desktop testing and if that looks good I'll remove the WIP.

Member

AndyAyersMS commented Aug 7, 2017

Looked at the two slowest full-opts cases and the slowest min-opt case from the TP job. Could not repro the slowdowns locally.

Doing some desktop testing and if that looks good I'll remove the WIP.

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 8, 2017

Member

@dotnet-bot test Windows_NT minopts

Member

AndyAyersMS commented Aug 8, 2017

@dotnet-bot test Windows_NT minopts

JIT: modify box/unbox/isinst/castclass expansions for fast jitting
When the jit is generating code in debug/minopts/rare-block modes,
we'd prefer it to generate code more quickly and worry less about overall
generated code performance.

Generally speaking smaller intermediate and final code should correlate
well with faster jitting.

This change alters the expansions of box, unbox, isinst, and castclass when
generating code for minopts, debug, or in rarely run blocks. In such modes
the jit estimates whether an inline sequence or general helper call would
result in more compact code, and then chooses the smaller sequence.

This reduces generated code size around 2.5% in a variety of scenarios,
and roughly translates to a 1.5% improvement in time spent jitting.

Similar strategies can be applied to other complex operations during
importation. That work is forthcoming.
@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 8, 2017

Member

Desktop testing underway, no issues so far. Just pushed up a squashed version. Removing the WIP.

@dotnet/jit-contrib PTAL

Member

AndyAyersMS commented Aug 8, 2017

Desktop testing underway, no issues so far. Just pushed up a squashed version. Removing the WIP.

@dotnet/jit-contrib PTAL

@AndyAyersMS AndyAyersMS changed the title from [WIP] Experiment with box expansion strategy for fast jitting to JIT: modify box/unbox/isinst/castclass expansions for fast jitting Aug 8, 2017

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 9, 2017

Member

Desktop tests passed, but after looking over diffs I'm going to back away from using the box helper for rarely run blocks in full opt. It seems tricky to get the logic right here; using the helper call can break the box(value) == null optimization and also may push zeroinits into the prolog.

Since the value of this in full opt is less clear, it seems prudent to just defer for now.

isinst/castclass/unbox don't seem to have such issues, but will revisit just to be sure, once the diffs from unbox are out of the picture.

Member

AndyAyersMS commented Aug 9, 2017

Desktop tests passed, but after looking over diffs I'm going to back away from using the box helper for rarely run blocks in full opt. It seems tricky to get the logic right here; using the helper call can break the box(value) == null optimization and also may push zeroinits into the prolog.

Since the value of this in full opt is less clear, it seems prudent to just defer for now.

isinst/castclass/unbox don't seem to have such issues, but will revisit just to be sure, once the diffs from unbox are out of the picture.

@AndyAyersMS

This comment has been minimized.

Show comment
Hide comment
@AndyAyersMS

AndyAyersMS Aug 9, 2017

Member

No full-opt diffs in jit-diffs. Desktop now shows 23599 total methods with size differences (23530 improved, 69 regressed), 25 unchanged; diffs are minor.

Think this is ready to go.

@JosephTremoulet PTAL
@dotnet/jit-contrib any final thoughts?

Member

AndyAyersMS commented Aug 9, 2017

No full-opt diffs in jit-diffs. Desktop now shows 23599 total methods with size differences (23530 improved, 69 regressed), 25 unchanged; diffs are minor.

Think this is ready to go.

@JosephTremoulet PTAL
@dotnet/jit-contrib any final thoughts?

@AndyAyersMS AndyAyersMS merged commit 47b38da into dotnet:master Aug 10, 2017

19 checks passed

CROSS Check Build finished.
Details
CentOS7.1 x64 Debug Build and Test Build finished.
Details
FreeBSD x64 Checked Build Build finished.
Details
OSX10.12 x64 Checked Build and Test Build finished.
Details
Tizen armel Cross Debug Build Build finished.
Details
Tizen armel Cross Release Build Build finished.
Details
Ubuntu arm Cross Release Build Build finished.
Details
Ubuntu arm64 Cross Debug Build Build finished.
Details
Ubuntu x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Ubuntu16.04 arm Cross Debug Build Build finished.
Details
Windows_NT arm Cross Checked Build and Test Build finished.
Details
Windows_NT arm64 Cross Debug Build Build finished.
Details
Windows_NT x64 CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x64 Debug Build and Test Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release Priority 1 Build and Test Build finished.
Details
Windows_NT x86 Checked Build and Test Build finished.
Details
Windows_NT x86 CoreCLR Perf Tests Correctness Build finished.
Details

@AndyAyersMS AndyAyersMS deleted the AndyAyersMS:RestrictImporterExpansionInMinopts branch Aug 10, 2017

pgavlin added a commit to pgavlin/coreclr that referenced this pull request Aug 15, 2017

Use a smaller expansion of `GT_INDEX` in MinOpts.
We currently expand `GT_INDEX` nodes during morph into an explicit
bounds check followed by a load. For example, this tree:

```
[000059] ------------       /--*  LCL_VAR   int    V09 loc6
[000060] R--XG-------    /--*  INDEX     ref
[000058] ------------    |  \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG-------    *  ASG       ref
[000061] D------N----    \--*  LCL_VAR   ref    V10 loc7
```

is expanded into this tree:

```
[000060] R--XG+------       /--*  IND       ref
[000491] -----+------       |  |  /--*  CNS_INT   long   16 Fseq[#FirstElem]
[000492] -----+------       |  \--*  ADD       byref
[000488] -----+-N----       |     |     /--*  CNS_INT   long   3
[000489] -----+------       |     |  /--*  LSH       long
[000487] -----+------       |     |  |  \--*  CAST      long <- int
[000484] i----+------       |     |  |     \--*  LCL_VAR   int    V09 loc6
[000490] -----+------       |     \--*  ADD       byref
[000483] -----+------       |        \--*  LCL_VAR   ref    V00 arg0
[000493] ---XG+------    /--*  COMMA     ref
[000486] ---X-+------    |  \--*  ARR_BOUNDS_CHECK_Rng void
[000059] -----+------    |     +--*  LCL_VAR   int    V09 loc6
[000485] ---X-+------    |     \--*  ARR_LENGTH int
[000058] -----+------    |        \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

```

Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:

```
IN0018: 000092 mov      rdi, gword ptr [V00 rbp-10H]
IN0019: 000096 mov      edi, dword ptr [rdi+8]
IN001a: 000099 cmp      dword ptr [V09 rbp-48H], edi
IN001b: 00009C jae      G_M5106_IG38
IN001c: 0000A2 mov      rdi, gword ptr [V00 rbp-10H]
IN001d: 0000A6 mov      esi, dword ptr [V09 rbp-48H]
IN001e: 0000A9 movsxd   rsi, esi
IN001f: 0000AC mov      rdi, gword ptr [rdi+8*rsi+16]
IN0020: 0000B1 mov      gword ptr [V10 rbp-50H], rdi
```

Inspired by other recent experiments (e.g. #13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node, `GT_INDEX_ADDR`, represents the bounds check and address
computation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:

```
[000489] a--XG+------    /--*  IND       ref
[000059] -----+------    |  |  /--*  LCL_VAR   int    V09 loc6
[000060] R--XG+--R---    |  \--*  INDEX_ADDR byref
[000058] -----+------    |     \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7
```

This expansion requires only the addition of the `GT_IND` node that
represents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation. Furthermore, this expansion tends to generate smaller
code; for example, the tree given above is generated in 29 rather than
35 bytes:

```
IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi
```

pgavlin added a commit to pgavlin/coreclr that referenced this pull request Aug 15, 2017

Use a smaller expansion of `GT_INDEX` in MinOpts.
We currently expand `GT_INDEX` nodes during morph into an explicit
bounds check followed by a load. For example, this tree:

```
[000059] ------------       /--*  LCL_VAR   int    V09 loc6
[000060] R--XG-------    /--*  INDEX     ref
[000058] ------------    |  \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG-------    *  ASG       ref
[000061] D------N----    \--*  LCL_VAR   ref    V10 loc7
```

is expanded into this tree:

```
[000060] R--XG+------       /--*  IND       ref
[000491] -----+------       |  |  /--*  CNS_INT   long   16 Fseq[#FirstElem]
[000492] -----+------       |  \--*  ADD       byref
[000488] -----+-N----       |     |     /--*  CNS_INT   long   3
[000489] -----+------       |     |  /--*  LSH       long
[000487] -----+------       |     |  |  \--*  CAST      long <- int
[000484] i----+------       |     |  |     \--*  LCL_VAR   int    V09 loc6
[000490] -----+------       |     \--*  ADD       byref
[000483] -----+------       |        \--*  LCL_VAR   ref    V00 arg0
[000493] ---XG+------    /--*  COMMA     ref
[000486] ---X-+------    |  \--*  ARR_BOUNDS_CHECK_Rng void
[000059] -----+------    |     +--*  LCL_VAR   int    V09 loc6
[000485] ---X-+------    |     \--*  ARR_LENGTH int
[000058] -----+------    |        \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

```

Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:

```
IN0018: 000092 mov      rdi, gword ptr [V00 rbp-10H]
IN0019: 000096 mov      edi, dword ptr [rdi+8]
IN001a: 000099 cmp      dword ptr [V09 rbp-48H], edi
IN001b: 00009C jae      G_M5106_IG38
IN001c: 0000A2 mov      rdi, gword ptr [V00 rbp-10H]
IN001d: 0000A6 mov      esi, dword ptr [V09 rbp-48H]
IN001e: 0000A9 movsxd   rsi, esi
IN001f: 0000AC mov      rdi, gword ptr [rdi+8*rsi+16]
IN0020: 0000B1 mov      gword ptr [V10 rbp-50H], rdi
```

Inspired by other recent experiments (e.g. #13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node, `GT_INDEX_ADDR`, represents the bounds check and address
computation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:

```
[000489] a--XG+------    /--*  IND       ref
[000059] -----+------    |  |  /--*  LCL_VAR   int    V09 loc6
[000060] R--XG+--R---    |  \--*  INDEX_ADDR byref
[000058] -----+------    |     \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7
```

This expansion requires only the addition of the `GT_IND` node that
represents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation. Furthermore, this expansion tends to generate smaller
code; for example, the tree given above is generated in 29 rather than
35 bytes:

```
IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi
```

pgavlin added a commit to pgavlin/coreclr that referenced this pull request Aug 21, 2017

Use a smaller expansion of `GT_INDEX` in MinOpts.
We currently expand `GT_INDEX` nodes during morph into an explicit
bounds check followed by a load. For example, this tree:

```
[000059] ------------       /--*  LCL_VAR   int    V09 loc6
[000060] R--XG-------    /--*  INDEX     ref
[000058] ------------    |  \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG-------    *  ASG       ref
[000061] D------N----    \--*  LCL_VAR   ref    V10 loc7
```

is expanded into this tree:

```
[000060] R--XG+------       /--*  IND       ref
[000491] -----+------       |  |  /--*  CNS_INT   long   16 Fseq[#FirstElem]
[000492] -----+------       |  \--*  ADD       byref
[000488] -----+-N----       |     |     /--*  CNS_INT   long   3
[000489] -----+------       |     |  /--*  LSH       long
[000487] -----+------       |     |  |  \--*  CAST      long <- int
[000484] i----+------       |     |  |     \--*  LCL_VAR   int    V09 loc6
[000490] -----+------       |     \--*  ADD       byref
[000483] -----+------       |        \--*  LCL_VAR   ref    V00 arg0
[000493] ---XG+------    /--*  COMMA     ref
[000486] ---X-+------    |  \--*  ARR_BOUNDS_CHECK_Rng void
[000059] -----+------    |     +--*  LCL_VAR   int    V09 loc6
[000485] ---X-+------    |     \--*  ARR_LENGTH int
[000058] -----+------    |        \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

```

Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:

```
IN0018: 000092 mov      rdi, gword ptr [V00 rbp-10H]
IN0019: 000096 mov      edi, dword ptr [rdi+8]
IN001a: 000099 cmp      dword ptr [V09 rbp-48H], edi
IN001b: 00009C jae      G_M5106_IG38
IN001c: 0000A2 mov      rdi, gword ptr [V00 rbp-10H]
IN001d: 0000A6 mov      esi, dword ptr [V09 rbp-48H]
IN001e: 0000A9 movsxd   rsi, esi
IN001f: 0000AC mov      rdi, gword ptr [rdi+8*rsi+16]
IN0020: 0000B1 mov      gword ptr [V10 rbp-50H], rdi
```

Inspired by other recent experiments (e.g. #13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node, `GT_INDEX_ADDR`, represents the bounds check and address
computation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:

```
[000489] a--XG+------    /--*  IND       ref
[000059] -----+------    |  |  /--*  LCL_VAR   int    V09 loc6
[000060] R--XG+--R---    |  \--*  INDEX_ADDR byref
[000058] -----+------    |     \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7
```

This expansion requires only the addition of the `GT_IND` node that
represents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation. Furthermore, this expansion tends to generate smaller
code; for example, the tree given above is generated in 29 rather than
35 bytes:

```
IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi
```

pgavlin added a commit to pgavlin/coreclr that referenced this pull request Aug 21, 2017

Use a smaller expansion of `GT_INDEX` in MinOpts.
We currently expand `GT_INDEX` nodes during morph into an explicit
bounds check followed by a load. For example, this tree:

```
[000059] ------------       /--*  LCL_VAR   int    V09 loc6
[000060] R--XG-------    /--*  INDEX     ref
[000058] ------------    |  \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG-------    *  ASG       ref
[000061] D------N----    \--*  LCL_VAR   ref    V10 loc7
```

is expanded into this tree:

```
[000060] R--XG+------       /--*  IND       ref
[000491] -----+------       |  |  /--*  CNS_INT   long   16 Fseq[#FirstElem]
[000492] -----+------       |  \--*  ADD       byref
[000488] -----+-N----       |     |     /--*  CNS_INT   long   3
[000489] -----+------       |     |  /--*  LSH       long
[000487] -----+------       |     |  |  \--*  CAST      long <- int
[000484] i----+------       |     |  |     \--*  LCL_VAR   int    V09 loc6
[000490] -----+------       |     \--*  ADD       byref
[000483] -----+------       |        \--*  LCL_VAR   ref    V00 arg0
[000493] ---XG+------    /--*  COMMA     ref
[000486] ---X-+------    |  \--*  ARR_BOUNDS_CHECK_Rng void
[000059] -----+------    |     +--*  LCL_VAR   int    V09 loc6
[000485] ---X-+------    |     \--*  ARR_LENGTH int
[000058] -----+------    |        \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

```

Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:

```
IN0018: 000092 mov      rdi, gword ptr [V00 rbp-10H]
IN0019: 000096 mov      edi, dword ptr [rdi+8]
IN001a: 000099 cmp      dword ptr [V09 rbp-48H], edi
IN001b: 00009C jae      G_M5106_IG38
IN001c: 0000A2 mov      rdi, gword ptr [V00 rbp-10H]
IN001d: 0000A6 mov      esi, dword ptr [V09 rbp-48H]
IN001e: 0000A9 movsxd   rsi, esi
IN001f: 0000AC mov      rdi, gword ptr [rdi+8*rsi+16]
IN0020: 0000B1 mov      gword ptr [V10 rbp-50H], rdi
```

Inspired by other recent experiments (e.g. #13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node, `GT_INDEX_ADDR`, represents the bounds check and address
computation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:

```
[000489] a--XG+------    /--*  IND       ref
[000059] -----+------    |  |  /--*  LCL_VAR   int    V09 loc6
[000060] R--XG+--R---    |  \--*  INDEX_ADDR byref
[000058] -----+------    |     \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7
```

This expansion requires only the addition of the `GT_IND` node that
represents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation. Furthermore, this expansion tends to generate smaller
code; for example, the tree given above is generated in 29 rather than
35 bytes:

```
IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi
```

@karelz karelz modified the milestone: 2.1.0 Aug 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment