Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve loop cloning, with debugging improvements #55299

Merged

Conversation

BruceForstall
Copy link
Member

When loop cloning was creating cloning conditions, it was creating unnecessary bounds checks in some multi-dimensional array index cases. When creating a set of cloning conditions, first a null check is done, then an array length check is done, etc. Thus, the array length expression itself won't fault because we've already done a null check. And a subsequent array index expression won't fault (or need a bounds check) because we've already checked the array length (i.e., we've done a manual bounds check). So, stop creating the unnecessary bounds checks, and mark the appropriate instructions as non-faulting by clearing the GTF_EXCEPT bit.

Note that I did not turn on the code to clear GTF_EXCEPT for array length checks because it leads to negative downstream effects in CSE. Namely, there end up being array length expressions that are identical except for the exception bit. When CSE sees this, it gives up on creating a CSE, which leads to regressions in some cases where we don't CSE the array length expression.

Also, for multi-dimension jagged arrays, when optimizing the fast path, we were not removing as many bounds checks as we could. In particular, we weren't removing outer bounds checks, only inner ones. Add code to handle all the bounds checks.

There are some runtime improvements (measured via BenchmarkDotNet on the JIT microbenchmarks), but also some regressions, due, as far as I can tell, to the Intel jcc erratum performance impact. In particular, benchmark ludcmp shows up to a 9% regression due to a jae instruction in the hot loop now crossing a 32-byte boundary due to code changes earlier in the function affecting instruction alignment. The hot loop itself is exactly the same (module register allocation differences). As there is nothing that can be done (without mitigating the jcc erratum) -- it's "bad luck".

In addition to those functional changes, there are a number of debugging-related improvements:

  1. Loop cloning: (a) Improved dumping of cloning conditions and other things, (b) remove an unnecessary member to LcOptInfo, (c) convert the LoopCloneContext raw arrays to jitstd::vector for easier debugging, as clrjit.natvis can be taught to understand them.
  2. CSE improvements: (a) Add getCSEAvailBit and getCSEAvailCrossCallBit functions to avoid multiple hard-codings of these expresions, (b) stop printing all the details of the CSE dataflow to JitDump; just print the result, (c) add optPrintCSEDataFlowSet function to print the CSE dataflow set in symbolic form, not just the raw bits, (d) added FMT_CSE string to use for formatting CSE candidates, (e) added optOptimizeCSEs to the phase structure for JitDump output, (f) remove unused optCSECandidateTotal (remnant of Valnum + lexical CSE)
  3. Alignment: (a) Moved printing of alignment boundaries from emitIssue1Instr to emitEndCodeGen, to avoid the possibility of reading an instruction beyond the basic block. Also, improved the Intel jcc erratum criteria calculations, (b) Change align instructions of zero size to have a zero PerfScore throughput number (since they don't generate code), (c) Add COMPlus_JitDasmWithAlignmentBoundaries to force disasm output to display alignment boundaries.
  4. Codegen / Emitter: (a) Added emitLabelString function for constructing a string to display for a bound emitter label. Created emitPrintLabel to directly print the label, (b) Add genInsDisplayName function to create a string for use when outputting an instruction. For xarch, this prepends the "v" for SIMD instructions, as necessary. This is preferable to calling the raw genInsName function, (c) For each insGroup, created a debug-only list of basic blocks that contributed code to that insGroup. Display this set of blocks in the JitDump disasm output, with block ID. This is useful for looking at an IG, and finding the blocks in a .dot flow graph visualization that contributed to it, (d) remove unused instDisp
  5. Clrjit.natvis: (a) add support for jitstd::vector, JitExpandArray<T>, JitExpandArrayStack<T>, LcOptInfo.
  6. Misc: (a) When compacting an empty loop preheader block with a subsequent block, clear the preheader flag.

benchmarks.run.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 25504
Total bytes of diff: 25092
Total bytes of delta: -412 (-1.62% of base)
Total relative delta: -0.31
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
         -92 : 14861.dasm (-2.57% of base)
         -88 : 2430.dasm (-0.77% of base)
         -68 : 12182.dasm (-3.82% of base)
         -48 : 24678.dasm (-1.61% of base)
         -31 : 21598.dasm (-5.13% of base)
         -26 : 21601.dasm (-4.57% of base)
         -21 : 25069.dasm (-7.14% of base)
         -16 : 14859.dasm (-1.38% of base)
         -11 : 14862.dasm (-1.35% of base)
          -6 : 21600.dasm (-1.83% of base)
          -5 : 25065.dasm (-0.58% of base)

11 total files with Code Size differences (11 improved, 0 regressed), 1 unchanged.

Top method improvements (bytes):
         -92 (-2.57% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
         -88 (-0.77% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
         -68 (-3.82% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -48 (-1.61% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -31 (-5.13% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -21 (-7.14% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -16 (-1.38% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -11 (-1.35% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
          -6 (-1.83% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
          -5 (-0.58% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this

Top method improvements (percentages):
         -21 (-7.14% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -31 (-5.13% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -68 (-3.82% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -92 (-2.57% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
          -6 (-1.83% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
         -48 (-1.61% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -16 (-1.38% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -11 (-1.35% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
         -88 (-0.77% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
          -5 (-0.58% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this

11 total methods with Code Size differences (11 improved, 0 regressed), 1 unchanged.



Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 38374.96
Total PerfScoreUnits of diff: 37914.07000000001
Total PerfScoreUnits of delta: -460.89 (-1.20% of base)
Total relative delta: -0.12
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (PerfScoreUnits):
     -220.67 : 24678.dasm (-1.74% of base)
      -99.27 : 14861.dasm (-2.09% of base)
      -66.30 : 21598.dasm (-1.41% of base)
      -18.73 : 2430.dasm (-0.28% of base)
      -18.40 : 21601.dasm (-1.37% of base)
       -9.73 : 25065.dasm (-0.56% of base)
       -9.05 : 14859.dasm (-0.77% of base)
       -5.51 : 21600.dasm (-0.77% of base)
       -4.15 : 12182.dasm (-0.17% of base)
       -3.92 : 14860.dasm (-0.32% of base)
       -3.46 : 25069.dasm (-2.31% of base)
       -1.70 : 14862.dasm (-0.20% of base)

12 total files with Perf Score differences (12 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
     -220.67 (-1.74% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
      -99.27 (-2.09% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
      -66.30 (-1.41% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -18.73 (-0.28% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
      -18.40 (-1.37% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
       -9.73 (-0.56% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this
       -9.05 (-0.77% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -5.51 (-0.77% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -4.15 (-0.17% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
       -3.92 (-0.32% of base) : 14860.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
       -3.46 (-2.31% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
       -1.70 (-0.20% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])

Top method improvements (percentages):
       -3.46 (-2.31% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
      -99.27 (-2.09% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
     -220.67 (-1.74% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
      -66.30 (-1.41% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -18.40 (-1.37% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
       -9.05 (-0.77% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -5.51 (-0.77% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -9.73 (-0.56% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this
       -3.92 (-0.32% of base) : 14860.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
      -18.73 (-0.28% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
       -1.70 (-0.20% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
       -4.15 (-0.17% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])

12 total methods with Perf Score differences (12 improved, 0 regressed), 0 unchanged.


coreclr_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 25430
Total bytes of diff: 24994
Total bytes of delta: -436 (-1.71% of base)
Total relative delta: -0.42
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
         -92 : 194668.dasm (-2.57% of base)
         -68 : 194589.dasm (-3.82% of base)
         -48 : 248565.dasm (-1.61% of base)
         -32 : 249053.dasm (-3.58% of base)
         -31 : 251012.dasm (-5.13% of base)
         -26 : 251011.dasm (-4.57% of base)
         -19 : 248561.dasm (-6.76% of base)
         -16 : 194667.dasm (-1.38% of base)
         -15 : 252241.dasm (-0.72% of base)
         -12 : 252242.dasm (-0.81% of base)
         -11 : 194669.dasm (-1.35% of base)
          -9 : 246308.dasm (-1.06% of base)
          -9 : 246307.dasm (-1.06% of base)
          -9 : 246245.dasm (-1.06% of base)
          -9 : 246246.dasm (-1.06% of base)
          -6 : 228622.dasm (-0.77% of base)
          -6 : 251010.dasm (-1.83% of base)
          -5 : 248557.dasm (-0.61% of base)
          -4 : 249054.dasm (-0.50% of base)
          -4 : 249052.dasm (-0.47% of base)

22 total files with Code Size differences (22 improved, 0 regressed), 1 unchanged.

Top method improvements (bytes):
         -92 (-2.57% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
         -68 (-3.82% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -48 (-1.61% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -32 (-3.58% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
         -31 (-5.13% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -19 (-6.76% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -16 (-1.38% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -15 (-0.72% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
         -12 (-0.81% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
         -11 (-1.35% of base) : 194669.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
          -9 (-1.06% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
          -6 (-0.77% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[])
          -6 (-1.83% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
          -5 (-0.61% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
          -4 (-0.50% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()
          -4 (-0.47% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1()

Top method improvements (percentages):
         -19 (-6.76% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -31 (-5.13% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -68 (-3.82% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -32 (-3.58% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
         -92 (-2.57% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
          -6 (-1.83% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
         -48 (-1.61% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -16 (-1.38% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -11 (-1.35% of base) : 194669.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
          -3 (-1.11% of base) : 249057.dasm - SimpleArray_01.Test:Test2()
          -9 (-1.06% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
         -12 (-0.81% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
          -6 (-0.77% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[])
         -15 (-0.72% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
          -5 (-0.61% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
          -4 (-0.50% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()

22 total methods with Code Size differences (22 improved, 0 regressed), 1 unchanged.



Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 161610.68999999997
Total PerfScoreUnits of diff: 160290.10999999996
Total PerfScoreUnits of delta: -1320.58 (-0.82% of base)
Total relative delta: -0.20
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (PerfScoreUnits):
     -639.25 : 252241.dasm (-0.97% of base)
     -220.67 : 248565.dasm (-1.74% of base)
     -132.59 : 252242.dasm (-0.26% of base)
      -99.27 : 194668.dasm (-2.09% of base)
      -66.30 : 251012.dasm (-1.41% of base)
      -62.20 : 249053.dasm (-2.74% of base)
      -18.40 : 251011.dasm (-1.37% of base)
       -9.33 : 248557.dasm (-0.54% of base)
       -9.05 : 194667.dasm (-0.77% of base)
       -8.32 : 249054.dasm (-0.42% of base)
       -5.85 : 246308.dasm (-0.52% of base)
       -5.85 : 246307.dasm (-0.52% of base)
       -5.85 : 246245.dasm (-0.52% of base)
       -5.85 : 246246.dasm (-0.52% of base)
       -5.51 : 251010.dasm (-0.77% of base)
       -4.36 : 249052.dasm (-0.22% of base)
       -4.16 : 253363.dasm (-0.21% of base)
       -4.15 : 194589.dasm (-0.17% of base)
       -3.92 : 194666.dasm (-0.32% of base)
       -3.41 : 248561.dasm (-2.29% of base)

23 total files with Perf Score differences (23 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
     -639.25 (-0.97% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
     -220.67 (-1.74% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
     -132.59 (-0.26% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
      -99.27 (-2.09% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
      -66.30 (-1.41% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -62.20 (-2.74% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
      -18.40 (-1.37% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
       -9.33 (-0.54% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
       -9.05 (-0.77% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -8.32 (-0.42% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()
       -5.85 (-0.52% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -5.51 (-0.77% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -4.36 (-0.22% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1()
       -4.16 (-0.21% of base) : 253363.dasm - MatrixMul.Test:MatrixMul()
       -4.15 (-0.17% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
       -3.92 (-0.32% of base) : 194666.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
       -3.41 (-2.29% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)

Top method improvements (percentages):
      -62.20 (-2.74% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
       -3.41 (-2.29% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
      -99.27 (-2.09% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
     -220.67 (-1.74% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
       -2.70 (-1.71% of base) : 249057.dasm - SimpleArray_01.Test:Test2()
      -66.30 (-1.41% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -18.40 (-1.37% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
     -639.25 (-0.97% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
       -9.05 (-0.77% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -5.51 (-0.77% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -9.33 (-0.54% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
       -5.85 (-0.52% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -8.32 (-0.42% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()
       -3.92 (-0.32% of base) : 194666.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
     -132.59 (-0.26% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
       -1.89 (-0.22% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[])
       -4.36 (-0.22% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1()

23 total methods with Perf Score differences (23 improved, 0 regressed), 0 unchanged.


libraries.crossgen2.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 10828
Total bytes of diff: 10809
Total bytes of delta: -19 (-0.18% of base)
Total relative delta: -0.00
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
         -19 : 72504.dasm (-0.18% of base)

1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -19 (-0.18% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

Top method improvements (percentages):
         -19 (-0.18% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

1 total methods with Code Size differences (1 improved, 0 regressed), 0 unchanged.



Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 6597.12
Total PerfScoreUnits of diff: 6586.31
Total PerfScoreUnits of delta: -10.81 (-0.16% of base)
Total relative delta: -0.00
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (PerfScoreUnits):
      -10.81 : 72504.dasm (-0.16% of base)

1 total files with Perf Score differences (1 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
      -10.81 (-0.16% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

Top method improvements (percentages):
      -10.81 (-0.16% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

1 total methods with Perf Score differences (1 improved, 0 regressed), 0 unchanged.


@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 7, 2021
@BruceForstall
Copy link
Member Author

@AndyAyersMS @kunalspathak @dotnet/jit-contrib PTAL

@kunalspathak
Copy link
Member

Does it help in #35056 or #35293?

@BruceForstall
Copy link
Member Author

Does it help in #35056 or #35293?

No, loop cloning currently only handles jagged arrays, not true multi-dimensional arrays.

@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Allows inner loop of 3-nested loops (e.g., Array2 benchmark)
to be cloned.
to avoid unnecessary bounds checks.

Revert max cloning condition blocks to 3; allowing more doesn't
seem to improve performance (probably too many conditions before
a not-sufficiently-executed loop, at least for the Array2 benchmark)
1. "#if 0" the guts of the CSE dataflow; that's not useful to most people.
2. Add readable CSE number output to the CSE dataflow set output
3. Add FMT_CSE to commonize CSE number output.
4. Add PHASE_OPTIMIZE_VALNUM_CSES to the pre-phase output "allow list"
and stop doing its own blocks/trees output.
5. Remove unused optCSECandidateTotal
6. Add functions `getCSEAvailBit` and `getCSEAvailCrossCallBit` to avoid
hand-coding these bit calculations in multiple places, for the CSE dataflow set bits.
When generating loop cloning conditions, mark array index expressions
as non-faulting, as we have already null- and range-checked the array
before generating an index expression.

I also added similary code to mark array length expressions as non-faulting,
for the same reason. However, that leads to CQ losses because of downstream
CSE effects.
This outputs the alignment boundaries without requiring outputting the actual addresses.
It makes it easier to diff changes.
Create function for printing bound emitter labels.

Also, add debug code to associate a BasicBlock with an insGroup, and
output the block number and ID with the emitter label in JitDump, so it's easier
to find where a group of generated instructions came from.
For instructions or instruction sequences which match the Intel jcc
erratum criteria, note that in the alignment boundary dump.

Also, a few fixes:
1. Move the alignment boundary dumping from `emitIssue1Instr` to
`emitEndCodeGen` to avoid the possibility of reading the next instruction in
a group when there is no next instruction.
2. Create `IsJccInstruction` and `IsJmpInstruction` functions for use by the
jcc criteria detection, and fix that detection to fix a few omissions/errors.
3. Change the jcc criteria detection to be hard-coded to 32 byte boundaries
instead of assuming `compJitAlignLoopBoundary` is 32.

An example:
```
    cmp      r11d, dword ptr [rax+8]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 0 ; jcc erratum) 32B boundary ...............................
    jae      G_M42486_IG103
```
In this case, the `cmp` doesn't cross the boundary, it is adjacent (the zero indicates the number of bytes
of the instruction which cross the boundary), followed by the `jae` which starts after the boundary.

Indicating the jcc erratum criteria can help point out potential performance issues due to unlucky
alignment of these instructions in asm diffs.
XArch sometimes prepends a "v" to the instructions names from the instruction
table. Add a function `genInsDisplayName` to create the full instruction name
that should be displayed, and use that in most places an instruction name will
be displayed, such as in the alignment messages, and normal disassembly. Use
this instead of the raw `genInsName`.

This could be extended to handle arm32 appending an "s", but I didn't want to
touch arm32 with this change.
@BruceForstall BruceForstall force-pushed the IncreaseLoopCloningMaxBlocksHeuristic branch from 25ab640 to 35de73f Compare July 8, 2021 17:54
@BruceForstall
Copy link
Member Author

Rebased to pick up formatting fix. All test failures were standard infra noise.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good...Added some minor feedback.

if (id->idCodeSize() == 0)
{
// We're not going to generate any instruction, so it doesn't count for PerfScore.
result.insThroughput = PERFSCORE_THROUGHPUT_ZERO;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be PERFSCORE_THROUGHPUT_2X * id->idCodeSize() ? PERFSCORE_THROUGHPUT_2X because NOP are cheap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also count the NOP compensation we add as part of alignment?

if (emitComp->opts.disAsm)
{
emitDispInsAddr(dst);
printf("\t\t ;; NOP compensation instructions of %d bytes.\n", diff);
}
#endif
BYTE* dstRW = dst + writeableOffset;
dstRW = emitOutputNOP(dstRW, diff);
dst = dstRW - writeableOffset;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both of these ideas make sense as follow-up work, although maybe use PERFSCORE_THROUGHPUT_4X * id->idCodeSize() to make them even cheaper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok as a follow-up work, not sure which one PERFSCORE_THROUGHPUT_2X, PERFSCORE_THROUGHPUT_4X, etc. to pick. @tannergooding - any thoughts?

// one of cmp, test, add, sub, and, inc, or dec), direct unconditional jump, indirect jump,
// direct/indirect call, and return.

size_t jccAlignBoundary = 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, JitAlignLoopForJcc alignment is done when we align the loop using non-adaptive approach. At that time, to determine how much padding is needed to align JCC, we base our calculation on the compJitAlignLoopBoundary boundary.

// Mitigate JCC erratum by making sure the jmp doesn't fall on the boundary
if (emitComp->opts.compJitAlignLoopForJcc)
{
// TODO: See if extra padding we might end up adding to mitigate JCC erratum is worth doing?
currentOffset++;
}

As such, either change this to compJitAlignLoopBoundary or add a note that once we start aligning JCC during adaptive loop alignment, make sure that JCC is always aligned using 32-byte boundary because that's what the reference manual says.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple questions about this comment.

  1. We don't do anything about jcc erratum in adaptive loop alignment mode, currently. Are you suggesting we do?
  2. I don't see how this code for compJitAlignLoopForJcc works. First, it's DEBUG only, so it appears to have been an experiment only, not in the shipping product. Also, how does incrementing currentOffset affect whether the jcc erratum condition is avoided?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do anything about jcc erratum in adaptive loop alignment mode, currently. Are you suggesting we do?

In general, if we think that it degrades performance (as you observed), then yes, we should at least do it for instructions - Here is the graph from the findings made in #35730:

Axis:

  • X: Ratio of after/before. Stated another way, the ratio is (withmcu/withoutmcu). Ratios less than 1 mean the benchmark performed better with the JCC microcode update applied. Ratios greater than 1 mean the benchmark performed worse with the JCC microcode update applied.
  • Y: Count of benchmarks in the bucket.

80840714-ba0bda80-8bb2-11ea-8869-07ec55182661

In general, I see more Microbenchmarks degraded (be it by small amount) than improved.

My proposal would be to at least do JCC erratum for jumps that participate in the loop (like backedge).

I don't see how this code for compJitAlignLoopForJcc works. First, it's DEBUG only, so it appears to have been an experiment only, not in the shipping product. Also, how does incrementing currentOffset affect whether the jcc erratum condition is avoided?

Yes, it was intentionally made for DEBUG only because it didn't fully solve the JCC erratum nor did it give big benefits. The way it works is - for non-adaptive alignment, if the loop already starts from an offset such that it will still fit in minimum no. of blocks required, then we skip aligning it. Before determining this, if JitAlignLoopForJcc=1, we just increase the offset from which the loop starts, so that the last backedge is pushed further down. The assumption is, if there was JCC erratum, we would hope that our condition if (currentOffset > extraBytesNotInLoop) would do the right thing and not align the last backedge at the boundary. The more I think now - it won't work properly in most of the cases, and we need some more tracking to make sure that the backedges don't fall on the boundary. On the contrary, it could also happen that previously the backedge was not on boundary and after aligning the loop, it falls on the boundary, worsening the performance, and that too needs to be handle correctly.

{
printf("\n");
}
if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins))
#ifdef TARGET_XARCH
const int TEMP_BUFFER_LEN = 40;
static unsigned curBuf = 0;
static char buf[4][TEMP_BUFFER_LEN];
const char* retbuf;
if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins))
{
sprintf_s(buf[curBuf], TEMP_BUFFER_LEN, "v%s", insName);
retbuf = buf[curBuf];
curBuf = (curBuf + 1) % 4;
return retbuf;
}
#endif
return insName;

We don't need the else and #else.

@@ -183,4 +190,48 @@ The .NET Foundation licenses this file to you under the MIT license.
</Expand>
</Type>

<Type Name="jitstd::vector&lt;*&gt;">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see that this file is getting longer :)

@@ -1620,6 +1620,9 @@ void Compiler::fgCompactBlocks(BasicBlock* block, BasicBlock* bNext)
}
}
bNext->bbPreds = nullptr;

// `block` can no longer be a loop pre-header (if it was before).
block->bbFlags &= ~BBF_LOOP_PREHEADER;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we maintain this invariant? Do we have asserts at other places to make sure that a preheader loop block doesn't have more than one incoming edge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit is not widely used (just assertion prop and fgDominate for "new" blocks). Adding such an assert might make sense. I just noticed this because it didn't make sense in a JitDump I was looking at for a block to be marked as a pre-header. fwiw, this change alone causes no asm diffs.

@@ -38,7 +38,6 @@ void Compiler::optInit()
optNativeCallCount = 0;
optAssertionCount = 0;
optAssertionDep = nullptr;
optCSECandidateTotal = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding it in JitTimeLogCsv?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. This gets back to the idea for generally improving JIT stats/metrics: #52877

@@ -494,6 +494,7 @@ enum GenTreeFlags : unsigned int

GTF_INX_RNGCHK = 0x80000000, // GT_INDEX/GT_INDEX_ADDR -- the array reference should be range-checked.
GTF_INX_STRING_LAYOUT = 0x40000000, // GT_INDEX -- this uses the special string array layout
GTF_INX_NONFAULTING = 0x20000000, // GT_INDEX -- the INDEX does not throw an exception (morph to GTF_IND_NONFAULTING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name can be easily confused with GTF_IND_NONFAULTING...In fact, even I got confused while reviewing. Can you please pick a different name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually exactly the same as GTF_IND_NONFAULTING except for use on GT_INDEX nodes, not GT_IND nodes. Maybe GTF_INX_NOFAULT just to be different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good.

// REVIEW: due to the definition of `condBlocks`, above, the effective max is 3 blocks, meaning
// `maxRank` of 1. Question: should the heuristic allow more blocks to be created in some situations?
// REVIEW: make this based on a COMPlus configuration?
if (condBlocks > 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, the comment mentioned as 3 blocks but we were allowing 4 blocks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value being compared was 2n + 1, so was always odd. Comparing against 3 and 4 are equivalent. I started this work trying to change it to 5, to allow more cloning (especially for the Array2 benchmark), but saw perf regressions. Now that I have fixed some issues, and have more experience looking into those kind of regressions, especially the jcc erratum and alignment related regressions, I want to go back and try setting this to 5 again, and see if the regressions still exist and can be mitigated.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.

It would be ideal to separate out the diff causing changes from the dumping/refactoring changes and have a no-diff PR for the former latter. I realize one often works on both in tandem and pulling them apart later can be painful. So no need to do that here. When things are intermixed like this it makes it harder to be confident that all the changes that lead to diffs were looked at carefully.

@BruceForstall
Copy link
Member Author

It would be ideal to separate out the diff causing changes from the dumping/refactoring changes...

Yeah, I thought about that. In fact, I'd prefer to submit PRs for each separate debugging change (CSE, alignment, etc.). But it's so painful and time consuming nowadays to get clean CI runs (and code reviews), that I "gave up" and decided not to.

1. Rename GTF_INX_NONFAULTING to GTF_INX_NOFAULT to increase clarity compared
to existing GTF_IND_NONFAULTING.
2. Minor cleanup in getInsDisplayName.
@BruceForstall BruceForstall merged commit 02ccdac into dotnet:main Jul 10, 2021
@BruceForstall BruceForstall deleted the IncreaseLoopCloningMaxBlocksHeuristic branch July 10, 2021 17:18
#ifdef DEBUG
ig->lastGeneratedBlock = nullptr;
// Explicitly call the constructor, since IGs don't actually have a constructor.
ig->igBlocks.jitstd::list<BasicBlock*>::list(emitComp->getAllocator(CMK_LoopOpt));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCC leg was failing when PR got merged. it looks to be expecting type rather than ctor (function):

Suggested change
ig->igBlocks.jitstd::list<BasicBlock*>::list(emitComp->getAllocator(CMK_LoopOpt));
ig->igBlocks.jitstd::list<BasicBlock*>(emitComp->getAllocator(CMK_LoopOpt));

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants