Improve loop cloning, with debugging improvements #55299

BruceForstall · 2021-07-07T23:05:35Z

When loop cloning was creating cloning conditions, it was creating unnecessary bounds checks in some multi-dimensional array index cases. When creating a set of cloning conditions, first a null check is done, then an array length check is done, etc. Thus, the array length expression itself won't fault because we've already done a null check. And a subsequent array index expression won't fault (or need a bounds check) because we've already checked the array length (i.e., we've done a manual bounds check). So, stop creating the unnecessary bounds checks, and mark the appropriate instructions as non-faulting by clearing the GTF_EXCEPT bit.

Note that I did not turn on the code to clear GTF_EXCEPT for array length checks because it leads to negative downstream effects in CSE. Namely, there end up being array length expressions that are identical except for the exception bit. When CSE sees this, it gives up on creating a CSE, which leads to regressions in some cases where we don't CSE the array length expression.

Also, for multi-dimension jagged arrays, when optimizing the fast path, we were not removing as many bounds checks as we could. In particular, we weren't removing outer bounds checks, only inner ones. Add code to handle all the bounds checks.

There are some runtime improvements (measured via BenchmarkDotNet on the JIT microbenchmarks), but also some regressions, due, as far as I can tell, to the Intel jcc erratum performance impact. In particular, benchmark ludcmp shows up to a 9% regression due to a jae instruction in the hot loop now crossing a 32-byte boundary due to code changes earlier in the function affecting instruction alignment. The hot loop itself is exactly the same (module register allocation differences). As there is nothing that can be done (without mitigating the jcc erratum) -- it's "bad luck".

In addition to those functional changes, there are a number of debugging-related improvements:

Loop cloning: (a) Improved dumping of cloning conditions and other things, (b) remove an unnecessary member to LcOptInfo, (c) convert the LoopCloneContext raw arrays to jitstd::vector for easier debugging, as clrjit.natvis can be taught to understand them.
CSE improvements: (a) Add getCSEAvailBit and getCSEAvailCrossCallBit functions to avoid multiple hard-codings of these expresions, (b) stop printing all the details of the CSE dataflow to JitDump; just print the result, (c) add optPrintCSEDataFlowSet function to print the CSE dataflow set in symbolic form, not just the raw bits, (d) added FMT_CSE string to use for formatting CSE candidates, (e) added optOptimizeCSEs to the phase structure for JitDump output, (f) remove unused optCSECandidateTotal (remnant of Valnum + lexical CSE)
Alignment: (a) Moved printing of alignment boundaries from emitIssue1Instr to emitEndCodeGen, to avoid the possibility of reading an instruction beyond the basic block. Also, improved the Intel jcc erratum criteria calculations, (b) Change align instructions of zero size to have a zero PerfScore throughput number (since they don't generate code), (c) Add COMPlus_JitDasmWithAlignmentBoundaries to force disasm output to display alignment boundaries.
Codegen / Emitter: (a) Added emitLabelString function for constructing a string to display for a bound emitter label. Created emitPrintLabel to directly print the label, (b) Add genInsDisplayName function to create a string for use when outputting an instruction. For xarch, this prepends the "v" for SIMD instructions, as necessary. This is preferable to calling the raw genInsName function, (c) For each insGroup, created a debug-only list of basic blocks that contributed code to that insGroup. Display this set of blocks in the JitDump disasm output, with block ID. This is useful for looking at an IG, and finding the blocks in a .dot flow graph visualization that contributed to it, (d) remove unused instDisp
Clrjit.natvis: (a) add support for jitstd::vector, JitExpandArray<T>, JitExpandArrayStack<T>, LcOptInfo.
Misc: (a) When compacting an empty loop preheader block with a subsequent block, clear the preheader flag.

benchmarks.run.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 25504
Total bytes of diff: 25092
Total bytes of delta: -412 (-1.62% of base)
Total relative delta: -0.31
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
         -92 : 14861.dasm (-2.57% of base)
         -88 : 2430.dasm (-0.77% of base)
         -68 : 12182.dasm (-3.82% of base)
         -48 : 24678.dasm (-1.61% of base)
         -31 : 21598.dasm (-5.13% of base)
         -26 : 21601.dasm (-4.57% of base)
         -21 : 25069.dasm (-7.14% of base)
         -16 : 14859.dasm (-1.38% of base)
         -11 : 14862.dasm (-1.35% of base)
          -6 : 21600.dasm (-1.83% of base)
          -5 : 25065.dasm (-0.58% of base)

11 total files with Code Size differences (11 improved, 0 regressed), 1 unchanged.

Top method improvements (bytes):
         -92 (-2.57% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
         -88 (-0.77% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
         -68 (-3.82% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -48 (-1.61% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -31 (-5.13% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -21 (-7.14% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -16 (-1.38% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -11 (-1.35% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
          -6 (-1.83% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
          -5 (-0.58% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this

Top method improvements (percentages):
         -21 (-7.14% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -31 (-5.13% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -68 (-3.82% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -92 (-2.57% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
          -6 (-1.83% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
         -48 (-1.61% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -16 (-1.38% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -11 (-1.35% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
         -88 (-0.77% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
          -5 (-0.58% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this

11 total methods with Code Size differences (11 improved, 0 regressed), 1 unchanged.


Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 38374.96
Total PerfScoreUnits of diff: 37914.07000000001
Total PerfScoreUnits of delta: -460.89 (-1.20% of base)
Total relative delta: -0.12
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (PerfScoreUnits):
     -220.67 : 24678.dasm (-1.74% of base)
      -99.27 : 14861.dasm (-2.09% of base)
      -66.30 : 21598.dasm (-1.41% of base)
      -18.73 : 2430.dasm (-0.28% of base)
      -18.40 : 21601.dasm (-1.37% of base)
       -9.73 : 25065.dasm (-0.56% of base)
       -9.05 : 14859.dasm (-0.77% of base)
       -5.51 : 21600.dasm (-0.77% of base)
       -4.15 : 12182.dasm (-0.17% of base)
       -3.92 : 14860.dasm (-0.32% of base)
       -3.46 : 25069.dasm (-2.31% of base)
       -1.70 : 14862.dasm (-0.20% of base)

12 total files with Perf Score differences (12 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
     -220.67 (-1.74% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
      -99.27 (-2.09% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
      -66.30 (-1.41% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -18.73 (-0.28% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
      -18.40 (-1.37% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
       -9.73 (-0.56% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this
       -9.05 (-0.77% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -5.51 (-0.77% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -4.15 (-0.17% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
       -3.92 (-0.32% of base) : 14860.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
       -3.46 (-2.31% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
       -1.70 (-0.20% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])

Top method improvements (percentages):
       -3.46 (-2.31% of base) : 25069.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
      -99.27 (-2.09% of base) : 14861.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
     -220.67 (-1.74% of base) : 24678.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
      -66.30 (-1.41% of base) : 21598.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -18.40 (-1.37% of base) : 21601.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
       -9.05 (-0.77% of base) : 14859.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -5.51 (-0.77% of base) : 21600.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -9.73 (-0.56% of base) : 25065.dasm - Benchstone.BenchF.InProd:Test():bool:this
       -3.92 (-0.32% of base) : 14860.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
      -18.73 (-0.28% of base) : 2430.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
       -1.70 (-0.20% of base) : 14862.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
       -4.15 (-0.17% of base) : 12182.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])

12 total methods with Perf Score differences (12 improved, 0 regressed), 0 unchanged.

coreclr_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 25430
Total bytes of diff: 24994
Total bytes of delta: -436 (-1.71% of base)
Total relative delta: -0.42
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
         -92 : 194668.dasm (-2.57% of base)
         -68 : 194589.dasm (-3.82% of base)
         -48 : 248565.dasm (-1.61% of base)
         -32 : 249053.dasm (-3.58% of base)
         -31 : 251012.dasm (-5.13% of base)
         -26 : 251011.dasm (-4.57% of base)
         -19 : 248561.dasm (-6.76% of base)
         -16 : 194667.dasm (-1.38% of base)
         -15 : 252241.dasm (-0.72% of base)
         -12 : 252242.dasm (-0.81% of base)
         -11 : 194669.dasm (-1.35% of base)
          -9 : 246308.dasm (-1.06% of base)
          -9 : 246307.dasm (-1.06% of base)
          -9 : 246245.dasm (-1.06% of base)
          -9 : 246246.dasm (-1.06% of base)
          -6 : 228622.dasm (-0.77% of base)
          -6 : 251010.dasm (-1.83% of base)
          -5 : 248557.dasm (-0.61% of base)
          -4 : 249054.dasm (-0.50% of base)
          -4 : 249052.dasm (-0.47% of base)

22 total files with Code Size differences (22 improved, 0 regressed), 1 unchanged.

Top method improvements (bytes):
         -92 (-2.57% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
         -68 (-3.82% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -48 (-1.61% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -32 (-3.58% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
         -31 (-5.13% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -19 (-6.76% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -16 (-1.38% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -15 (-0.72% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
         -12 (-0.81% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
         -11 (-1.35% of base) : 194669.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
          -9 (-1.06% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
          -6 (-0.77% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[])
          -6 (-1.83% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
          -5 (-0.61% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
          -4 (-0.50% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()
          -4 (-0.47% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1()

Top method improvements (percentages):
         -19 (-6.76% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
         -31 (-5.13% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
         -26 (-4.57% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
         -68 (-3.82% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
         -32 (-3.58% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
         -92 (-2.57% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
          -6 (-1.83% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
         -48 (-1.61% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
         -16 (-1.38% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
         -11 (-1.35% of base) : 194669.dasm - LUDecomp:lubksb(System.Double[][],int,System.Int32[],System.Double[])
          -3 (-1.11% of base) : 249057.dasm - SimpleArray_01.Test:Test2()
          -9 (-1.06% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
          -9 (-1.06% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
         -12 (-0.81% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
          -6 (-0.77% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[])
         -15 (-0.72% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
          -5 (-0.61% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
          -4 (-0.50% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()

22 total methods with Code Size differences (22 improved, 0 regressed), 1 unchanged.


Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 161610.68999999997
Total PerfScoreUnits of diff: 160290.10999999996
Total PerfScoreUnits of delta: -1320.58 (-0.82% of base)
Total relative delta: -0.20
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (PerfScoreUnits):
     -639.25 : 252241.dasm (-0.97% of base)
     -220.67 : 248565.dasm (-1.74% of base)
     -132.59 : 252242.dasm (-0.26% of base)
      -99.27 : 194668.dasm (-2.09% of base)
      -66.30 : 251012.dasm (-1.41% of base)
      -62.20 : 249053.dasm (-2.74% of base)
      -18.40 : 251011.dasm (-1.37% of base)
       -9.33 : 248557.dasm (-0.54% of base)
       -9.05 : 194667.dasm (-0.77% of base)
       -8.32 : 249054.dasm (-0.42% of base)
       -5.85 : 246308.dasm (-0.52% of base)
       -5.85 : 246307.dasm (-0.52% of base)
       -5.85 : 246245.dasm (-0.52% of base)
       -5.85 : 246246.dasm (-0.52% of base)
       -5.51 : 251010.dasm (-0.77% of base)
       -4.36 : 249052.dasm (-0.22% of base)
       -4.16 : 253363.dasm (-0.21% of base)
       -4.15 : 194589.dasm (-0.17% of base)
       -3.92 : 194666.dasm (-0.32% of base)
       -3.41 : 248561.dasm (-2.29% of base)

23 total files with Perf Score differences (23 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
     -639.25 (-0.97% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
     -220.67 (-1.74% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
     -132.59 (-0.26% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
      -99.27 (-2.09% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
      -66.30 (-1.41% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -62.20 (-2.74% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
      -18.40 (-1.37% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
       -9.33 (-0.54% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
       -9.05 (-0.77% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -8.32 (-0.42% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()
       -5.85 (-0.52% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -5.51 (-0.77% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -4.36 (-0.22% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1()
       -4.16 (-0.21% of base) : 253363.dasm - MatrixMul.Test:MatrixMul()
       -4.15 (-0.17% of base) : 194589.dasm - AssignJagged:second_assignments(System.Int32[][],System.Int16[][])
       -3.92 (-0.32% of base) : 194666.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
       -3.41 (-2.29% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)

Top method improvements (percentages):
      -62.20 (-2.74% of base) : 249053.dasm - SimpleArray_01.Test:BadMatrixMul2()
       -3.41 (-2.29% of base) : 248561.dasm - Benchstone.BenchF.InProd:InnerProduct(byref,System.Double[][],System.Double[][],int,int)
      -99.27 (-2.09% of base) : 194668.dasm - LUDecomp:ludcmp(System.Double[][],int,System.Int32[],byref):int
     -220.67 (-1.74% of base) : 248565.dasm - Benchstone.BenchI.MulMatrix:Inner(System.Int32[][],System.Int32[][],System.Int32[][])
       -2.70 (-1.71% of base) : 249057.dasm - SimpleArray_01.Test:Test2()
      -66.30 (-1.41% of base) : 251012.dasm - Benchstone.BenchI.Array2:Bench(int):bool
      -18.40 (-1.37% of base) : 251011.dasm - Benchstone.BenchI.Array2:VerifyCopy(System.Int32[][][],System.Int32[][][]):bool
     -639.25 (-0.97% of base) : 252241.dasm - Complex_Array_Test:Main(System.String[]):int
       -9.05 (-0.77% of base) : 194667.dasm - LUDecomp:build_problem(System.Double[][],int,System.Double[])
       -5.51 (-0.77% of base) : 251010.dasm - Benchstone.BenchI.Array2:Initialize(System.Int32[][][])
       -9.33 (-0.54% of base) : 248557.dasm - Benchstone.BenchF.InProd:Bench():bool
       -5.85 (-0.52% of base) : 246308.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246307.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246245.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagAry(System.Object[][][],int,int):this
       -5.85 (-0.52% of base) : 246246.dasm - DefaultNamespace.MulDimJagAry:SetThreeDimJagVarAry(System.Object[][][],int,int):this
       -8.32 (-0.42% of base) : 249054.dasm - SimpleArray_01.Test:BadMatrixMul3()
       -3.92 (-0.32% of base) : 194666.dasm - LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long
     -132.59 (-0.26% of base) : 252242.dasm - Simple_Array_Test:Main(System.String[]):int
       -1.89 (-0.22% of base) : 228622.dasm - SciMark2.LU:solve(System.Double[][],System.Int32[],System.Double[])
       -4.36 (-0.22% of base) : 249052.dasm - SimpleArray_01.Test:BadMatrixMul1()

23 total methods with Perf Score differences (23 improved, 0 regressed), 0 unchanged.

libraries.crossgen2.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 10828
Total bytes of diff: 10809
Total bytes of delta: -19 (-0.18% of base)
Total relative delta: -0.00
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
         -19 : 72504.dasm (-0.18% of base)

1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -19 (-0.18% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

Top method improvements (percentages):
         -19 (-0.18% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

1 total methods with Code Size differences (1 improved, 0 regressed), 0 unchanged.


Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 6597.12
Total PerfScoreUnits of diff: 6586.31
Total PerfScoreUnits of delta: -10.81 (-0.16% of base)
Total relative delta: -0.00
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (PerfScoreUnits):
      -10.81 : 72504.dasm (-0.16% of base)

1 total files with Perf Score differences (1 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
      -10.81 (-0.16% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

Top method improvements (percentages):
      -10.81 (-0.16% of base) : 72504.dasm - System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this

1 total methods with Perf Score differences (1 improved, 0 regressed), 0 unchanged.

BruceForstall · 2021-07-07T23:06:26Z

@AndyAyersMS @kunalspathak @dotnet/jit-contrib PTAL

kunalspathak · 2021-07-08T00:36:23Z

Does it help in #35056 or #35293?

BruceForstall · 2021-07-08T00:43:49Z

Does it help in #35056 or #35293?

No, loop cloning currently only handles jagged arrays, not true multi-dimensional arrays.

BruceForstall · 2021-07-08T01:38:03Z

/azp run runtime-coreclr jitstress

azure-pipelines · 2021-07-08T01:38:21Z

Azure Pipelines successfully started running 1 pipeline(s).

Allows inner loop of 3-nested loops (e.g., Array2 benchmark) to be cloned.

to avoid unnecessary bounds checks. Revert max cloning condition blocks to 3; allowing more doesn't seem to improve performance (probably too many conditions before a not-sufficiently-executed loop, at least for the Array2 benchmark)

1. "#if 0" the guts of the CSE dataflow; that's not useful to most people. 2. Add readable CSE number output to the CSE dataflow set output 3. Add FMT_CSE to commonize CSE number output. 4. Add PHASE_OPTIMIZE_VALNUM_CSES to the pre-phase output "allow list" and stop doing its own blocks/trees output. 5. Remove unused optCSECandidateTotal 6. Add functions `getCSEAvailBit` and `getCSEAvailCrossCallBit` to avoid hand-coding these bit calculations in multiple places, for the CSE dataflow set bits.

When generating loop cloning conditions, mark array index expressions as non-faulting, as we have already null- and range-checked the array before generating an index expression. I also added similary code to mark array length expressions as non-faulting, for the same reason. However, that leads to CQ losses because of downstream CSE effects.

This outputs the alignment boundaries without requiring outputting the actual addresses. It makes it easier to diff changes.

Create function for printing bound emitter labels. Also, add debug code to associate a BasicBlock with an insGroup, and output the block number and ID with the emitter label in JitDump, so it's easier to find where a group of generated instructions came from.

For instructions or instruction sequences which match the Intel jcc erratum criteria, note that in the alignment boundary dump. Also, a few fixes: 1. Move the alignment boundary dumping from `emitIssue1Instr` to `emitEndCodeGen` to avoid the possibility of reading the next instruction in a group when there is no next instruction. 2. Create `IsJccInstruction` and `IsJmpInstruction` functions for use by the jcc criteria detection, and fix that detection to fix a few omissions/errors. 3. Change the jcc criteria detection to be hard-coded to 32 byte boundaries instead of assuming `compJitAlignLoopBoundary` is 32. An example: ``` cmp r11d, dword ptr [rax+8] ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 0 ; jcc erratum) 32B boundary ............................... jae G_M42486_IG103 ``` In this case, the `cmp` doesn't cross the boundary, it is adjacent (the zero indicates the number of bytes of the instruction which cross the boundary), followed by the `jae` which starts after the boundary. Indicating the jcc erratum criteria can help point out potential performance issues due to unlucky alignment of these instructions in asm diffs.

XArch sometimes prepends a "v" to the instructions names from the instruction table. Add a function `genInsDisplayName` to create the full instruction name that should be displayed, and use that in most places an instruction name will be displayed, such as in the alignment messages, and normal disassembly. Use this instead of the raw `genInsName`. This could be extended to handle arm32 appending an "s", but I didn't want to touch arm32 with this change.

BruceForstall · 2021-07-08T17:54:30Z

Rebased to pick up formatting fix. All test failures were standard infra noise.

kunalspathak

Overall looks good...Added some minor feedback.

kunalspathak · 2021-07-08T00:43:49Z

src/coreclr/jit/emitxarch.cpp

+            if (id->idCodeSize() == 0)
+            {
+                // We're not going to generate any instruction, so it doesn't count for PerfScore.
+                result.insThroughput = PERFSCORE_THROUGHPUT_ZERO;


should this be PERFSCORE_THROUGHPUT_2X * id->idCodeSize() ? PERFSCORE_THROUGHPUT_2X because NOP are cheap.

Could we also count the NOP compensation we add as part of alignment?

runtime/src/coreclr/jit/emitxarch.cpp

Lines 14355 to 14364 in ede3733

if (emitComp->opts.disAsm)

{

emitDispInsAddr(dst);

printf("\t\t ;; NOP compensation instructions of %d bytes.\n", diff);

}

#endif

BYTE* dstRW = dst + writeableOffset;

dstRW = emitOutputNOP(dstRW, diff);

dst = dstRW - writeableOffset;

I think both of these ideas make sense as follow-up work, although maybe use PERFSCORE_THROUGHPUT_4X * id->idCodeSize() to make them even cheaper?

I am ok as a follow-up work, not sure which one PERFSCORE_THROUGHPUT_2X, PERFSCORE_THROUGHPUT_4X, etc. to pick. @tannergooding - any thoughts?

kunalspathak · 2021-07-08T21:29:21Z

src/coreclr/jit/emit.cpp

+                // one of cmp, test, add, sub, and, inc, or dec), direct unconditional jump, indirect jump,
+                // direct/indirect call, and return.
+
+                size_t jccAlignBoundary     = 32;


Currently, JitAlignLoopForJcc alignment is done when we align the loop using non-adaptive approach. At that time, to determine how much padding is needed to align JCC, we base our calculation on the compJitAlignLoopBoundary boundary.

runtime/src/coreclr/jit/emit.cpp

Lines 5144 to 5149 in fdbca22

// Mitigate JCC erratum by making sure the jmp doesn't fall on the boundary

if (emitComp->opts.compJitAlignLoopForJcc)

{

// TODO: See if extra padding we might end up adding to mitigate JCC erratum is worth doing?

currentOffset++;

}

As such, either change this to compJitAlignLoopBoundary or add a note that once we start aligning JCC during adaptive loop alignment, make sure that JCC is always aligned using 32-byte boundary because that's what the reference manual says.

I have a couple questions about this comment.

We don't do anything about jcc erratum in adaptive loop alignment mode, currently. Are you suggesting we do?

I don't see how this code for compJitAlignLoopForJcc works. First, it's DEBUG only, so it appears to have been an experiment only, not in the shipping product. Also, how does incrementing currentOffset affect whether the jcc erratum condition is avoided?

We don't do anything about jcc erratum in adaptive loop alignment mode, currently. Are you suggesting we do?

In general, if we think that it degrades performance (as you observed), then yes, we should at least do it for instructions - Here is the graph from the findings made in #35730:

Axis:

X: Ratio of after/before. Stated another way, the ratio is (withmcu/withoutmcu). Ratios less than 1 mean the benchmark performed better with the JCC microcode update applied. Ratios greater than 1 mean the benchmark performed worse with the JCC microcode update applied.

Y: Count of benchmarks in the bucket.

In general, I see more Microbenchmarks degraded (be it by small amount) than improved.

My proposal would be to at least do JCC erratum for jumps that participate in the loop (like backedge).

I don't see how this code for compJitAlignLoopForJcc works. First, it's DEBUG only, so it appears to have been an experiment only, not in the shipping product. Also, how does incrementing currentOffset affect whether the jcc erratum condition is avoided?

Yes, it was intentionally made for DEBUG only because it didn't fully solve the JCC erratum nor did it give big benefits. The way it works is - for non-adaptive alignment, if the loop already starts from an offset such that it will still fit in minimum no. of blocks required, then we skip aligning it. Before determining this, if JitAlignLoopForJcc=1, we just increase the offset from which the loop starts, so that the last backedge is pushed further down. The assumption is, if there was JCC erratum, we would hope that our condition if (currentOffset > extraBytesNotInLoop) would do the right thing and not align the last backedge at the boundary. The more I think now - it won't work properly in most of the cases, and we need some more tracking to make sure that the backedges don't fall on the boundary. On the contrary, it could also happen that previously the backedge was not on boundary and after aligning the loop, it falls on the boundary, worsening the performance, and that too needs to be handle correctly.

kunalspathak · 2021-07-08T21:38:53Z

src/coreclr/jit/instr.cpp

-        {
-            printf("\n");
-        }
+    if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins))


Suggested change

if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins))

#ifdef TARGET_XARCH

const int TEMP_BUFFER_LEN = 40;

static unsigned curBuf = 0;

static char buf[4][TEMP_BUFFER_LEN];

const char* retbuf;

if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins))

{

sprintf_s(buf[curBuf], TEMP_BUFFER_LEN, "v%s", insName);

retbuf = buf[curBuf];

curBuf = (curBuf + 1) % 4;

return retbuf;

}

#endif

return insName;

We don't need the else and #else.

kunalspathak · 2021-07-08T21:39:38Z

src/coreclr/jit/clrjit.natvis

@@ -183,4 +190,48 @@ The .NET Foundation licenses this file to you under the MIT license.
    </Expand>
  </Type>

+  <Type Name="jitstd::vector&lt;*&gt;">


Glad to see that this file is getting longer :)

kunalspathak · 2021-07-08T21:42:01Z

src/coreclr/jit/fgopt.cpp

@@ -1620,6 +1620,9 @@ void Compiler::fgCompactBlocks(BasicBlock* block, BasicBlock* bNext)
            }
        }
        bNext->bbPreds = nullptr;
+
+        // `block` can no longer be a loop pre-header (if it was before).
+        block->bbFlags &= ~BBF_LOOP_PREHEADER;


Do we maintain this invariant? Do we have asserts at other places to make sure that a preheader loop block doesn't have more than one incoming edge?

The bit is not widely used (just assertion prop and fgDominate for "new" blocks). Adding such an assert might make sense. I just noticed this because it didn't make sense in a JitDump I was looking at for a block to be marked as a pre-header. fwiw, this change alone causes no asm diffs.

kunalspathak · 2021-07-08T21:44:48Z

src/coreclr/jit/optimizer.cpp

@@ -38,7 +38,6 @@ void Compiler::optInit()
    optNativeCallCount   = 0;
    optAssertionCount    = 0;
    optAssertionDep      = nullptr;
-    optCSECandidateTotal = 0;


Probably worth adding it in JitTimeLogCsv?

Probably not. This gets back to the idea for generally improving JIT stats/metrics: #52877

kunalspathak · 2021-07-08T21:51:41Z

src/coreclr/jit/gentree.h

@@ -494,6 +494,7 @@ enum GenTreeFlags : unsigned int

    GTF_INX_RNGCHK              = 0x80000000, // GT_INDEX/GT_INDEX_ADDR -- the array reference should be range-checked.
    GTF_INX_STRING_LAYOUT       = 0x40000000, // GT_INDEX -- this uses the special string array layout
+    GTF_INX_NONFAULTING         = 0x20000000, // GT_INDEX -- the INDEX does not throw an exception (morph to GTF_IND_NONFAULTING)


The name can be easily confused with GTF_IND_NONFAULTING...In fact, even I got confused while reviewing. Can you please pick a different name?

It's actually exactly the same as GTF_IND_NONFAULTING except for use on GT_INDEX nodes, not GT_IND nodes. Maybe GTF_INX_NOFAULT just to be different?

That sounds good.

kunalspathak · 2021-07-08T22:17:32Z

src/coreclr/jit/loopcloning.cpp

-    // REVIEW: due to the definition of `condBlocks`, above, the effective max is 3 blocks, meaning
-    // `maxRank` of 1. Question: should the heuristic allow more blocks to be created in some situations?
-    // REVIEW: make this based on a COMPlus configuration?
-    if (condBlocks > 4)


Earlier, the comment mentioned as 3 blocks but we were allowing 4 blocks?

The value being compared was 2n + 1, so was always odd. Comparing against 3 and 4 are equivalent. I started this work trying to change it to 5, to allow more cloning (especially for the Array2 benchmark), but saw perf regressions. Now that I have fixed some issues, and have more experience looking into those kind of regressions, especially the jcc erratum and alignment related regressions, I want to go back and try setting this to 5 again, and see if the regressions still exist and can be mitigated.

AndyAyersMS

Changes look good to me.

It would be ideal to separate out the diff causing changes from the dumping/refactoring changes and have a no-diff PR for the ~~former~~ latter. I realize one often works on both in tandem and pulling them apart later can be painful. So no need to do that here. When things are intermixed like this it makes it harder to be confident that all the changes that lead to diffs were looked at carefully.

BruceForstall · 2021-07-09T19:43:22Z

It would be ideal to separate out the diff causing changes from the dumping/refactoring changes...

Yeah, I thought about that. In fact, I'd prefer to submit PRs for each separate debugging change (CSE, alignment, etc.). But it's so painful and time consuming nowadays to get clean CI runs (and code reviews), that I "gave up" and decided not to.

1. Rename GTF_INX_NONFAULTING to GTF_INX_NOFAULT to increase clarity compared to existing GTF_IND_NONFAULTING. 2. Minor cleanup in getInsDisplayName.

am11 · 2021-07-11T11:54:06Z

src/coreclr/jit/emit.cpp

+#ifdef DEBUG
+    ig->lastGeneratedBlock = nullptr;
+    // Explicitly call the constructor, since IGs don't actually have a constructor.
+    ig->igBlocks.jitstd::list<BasicBlock*>::list(emitComp->getAllocator(CMK_LoopOpt));


GCC leg was failing when PR got merged. it looks to be expecting type rather than ctor (function):

Suggested change

ig->igBlocks.jitstd::list<BasicBlock*>::list(emitComp->getAllocator(CMK_LoopOpt));

ig->igBlocks.jitstd::list<BasicBlock*>(emitComp->getAllocator(CMK_LoopOpt));

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 7, 2021

BruceForstall requested review from AndyAyersMS and kunalspathak July 7, 2021 23:05

BruceForstall added 16 commits July 8, 2021 10:52

Increase loop cloning max allowed condition blocks

448e5f1

Allows inner loop of 3-nested loops (e.g., Array2 benchmark) to be cloned.

Remove outer index bounds checks

cdcc120

Convert loop cloning data structures to vector for better debugging

74d62df

Don't count zero-sized align instructions in PerfScore

f45028f

Add COMPlus_JitDasmWithAlignmentBoundaries

0efd5c4

This outputs the alignment boundaries without requiring outputting the actual addresses. It makes it easier to diff changes.

Improve bounds check output

546785b

Formatting

a3cecf7

Clear BBF_LOOP_PREHEADER bit when compacting empty pre-header block

141c44b

Keep track of all basic blocks that contribute code to an insGroup

653a7e2

Fix build

35de73f

BruceForstall force-pushed the IncreaseLoopCloningMaxBlocksHeuristic branch from 25ab640 to 35de73f Compare July 8, 2021 17:54

kunalspathak reviewed Jul 8, 2021

View reviewed changes

AndyAyersMS approved these changes Jul 9, 2021

View reviewed changes

Code review feedback

3bb3110

1. Rename GTF_INX_NONFAULTING to GTF_INX_NOFAULT to increase clarity compared to existing GTF_IND_NONFAULTING. 2. Minor cleanup in getInsDisplayName.

Formatting

182f579

runfoapp bot mentioned this pull request Jul 10, 2021

Feed unreliability affecting CI #55449

Closed

BruceForstall merged commit 02ccdac into dotnet:main Jul 10, 2021

BruceForstall deleted the IncreaseLoopCloningMaxBlocksHeuristic branch July 10, 2021 17:18

am11 reviewed Jul 11, 2021

View reviewed changes

am11 mentioned this pull request Jul 11, 2021

W^X support #54954

Merged

ManickaP mentioned this pull request Jul 20, 2021

[QUIC] Remove AppContext switch from S.N.Quic #56027

Merged

danmoseley mentioned this pull request Jul 27, 2021

[Perf] Regression in System.Collections.CtorFromCollection<String>.ConcurrentQueue #56017

Closed

kunalspathak mentioned this pull request Aug 9, 2021

[Perf] Changes at 7/11/2021 12:10:48 AM dotnet/perf-autofiling-issues#300

Closed

ghost locked as resolved and limited conversation to collaborators Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve loop cloning, with debugging improvements #55299

Improve loop cloning, with debugging improvements #55299

BruceForstall commented Jul 7, 2021

BruceForstall commented Jul 7, 2021

kunalspathak commented Jul 8, 2021

BruceForstall commented Jul 8, 2021

BruceForstall commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

BruceForstall commented Jul 8, 2021

kunalspathak left a comment

kunalspathak Jul 8, 2021

kunalspathak Jul 9, 2021

BruceForstall Jul 9, 2021

kunalspathak Jul 9, 2021

kunalspathak Jul 8, 2021

BruceForstall Jul 9, 2021

kunalspathak Jul 9, 2021

kunalspathak Jul 8, 2021

kunalspathak Jul 8, 2021

kunalspathak Jul 8, 2021

BruceForstall Jul 9, 2021

kunalspathak Jul 8, 2021

BruceForstall Jul 9, 2021

kunalspathak Jul 8, 2021

BruceForstall Jul 9, 2021

kunalspathak Jul 9, 2021

kunalspathak Jul 8, 2021

BruceForstall Jul 9, 2021

AndyAyersMS left a comment •

edited

Loading

BruceForstall commented Jul 9, 2021

am11 Jul 11, 2021

	if (emitComp->opts.disAsm)
	{
	emitDispInsAddr(dst);
	printf("\t\t ;; NOP compensation instructions of %d bytes.\n", diff);
	}
	#endif

	BYTE* dstRW = dst + writeableOffset;
	dstRW = emitOutputNOP(dstRW, diff);
	dst = dstRW - writeableOffset;

	// Mitigate JCC erratum by making sure the jmp doesn't fall on the boundary
	if (emitComp->opts.compJitAlignLoopForJcc)
	{
	// TODO: See if extra padding we might end up adding to mitigate JCC erratum is worth doing?
	currentOffset++;
	}

-    if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins))
+#ifdef TARGET_XARCH
+    const int       TEMP_BUFFER_LEN = 40;
+    static unsigned curBuf          = 0;
+    static char     buf[4][TEMP_BUFFER_LEN];
+    const char*     retbuf;
+    if (GetEmitter()->IsAVXInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins))
+    {
+        sprintf_s(buf[curBuf], TEMP_BUFFER_LEN, "v%s", insName);
+        retbuf = buf[curBuf];
+        curBuf = (curBuf + 1) % 4;
+        return retbuf;
+    }
+#endif
+return insName;

	ig->igBlocks.jitstd::list<BasicBlock*>::list(emitComp->getAllocator(CMK_LoopOpt));
	ig->igBlocks.jitstd::list<BasicBlock*>(emitComp->getAllocator(CMK_LoopOpt));

Improve loop cloning, with debugging improvements #55299

Improve loop cloning, with debugging improvements #55299

Conversation

BruceForstall commented Jul 7, 2021

benchmarks.run.windows.x64.checked.mch:

coreclr_tests.pmi.windows.x64.checked.mch:

libraries.crossgen2.windows.x64.checked.mch:

BruceForstall commented Jul 7, 2021

kunalspathak commented Jul 8, 2021

BruceForstall commented Jul 8, 2021

BruceForstall commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

BruceForstall commented Jul 8, 2021

kunalspathak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyAyersMS left a comment • edited Loading

Choose a reason for hiding this comment

BruceForstall commented Jul 9, 2021

Choose a reason for hiding this comment

AndyAyersMS left a comment •

edited

Loading