JIT: Remove BBJ_NONE #94239

amanasifkhalid · 2023-10-31T20:40:37Z

Next step for #93020, per conversation on #93772. Replacing BBJ_NONE with BBJ_ALWAYS to the next block helps limit our use of implicit fall-through (though we still expect BBJ_COND to fall through when its false branch is taken; #93772 should eventually address this).

I've added a small peephole optimization to skip emitting unconditional branches to the next block during codegen.

ghost · 2023-10-31T20:40:48Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Next step for #93020, per conversation on #93772. Replacing BBJ_NONE with BBJ_ALWAYS to the next block helps limit our use of implicit fall-through (though we still expect BBJ_COND to fall through when its false branch is taken; #93772 should eventually address this).

I've added a small peephole optimization to skip emitting unconditional branches to the next block during codegen.

Author:	amanasifkhalid
Assignees:	amanasifkhalid
Labels:	`area-CodeGen-coreclr`
Milestone:	-

amanasifkhalid · 2023-11-02T20:04:57Z

Failures look like #91757.

amanasifkhalid · 2023-11-02T20:13:16Z

CC @dotnet/jit-contrib, @AndyAyersMS PTAL. I tried to rein in the asmdiffs as much as possible without adding too many weird edge cases. The code size increases in the libraries_tests.run... collections for FullOpts are pretty dramatic, though for what it's worth, the JIT seems to be justifying these increases with improved PerfScores. Here's the PerfScore diff for this collection when targeting Windows ARM64:

Found 72 files with textual diffs.

Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 632533.3599999999
Total PerfScoreUnits of diff: 459753.63
Total PerfScoreUnits of delta: -172779.73 (-27.32 % of base)
Total relative delta: -19.98
    diff is an improvement.
    relative diff is an improvement.


Top file regressions (PerfScoreUnits):
       13.10 : 618160.dasm (45.96% of base)
       12.40 : 562710.dasm (40.91% of base)
       12.40 : 153205.dasm (40.90% of base)

Top file improvements (PerfScoreUnits):
    -6109.65 : 121208.dasm (-34.90% of base)
    -6100.66 : 308960.dasm (-34.96% of base)
    -6098.06 : 26795.dasm (-34.95% of base)
    -6094.56 : 72962.dasm (-34.94% of base)
    -6094.06 : 372595.dasm (-34.94% of base)
    -6093.96 : 267789.dasm (-34.94% of base)
    -6093.86 : 123496.dasm (-34.94% of base)
    -6093.86 : 622898.dasm (-34.94% of base)
    -6093.21 : 246705.dasm (-34.84% of base)
    -6093.21 : 169013.dasm (-34.84% of base)
    -6086.91 : 94037.dasm (-34.82% of base)
    -6080.10 : 356732.dasm (-35.18% of base)
    -6079.46 : 500179.dasm (-34.89% of base)
    -6079.30 : 68364.dasm (-35.17% of base)
    -6079.30 : 115234.dasm (-35.17% of base)
    -6069.86 : 623747.dasm (-34.85% of base)
    -6066.10 : 187066.dasm (-35.12% of base)
    -6065.55 : 347667.dasm (-35.22% of base)
    -6064.80 : 120021.dasm (-35.12% of base)
    -6064.80 : 103264.dasm (-35.12% of base)

72 total files with Perf Score differences (69 improved, 3 regressed), 20 unchanged.

AndyAyersMS · 2023-11-02T20:30:25Z

Diffs

Very interesting. I would not have expected massive code size improvements from something like this, and I'd like to understand this aspect a bit better (especially the min opts cases). Can we pick a few examples for case studies?

I will need some time to go through the changes -- will try to get you a first pass later today.

jakobbotsch · 2023-11-02T20:58:05Z

CC @dotnet/jit-contrib, @AndyAyersMS PTAL. I tried to rein in the asmdiffs as much as possible without adding too many weird edge cases. The code size increases in the libraries_tests.run... collections for FullOpts are pretty dramatic, though for what it's worth, the JIT seems to be justifying these increases with improved PerfScores. Here's the PerfScore diff for this collection when targeting Windows ARM64:

Found 72 files with textual diffs.

Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 632533.3599999999
Total PerfScoreUnits of diff: 459753.63
Total PerfScoreUnits of delta: -172779.73 (-27.32 % of base)
Total relative delta: -19.98
    diff is an improvement.
    relative diff is an improvement.


Top file regressions (PerfScoreUnits):
       13.10 : 618160.dasm (45.96% of base)
       12.40 : 562710.dasm (40.91% of base)
       12.40 : 153205.dasm (40.90% of base)

Top file improvements (PerfScoreUnits):
    -6109.65 : 121208.dasm (-34.90% of base)
    -6100.66 : 308960.dasm (-34.96% of base)
    -6098.06 : 26795.dasm (-34.95% of base)
    -6094.56 : 72962.dasm (-34.94% of base)
    -6094.06 : 372595.dasm (-34.94% of base)
    -6093.96 : 267789.dasm (-34.94% of base)
    -6093.86 : 123496.dasm (-34.94% of base)
    -6093.86 : 622898.dasm (-34.94% of base)
    -6093.21 : 246705.dasm (-34.84% of base)
    -6093.21 : 169013.dasm (-34.84% of base)
    -6086.91 : 94037.dasm (-34.82% of base)
    -6080.10 : 356732.dasm (-35.18% of base)
    -6079.46 : 500179.dasm (-34.89% of base)
    -6079.30 : 68364.dasm (-35.17% of base)
    -6079.30 : 115234.dasm (-35.17% of base)
    -6069.86 : 623747.dasm (-34.85% of base)
    -6066.10 : 187066.dasm (-35.12% of base)
    -6065.55 : 347667.dasm (-35.22% of base)
    -6064.80 : 120021.dasm (-35.12% of base)
    -6064.80 : 103264.dasm (-35.12% of base)

72 total files with Perf Score differences (69 improved, 3 regressed), 20 unchanged.

What is this from? You need to pass -metrics PerfScore to superpmi.py to do a correct perfscore measurement. Analyzing the example diffs produced for a normal asmdiffs run will not give right results (I would expect way more than 72 files). Even so the results may not be very insightful since I think it would include PerfScore diffs in MinOpts contexts.

For this change: it looks like the "jump to next BB" optimization is kicking in a few places in MinOpts. I don't see an easy way to make the behavior the same as before, but we should ensure that this doesn't impact debugging.

amanasifkhalid · 2023-11-02T21:13:53Z

You need to pass -metrics PerfScore to superpmi.py to do a correct perfscore measurement. Analyzing the example diffs produced for a normal asmdiffs run will not give right results (I would expect way more than 72 files).

Ah, thanks for catching that. I'm rerunning the PerfScore measurement now; will update with the results here.

For this change: it looks like the "jump to next BB" optimization is kicking in a few places in MinOpts.

By "jump to next BB optimization," do you mean the peephole optimization for jumping to the next block during codegen, or one of the flowgraph optimizations (like fgCompactBlocks, which will compact a BBJ_ALWAYS to the next block, or fgOptimizeBranchToNext, etc)?

jakobbotsch · 2023-11-02T22:11:27Z

By "jump to next BB optimization," do you mean the peephole optimization for jumping to the next block during codegen, or one of the flowgraph optimizations (like fgCompactBlocks, which will compact a BBJ_ALWAYS to the next block, or fgOptimizeBranchToNext, etc)?

I meant the peephole optimization, e.g. I assume that's the cause of diffs like:

We should just make sure we don't lose the ability to place breakpoints on "closing braces", for example (though I wouldn't expect that). As Andy mentioned, understanding these diffs would be a good idea. Also, any idea what's causing TP regressions in asp.net and benchmarks.run_pgo MinOpts?

amanasifkhalid · 2023-11-03T17:57:26Z

I'd like to understand this aspect a bit better (especially the min opts cases). Can we pick a few examples for case studies?

Sure thing. There are a couple methods in the libraries_tests.run... collection with big size regressions. For example, System.DefaultBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this (Instrumented Tier1) increased in size from 8741 bytes to 14283 bytes (+5542 bytes, or 63.40% of the base size). The instruction count increased from 1867 to 3249. However, the PerfScore decreased from 13300.68 to 8990.12, so the JIT thinks this increase is worth it. Looking at the JIT dumps, the biggest diffs are during loop cloning, where the JIT is now much more aggressive for this method: The baseline JIT didn't clone any loops, while the diff JIT cloned 4 loops. This cloning increased the number of basic blocks from 236 to 454. It seems that in the baseline JIT, Compiler::optCanOptimizeByLoopCloning bails out pretty early, getting only a few statements deep into each loop. Here's a dump snippet for the baseline JIT:

Considering loop L00 to clone for optimizations.
Checking loop L00 for optimization candidates (GDV tests)
...GDV considering [000111]
------------------------------------------------------------
Considering loop L01 to clone for optimizations.
Checking loop L01 for optimization candidates (GDV tests)
...GDV considering [000786]
...GDV considering [000761]
------------------------------------------------------------
Considering loop L02 to clone for optimizations.
Checking loop L02 for optimization candidates (GDV tests)
...GDV considering [001478]
...GDV considering [001486]
...GDV considering [001519]
...GDV considering [002400]
...GDV considering [002404]
...GDV considering [004228]
...GDV considering [004229]
... right form for type test with local V156
... but not invariant
...GDV considering [004300]
...GDV considering [002429]
...GDV considering [001464]
------------------------------------------------------------
Considering loop L03 to clone for optimizations.
Checking loop L03 for optimization candidates (GDV tests)
...GDV considering [001127]
------------------------------------------------------------
Considering loop L04 to clone for optimizations.
Checking loop L04 for optimization candidates (GDV tests)
...GDV considering [001150]
------------------------------------------------------------

Loops cloned: 0
Loops statically optimized: 0

I don't fully understand the requirements for deciding to clone a loop, but I'm guessing some slightly different decisions made in Compiler::fgUpdateFlowGraph enabled the diff JIT to do cloning. Some cloned loops are pretty large -- one clone added over 100 new blocks (I'm guessing we don't consider loop size when cloning, or the tolerance for size increases is pretty high?). Also @jakobbotsch I haven't investigated the TP diffs for MinOpts yet, but I hypothesize this ambitious loop cloning is to blame for the TP regression for libraries_tests.run... with FullOpts.

In this case, do we trust the perfScore improvements, or does this amount of cloning seem extreme? I'll update shortly with an example of a code size improvement in MinOpts.

amanasifkhalid · 2023-11-03T18:52:39Z

I took a look at an example similar to the one Jakob screenshotted above; see MicroBenchmarks.Serializers.DataGenerator:Generate[int]():int (Tier0) in benchmarks.run_pgo.windows.arm64.checked. For the baseline JIT, this method is 28 instructions (112 bytes) long, with a PerfScore of 38.20. For the diff JIT, this method is 7 instructions (28 bytes) long, with a PerfScore of 8.80. By percent decreases, this method was one of the best improvements in the collection, with a 75% decrease in size. Taking a look at the JIT dumps, this method has a lot of BBJ_COND -> BBJ_RETURN chains. Here's a snippet:

--------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]
--------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..01B)-> BB03 ( cond )                     
BB02 [0001]  1       BB01                  1       [01B..026)        (return)                     
BB03 [0002]  1       BB01                  1       [026..041)-> BB05 ( cond )                     
BB04 [0003]  1       BB03                  1       [041..04C)        (return)                     
BB05 [0004]  1       BB03                  1       [04C..067)-> BB07 ( cond )                     
BB06 [0005]  1       BB05                  1       [067..072)        (return)

The conditions of these BBJ_COND blocks get folded into constants, and we're able to convert them into BBJ_ALWAYS blocks. Here's what they look like after:

BB01 [0000]  1                             1       [000..01B)-> BB03 (always)
BB02 [0001]  0                             1       [01B..026)        (return)
BB03 [0002]  1       BB01                  1       [026..041)-> BB05 (always)
BB04 [0003]  0                             1       [041..04C)        (return)
BB05 [0004]  1       BB03                  1       [04C..067)-> BB07 (always)
BB06 [0005]  0                             1       [067..072)        (return)

The BBJ_RETURN blocks are no longer reachable, since their previous blocks cannot fall through into them. So during the post-import phase, these blocks are removed. Now, the block list looks like this:

BB01 [0000]  1                             1       [000..01B)-> BB03 (always)
BB03 [0002]  1       BB01                  1       [026..041)-> BB05 (always)
BB05 [0004]  1       BB03                  1       [04C..067)-> BB07 (always) 
BB07 [0006]  1       BB05                  1       [072..08D)-> BB09 (always)

Normally, the JIT would convert these jumps to the next block into BBJ_NONE via fgOptimizeBranchToNext in the layout optimization phase, but that phase isn't run in MinOpts, so by the time we get to code gen, these blocks are still BBJ_ALWAYS. The new peephole optimization kicks in, removing all the unnecessary jumps. So aside from that optimization, the actual flowgraph behavior doesn't seem to differ at all between the baseline and diff JITs; we just never had the chance to convert these BBJ_ALWAYS blocks to BBJ_NONE in MinOpts for these scenarios. The top code size improvements by raw and percent decrease for this collection are all Tier 0, so this pattern of not being able to optimize unnecessary BBJ_ALWAYS away with fgOptimizeBranchToNext in MinOpts seems to be fixed by the peephole optimization. So as Jakob alluded to, this optimization seems responsible for many of the improvements in MinOpts.

I imagine it would be easy to disable this optimization in MinOpts the same way we disable the layout optimization phase, if we find it interferes with debugging unoptimized code.

AndyAyersMS · 2023-11-03T20:42:28Z

In this case, do we trust the perfScore improvements, or does this amount of cloning seem extreme? I'll update shortly with an example of a code size improvement in MinOpts

I've seen cloning "unexpectedly" kick in from changes like this. It is something we'll just have to tolerate for now.

It would be helpful to see how much of the code growth comes in methods where more cloning happens, and whether aside from that there are other things going on that would be worth understanding.

However it may be a little bit painful to gather this data. For instance you could add output to the jit's disasm footer line indicating the number of cloned loops, then update jit-analyze or similar to parse this and aggregate data separately for methods where # of clones matches vs those where # of clones differs...

I imagine it would be easy to disable this optimization in MinOpts the same way we disable the layout optimization phase, if we find it interferes with debugging unoptimized code.

For now you should try and see if you can get to zero diffs (or close) for MinOpts, we can always come back later to see if this optimization can be safely enabled in modes where we are generally (and intentionally) not optimizing the code. For instance, we might want to turn it on for Tier0 but not MinOpts or Debuggable Code.

BruceForstall · 2023-11-03T20:49:16Z

However it may be a little bit painful to gather this data. For instance you could add output to the jit's disasm footer line indicating the number of cloned loops, then update jit-analyze or similar to parse this and aggregate data separately for methods where # of clones matches vs those where # of clones differs...

DOTNET_JitTimeLogCsv=* includes cloned loops as one of the columns in the (per-function) output.

amanasifkhalid · 2023-11-03T22:19:20Z

DOTNET_JitTimeLogCsv=* includes cloned loops as one of the columns in the (per-function) output.

Thanks for pointing this out. I'll share some metrics on loop cloning for the collections with the biggest size regressions for FullOpts.

For now you should try and see if you can get to zero diffs (or close) for MinOpts, we can always come back later to see if this optimization can be safely enabled in modes where we are generally (and intentionally) not optimizing the code.

Sure thing, I'll try disabling the optimization for MinOpts. FYI that while disabling it will probably reduce/remove the size improvements, I expect us to get some pretty big diffs in the opposite direction for MinOpts, as all the blocks that used to be BBJ_NONE will now have a branch instruction at the end. So either way, I don't think we'll be close to zero diffs.

AndyAyersMS · 2023-11-03T22:40:15Z

I expect us to get some pretty big diffs in the opposite direction for MinOpts,

Interesting ... so we seemingly have lost something important by removing BBJ_NONE. I wonder if we have anything else lying around that could help us figure out when to materialize the jump for MinOpts.

One idea is to look at the associated IL offset info. If the block end IL offset is valid and different than the last statement in the block end IL offset, and also different from the next block's start IL offset, then the jump may be significant for source-level debugging, and so it should result in some instruction (though perhaps emitting a nop would be sufficient).

If we can't get this to zero diff for MinOpts/Debuggable code, we'll have to verify behavior with the debugger tests.

I suppose we could also try and look at the debug info we generate for some of these methods ((say pass --debuginfo to superpmi.py asmdiffs)... while it won't match up exactly before/after, every IL offset we used to report should still be reported. So perhaps a simple check that the number of offset records is the same would be sufficient. I don't recall how smart the SPMI debug info differ is, it may already flag this case.

AndyAyersMS · 2023-11-04T01:22:05Z

src/coreclr/jit/codegenlinear.cpp

@@ -737,7 +737,9 @@ void CodeGen::genCodeForBBlist()
            {
                // Peephole optimization: If this block jumps to the next one, skip emitting the jump
                // (unless we are jumping between hot/cold sections, or if we need the jump for EH reasons)
-                const bool skipJump = block->JumpsToNext() && !(block->bbFlags & BBF_KEEP_BBJ_ALWAYS) &&
+                // (Skip this optimization in MinOpts)
+                const bool skipJump = !compiler->opts.MinOpts() && block->JumpsToNext() &&


You probably want compiler->optimizationsEnabled() here.

AndyAyersMS · 2023-11-04T01:22:36Z

Original	Revised

Seems like there is more than just this peephole involved, as x64 is still smaller than it was.

AndyAyersMS

Left some notes.

Not sure about some of the changes to fgReorderBlocks but that method is complex enough that I'd need to walk through some cases to understand it better.

I still think the goal here for now should be to try and minimize diffs. If there are opportunities to improve, we can do those as follow-ups. We should do is pick a few relatively small / simple methods for case studies and make sure we understand what is leading to the diffs there.

Happy to help with this.

AndyAyersMS · 2023-11-05T15:38:20Z

src/coreclr/jit/codegenlinear.cpp

+                        // If that happens, make sure a NOP is emitted as the last instruction in the block.
+                        emitNop = true;
+                        break;
+


Did we ever hit the case in the old code where we had BBJ_NONE on the last block?

I added an assert(false) to that case in the old code to see if I could get it to hit during a SuperPMI replay, and it never hit across all collections. Also in the new code, I added an assert that BBJ_ALWAYS has a jump before trying to emit the jump, so that we never have a BBJ_ALWAYS that "falls into" nothing at the end of the block list -- that also never hit.

AndyAyersMS · 2023-11-05T15:43:56Z

src/coreclr/jit/fgbasic.cpp

@@ -3157,7 +3155,7 @@ unsigned Compiler::fgMakeBasicBlocks(const BYTE* codeAddr, IL_OFFSET codeSize, F

                jmpDist = (sz == 1) ? getI1LittleEndian(codeAddr) : getI4LittleEndian(codeAddr);

-                if ((jmpDist == 0) && (opcode == CEE_BR || opcode == CEE_BR_S) && opts.DoEarlyBlockMerging())
+                if ((jmpDist == 0) && (jmpKind == BBJ_ALWAYS) && opts.DoEarlyBlockMerging())


Does this handle the CEE_LEAVE cases like the old code did?

Suspect you may want to change this back to what it was before.

I think so, since jmpKind is set to BBJ_ALWAYS only when the opcode is CEE_BR or CEE_BR_S, though I can change this back for simplicity.

AndyAyersMS · 2023-11-05T15:57:33Z

src/coreclr/jit/fgehopt.cpp

@@ -1432,19 +1428,6 @@ void Compiler::fgDebugCheckTryFinallyExits()
                        }
                    }
                }


Can you also update the comment above this code since case (d) is no longer possible? Instead of compacting (e) and (f) maybe just do something like

// ~~(d) via a fallthrough to an empty block to (b)~~ [no longer possible]

AndyAyersMS · 2023-11-05T15:58:38Z

src/coreclr/jit/fgehopt.cpp

@@ -1472,7 +1455,7 @@ void Compiler::fgDebugCheckTryFinallyExits()
                            block->bbNum, succBlock->bbNum);
                }

-                allTryExitsValid = allTryExitsValid & thisExitValid;
+                allTryExitsValid = allTryExitsValid && thisExitValid;


Nit: since these bools are almost always true, & is possibly a bit more efficient than &&.

AndyAyersMS · 2023-11-05T16:18:32Z

src/coreclr/jit/redundantbranchopts.cpp

        const bool fallThroughIsTruePred = BlockSetOps::IsMember(this, jti.m_truePreds, jti.m_fallThroughPred->bbNum);
+        const bool predJumpsToNext = jti.m_fallThroughPred->KindIs(BBJ_ALWAYS) && jti.m_fallThroughPred->JumpsToNext();


This code can likely be simplified quite a bit now that there are no implicit fall throughs. That is, there is no longer any reason to set jti.m_fallThroughPred to true.

Can you leave a todo comment here and maybe a note in the meta-issue that we should revisit this?

Sure thing.

AndyAyersMS · 2023-11-05T16:54:23Z

src/coreclr/jit/fgopt.cpp

+
+    // TODO: Now that block has a jump to bNext,
+    // can we relax this requirement?
+    assert(!fgInDifferentRegions(block, bNext));


Probably not, unless the block whose IR is moving is empty or can't cause an exception.

AndyAyersMS · 2023-11-05T17:02:47Z

src/coreclr/jit/optimizer.cpp

@@ -2750,6 +2733,13 @@ void Compiler::optRedirectBlock(BasicBlock* blk, BlockToBlockMap* redirectMap, R
            break;

        case BBJ_ALWAYS:
+            // Fall-through successors are assumed correct and are not modified


Is this new logic really necessary?

I think so. That comment is copied from the notes in the doc comment for Compiler::optRedirectBlock; I added this check in to emulate the no-op behavior it previously had for BBJ_NONE. If I remove it, we hit assert(h->HasJumpTo(t) || !h->KindIs(BBJ_ALWAYS)) in Compiler::optCanonicalizeLoopCore.

The logic seems wrong with this added code here -- I would expect optRedirectBlock to always redirect a BBJ_ALWAYS based on the map. The behavior now doesn't match the documentation. Maybe some update to redirectMap is needed somewhere?

To preserve the original behavior of this method, which was to not redirect BBJ_NONE, would it be ok to use BBF_NONE_QUIRK here instead to determine if we should redirect a BBJ_ALWAYS? This seems to work locally. e.g:

if (blk->JumpsToNext() && ((blk->bbFlags & BBF_NONE_QUIRK) != 0)) // Functionally equivalent to BBJ_NONE

It would be better, even though I generally think quirks should not affect behavior in this way. It seems like there is some form of bug here around how the redirection happens or how the map is constructed.

AndyAyersMS · 2023-11-05T17:05:50Z

src/coreclr/jit/optimizer.cpp

-    {
-        preHead = BasicBlock::bbNewBasicBlock(this, BBJ_ALWAYS, entry);
-    }
+    BasicBlock* preHead        = BasicBlock::New(this, BBJ_ALWAYS, (isTopEntryLoop ? top : entry));


I think we can just always branch to entry now, the logic before was trying to optimize the case for top-entry loops and fall through, but we don't need to do that anymore.

BruceForstall

It's cool to see how much code is getting deleted (and how much more will probably follow).

I agree with Andy that you should get as close to zero diffs as possible with this change, putting in temporary workarounds if necessary for cases we can later remove.

In addition:

Comment unrelated to this change: I don't like how HasJump() determines if we have a jump using bbJumpDest != nullptr. Note that bbJumpDest is part of a union. We should never access that value without first checking bbJumpKind (or asserting that bbJumpKind is one where we expect bbJumpDest to be set.) It would probably be useful to have functions:

    HasJumpDest() => KindIs(BBJ_ALWAYS,<others>)  // bbJumpDest is valid
    HasJumpSwt() => KindIs(BBJ_SWITCH)            // bbJumpSwt is valid
    HasJumpEhf() => KindIs(BBJ_EHFINALLYRET)      // bbJumpEhf is valid

(and presumably bbJumpOffs is only valid during a limited time during importing?)

These could be used in appropriate asserts as well.

Questions about next steps: (a) do we get rid of the BBJ_COND "fall through"? (b) do we get rid of "JumpsToNext()"? (c) do we get rid of bbFallsThrough()? (d) do we get rid of fgConnectFallThrough()? (e) do we rename/remove fgIsBetterFallThrough()?

BruceForstall · 2023-11-06T04:33:31Z

src/coreclr/jit/fgehopt.cpp

@@ -1101,15 +1097,15 @@ PhaseStatus Compiler::fgCloneFinally()
        {
            BasicBlock* newBlock = blockMap[block];
            // Jump kind/target should not be set yet
-            assert(newBlock->KindIs(BBJ_NONE));
+            assert(!newBlock->HasJump());


Shouldn't this be:

Suggested change

assert(!newBlock->HasJump());

assert(newBlock->KindIs(BBJ_ALWAYS));

? Or do you also want:

assert(newBlock->KindIs(BBJ_ALWAYS) && !newBlock->HasJump());

The second is what I was going for; I'll update it.

BruceForstall · 2023-11-06T04:38:57Z

src/coreclr/jit/fgopt.cpp

        {
-            return true;
+            noway_assert(b1->KindIs(BBJ_ALWAYS));


This assert seems unnecessary. At least, make it a normal assert (not a noway_assert)

BruceForstall · 2023-11-06T04:48:43Z

src/coreclr/jit/flowgraph.cpp

-        BBJ_THROW, // SCK_ARG_EXCPN
-        BBJ_THROW, // SCK_ARG_RNG_EXCPN
-        BBJ_THROW, // SCK_FAIL_FAST
+        BBJ_ALWAYS, // SCK_NONE


This seems odd. Does SCK_NONE ever get used?

I guess not; I added an assert in to test if add->acdKind == SCK_NONE is ever true, and the SuperPMI replay was clean. A quick search of the source code doesn't yield any places where we assign SCK_NONE, so maybe I can add an assert in here that assures we don't use SCK_NONE? Then I can remove the conditional logic below for setting the jump target if newBlk is a BBJ_ALWAYS.

BruceForstall · 2023-11-06T04:49:43Z

src/coreclr/jit/flowgraph.cpp

+        if (newBlk->KindIs(BBJ_ALWAYS))
+        {
+            assert(add->acdKind == SCK_NONE);
+            newBlk->SetJumpDest(newBlk->Next());


What if newBlk is the last block of the function? Then is bbJumpDest == nullptr an indication of "fall off the end?" (previously, we'd have a BBJ_NONE and generate an int3 / breakpoint if that happened)

Good point. When emitting jumps, I added in assert to see if we ever get a BBJ_ALWAYS with a null jump target, and it never hit during SuperPMI replays. Per your note on SCK_NONE, I think we can get rid of this if statement altogether.

amanasifkhalid · 2023-11-06T17:20:40Z

@AndyAyersMS @BruceForstall thank you both for the code reviews! I'll address your feedback shortly.

I suppose we could also try and look at the debug info we generate for some of these methods ((say pass --debuginfo to superpmi.py asmdiffs)...

I tried this with the peephole optimization always enabled, and only found differences in the number of IL offsets when the number of blocks differed (usually from flowgraph optimizations like fgCompactBlocks behaving less/more aggressively), so those diffs only applied for FullOpts. I spot-checked the MinOpts examples where the peephole optimization significantly reduced the code size (like the second case study above), and found no differences in the number of IL offsets -- only the actual offsets differed to reflect the different codegen. So I don't think this optimization will affect debugging? If you'd like, I can try modifying jit-analyze to check for diffs in the number of IL offsets; maybe I'm missing something, but it doesn't explicitly report diffs in debug info for me.

It would be helpful to see how much of the code growth comes in methods where more cloning happens, and whether aside from that there are other things going on that would be worth understanding.

I replayed libraries_tests.run.windows.arm64.Release with JitTimeLogCsv and diff'd methods by the number of loops cloned, and all of the methods with the top size regressions by percentage reported by SuperPMI had diffs in number of cloned loops. @AndyAyersMS I see in #94363 you have similar size regressions. I'll tweak the script I used to collect diffs a bit to improve its usability, and share it with you offline.

Seems like there is more than just this peephole involved, as x64 is still smaller than it was.

I'll take a look at the JIT dumps for the top improvements next to see where else the decreases are coming from (particularly for MinOpts).

BruceForstall · 2023-11-06T18:57:03Z

I'd like to understand why there is more (or less) cloning. There's a lot of code that likes "top entry loops" so perhaps those aren't being distinguished as before?

More fundamentally: are a different set/number of loops being recognized by the loop recognizer, without "fall through"? Does loop inversion happen the same amount?

amanasifkhalid · 2023-11-06T20:03:09Z

More fundamentally: are a different set/number of loops being recognized by the loop recognizer, without "fall through"?

I think this is the case. Looking at the DOTNET_JitTimeLogCsv output, for the methods that the diff JIT does more loop cloning, many methods are reported as having more loops to begin with, so it looks like I've broken loop recognition. I think there are still plenty of methods where we recognize the same loops, but because prior optimization passes behaved slightly differently, we're able to clone loops we previously couldn't. But I'll look at fixing loop recognition first, and see how much that cuts down on the code size increases.

kunalspathak · 2023-11-06T20:10:54Z

I've added a small peephole optimization to skip emitting unconditional branches to the next block during codegen.

Is this different from #69041?

amanasifkhalid · 2023-11-06T20:49:29Z

@kunalspathak they seem to have the same goal, but I think my approach is more aggressive, in that it checks to see if a BBJ_ALWAYS is functionally equivalent to BBJ_NONE during codegen. Without the opt I added, I saw plenty of unnecessary jumps to next after replacing BBJ_NONE with BBJ_ALWAYS, so since #69041 was implemented with the assumption that we have BBJ_NONE, I'm guessing that approach isn't as aggressive?

AndyAyersMS · 2023-11-28T03:55:25Z

Looks like that was the problem.

AndyAyersMS

LGTM. Thanks for hanging in there through a number of revisions.

amanasifkhalid · 2023-11-28T12:06:28Z

Thank you for all the reviews!

BruceForstall · 2023-11-28T18:24:31Z

I asked Andy if we should do similar work for BBJ_CALLFINALLY so we can split up BBJ_CALLFINALLY/BBJ_ALWAYS pairs, and we decided at the time to leave these be for now. I don't know if we have anything to gain from being able to split them up in terms of codegen, but removing all those edge cases around call/always pairs would be nice, so maybe we should add a similar successor pointer for BBJ_CALLFINALLY...

There is no point to allow the BBJ_ALWAYS to be split up from (live in a different place from; have code in between; etc.) its paired BBJ_CALLFINALLY. All the codegen happens when processing the BBJ_CALLFINALLY, which uses the data from the BBJ_ALWAYS. The BBJ_ALWAYS itself generates nothing.

jakobbotsch · 2023-11-28T18:30:36Z

Yeah, the main benefit would be to allow us to get rid of all code in the JIT that has to deal with the possibility of implicit fallthrough (bbFallsThrough and all its uses). I expect that would be a significant simplification of a lot of logic. Something to consider, but clearly BBJ_COND is the more important one for now, and most (or at least a lot) of the places that deal with fallthrough only end up having to deal with BBJ_COND and BBJ_NONE anyway.

amanasifkhalid · 2023-11-28T18:33:17Z

Thanks for clarifying @BruceForstall. Maybe later on, we can consider replacing BBJ_CALLFINALLY/BBJ_ALWAYS pairs with a new jump kind that has two jump targets: the CALLFINALLY jump target, and then the ALWAYS target after. This should allow us to get rid of a lot of the special checks for these pairs when modifying BBJ_ALWAYS blocks.

BruceForstall · 2023-11-28T18:46:05Z

bbFallsThrough

It's very weird to think about non-retless CALLFINALLY as "fall through": you can't insert a block between the CALLFINALLY and ALWAYS, and control flow doesn't "fall through" to the block after the ALWAYS since the finally returns to the ALWAYS target. I'm not sure what it means.

Maybe later on, we can consider replacing BBJ_CALLFINALLY/BBJ_ALWAYS pairs with a new jump kind that has two jump targets: the CALLFINALLY jump target, and then the ALWAYS target after. This should allow us to get rid of a lot of the special checks for these pairs when modifying BBJ_ALWAYS blocks.

If we could make that work, it would be great. Currently, the finally's EHFINALLYRET returns each ALWAYS block as a successor, and that ALWAYS block has its EHFINALLYRET as a predecessor. What you suggest would simplify a lot of current logic but might add additional special logic to flow graph interpretation. E.g., the EHFINALLYRET would I suppose yield all the continuation blocks as successors (makes sense), and each of them would have an EHFINALLYRET as predecessor, as well as any non-finally block predecessors.

jakobbotsch · 2023-11-28T19:06:11Z

It's very weird to think about non-retless CALLFINALLY as "fall through": you can't insert a block between the CALLFINALLY and ALWAYS, and control flow doesn't "fall through" to the block after the ALWAYS since the finally returns to the ALWAYS target. I'm not sure what it means.

The function has the comment "Can a BasicBlock be inserted after this without altering the flowgraph" and the current design definitely means retless CALLFINALLY falls in the category (well, the opposite meaning is implied I'm sure). I guess that's why it returns true.

I think the representation where we store the continuation in the CALLFINALLY sounds natural. Finding the successors of the EHFINALLYRET would be done by looking at the regular predecessors of the handler entry, which should be all relevant CALLFINALLY blocks. (Actually couldn't we do that even today, instead of the side table added in #93377?)

amanasifkhalid · 2023-11-28T19:08:01Z

The function has the comment "Can a BasicBlock be inserted after this without altering the flowgraph"

And based on the return values, the comment should probably be something like "Would inserting after this block alter the flowgraph", since it returns true for blocks with implicit fallthrough.

BruceForstall · 2023-11-28T19:31:39Z

Finding the successors of the EHFINALLYRET would be done by looking at the regular predecessors of the handler entry, which should be all relevant CALLFINALLY blocks. (Actually couldn't we do that even today, instead of the side table added in #93377?)

The "side table" is mostly for performance, simplicity, use by the iterators, and consistency with how switches are represented.

What you say about using the regular predecessors of the finally entry makes sense. It wasn't done that way before, possibly because the previous code was written before predecessors were always available.

AndyAyersMS · 2023-11-28T19:47:32Z

I think we can indeed get rid of these, and it would nice to no longer have all that special case handling everywhere.

We might need to constrain layout and/or add to codegen to handle the case where the retfinallly target block does not end up immediately after the corresponding callfinally block. I don't know if codegen is flexible enough today to do the latter (basically introducing a label that's not at a block begin, and after that, a branch to the right spot).

BruceForstall · 2023-11-28T20:00:22Z

I wrote a proposal to reconsider the BBJ_CALLFINALLY/BBJ_ALWAYS representation: #95355

Please add comments there about what might be required to make that work.

Follow-up to #94239. In MinOpts scenarios, we should remove branches to the next block regardless of whether BBF_NONE_QUIRK is set, as this yields code size and TP improvements.

EgorBo · 2023-12-05T17:56:47Z

Improvements:

[Perf] Linux/x64: 28 Improvements on 11/28/2023 1:58:26 PM perf-autofiling-issues#25405
[Perf] Linux/arm64: 18 Improvements on 11/29/2023 6:46:09 PM perf-autofiling-issues#25608
[Perf] Linux/arm64: 4 Improvements on 11/28/2023 5:48:12 PM perf-autofiling-issues#25607

amanasifkhalid · 2023-12-12T22:17:42Z

Collated set of improvements/regressions (lower is better) as of 12/12/2023.

Recent Score	Orig Score	arm64 Ubuntu	arm64 Windows	intel x64 Ubuntu	intel x64	amd x64 Windows	Benchmark
1.51	1.49			1.51 1.49			System.Memory.Span(Char).IndexOfAnyThreeValues(Size: 512)
1.40	1.41			1.40 1.41			System.Tests.Perf_Uri.EscapeDataString(input: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1.39	1.38			1.39 1.38			System.Memory.Span(Char).LastIndexOfAnyValues(Size: 512)
1.31	1.31			1.31 1.31			Burgers.Test2
1.28	1.28			1.28 1.28			System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
1.26	1.26			1.26 1.26			System.Memory.ReadOnlySpan.IndexOfString(input: "???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
1.25	1.25			1.25 1.25			SciMark2.kernel.benchSparseMult
1.25	1.25			1.25 1.25			System.Tests.Perf_Int128.ParseSpan(value: "170141183460469231731687303715884105727")
1.25	1.25			1.25 1.25			System.Tests.Perf_Int128.Parse(value: "170141183460469231731687303715884105727")
1.23	1.23			1.23 1.23			System.Tests.Perf_Int128.TryParse(value: "170141183460469231731687303715884105727")
1.22	1.22			1.22 1.22			System.Tests.Perf_Int128.TryParseSpan(value: "170141183460469231731687303715884105727")
1.21	1.21			1.21 1.21			Benchstone.BenchI.CSieve.Test
1.21	1.17			1.21 1.17			System.Collections.IterateFor(Int32).ImmutableSortedSet(Size: 512)
1.20	1.20			1.20 1.20			System.Text.Perf_Utf8Encoding.GetByteCount(Input: Chinese)
1.20	1.20			1.20 1.20			System.Text.Perf_Utf8Encoding.GetByteCount(Input: Cyrillic)
1.20	1.19			1.20 1.19			System.Text.Perf_Utf8Encoding.GetByteCount(Input: EnglishMostlyAscii)
1.19	1.19			1.19 1.19			System.Text.Perf_Utf8Encoding.GetByteCount(Input: Greek)
1.17	1.13			1.17 1.13			System.Tests.Perf_Enum.InterpolateIntoStringBuilder_NonFlags(value: 42)
1.16	1.16			1.16 1.16			System.Numerics.Tests.Perf_VectorOf(Int16).DivisionOperatorBenchmark
1.15	1.13			1.15 1.13			System.Memory.Span(Char).Reverse(Size: 512)
1.14	1.13			1.14 1.13			System.Tests.Perf_String.ToUpperInvariant(s: "This is a much longer piece of text that might benefit more from vectorization.")
1.13	1.13			1.13 1.13			System.Collections.ContainsFalse(Int32).ImmutableSortedSet(Size: 512)
1.13	1.10			1.13 1.10			System.Tests.Perf_String.ToUpper(s: "This is a much longer piece of text that might benefit more from vectorization.")
1.13	1.13			1.13 1.13			System.MathBenchmarks.Single.ScaleB
1.12	1.05			1.12 1.05			System.Tests.Perf_Enum.InterpolateIntoStringBuilder_Flags(value: 32)
1.12	1.12			1.12 1.12			SciMark2.kernel.benchFFT
1.12	1.12			1.12 1.12			System.Tests.Perf_Enum.InterpolateIntoSpan_NonFlags(value: 42)
1.11	1.07			1.11 1.07			System.Collections.ContainsFalse(String).FrozenSet(Size: 512)
1.11	1.10			1.11 1.10			System.Collections.IndexerSet(Int32).SortedList(Size: 512)
1.11	1.10			1.11 1.10			Benchstone.BenchI.Array1.Test
1.10	1.11			1.10 1.11			System.Tests.Perf_String.ToLowerInvariant(s: "This is a much longer piece of text that might benefit more from vectorization.")
1.10	1.09			1.10 1.09			System.Tests.Perf_String.ToLower(s: "This is a much longer piece of text that might benefit more from vectorization.")
1.09	1.10			1.09 1.10			Benchstone.MDBenchI.MDAddArray2.Test
1.09	1.10			1.09 1.10			System.Memory.Span(Char).Clear(Size: 512)
1.09	1.32			1.09 1.32			System.Tests.Perf_String.Concat_CharEnumerable
1.08	1.08			1.08 1.08			System.Collections.IterateFor(String).ImmutableSortedSet(Size: 512)
1.08	1.08			1.08 1.08			System.Collections.ContainsFalse(Int32).SortedSet(Size: 512)
1.08	1.07			1.37 1.37	0.85 0.84		System.Tests.Perf_String.IndexerCheckPathLength
1.07	1.07			1.07 1.07			System.Collections.ContainsFalse(Int32).ImmutableHashSet(Size: 512)
1.07	1.06			1.07 1.06			System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,?2020,16)
1.06	1.06			1.06 1.06			PerfLabTests.CastingPerf.IFooFooIsIFoo
1.06	1.06			1.06 1.06			PerfLabTests.CastingPerf.ObjFooIsObj2
1.03	1.29			1.03 1.29			System.Collections.ContainsFalse(Int32).Queue(Size: 512)
1.00	0.91				1.00 0.91		IfStatements.IfStatements.And
1.00	0.82				1.00 0.82		IfStatements.IfStatements.Single
1.00	0.84				1.00 0.84		PerfLabTests.LowLevelPerf.StructWithInterfaceInterfaceMethod
1.00	0.89				1.00 0.89		PerfLabTests.CastingPerf2.CastingPerf.FooObjCastIfIsa
1.00	0.89				1.00 0.89		PerfLabTests.LowLevelPerf.InterfaceInterfaceMethodSwitchCallType
0.94	0.94	0.95 0.94	0.93 0.94				System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 512)
0.94	0.93			0.94 0.93			System.MathBenchmarks.Single.SinCosPi
0.94	0.94	0.94 0.94					System.Numerics.Tests.Perf_BitOperations.Log2_ulong
0.94	0.94	0.94 0.94					System.Numerics.Tests.Perf_BitOperations.Log2_uint
0.94	0.93				0.94 0.93		System.Collections.ContainsKeyTrue(Int32, Int32).Dictionary(Size: 512)
0.94	0.88	0.94 0.88					System.Text.Tests.Perf_Encoding.GetChars(size: 16, encName: "ascii")
0.93	0.93			0.93 0.93			System.Text.Json.Tests.Utf8JsonReaderCommentsTests.Utf8JsonReaderCommentParsing(CommentHandling: Skip, SegmentSize: 0, TestCase: LongMultiLine)
0.93	0.94		0.93 0.94				System.Buffers.Tests.SearchValuesCharTests.LastIndexOfAny(Values: "abcdefABCDEF0123456789")
0.93	0.93	0.93 0.93	0.93 0.93				ByteMark.BenchIDEAEncryption
0.93	0.93	0.92 0.92	0.94 0.94				System.Collections.ContainsFalse(Int32).Span(Size: 512)
0.92	0.93		0.92 0.93				System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: UnsafeRelaxed,hello "there",512)
0.91	0.93		0.91 0.93				MicroBenchmarks.Serializers.Xml_ToStream(IndexViewModel).XmlSerializer_
0.91	0.92			0.91 0.92			Benchstone.MDBenchF.MDSqMtx.Test
0.91	0.91	0.91 0.90	0.92 0.93				System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: UnsafeRelaxed,hello "there",16)
0.91	0.91				0.91 0.91		System.Collections.IterateForEachNonGeneric(String).Stack(Size: 512)
0.91	0.90	0.91 0.90					System.Collections.Tests.Perf_PriorityQueue(Guid, Guid).Dequeue_And_Enqueue(Size: 100)
0.90	0.88		0.90 0.88				System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: JavaScript,&Hello+(World)!,512)
0.90	0.90					0.90 0.90	Benchstone.BenchI.BubbleSort.Test
0.90	0.87	0.92 0.90	0.88 0.85				System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: JavaScript,&Hello+(World)!,16)
0.90	0.86	0.90 0.86					System.Collections.Perf_LengthBucketsFrozenDictionary.ToFrozenDictionary(Count: 10000, ItemsPerBucket: 1)
0.90	0.84			0.90 0.84			Benchstone.BenchI.Puzzle.Test
0.90	0.90				0.90 0.90		System.Collections.IterateForEachNonGeneric(String).ArrayList(Size: 512)
0.89	0.94				0.89 0.94		System.Collections.ContainsKeyTrue(Int32, Int32).ConcurrentDictionary(Size: 512)
0.89	0.89		0.89 0.89				System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_True_FrozenDictionary(Count: 100, ItemsPerBucket: 1)
0.88	0.85	0.88 0.85					System.Text.Tests.Perf_Encoding.GetByteCount(size: 16, encName: "ascii")
0.87	0.87					0.87 0.87	System.Text.Json.Tests.Utf8JsonReaderCommentsTests.Utf8JsonReaderCommentParsing(CommentHandling: Skip, SegmentSize: 100, TestCase: LongMultiLine)
0.87	0.82			0.87 0.82			System.Tests.Perf_Int32.ParseSpan(value: "12345")
0.87	0.87			0.87 0.87			Benchstone.BenchI.Array2.Test
0.87	0.87				0.87 0.87		PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod
0.86	0.85	0.88 0.84	0.85 0.87				System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,&lorem ipsum=dolor sit amet,512)
0.86	0.87				0.86 0.87		Benchstone.BenchI.BenchE.Test
0.86	0.87		0.86 0.87				System.Buffers.Tests.SearchValuesByteTests.IndexOfAnyExcept(Values: "abcdefABCDEF0123456789Ü")
0.86	0.85	0.88 0.88	0.83 0.83				System.Numerics.Tests.Perf_BitOperations.PopCount_uint
0.85	0.87	0.88 0.88	0.88 0.88	0.78 0.84			ByteMark.BenchAssignJagged
0.85	0.85	0.85 0.85					System.Numerics.Tests.Perf_BigInteger.Parse(numberString: 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012
0.85	0.84			0.85 0.84			System.Tests.Perf_Int32.TryParse(value: "12345")
0.84	0.84			0.84 0.84			Burgers.Test1
0.84	0.86		0.84 0.86				System.Buffers.Tests.SearchValuesCharTests.LastIndexOfAny(Values: "ßäöüÄÖÜ")
0.84	0.85			0.84 0.85			System.Tests.Perf_Int32.TryParseSpan(value: "12345")
0.83	0.81			0.83 0.81			System.Tests.Perf_Int32.ParseSpan(value: "2147483647")
0.83	0.79		0.83 0.79				Interop.StructureToPtr.MarshalDestroyStructure
0.83	0.82	0.83 0.82					System.Text.Tests.Perf_Encoding.GetByteCount(size: 512, encName: "ascii")
0.82	0.82				0.82 0.82		PerfLabTests.CastingPerf2.CastingPerf.IntObj
0.82	0.83				0.82 0.83		PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj
0.82	0.82			0.82 0.82			System.Memory.Span(Char).IndexOfAnyFiveValues(Size: 512)
0.82	0.81	0.82 0.81					System.Numerics.Tests.Perf_BigInteger.Parse(numberString: -2147483648)
0.82	0.79		0.82 0.79				Interop.StructureToPtr.MarshalPtrToStructure
0.81	0.79			0.81 0.79			System.Tests.Perf_Int32.TryParseSpan(value: "2147483647")
0.81	0.80	0.82 0.81	0.80 0.80				System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: Url,&lorem ipsum=dolor sit amet,16)
0.81	0.80		0.81 0.80				Interop.StructureToPtr.MarshalStructureToPtr
0.80	0.81	0.81 0.83	0.80 0.80				System.Collections.Sort(IntStruct).List(Size: 512)
0.80	0.80					0.80 0.80	System.Collections.Tests.Perf_PriorityQueue(String, String).Enumerate(Size: 1000)
0.80	0.79			0.80 0.79			System.Tests.Perf_UInt32.ParseSpan(value: "4294967295")
0.80	0.80			0.80 0.80			System.Memory.Span(Byte).IndexOfAnyTwoValues(Size: 512)
0.79	0.83	0.80 0.82	0.78 0.84				System.Collections.Sort(IntStruct).Array(Size: 512)
0.78	0.75			0.78 0.75			System.Tests.Perf_Int32.TryParse(value: "2147483647")
0.78	0.78				0.78 0.78		System.Collections.IterateForEachNonGeneric(String).Queue(Size: 512)
0.78	0.77			0.69 0.69	0.88 0.86		System.Memory.Span(Char).IndexOfValue(Size: 512)
0.76	0.78			0.76 0.78			System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Mariomkas.Count(Pattern: "(?:(?:25[0-5]
0.76	0.76			0.76 0.76			System.Tests.Perf_Int32.ParseHex(value: "80000000")
0.75	0.74			0.75 0.74			System.Tests.Perf_UInt32.TryParse(value: "4294967295")
0.75	0.75	0.74 0.75	0.75 0.76				System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstSpanSingleSegment
0.74	0.73			0.74 0.73			System.Tests.Perf_Int32.ParseHex(value: "7FFFFFFF")
0.73	0.76			0.73 0.76			System.Tests.Perf_UInt32.Parse(value: "4294967295")
0.73	0.72	0.72 0.72	0.73 0.73				System.Buffers.Tests.ReadOnlySequenceTests(Char).FirstSpanTenSegments
0.70	0.86			0.70 0.86			System.Text.Tests.Perf_Encoding.GetByteCount(size: 512, encName: "utf-8")
0.70	0.69				0.70 0.69		System.Memory.Span(Char).Fill(Size: 512)
0.67	0.67			0.67 0.67			System.MathBenchmarks.Single.Max
0.67	0.67			0.67 0.67			System.MathBenchmarks.Single.Min
0.67	0.68			0.68 0.68	0.66 0.68		System.Tests.Perf_String.IndexerCheckBoundCheckHoist
0.66	0.67			0.66 0.67	0.66 0.67		System.Tests.Perf_String.IndexerCheckLengthHoisting
0.56	0.54			0.56 0.54			System.Collections.Tests.Perf_Dictionary.ContainsValue(Items: 3000)

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 31, 2023

ghost assigned amanasifkhalid Oct 31, 2023

amanasifkhalid mentioned this pull request Oct 31, 2023

JIT: Flow Graph Modernization and Improved Block Layout #93020

Open

51 tasks

This was referenced Oct 31, 2023

System.Security.Cryptography.Tests timing out #93840

Closed

Timeout in System.Net.Quic.Functional.Tests #86019

Closed

CI error: System.Net.Quic.QuicException: The connection timed out from inactivity #91757

Closed

amanasifkhalid marked this pull request as ready for review November 2, 2023 20:05

amanasifkhalid requested a review from AndyAyersMS November 2, 2023 20:13

AndyAyersMS reviewed Nov 4, 2023

View reviewed changes

AndyAyersMS reviewed Nov 5, 2023

View reviewed changes

BruceForstall reviewed Nov 6, 2023

View reviewed changes

amanasifkhalid added 2 commits November 27, 2023 19:14

Fix merge conflict

722f16a

Accidentally deleted BBF_NONE_QUIRK usage

10f82d4

AndyAyersMS approved these changes Nov 28, 2023

View reviewed changes

build-analysis bot mentioned this pull request Nov 28, 2023

Test failure WindowsAlternateDataStreamOverwrite #83659

Open

amanasifkhalid merged commit 52e65a5 into dotnet:main Nov 28, 2023
127 of 129 checks passed

amanasifkhalid mentioned this pull request Nov 28, 2023

JIT: Enable jump-to-next removal optimization in MinOpts #95340

Merged

jakobbotsch mentioned this pull request Nov 28, 2023

JIT: Generalize check for full interruptibility #95299

Merged

EgorBo mentioned this pull request Dec 5, 2023

[Perf] Regressions from "Remove BBJ_NONE" #95646

Closed

This was referenced Dec 5, 2023

[Perf] Windows/x64: 19 Improvements on 11/27/2023 8:58:28 PM dotnet/perf-autofiling-issues#25444

Closed

[Perf] Windows/arm64: 23 Improvements on 11/29/2023 7:10:28 AM dotnet/perf-autofiling-issues#25617

Closed

AndyAyersMS mentioned this pull request Dec 12, 2023

[Perf] Windows/x64: 3 Improvements on 11/28/2023 7:36:40 PM dotnet/perf-autofiling-issues#25801

Closed

amanasifkhalid mentioned this pull request Dec 14, 2023

[JIT] Remove BBF_NONE_QUIRK #95998

Closed

4 tasks

amanasifkhalid deleted the bbj_none branch January 8, 2024 16:17

jakobbotsch mentioned this pull request Jan 15, 2024

JIT: Extraneous jumps around alignment in some cases after BBJ_NONE removal #96998

Closed

amanasifkhalid mentioned this pull request Jan 16, 2024

JIT: Allow jump-to-next-block removal for blocks with alignment #97011

Merged

github-actions bot locked and limited conversation to collaborators Feb 8, 2024

@@ @@ -1432,19 +1428,6 @@ void Compiler::fgDebugCheckTryFinallyExits() @@
                                       }
                                   }
                               }

		const bool fallThroughIsTruePred = BlockSetOps::IsMember(this, jti.m_truePreds, jti.m_fallThroughPred->bbNum);
		const bool predJumpsToNext = jti.m_fallThroughPred->KindIs(BBJ_ALWAYS) && jti.m_fallThroughPred->JumpsToNext();

	assert(!newBlock->HasJump());
	assert(newBlock->KindIs(BBJ_ALWAYS));

JIT: Remove BBJ_NONE #94239

JIT: Remove BBJ_NONE #94239

Conversation

amanasifkhalid commented Oct 31, 2023

ghost commented Oct 31, 2023

amanasifkhalid commented Nov 2, 2023

amanasifkhalid commented Nov 2, 2023

AndyAyersMS commented Nov 2, 2023

jakobbotsch commented Nov 2, 2023 • edited Loading

amanasifkhalid commented Nov 2, 2023

jakobbotsch commented Nov 2, 2023

amanasifkhalid commented Nov 3, 2023 • edited Loading

amanasifkhalid commented Nov 3, 2023

AndyAyersMS commented Nov 3, 2023

BruceForstall commented Nov 3, 2023

amanasifkhalid commented Nov 3, 2023

AndyAyersMS commented Nov 3, 2023

Choose a reason for hiding this comment

AndyAyersMS commented Nov 4, 2023 • edited Loading

AndyAyersMS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BruceForstall left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amanasifkhalid commented Nov 6, 2023 • edited Loading

BruceForstall commented Nov 6, 2023

amanasifkhalid commented Nov 6, 2023

kunalspathak commented Nov 6, 2023

amanasifkhalid commented Nov 6, 2023

AndyAyersMS commented Nov 28, 2023

AndyAyersMS left a comment

Choose a reason for hiding this comment

amanasifkhalid commented Nov 28, 2023

BruceForstall commented Nov 28, 2023

jakobbotsch commented Nov 28, 2023

amanasifkhalid commented Nov 28, 2023

BruceForstall commented Nov 28, 2023

jakobbotsch commented Nov 28, 2023

amanasifkhalid commented Nov 28, 2023

BruceForstall commented Nov 28, 2023

AndyAyersMS commented Nov 28, 2023

BruceForstall commented Nov 28, 2023

EgorBo commented Dec 5, 2023 • edited Loading

amanasifkhalid commented Dec 12, 2023

jakobbotsch commented Nov 2, 2023 •

edited

Loading

amanasifkhalid commented Nov 3, 2023 •

edited

Loading

AndyAyersMS commented Nov 4, 2023 •

edited

Loading

amanasifkhalid commented Nov 6, 2023 •

edited

Loading

EgorBo commented Dec 5, 2023 •

edited

Loading