Add lowering support for conditional nodes #71705

a74nh · 2022-07-06T10:44:06Z

This builds on the code added in #71616

The code is pulled directly from #67894

As before, with this patch nothing uses the conditional nodes, so the impact on code gen should be zero.

ghost · 2022-07-06T10:44:22Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This builds on the code added in #71616

The code is pulled directly from #67286

As before, with this patch nothing uses the conditional nodes, so the impact on code gen should be zero.

Author:	a74nh
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `community-contribution`
Milestone:	-

a74nh · 2022-07-06T17:04:38Z

Looks like those test failures might be related to my changes in LowerNodeCC LowerHWIntrinsicCC. Will investigate.

src/coreclr/jit/gentree.cpp

src/coreclr/jit/liveness.cpp

src/coreclr/jit/compiler.h

kunalspathak · 2022-07-07T23:21:10Z

Seems there are code paths touching x64 as seen in spmi-diff

a74nh · 2022-07-11T08:58:33Z

Seems there are code paths touching x64 as seen in spmi-diff

Highly suspect this is my codegenxarch changes. Will look at this.

src/coreclr/jit/codegenarm64.cpp

src/coreclr/jit/codegenarmarch.cpp

jakobbotsch · 2022-07-19T07:29:25Z

src/coreclr/jit/codegenarm64.cpp

+            // An And that is not contained should not have any contained children.
+            assert(!op1->isContained() && !op2->isContained());


This code should not make any assumptions about this.

The problem is handling that case.

Firstly that would mean having extra logic in the AND code generation to handle contained children. Do we then assume the contained children of the AND can be of any node type?

Then should I apply compare chain logic to it? If op1 is a compare chain, then op2 should probably generate a conditional compare, and then the AND needs to generate into a register. Alternatively, just do the easy way of generate op1 into a register, op2 into a register and do a normal AND.

The problem is, those cases aren't going to be happen (due to the lower phase not creating non-contained ANDs that have contained children).

So maybe the answer is:
in the code generation for and, if it has contained children, then generate those children into registers and then plant the AND as normal

Firstly that would mean having extra logic in the AND code generation to handle contained children. Do we then assume the contained children of the AND can be of any node type?

No; this should be left up to codegen of AND as it exists today.

For any uncontained child it does not make sense to ask questions about its children in turn; those will have been handled as part of its code generation. They should be opaque to the grandparent node. Above you are also assuming that all operands of GT_AND nodes are themselves GenTreeOp; this is not an ok assumption to make either.

These are probably indications that the implementation is not completely right yet. It is ok to make this work only for contained nodes and to rely on containment checks done in lowering previously, but the way this is currently implemented it is not the case. For example, for LIR like:

a = LCL_VAR V00 b = CNS_INT 4 c = AND a, b d = LCL_VAR_ADDR V00 e = CALL Foo, d f = CNS_INT 1 g = SELECT c, e, f

we will not be able to contain AND in SELECT, yet IIUC genCodeForSelect will come here and assume that LCL_VAR V00 and CNS_INT 4 can be cast to GenTreeOp (not true) and that they cannot be contained (not true for the constant).

No; this should be left up to codegen of AND as it exists today.

There will still need to be some changes in the codegen for AND....

in CodeGen::genCodeForBinary() for Arm64 it only handles contained for MUL:
if (op2->OperIs(GT_MUL) && op2->isContained())

They should be opaque to the grandparent node

Agreed (and the subsequent comments too)

There will still need to be some changes in the codegen for AND....

I agree this will be necessary if we are going to add support for AND to consume compares via ccmp instead of by materializing the truth value into a register. However, if we are leaving the support for SELECT only then I don't see why it should be necessary, but maybe I am missing something.

I guess the way I would shape this is the following:

Make genCodeForConditionalCompare always produce a value into the flags, never into a register. Assert that it is only called on contained AND and compare nodes of the right shapes. In the base cases, we still need to call genConsumeReg on operands and get values from registers.

Change genCodeForSelect to satisfy the above: if the cond op is contained, then call genCodeForConditionalCompare, and otherwise call genConsumeReg and generate code to compare the register with 0

(Optional) Make a similar change in codegen for GT_AND to support contained comparisons there via ccmp as well. This will need to materialize a truth value.

The way the current code is shaped might also be fine, but you are probably missing some handling for the base cases (where the operands are no longer contained).

That logic doesn't quite work. For a 'SELECT' node, if the conditional is contained then it could be a 'CMP' node or an 'AND' node. So we still need something to handle both.

I'll refine the existing code using some of the above, and add some code in the AND and then see where that gets us....

My suggestion was to handle the contained AND in genCodeForConditionalCompare, in the same way you are doing now.
It is fine to assert that you only see contained ANDs/CMP here.

This part is updated.

It looks great now, but I'll have to spend a bit of time on this to make sure that this handles interference checking with IsSafeToContainMem correctly.

src/coreclr/jit/codegenarm64.cpp

src/coreclr/jit/lower.cpp

jakobbotsch · 2022-07-19T08:03:07Z

src/coreclr/jit/lower.cpp

+// Return Value:
+//    True if the chain is valid
+//
+bool Lowering::ContainCheckCompareChain(GenTree* tree, GenTree* parent, GenTree** earliest_valid)


I'm wondering if we can start using this in other places in this PR to test it without having to introduce the complicated if-conversion logic.
For example, should it not be beneficial to do this for any possible comparison/and node? E.g. code like
bool b = (x < 5) & (y < 3);` might be able to use this even without if-conversion.

In fact, thinking about it some more, this part of the PR seems orthogonal to GT_SELECT, there are several nodes that can be taught to consume flags without materializing a truth value in a register:

GT_AND via ccmp

GT_SELECT via csel

GT_JTRUE via cb

Hopefully all of these will end up on the same plan (doesn't have to be in this PR, but would be nice if we could do something for GT_AND to at least test it out).

Thinking this through.... If I add compare chain logic to the lowering of GT_AND, then most of ContainCheckCompareChain() will vanish because the AND nodes will be lowered before the SELECT node. That's probably a good thing.

Sounds good to me, potentially some heuristics may be needed to ensure we only do this when it is profitable, not sure about the latency/throughput of ccmp vs bitwise and. But this way we can get some early testing on this logic here which will be nice.

Also, we should be sure only do this transformation when optimizations are enabled (comp->opts.OptimizationEnabled()).

But this way we can get some early testing on this logic here which will be nice.

+1 on that. @a74nh , just pinging to see if you agree and plan to do this while lowering AND itself.

It sounds like there is a decision to be made here -- what is the most efficient way to generate this pattern of nodes. It does not matter how we represent the sequence of nodes, we still have to make that decision.

Just to clarify, because I'm not sure my point was very clear. Right now the TEST_EQ transformation is assuming ahead of time that it is always the most optimal way to do it. If we instead had conditional compare nodes, we would be doing the opposite -- assuming ahead of time that the conditional compares are always the most optimal way. In reality we need to consider these transforms that conflict on a case by case basis and make a decision for each on what the best way is. So I think it is a good thing that we hit a situation like this.

I also think it's best to disable the TEST_EQ optimisation if it sees the contained chain. Disabling that cause the JTRUE to JCC optimisation to kick in instead. Disabling that too, means we end up with:

cmp ... ccmp ...., le cset x0, gt cbz x0, label

Which is much better than the code without my patch.

I don't see how GT_SELECT helps in the general case. I can only see that it would help if the successor blocks to the block containing GT_JTRUE are simple enough that the entire thing can be replaced with GT_SELECT. Can you elaborate what you mean?

In most cases in the code above, the last two instructions will become a csel.
But, yes, there will be scenarios where we can't use GT_SELECT. In most of those cases though, we'll also fail to generate a compare chain. If it turns out there are more instances than I expect, then we can look at using the JCC in a follow on patch to replace the last two instructions in the code above with a jgt.

Which is much better than the code without my patch.

Indeed, looks great. I look forward to the PR that puts JTRUE on the same plan as this.
One thing I would like to see is a micro benchmark to make sure our expectation that this is better is correct.
The easy way to do that is via benchmark.NET where you can specify the old and new corerun with the --corerun <old corerun> <new corerun>.

In most cases in the code above, the last two instructions will become a csel.

I would not expect that to be the case. Only for very limited forms of IR will we be able to generate GT_SELECT in this case, since it requires both successor blocks to be single assignments to the same variable without any other side effects. This is probably a common pattern, but not the the majority case pattern?

New patch with everything above included.

I've left in the compare chain size calculation in the lower phase - but, it's not really being used at the moment, as its allowing all chain sizes.

Getting the register consuming working took a while. I had to move the consume calls out of GenCodeCompare(), otherwise the node gets consumed in the BinaryOp generation, then during the compare chain generation, it gets consumed again during the compare generation.

Running asmdiffs on the library tests, I only get 10 functions firing. Not seeing any chains longer than the above examples being generated (possible that it's due to my code). Want to spend a little more time playing with the results.

FWIW, I'm fine with this getting low hits for now. I expect in .NET 8 we can enable it for JTRUE and expand boolean optimizations (I believe we do not transform (a relop b) && (c relop d) => (a relop b) & (c relop d) today), which should make the optimization much more impactful.

Also, we can presumably do something similar for bitwise or.

src/coreclr/jit/lower.cpp

src/coreclr/jit/lowerarmarch.cpp

a74nh · 2022-08-05T18:08:21Z

Failures in Antigen and jitstress and outerloop. Are these something I should be concerned about and any quick pointers on what I should be running to reproduce them?

kunalspathak · 2022-08-05T18:35:22Z

Antigen failures are known issues. I would wait to complete jitstress and outerloop legs. I do see some new failures (only on Arm) for outerloop and the way to reproduce is download the correlation payload using runfo

Some known failures are:

jakobbotsch · 2022-08-05T19:40:35Z

I do see some new failures (only on Arm) for outerloop

From what I can see there are no new outerloop failures if you compare to the last outerloop run on main:
https://dev.azure.com/dnceng/public/_build/results?buildId=1925086&view=results

On the other hand, the arm64 System.Threading.Tests.SpinLockTests.RunSpinLockTests_NegativeTests failure looks like it is related.

a74nh · 2022-08-08T10:15:40Z

I do see some new failures (only on Arm) for outerloop

From what I can see there are no new outerloop failures if you compare to the last outerloop run on main: https://dev.azure.com/dnceng/public/_build/results?buildId=1925086&view=results

On the other hand, the arm64 System.Threading.Tests.SpinLockTests.RunSpinLockTests_NegativeTests failure looks like it is related.

That's odd, I don't get any failures when I run it myself:

❯ ~/dotnet/runtime_csel/.dotnet/dotnet build -t:Test -c Release -p:XunitMethodName=System.Threading.Tests.SpinLockTests.RunSpinLockTests_NegativeTests
MSBuild version 17.3.0-preview-22306-01+1c045cf58 for .NET
  Determining projects to restore...
  All projects are up-to-date for restore.
  Microsoft.Interop.SourceGeneration -> /home/alahay01/dotnet/runtime_csel/artifacts/bin/Microsoft.Interop.SourceGeneration/Release/netstandard2.0/Microsoft.Interop.SourceGeneration.dll
  LibraryImportGenerator -> /home/alahay01/dotnet/runtime_csel/artifacts/bin/LibraryImportGenerator/Release/netstandard2.0/Microsoft.Interop.LibraryImportGenerator.dll
  TestUtilities -> /home/alahay01/dotnet/runtime_csel/artifacts/bin/TestUtilities/Release/net6.0/TestUtilities.dll
  System.Threading.Tests -> /home/alahay01/dotnet/runtime_csel/artifacts/bin/System.Threading.Tests/Release/net7.0/System.Threading.Tests.dll
  ----- start Mon Aug 8 10:05:18 UTC 2022 =============== To repro directly: =====================================================
  pushd /home/alahay01/dotnet/runtime_csel/artifacts/bin/System.Threading.Tests/Release/net7.0
  /home/alahay01/dotnet/runtime_csel/artifacts/bin/testhost/net7.0-Linux-Release-arm64/dotnet exec --runtimeconfig System.Threading.Tests.runtimeconfig.json --depsfile System.Threading.Tests.deps.json xunit.console.dll System.Threading.Tests.dll -xml testResults.xml -nologo -method System.Threading.Tests.SpinLockTests.RunSpinLockTests_NegativeTests -notrait category=OuterLoop -notrait category=failing
  popd
  ===========================================================================================================
  ~/dotnet/runtime_csel/artifacts/bin/System.Threading.Tests/Release/net7.0 ~/dotnet/runtime_csel/src/libraries/System.Threading/tests
    Discovering: System.Threading.Tests (method display = ClassAndMethod, method display options = None)
    Discovered:  System.Threading.Tests (found 1 of 274 test case)
    Starting:    System.Threading.Tests (parallel test collections = on, max threads = 64)
    Finished:    System.Threading.Tests
  === TEST EXECUTION SUMMARY ===
     System.Threading.Tests  Total: 1, Errors: 0, Failed: 0, Skipped: 0, Time: 0.571s
  ~/dotnet/runtime_csel/src/libraries/System.Threading/tests
  ----- end Mon Aug 8 10:05:23 UTC 2022 ----- exit code 0 ----------------------------------------------------------
  exit code 0 means Exited Successfully

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:11.84


❯ env | grep COMPlus
COMPlus_TieredCompilation=1
COMPlus_JitStress=1

I also did a run with JitDisasm=*, which showed exactly 2 uses of ccmp (although it's hard to tell much from this, because it was mixed with the output of other functions. A way of dumping each function to a different file would be really useful)

a74nh · 2022-08-08T10:25:57Z

Fixed up all of Kunal's comments.

jakobbotsch · 2022-08-08T10:42:44Z

That's odd, I don't get any failures when I run it myself:

This is the codegen diff for System.Threading.SpinLock.Exit(bool). It does not look right, seems like this change ends up treating EQ(AND(x, 0x80000000), 0) as EQ(x, 0x80000000).

jakobbotsch · 2022-08-08T10:49:19Z

FWIW, whether or not you see the failure probably depends on whether we end up tiering System.Threading.SpinLock:Exit(bool).
You might be able to reproduce it more consistently with COMPlus_ReadyToRun=0 and COMPlus_TieredCompilation=0.

a74nh · 2022-08-08T11:45:17Z

FWIW, whether or not you see the failure probably depends on whether we end up tiering System.Threading.SpinLock:Exit(bool). You might be able to reproduce it more consistently with COMPlus_ReadyToRun=0 and COMPlus_TieredCompilation=0.

Getting the diff now, but not the failure. Investigating the IR.

a74nh · 2022-08-08T14:33:45Z

That's odd, I don't get any failures when I run it myself:

This is the codegen diff for System.Threading.SpinLock.Exit(bool). It does not look right, seems like this change ends up treating EQ(AND(x, 0x80000000), 0) as EQ(x, 0x80000000).

That looks fine to me.

Original code is:

        public void Exit(bool useMemoryBarrier)
        {
            // This is the fast path for the thread tracking is disabled and not to use memory barrier, otherwise go to the slow path
            // The reason not to add else statement if the usememorybarrier is that it will add more branching in the code and will prevent
            // method inlining, so this is optimized for useMemoryBarrier=false and Exit() overload optimized for useMemoryBarrier=true.
            int tmpOwner = _owner;
            if ((tmpOwner & LOCK_ID_DISABLE_MASK) != 0 & !useMemoryBarrier)
            {
                _owner = tmpOwner & (~LOCK_ANONYMOUS_OWNED);
            }
            else
            {
                ExitSlowPath(useMemoryBarrier);
            }
        }

(Note - this chaining only happens because the C# code is using & instead of &&).

We have the following IR: (Ignoring the children of NE 36 and EQ 41 for space)

                 [000033] -----------                         *  JTRUE     void  
                 [000034] J------N---                         \--*  EQ        int   
                 [000035] -----------                            +--*  AND       int   
                 [000036] -----------                            |  +--*  NE        int   ...
                 [000041] -----------                            |  \--*  EQ        int     ...
                 [000045] -----------                            \--*  CNS_INT   int    0

In current head, The AND EQ 0 is turned into a TEST_NE during lowering, giving:

N023 ( 15, 12) [000041] -----------                   t41 = *  EQ        int    REG x2 $202
N029 ( 10, 10) [000036] -----------                   t36 = *  TEST_NE   int    REG x3 $204
                                                            /--*  t41    int    
                                                            +--*  t36    int    
N031 ( 28, 26) [000034] J------N---                         *  TEST_NE   void   REG NA
N033 ( 30, 28) [000033] -----------                         *  JTRUE     void   REG NA $VN.Void

(and the NE was turned into a TEST_NE, but that's not relevant)

That's generated as:

EQ 41:
        7100003F          cmp     w1, #0        
        9A9F17E2          cset    x2, eq   
TEST_NE 36:
        7201027F          tst     w19, #0x80000000      
        9A9F07E3          cset    x3, ne  
TEST_NE 34:
        6A03005F          tst     w2, w3        
        54000161          bne     G_M57783_IG06

With my patch, that optimisation is skipped (due to hitting the isContained checks in lowering).

Instead, a different optimisation kicks in, switching the JTRUE EQ AND 0 to a JCMP AND 0:
(This optimisation isn't part of my patch)

  N008 ( 15, 12) [000041] -c---------                   t41 = *  EQ        int    $202
  N013 ( 10, 10) [000036] -c---------                   t36 = *  TEST_NE   int    $204
                                                              /--*  t41    int    
                                                              +--*  t36    int    
  N014 ( 26, 23) [000035] -----------                   t35 = *  AND       int    $205
  N015 (  1,  2) [000045] -c---------                   t45 =    CNS_INT   int    0 $40
                                                              /--*  t35    int    
                                                              +--*  t45    int    
  N016 ( 28, 26) [000034] CNE-------N---                         *  JCMP      void

That generates:

Compare chain: EQ 41:
  IN0004:                           cmp     w1, #0
Compare chain: TEST_NE 36:
  IN0005:                           ccmp    w19, w2, z, eq
Compare chain Finished: move the result from flags into a register
  IN0006:                           cset    x2, ne
JCMP 34:
  IN0007:                           cbnz    w2, G_M57783_IG06

(Ideally, the chain wouldn't need to generate into a register)

jakobbotsch · 2022-08-08T14:48:04Z

The semantics of TEST_NE(x, y) is (x & y) != 0. I don't think you can turn this into a conditional compare.
Small self-contained example:

public static void Main(string[] args)
{
    Console.WriteLine(Foo(3, false));
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static int Foo(int i, bool b)
{
    if ((i & 1) != 0 & !b)
        return 1;
    return 0;
}

Expected result: 1
Actual result on this PR with TC disabled: 0

Change-Id: I8a1761e1e89f589e1daf0318e120aae5dd3d7241 CustomizedGitHooks: yes

a74nh · 2022-08-08T16:03:01Z

The semantics of TEST_NE(x, y) is (x & y) != 0. I don't think you can turn this into a conditional compare. Small self-contained example:

Right! I was paying attention to the wrong part of the code.

New version pushed. I've added OperIsCmpCompare() to ensure TEST_ nodes are not put into the chains. (I guess there's a argument for not creating the TEST_ nodes if a chain could be created, but I wouldn't want to do that here).

kunalspathak · 2022-08-08T23:04:58Z

Do we need to make similar change in codegen too? OperIsCompare() -> OperIsCmpCompare()?

a74nh · 2022-08-09T10:35:06Z

Do we need to make similar change in codegen too? OperIsCompare() -> OperIsCmpCompare()?

Yes. It probably will never make any difference (due to lower never creating an invalid sequence), but should be there.

Added a patch to do this. Plus, I added some tests for these types of sequences.

kunalspathak · 2022-08-09T14:44:26Z

Yes. It probably will never make any difference (due to lower never creating an invalid sequence), but should be there.

Sorry, I should I mentioned this yesterday, but I still see some places in gentree and lsrabuild where we still use OperIsCompare(). Do they also need fixup?

a74nh · 2022-08-09T15:32:46Z

Yes. It probably will never make any difference (due to lower never creating an invalid sequence), but should be there.

Sorry, I should I mentioned this yesterday, but I still see some places in gentree and lsrabuild where we still use OperIsCompare(). Do they also need fixup?

The lsrabuild one needed fixing up - done this now.

The others don't need changing - they are simply me reverting from OperIsCompare() || OperIsConditionalCompare() back to OperIsCompare().

kunalspathak · 2022-08-09T15:49:44Z

We don't have to do it in this PR, but wondering is the CCMP <immediate> variant not supported today?

Below I see that we do mov w1, #55 before using it in ccmp.

a74nh · 2022-08-09T15:57:54Z

We don't have to do it in this PR, but wondering is the CCMP <immediate> variant not supported today?

Below I see that we do mov w1, #55 before using it in ccmp.

We are using the immediate version of ccmp, but it only has 5 bits of space for the value.

That's different to the immediate version of cmp, which has 12bits plus an optional shift.

(This is why after containing a compare we have to redo the containing of its children)

You'll see quite a few places where an immediate will fit into cmp but not into ccmp.

kunalspathak · 2022-08-09T21:19:35Z

Thank you @a74nh for your contribution and thank you @jakobbotsch for the thorough review.

kunalspathak

LGTM

ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 6, 2022

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 6, 2022

a74nh mentioned this pull request Jul 6, 2022

Add Conditional nodes and Arm64 code generation #71616

Merged

runfoapp bot mentioned this pull request Jul 6, 2022

system.text.regularexpressions.tests Failing on ARM64 linux #71722

Closed

jakobbotsch reviewed Jul 6, 2022

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

jakobbotsch reviewed Jul 6, 2022

View reviewed changes

src/coreclr/jit/liveness.cpp Outdated Show resolved Hide resolved

SingleAccretion reviewed Jul 6, 2022

View reviewed changes

src/coreclr/jit/compiler.h Outdated Show resolved Hide resolved

a74nh force-pushed the github_a74nh_csel_lower branch from 2fbd8e5 to 2b34f75 Compare July 14, 2022 10:17

JulieLeeMSFT assigned a74nh and kunalspathak Jul 14, 2022

jakobbotsch reviewed Jul 15, 2022

View reviewed changes

src/coreclr/jit/codegenarm64.cpp Outdated Show resolved Hide resolved

runfoapp bot mentioned this pull request Jul 19, 2022

system.net.quic.functional.tests failing with stack buffer overflow #72429

Closed