Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Lift remaining cmov restrictions by introducing GT_SELECTCC #82235

Merged
merged 9 commits into from Feb 23, 2023

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Feb 16, 2023

This introduces GT_SELECTCC and unifies its handling with GT_JCC. We no longer use containment for GT_SELECT conditions in the xarch backend.

Additionally teaches liveness DCE about GT_SETCC and GT_SELECTCC by allowing it to remove GTF_SET_FLAGS from the previous node when they are unused.

Minor diffs expected; the additional cases are really not that common. The main benefit is that GT_SELECT is now fully on par with GT_JTRUE, and does not have any odd limitations. The code to handle conditions is completely shared.

cc @a74nh

Example:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test(int x, int y)
{
    return (x & (1 << y)) != 0 ? 2 : 4;
}

Before:

G_M25862_IG02:              ;; offset=0000H
       0FA3D1               bt       ecx, edx
       7206                 jb       SHORT G_M25862_IG05
                                                ;; size=5 bbWeight=1 PerfScore 1.50
G_M25862_IG03:              ;; offset=0005H
       B804000000           mov      eax, 4
                                                ;; size=5 bbWeight=0.50 PerfScore 0.12
G_M25862_IG04:              ;; offset=000AH
       C3                   ret
                                                ;; size=1 bbWeight=0.50 PerfScore 0.50
G_M25862_IG05:              ;; offset=000BH
       B802000000           mov      eax, 2
                                                ;; size=5 bbWeight=0.50 PerfScore 0.12
G_M25862_IG06:              ;; offset=0010H
       C3                   ret
                                                ;; size=1 bbWeight=0.50 PerfScore 0.50

After:

G_M25862_IG02:              ;; offset=0000H
       B802000000           mov      eax, 2
       41B804000000         mov      r8d, 4
       0FA3D1               bt       ecx, edx
       410F43C0             cmovae   eax, r8d
                                                ;; size=18 bbWeight=1 PerfScore 1.25
G_M25862_IG03:              ;; offset=0012H
       C3                   ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test(int x, int y)
{
    return (x + y) != 0 ? 2 : 4;
}

Before:

G_M25862_IG02:              ;; offset=0000H
       03D1                 add      edx, ecx
       7506                 jne      SHORT G_M25862_IG05
                                                ;; size=4 bbWeight=1 PerfScore 1.25
G_M25862_IG03:              ;; offset=0004H
       B804000000           mov      eax, 4
                                                ;; size=5 bbWeight=0.50 PerfScore 0.12
G_M25862_IG04:              ;; offset=0009H
       C3                   ret
                                                ;; size=1 bbWeight=0.50 PerfScore 0.50
G_M25862_IG05:              ;; offset=000AH
       B802000000           mov      eax, 2
                                                ;; size=5 bbWeight=0.50 PerfScore 0.12
G_M25862_IG06:              ;; offset=000FH
       C3                   ret
                                                ;; size=1 bbWeight=0.50 PerfScore 0.50

After:

G_M25862_IG02:              ;; offset=0000H
       B802000000           mov      eax, 2
       41B804000000         mov      r8d, 4
       03D1                 add      edx, ecx
       410F44C0             cmove    eax, r8d
                                                ;; size=17 bbWeight=1 PerfScore 1.00
G_M25862_IG03:              ;; offset=0011H
       C3                   ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00

A cool diff for BitOperations.IsPow2:

 G_M47362_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-       blsr     eax, ecx
+       xor      eax, eax
        xor      edx, edx
        test     ecx, ecx
-       setg     cl
-       movzx    rcx, cl
-       test     eax, eax
-       mov      eax, ecx
-       cmovne   eax, edx
-						;; size=22 bbWeight=1 PerfScore 3.00
+       setg     dl
+       blsr     ecx, ecx
+       cmove    eax, edx
+						;; size=17 bbWeight=1 PerfScore 2.50
 G_M47362_IG03:        ; bbWeight=1, epilog, nogc, extend
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
 
-; Total bytes of code 23, prolog size 0, PerfScore 6.30, instruction count 9, allocated bytes for code 23 (MethodHash=8f5346fd) for method System.Numerics.BitOperations:IsPow2(int):bool
+; Total bytes of code 18, prolog size 0, PerfScore 5.30, instruction count 7, allocated bytes for code 18 (MethodHash=8f5346fd) for method System.Numerics.BitOperations:IsPow2(int):bool

Example LIR diff:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test(int x, int y)
{
    return x > 10 ? 2 : 4;
}

Base:

N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  1) [000001] -c---------                    t1 =    CNS_INT   int    10 $42
                                                            ┌──▌  t0     int    
                                                            ├──▌  t1     int    
N003 (  3,  3) [000002] Jc-----N---                    t2 =GT        int    $100
N004 (  1,  1) [000004] -----------                    t4 =    CNS_INT   int    2 $43
N005 (  1,  1) [000006] -----------                    t6 =    CNS_INT   int    4 $44
                                                            ┌──▌  t2     int    
                                                            ├──▌  t4     int    
                                                            ├──▌  t6     int    
N006 (  6,  6) [000008] -----------                    t8 =SELECT    int   
                                                            ┌──▌  t8     int    
N007 (  7,  7) [000007] -----------RETURN    int    $VN.Void

Diff:

N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  1) [000001] -c---------                    t1 =    CNS_INT   int    10 $42
N004 (  1,  1) [000004] -----------                    t4 =    CNS_INT   int    2 $43
N005 (  1,  1) [000006] -----------                    t6 =    CNS_INT   int    4 $44
                                                            ┌──▌  t0     int    
                                                            ├──▌  t1     int    
N003 (  3,  3) [000002] -------N---CMP       void  
                                                            ┌──▌  t4     int    
                                                            ├──▌  t6     int    
N006 (  6,  6) [000008] -----------                    t8 =SELECTCC  int    cond=SGT
                                                            ┌──▌  t8     int    
N007 (  7,  7) [000007] -----------RETURN    int    $VN.Void

This introduces GT_SELECTCC and unifies its handling with GT_JCC. We no
longer use containment for GT_SELECT conditions in the xarch backend.

Additionally teaches liveness DCE about GT_SETCC and GT_SELECTCC by
allowing it to remove GTF_SET_FLAGS from the previous node when they are
unused.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 16, 2023
@ghost ghost assigned jakobbotsch Feb 16, 2023
@ghost
Copy link

ghost commented Feb 16, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This introduces GT_SELECTCC and unifies its handling with GT_JCC. We no longer use containment for GT_SELECT conditions in the xarch backend.

Additionally teaches liveness DCE about GT_SETCC and GT_SELECTCC by allowing it to remove GTF_SET_FLAGS from the previous node when they are unused.

Minor diffs expected; the additional cases are really not that common. The main benefit is that GT_SELECT is not fully on par with GT_JTRUE, and does not have any odd limitations. The code to handle conditions is completely shared.

cc @a74nh

Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@jakobbotsch
Copy link
Member Author

Failures look like #81688, #81901, #82252.

Diffs. As expected not very impactful (the cases I highlighted are not that common), but it does unify GT_SELECT with conditional jumps which is nice on its own. I also think this should simplify arm64 significantly in the future.

There is a small TP impact, but it is mostly related to IsInvariantInRange learning that we can never move nodes that consume CPU flags. That's sort of a necessary evil (the function just returned the wrong answer before).

cc @dotnet/jit-contrib PTAL @BruceForstall

@@ -8536,6 +8544,24 @@ struct GenTreeCC final : public GenTree
#endif // DEBUGGABLE_GENTREE
};

// Represents a node with two operands and a condition.
struct GenTreeOpCC final : public GenTreeOp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference between a GenTreeOpCC and a GenTreeConditional?

If not, I assume it's here just to help with categorising nodes.
Could GenTreeOpCC be a subclass of GenTreeConditional (or vice versa?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- GenTreeOpCC is a binary node with a condition code in it. As such, it only makes sense in the backend with explicit ordering where the flags definition can happen right before it. There is no explicit link to the flags definition (at least until #82355, though this is just an experiment for now).
So it's sort of a lowered version of GenTreeConditional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't spot this at first but GenTreeOpCC does not have a pointer to the genTree for the condition, except via gtPrev. I see that GenTreeCC works in the same way. That feels odd at first, but this is LIR, and it's the prev/next that matter. Ok, I'm happy with how that works now.

Thinking about the CCMP nodes, they would also use GenTreeOpCC. We'd need a CCMP node plus all 6 conditions (CCMP_EQ, CCMP_NE, CCMP_LT, etc).

Coming from HIR we have:

JTRUE(NE(AND(LT, GT), 0))

OptimizeConstCompare() will lower to a TEST_EQ:

JTRUE(TEST_EQ(LT, GT)))

Then lowerCompare() will create a CCMP and CMP:

CMP
JTRUE(CCMP_LT[GT])

Finally LowerJtrue() will create the JCC:

CMP
CCMP[GT]
JCC[LT]

Which means I need to wait for this PR before I can continue with the CCMP work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. Feel free to lift any of the code from this PR if it helps.

We'd need a CCMP node plus all 6 conditions (CCMP_EQ, CCMP_NE, CCMP_LT, etc).

Note that this PR is teaching TryLowerConditionToFlagsNode about GT_SETCC nodes. I think I had a note on it somewhere else, but I wouldn't expect the 6 combinations to be necessary due to that. I.e. we can represent CCMP_LT as CCMP + SETCC[LT]. The transformation done in this PR will automatically lower JTRUE(SETCC[LT]) into JCC[LT].
I think it's likely compare chains can end up being handled similarly to this, that is, something like AND(SETCC[cond], x) can be turned into some shape of CCMP + SETCC and then lowering will continue working on these successively (haven't worked out the details).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that scheme:

Coming from HIR we have:

JTRUE(NE(AND(LT, GT), 0))

OptimizeConstCompare() will lower to a TEST_EQ:

JTRUE(TEST_EQ(LT, GT)))

Then lowerCompare() will create a CCMP, SETCC and CMP:

CMP
CCMP[GT]
JTRUE(SETCC[LT])

Finally LowerJtrue() will create the JCC:

CMP
CCMP[GT]
JCC[LT]

I'm happy with that setup.

It's not clear to me how much of the existing AND chains code will be left in codegen after this, but that can be checked later.

@@ -3400,10 +3369,138 @@ GenTree* Lowering::LowerSelect(GenTreeConditional* select)
}
}

#ifdef TARGET_XARCH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X86 only - did you plan on doing this for the other architectures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in a follow-up. As you can imagine it will be more involved to do for the compare chains. We might also need a new node type due to the "flags" immediate that is part of ccmp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in a follow-up. As you can imagine it will be more involved to do for the compare chains. We might also need a new node type due to the "flags" immediate that is part of ccmp.

Did you still plan on doing both SELECTCC for arm64 and compare chains to CCMP in a single PR?

I'm rebasing my AND chains, and it's going to conflict heavily if your doing compare chains soon. I'm happy to hold off again if it's all in your plan.

Also, for CCMP nodes the cflags add a little big of complexity. Consider:

cmp     w0, #0
ccmp    w1, #0, nzc, eq
cset    x0, gt

The cflagsnzc is created from the gt (inverting it then turning into flags). So that means the CCMP node needs to hold both the condition taken from the previous compare (eq) and it's own condition now held by the cset (gt). Either CCMP would have to have two gtCondition fields, or we'd have to have CCMP_GT etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you still plan on doing both SELECTCC for arm64 and compare chains to CCMP in a single PR?

Hmm, I suppose it will be necessary to change both at the same time. Let me prioritize working on this to unblock you and to avoid unnecessarily churning things multiple times.

The cflagsnzc is created from the gt (inverting it then turning into flags). So that means the CCMP node needs to hold both the condition taken from the previous compare (eq) and it's own condition now held by the cset (gt).

Right, I think we can add a GenTreeCCMP node that has the flags immediate too. Or we could encode it as part of gtFlags. Or we could do what you said and have separate node types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I suppose it will be necessary to change both at the same time. Let me prioritize working on this to unblock you and to avoid unnecessarily churning things multiple times.

Ta!

Right, I think we can add a GenTreeCCMP node that has the flags immediate too. Or we could encode it as part of gtFlags. Or we could do what you said and have separate node types.

gtFlags feels bad because most of the flag space is in use, but I'm happy with any of the above.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/coreclr/jit/codegenxarch.cpp Show resolved Hide resolved
@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr superpmi-diffs, runtime-coreclr superpmi-replay

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

Failure is #82397 and known timeouts

@jakobbotsch jakobbotsch merged commit 3bc4f0e into dotnet:main Feb 23, 2023
@jakobbotsch jakobbotsch deleted the selectcc branch February 23, 2023 13:13
GTNODE(BT , GenTreeOp ,0,(GTK_BINOP|GTK_NOVALUE|DBK_NOTHIR))
#endif
// Variant of SELECT that reuses flags computed by a previous node with the specified condition.
GTNODE(SELECTCC , GenTreeCC ,0,GTK_BINOP|DBK_NOTHIR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should SELECTCC here be a GenTreeOpCC ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. Doesn't look like it has any impact since we only use this for some node measuring that can be turned on, but I will keep in mind to fix it in a follow-up.

@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants