Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid VN bugs with struct reinterpretation. #57076

Closed
wants to merge 5 commits into from

Conversation

sandreenko
Copy link
Contributor

@sandreenko sandreenko commented Aug 9, 2021

What issue are we trying to solve?

VN optimizations are using a unique aliasing model that depends on correct FieldSeq that are not respected by at least 3 places:

  1. C# Unsafe.As<struct1, struct2>(ref struct1).fieldOfStruct2 will be imported as FIELD(ADDR byref(*)) and JIT does not keep pointer type so ADDR byref( LCL_VAR struct1) and ADDR byref( LCL_VAR struct2) are the same for us.
More details about this case

Later if lclmorph sees FIELD fieldOfStruct2(ADDR byref(LCL_VAR struct1) it will produce LCL_FLD struct1.fieldOfStruct2 and VN will do incorrect things with it, for example, if fieldOfStruct2 has offset 0 and type INT and fieldOfStruct1 has the same offset and the same type and we have:

struct1 = init with 0;
LCL_FLD struct1.fieldOfStruct2 = 100;
if (LCL_FLD struct1.fieldOfStruct1 == 0) <- it will be always true, VN does not see that we changed the value at 0 offset because it had different FldSeq.

Question: why can't we catch and reject LCL_FLD creation when we see FIELD fieldOfStruct2(ADDR byref(LCL_VAR struct1) and
CORINFO_FIELD_HANDLE of the field does not belong to CORINFO_CLASS_HANDLE of the struct?
Answer: Because we don't have a JitEE method to check that CORINFO_FIELD_HANDLE belongs to CORINFO_CLASS_HANDLE. It can be added but maybe too late for 6.0.

  1. For nested structs lclmorph can replace an outer struct with a nested if their sizes are the same. It could create ASG(LCL_VAR struct1, LCL_VAR struct2). For the rest of the JIT these are identical memory chunks.
More details about this case

this happens for structs like:

struct S1
{
  int s1Field1;
}

struct S2
{
  S1 s2Field1;
}

S1 s1 = s2.s2Field1;
lclmorph transforms into 

ASG(s1, s2) losing `s2Field1` seq because JIT does not work with LCL_FLD struct.

Question: why can't we see during VN that CORINFO_CLASS_HANDLE of LHS and RHS are different?
Answer: it is possible, but it would require similar changes in VN to what is done in this PR and it won't solve the first issue. Also, there could be issue with CORINFO_CLASS_HANDLE of Generic<Canon> != CORINFO_CLASS_HANDLE of Generic<Concrete>

  1. Canon/non-canon accesses. With R2R we could have a tree like:
               [000256] -AC---------              *  ASG       byref
               [000255] D------N----              +--*  LCL_VAR   byref  V18 tmp14
               [000376] ----G-------              \--*  FIELD     byref  _value
               [000375] ------------                 \--*  ADDR      byref
               [000374] -------N----                    \--*  FIELD     struct _pointer
               [000373] ------------                       \--*  ADDR      byref
               [000372] -------N----                          \--*  LCL_VAR   struct<System.Span`1[System.__Canon], 16>(P) V25 tmp21
                                                              \--*    byref  V25._pointer (offs=0x00) -> V47 tmp43
                                                              \--*    int    V25._length (offs=0x08) -> V48 tmp44

that looks fine until we check [000374] -------N---- \--* FIELD struct _pointer and see that its class name is System.Span1[System.TimeZoneInfo+AdjustmentRule]`, so if we use both FldSeq for canon and non-canon classes VN will make wrong transformations.

How does this PR solve it?

It marks GT_LCL_VAR after such transformations as GTF_VAR_DONT_VN and GT_LCL_FLD are created with NoField seq. Then we try to keep these values until VN and respect them there. If all works correct we generate a unique VN for the LCL_VAR and we don't try to determine its value or values of its fields.

There are 3 potential issues with this approach:

  1. We don't set the value;
  2. It does not survive phases before VN;
  3. VN does not check it.

The first is low-risk, because the places where we do such things are well-known - inlining of Unsafe.As in import and MorphLocalIndir in lclmorph.

The second is worse, but from my analysis, we keep the flag because we copy it from old tree to a new one or because we don't change trees (for example, assertion propagation replaces LCL_NUM and keeps the flag intact).

The third is the biggest risk, in my opinion, because our understanding and control over VN is vague.

What were the alternatives?

  1. Disable MorphLocalIndir transformation

it causes pretty bad regressions, like

benchmarks.run.windows.x64.checked.mch
Total bytes of delta: 3913
55 total files with Code Size differences (0 improved, 55 regressed)

and does not solve the first issue.

  1. Use varDsc->lvOverlappingFields = true; for such structs as was done before.

This approach does not work as #42517 has shown. lvOverlappingFields in the jit is 'type' information, so JIT expects two LCL_VAR that have the same struct type to have the same value of lvOverlappingFields .
Because of that assumption, we don't check the flag during assertion prop and we had cases like:

LCL_VAR1 struct1 = LCL_VAR2 struct1;
LCL_VAR3 struct2 = Unsafe.As<struct1, struct2>(ref LCL_VAR1 struct1); // set lvOverlappingFields  on struct1.

and assertion propagation was replacing LCL_VAR1 with LCL_VAR2 and ended up with

ASG(LCL_VAR3 struct2, LCL_VAR2 struct1) 

where both don't forget about lvOverlappingFields on LCL_VAR1.

Also, the flag set by itself was not blocking struct promotion, so we often replaced local with the fields without that flag set.

  1. Set lclAddressExposed in both cases, this will kill all optimizations and end up with awful code generated in scenarios where people use Unsafe to get performance.

  2. a hybrid approach, like rely on other phases not to maintain correct FldSeq (with a new JitEEInterface method) but force VN to check LHS CORINFO_CLASS_HANDLE == RHS CORINFO_CLASS_HANDLE for assignments.

This one could actually work, I need to revisit what it will take to add JitEE method.

I am from the future, how do I fix it if I have time?

1 Change VN to a physical model, get rid of FldSeq;
or
2 Add JitEEInterface methods to check that FldSeq is correct during VN, like I was trying in #49504 and do all these checks inside VN so other phrases don't depend on what is necessary only for VN.

and in addition:
3 Support LCL_FLD struct types;

Fixes

Fixes #42517, fixes #49954, fixes #54102, fixes #56980.

Diffs

The diffs are small, all regressions.

Diffs.
benchmarks.run.windows.x64
Total bytes of delta: 152
8 total files with Code Size differences (0 improved, 8 regressed)

libraries.crossgen2.windows.x64
Total bytes of delta: 26
4 total files with Code Size differences (0 improved, 4 regressed)

libraries.pmi.windows.x64
Total bytes of delta: 578
37 total files with Code Size differences (1 improved, 36 regressed)

libraries.pmi.Linux.arm64
Total bytes of delta: 620
33 total files with Code Size differences (0 improved, 33 regressed)

benchmarks.run.Linux.x64
Total bytes of delta: 121
6 total files with Code Size differences (0 improved, 6 regressed)

libraries.pmi.Linux.x64
Total bytes of delta: 694
33 total files with Code Size differences (0 improved, 33 regressed)

benchmarks.run.windows.arm64
Total bytes of delta: 104
6 total files with Code Size differences (0 improved, 6 regressed)

libraries.crossgen2.windows.arm64
Total bytes of delta: 4
1 total files with Code Size differences (0 improved, 1 regressed)

libraries.pmi.windows.arm64
Total bytes of delta: 572
33 total files with Code Size differences (0 improved, 33 regressed)

benchmarks.run.windows.x86
Total bytes of delta: 218
10 total files with Code Size differences (1 improved, 9 regressed)

libraries.crossgen2.windows.x86
Total bytes of delta: 2
1 total files with Code Size differences (0 improved, 1 regressed)

libraries.pmi.windows.x86
Total bytes of delta: 1007
66 total files with Code Size differences (1 improved, 65 regressed)

libraries.pmi.Linux.arm
Total bytes of delta: 504
33 total files with Code Size differences (0 improved, 33 regressed)
Regression analysis.

The regressions are in cases like:

               [000184] -ACXG-------              *  ASG       struct (copy)
               [000183] D------N----              +--*  LCL_VAR   struct<System.ReadOnlyMemory`1[System.Byte], 16>(P) V08 tmp1         
                                                  +--*    ref    V08._object (offs=0x00) -> V49 tmp42        
                                                  +--*    int    V08._index (offs=0x08) -> V50 tmp43        
                                                  +--*    int    V08._length (offs=0x0c) -> V51 tmp44        
               [000181] --CXG-------              \--*  OBJ       struct<System.ReadOnlyMemory`1[System.Byte], 16>
               [000189] ------------                 \--*  ADDR      byref 
               [000190] -------N----                    \--*  LCL_VAR   struct<System.Memory`1[System.Byte], 16>(P) V20 tmp13        
                                                        \--*    ref    V20._object (offs=0x00) -> V63 tmp56        
                                                        \--*    int    V20._index (offs=0x08) -> V64 tmp57        
                                                        \--*    int    V20._length (offs=0x0c) -> V65 tmp58

and in the past we were setting V20.lvOverlappingFields = true but we were forgetting it after copy block field by field transformation. Now we keep the flag and it blocks VN.
Such transformation is correct because we replace structs with primitive types (because there are no struct field support) and we check field types here:

// Both structs should be of the same type, or have the same number of fields of the same type.

so these regressions don't fix any issues but avoiding them would require more changes and it is RC1 so probably it is better to keep it conservative.

@sandreenko sandreenko added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 9, 2021
@sandreenko sandreenko added this to the 6.0.0 milestone Aug 9, 2021
@sandreenko
Copy link
Contributor Author

/azp list

@azure-pipelines

This comment has been minimized.

@tannergooding
Copy link
Member

Do we need a similar fix for hardware intrinsics (#35620)? CC. @briansull

@sandreenko
Copy link
Contributor Author

/azp run runtime-coreclr libraries-jitstress, runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).


// Generated by Fuzzlyn v1.2 on 2021-08-06 19:31:41
// Run on .NET 6.0.0-dev on Arm Linux
// Seed: 5500224797583134883
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you hand modify this program?
And will it be generated using this Seed value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all fuzzling repos are hand modified because they usually don't return 100 on success.

S0 vr0 = default(S0);
long vr1 = vr0.F0++;
s_1[0, 0] = vr0;
if (s_1[0, 0].F3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field F3 looks to me like it will always be false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue is with the release build the condition satisfies and console writeline expression runs (1 line printed in release). Debug build is fine as no line is printed.

@sandreenko
Copy link
Contributor Author

/azp run runtime-coreclr libraries-jitstress, runtime-coreclr jitstress, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@sandreenko
Copy link
Contributor Author

/azp run runtime-coreclr libraries-jitstress, runtime-coreclr jitstress, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@sandreenko
Copy link
Contributor Author

/azp run runtime-coreclr libraries-jitstress, runtime-coreclr jitstress, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

ClassLayout::AreCompatible(structLayout, varDsc->GetLayout()))
{
indir->ChangeOper(GT_LCL_VAR);
indir->AsLclVar()->SetLclNum(val.LclNum());

if (structLayout->GetClassHandle() != varDsc->GetLayout()->GetClassHandle())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here, about the Class handle mismatch, and how/why we handle it

@@ -1065,6 +1087,14 @@ class LocalAddressVisitor final : public GenTreeVisitor<LocalAddressVisitor>

indir->gtFlags = flags;

if (cantDoVNOnTHisTree)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here as well

tree->gtVNPair = vnStore->VNPairApplySelectors(lclVNPair, lclFld->GetFieldSeq(), indType);
if (lclVNPair.GetLiberal() == ValueNumStore::NoVN)
{
// We didn't not assign a correct VN to the local, probably it was written using a different
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briansull I am trying to pass the ci, could you please help me with it? The failures look like https://dev.azure.com/dnceng/public/_build/results?buildId=1289712&view=ms.vss-test-web.build-test-results-tab&runId=38073760&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=107788
on arm32 windows/linux only with JitStess and only on one test. We assert when try to ApplySelector from NoVN.
I have added this code for one cases but we have more failures, what I don't understand is why we don't hit these asserts without my changes, do I understand correctly that:

  1. LCL_VAR can have a unique VN before the changes;
  2. other code parts expect it and do the right things with it;

if both are true why do we hit these asserts and what is the best way to fix it? Should I add similar checks if (lclVNPair.GetLiberal() == ValueNumStore::NoVN) to all failing cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand your issue.
It is difficult to answer your questions without a full understanding of what is going on here.

Copy link
Contributor

@briansull briansull Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it is an error to have read with a ValueNumber of NoVN

Check if line 7500 (above) applies here:

                        // There are a couple of cases where we haven't assigned a valid value number to 'lcl'
                        //
                        if (lcl->gtVNPair.GetLiberal() == ValueNumStore::NoVN)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't understand where here do we update lvaTable[lclNum].GetPerSsaData(lclDefSsaNum)->m_vnPair.SetBoth(lclVarVN);?
And when I do lvaTable[lclNum].GetPerSsaData(lclDefSsaNum)->m_vnPair.SetBoth(ValueNumStore::NoVN); it starts to produce errors, so what value should I write to GetPerSsaData when the local has invalid VN and why don't we write it there in the example on line 7500?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandreenko
I can look at a Jit Dump if you have one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you need to mark the local variable as unsafe to ValueNumber

Copy link
Contributor

@briansull briansull Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another fix would be to set

lvaTable[V09].GetPerSsaData() to $500

when processing [000365]

LCL_VAR long V15 tmp9 u:2 (last use) $500

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly VN that we print on LHS of an assignment does not matter at all, the real value is saved in lvaTable[V09].GetPerSsaData() .
At least it is how I understand this:

lcl->gtVNPair = ValueNumPair(); // Avoid confusion -- we don't set the VN of a lcl being defined.

but in our case it is a Def, because it is a use:

N005 ( 13, 11) [000365] -A------R----             *  ASG       long   $VN.Void
N004 (  6,  5) [000363] *------N-----             +--*  IND       long   $500
N003 (  3,  3) [000321] -------------             |  \--*  ADDR      byref  Zero Fseq[_00(0xab45adc)] $345
N002 (  3,  2) [000322] $U------N-----            |     \--*  LCL_VAR   struct<System.Runtime.Intrinsics.Vector128`1[Double], 16> V09 tmp3         ud:2->3 $482 <- marked as use.
N001 (  3,  2) [000364] -------------             \--*  LCL_VAR   long   V15 tmp9         u:2 (last use) $500

but, obviously, we don't care about its value, we won't read it anywhere.

What we care about is value of lvaTable[V09].GetPerSsaData() we want it be NoVN and we want the next tree that uses it not to try to optimize it or "see" the values of any of the fields inside this lclVar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are saying above does sound correct to me. I haven't made many (or any) changes in this area SSA and VN, so it is also my understanding from reading the code.

Copy link
Contributor

@briansull briansull Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But with this approach you will have to check for this value for every read (use) site, because the mismatch can occur either at the def or the use. In the above case we have a mismatch on the def site, which I guess already does write a NoVN, but later at the use site we pull out the NoVN and try to use it in VNPairApplySelectors and the use site won't have the GTF_VAR_DONT_VN flag set as the isn't a mismatch on the use site.

Comment on lines +7818 to +7829
ValueNumPair lhsVNPair;

if (!lcl->DontVN())
{
lhsVNPair = rhsVNPair;
}
else
{
ValueNum uniqVN = vnStore->VNForExpr(compCurBB, lcl->TypeGet());
lhsVNPair.SetBoth(uniqVN);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears lhsVNPair is unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, thanks for the catch. I need to investigate why it fixes one of the test cases that I was targeting with this change.

@sandreenko
Copy link
Contributor Author

This JIT only change is too risky, I am closing this in favor of #57282 that has less VN changes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
5 participants