Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: fix bug where a gc struct is not zero initialized #67825

Merged

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Apr 10, 2022

This fixes an unexpected interaction between the zero-init optimization and
dead stores.

We have a local gc struct that is tracked but not promoted (and so on the
frame) with an explicit zero init.

The zero-init opt determines that in-prolog init is not needed because the
local is tracked and has a live explicit initializer. So it marks the local
as lvHasExplicitInit. But subsequent control flow optimizations end up
making the explicit zero init dead, and it is removed by dead stores.

Later on when we report GC info for the struct we report it as untracked. This
leads to the GC seeing an uninitialized stack slot as a GC ref.

The fix is to inhibit dead stores of zero initializers for lvHasExplicitInit
(restricted to GC locals with multiple references).

[some alternative fixes were considered, see notes in the PR].

Addresses #65694 (needs backporting for a full fix).

This fixes an unexpected interaction between the zero-init optimization and
dead stores.

We have a local gc struct that is tracked but not promoted (and so on the
frame) with an explicit zero init.

The zero-init opt determines that in-prolog init is not needed because the
local is tracked and has a live explicit initializer. So it marks the local
as `lvHasExplicitInit`. But subsequent control flow optimizations end up
making the explicit zero init dead, and it is removed by dead stores.

Later on when we report GC info for the struct we report it as untracked. This
leads to the GC seeing an uninitialized stack slot as a GC ref.

The fix is to inhibit dead stores of zero initializers for `lvHasExplicitInit`
(restricted to GC locals with multiple references).

[some alternative fixes were considered, see notes in the PR].

Fixes dotnet#65694.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 10, 2022
@ghost ghost assigned AndyAyersMS Apr 10, 2022
@ghost
Copy link

ghost commented Apr 10, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This fixes an unexpected interaction between the zero-init optimization and
dead stores.

We have a local gc struct that is tracked but not promoted (and so on the
frame) with an explicit zero init.

The zero-init opt determines that in-prolog init is not needed because the
local is tracked and has a live explicit initializer. So it marks the local
as lvHasExplicitInit. But subsequent control flow optimizations end up
making the explicit zero init dead, and it is removed by dead stores.

Later on when we report GC info for the struct we report it as untracked. This
leads to the GC seeing an uninitialized stack slot as a GC ref.

The fix is to inhibit dead stores of zero initializers for lvHasExplicitInit
(restricted to GC locals with multiple references).

[some alternative fixes were considered, see notes in the PR].

Fixes #65694.

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

we will want to back port this to .net 6.

@BruceForstall PTAL
cc @dotnet/jit-contrib (fyi @erozenfeld)

There are a few problematic aspects. First is that we can do untracked GC reporting for a tracked local. Second is that because of optimizations we can change liveness and so deductions/expectations based in earlier phases might end up getting broken.

In this case the problematic control flow comes from

            if (_d?.TryGetValue(k, out p) == true && (p.x == 33))

Initially the jit sees a path where p can be referenced without invoking TryGetValue. So the upstream zero init of p is live.

The control flow snippet below is from the original repro case. The call to TryGetValue is in BB97, and there's a path from the null check of d in BB94 down through BB95 and BB96 to the access of the struct field in BB101. This is the state of the flow graph when the zero-init opt runs.

But it turns out that taking the BB94->BB95 path will then result in always taking the BB95->BB102 path; the jit proves this during subsequent optimization phases and removes the path. As a result the upstream zero initializer becomes dead.

image

Some alternative fixes I considered:

  • Recognize that despite being tracked this local is going to be reported as untracked, and so treat as case (a) in zero init
  • Fix jit so it can report tracked structs on the frame as tracked in the GC info
  • (possibly) do dependent promotion for all tracked structs?
  • (possibly) don’t track unpromoted structs

Diffs with this change are minimal -- just one method has diffs in SPMI, and the case with diffs does not have the bug (instead it has a local struct with multiple dead references).

Likely the odd control flow setup here is somewhat rare.

Comment on lines +1993 to +1995
// If this is a zero init of an explicit zero init gc local
// that has at least one other reference, we will keep the zero init.
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expand a bit on why? This just seems to be a restatement of the code below.

@AndyAyersMS
Copy link
Member Author

Arm64 failures: #67821 and one run that failed with no progress on DDARM64-046.

@erozenfeld
Copy link
Member

Have you considered a solution where you mark an explicit zero initialization with some sort of side effect to prevent its removal if zero-init optimization makes a decision based on the presence of the explicit zero int?

@erozenfeld
Copy link
Member

Or will that still lead to a problem if we report as untracked?

Copy link
Member

@erozenfeld erozenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you have is a good solution.

@AndyAyersMS
Copy link
Member Author

Have you considered a solution where you mark an explicit zero initialization with some sort of side effect to prevent its removal if zero-init optimization makes a decision based on the presence of the explicit zero int?

Yeah, that would be more surgical, but I'm not confident I can find a suitable flag -- we seem to widely assume stores to locals can't have operator/lhs side effects.

I suppose I could also insert a fake use ala keep alive.

@BruceForstall
Copy link
Member

The fix is to inhibit dead stores of zero initializers for lvHasExplicitInit
(restricted to GC locals with multiple references).

Since the JIT decided to report the struct as untracked (it would useful to understand under what conditions this happens), doesn't that invalidate the zeroing optimization assumptions that allowed it to use the in-body zeroing in the first place? I.e., even if you leave the dead in-body zeroing, isn't there a range between the end of the prolog and the in-body zeroing where the struct local is uninitialized and GC could occur? I guess optRemoveRedundantZeroInits checks for GC safe points, but I thought offset zero (immediately after the prolog) was always a GC safe point?

@AndyAyersMS
Copy link
Member Author

Since the JIT decided to report the struct as untracked (it would useful to understand under what conditions this happens),

if (varTypeIsGC(varDsc->TypeGet()))
{
// Do we have an argument or local variable?
if (!varDsc->lvIsParam)
{
// If is is pinned, it must be an untracked local.
assert(!varDsc->lvPinned || !varDsc->lvTracked);
if (varDsc->lvTracked || !varDsc->lvOnFrame)
{
continue;
}
}
else
{
// Stack-passed arguments which are not enregistered
// are always reported in this "untracked stack
// pointers" section of the GC info even if lvTracked==true
// Has this argument been fully enregistered?
CLANG_FORMAT_COMMENT_ANCHOR;
if (!varDsc->lvOnFrame)
{
// If a CEE_JMP has been used, then we need to report all the arguments
// even if they are enregistered, since we will be using this value
// in a JMP call. Note that this is subtle as we require that
// argument offsets are always fixed up properly even if lvRegister
// is set.
if (!compiler->compJmpOpUsed)
{
continue;
}
}
else
{
if (varDsc->lvIsRegArg && varDsc->lvTracked)
{
// If this register-passed arg is tracked, then
// it has been allocated space near the other
// pointer variables and we have accurate life-
// time info. It will be reported with
// gcVarPtrList in the "tracked-pointer" section.
continue;
}
}
}
// If we haven't continued to the next variable, we should report this as an untracked local.
CLANG_FORMAT_COMMENT_ANCHOR;
GcSlotFlags flags = GC_SLOT_UNTRACKED;
if (varDsc->TypeGet() == TYP_BYREF)
{
// Or in byref_OFFSET_FLAG for 'byref' pointer tracking
flags = (GcSlotFlags)(flags | GC_SLOT_INTERIOR);
}
if (varDsc->lvPinned)
{
// Or in pinned_OFFSET_FLAG for 'pinned' pointer tracking
flags = (GcSlotFlags)(flags | GC_SLOT_PINNED);
}
GcStackSlotBase stackSlotBase = GC_SP_REL;
if (varDsc->lvFramePointerBased)
{
stackSlotBase = GC_FRAMEREG_REL;
}
if (noTrackedGCSlots)
{
// No need to hash/lookup untracked GC refs; just grab a new Slot Id.
if (mode == MAKE_REG_PTR_MODE_ASSIGN_SLOTS)
{
gcInfoEncoderWithLog->GetStackSlotId(varDsc->GetStackOffset(), flags, stackSlotBase);
}
}
else
{
StackSlotIdKey sskey(varDsc->GetStackOffset(), (stackSlotBase == GC_FRAMEREG_REL), flags);
GcSlotId varSlotId;
if (mode == MAKE_REG_PTR_MODE_ASSIGN_SLOTS)
{
if (!m_stackSlotMap->Lookup(sskey, &varSlotId))
{
varSlotId =
gcInfoEncoderWithLog->GetStackSlotId(varDsc->GetStackOffset(), flags, stackSlotBase);
m_stackSlotMap->Set(sskey, varSlotId);
}
}
}
}
// If this is a TYP_STRUCT, handle its GC pointers.
// Note that the enregisterable struct types cannot have GC pointers in them.
if ((varDsc->TypeGet() == TYP_STRUCT) && varDsc->lvOnFrame && (varDsc->lvExactSize >= TARGET_POINTER_SIZE))
{
ClassLayout* layout = varDsc->GetLayout();
unsigned slots = layout->GetSlotCount();
for (unsigned i = 0; i < slots; i++)
{
if (!layout->IsGCPtr(i))
{
continue;
}
int offset = varDsc->GetStackOffset() + i * TARGET_POINTER_SIZE;
#if DOUBLE_ALIGN
// For genDoubleAlign(), locals are addressed relative to ESP and
// arguments are addressed relative to EBP.
if (compiler->genDoubleAlign() && varDsc->lvIsParam && !varDsc->lvIsRegArg)
offset += compiler->codeGen->genTotalFrameSize();
#endif
GcSlotFlags flags = GC_SLOT_UNTRACKED;
if (layout->GetGCPtrType(i) == TYP_BYREF)
{
flags = (GcSlotFlags)(flags | GC_SLOT_INTERIOR);
}
GcStackSlotBase stackSlotBase = GC_SP_REL;
if (varDsc->lvFramePointerBased)
{
stackSlotBase = GC_FRAMEREG_REL;
}
StackSlotIdKey sskey(offset, (stackSlotBase == GC_FRAMEREG_REL), flags);
GcSlotId varSlotId;
if (mode == MAKE_REG_PTR_MODE_ASSIGN_SLOTS)
{
if (!m_stackSlotMap->Lookup(sskey, &varSlotId))
{
varSlotId = gcInfoEncoderWithLog->GetStackSlotId(offset, flags, stackSlotBase);
m_stackSlotMap->Set(sskey, varSlotId);
}
}
}
}
}

Here we have TYP_STRUCT, which is not a gc type, so we end up at line 4217 and report the GC fields of the struct as untracked.

I thought offset zero (immediately after the prolog) was always a GC safe point?

Not sure. Empirically that doesn't seem to be the case for partially interruptible methods (if so, the repro above should fail under GCStress=3, but it passes).

The zero init opt "remove prolog" flavor won't kick in for fully interruptible methods.

@AndyAyersMS
Copy link
Member Author

I thought offset zero (immediately after the prolog) was always a GC safe point?

I asked both Jans about this and neither of them thought offset 0 was special either.

const LclVarDsc& varDsc = lvaTable[node->AsLclVarCommon()->GetLclNum()];
const bool isExplicitInitLocal = varDsc.lvHasExplicitInit;
const bool isReferencedLocal = varDsc.lvRefCnt() > 1;
const bool isZeroInit = store->OperIsInitBlkOp();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you also need to check the init value is actually zero if you only want to avoid deleting only zero inits? E.g., check IsConstInitVal and then the constant init value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this problem occur for normal untracked ref/byref lclVar and not just for struct field gc refs reported as untracked to the gc?

i.e., should we stop marking all gc vars or structs with gc fields as explicit init in optRemoveRedundantZeroInits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you also need to check the init value is actually zero

I suppose so, but this kicks in so rarely it won't matter in practice.

Couldn't this problem occur for normal untracked ref/byref lclVa

We won't have liveness info for these so won't dead store.

data->SetUnusedValue();

if (data->isIndir())
if (isExplicitInitLocal && isReferencedLocal && isZeroInit && isGCInit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know if this is the first dead store. E.g., what if the lclVar has multiple zeroing that are dead stores? All of them will be kept even though (probably) only the first was the "explicit init" one as determined by optRemoveRedundantZeroInits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if there are multiple dead stores we will keep them all.

@AndyAyersMS
Copy link
Member Author

@erozenfeld were you going to take another look?

@erozenfeld
Copy link
Member

@erozenfeld were you going to take another look?

Yeah, I'll take another look tomorrow.

Copy link
Member

@erozenfeld erozenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this again and I think this is the right fix.

@AndyAyersMS
Copy link
Member Author

Thanks @erozenfeld.

@AndyAyersMS AndyAyersMS merged commit b13513a into dotnet:main Apr 14, 2022
@AndyAyersMS
Copy link
Member Author

/backport to release/6.0

@github-actions
Copy link
Contributor

Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/2169354977

@github-actions
Copy link
Contributor

@AndyAyersMS backporting to release/6.0 failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: JIT: fix bug where a gc struct is not zero initialized
.git/rebase-apply/patch:131: trailing whitespace.
    public static void F() 
.git/rebase-apply/patch:164: new blank line at EOF.
+
warning: 2 lines add whitespace errors.
Using index info to reconstruct a base tree...
M	src/coreclr/jit/liveness.cpp
M	src/coreclr/jit/optimizer.cpp
Falling back to patching base and 3-way merge...
Auto-merging src/coreclr/jit/optimizer.cpp
CONFLICT (content): Merge conflict in src/coreclr/jit/optimizer.cpp
Auto-merging src/coreclr/jit/liveness.cpp
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 JIT: fix bug where a gc struct is not zero initialized
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this pull request Apr 14, 2022
carlossanlop pushed a commit that referenced this pull request May 4, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants