Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in frame pointer chaining codegen for frames with localloc. #4651

Closed
LLITCHEV opened this issue Nov 5, 2015 · 21 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug
Milestone

Comments

@LLITCHEV
Copy link
Contributor

LLITCHEV commented Nov 5, 2015

Fix a bug in frame pointer chaining codegen (prolog code generation) for frames with localloc for AMD64 and System V AMD64.

@LLITCHEV
Copy link
Contributor Author

LLITCHEV commented Nov 6, 2015

Just to clarify. The issue happens only on frames that have localloc or very largs local var count. It works for the rest. It potentially affects debuggers and profilers that use frame chaining to unwind the stack. In the VM for GC and exception handling we use the PData/XData stype unwind info for jitted frames and that works correctly.

@gkhanna79
Copy link
Member

@tzwlai - Can you please confirm if this affects debugger on Mac/Linux?

@LLITCHEV
Copy link
Contributor Author

LLITCHEV commented Nov 8, 2015

I have talked to Mike on Friday. This will affect the non SOS debugger on Linux/Mac. The SOS debugger (as the GC unwinding and EH) is using the PDatas/XData unwind info for managed frames and that behaves correctly. The GDB/LLDB bt command is affected by this since it doesn't use the PData/XData (Windows) unwind debug info. @tzwlai please correct me if I am wrong.

@brianrob
Copy link
Member

FYI, this will affect performance tracing on Linux and Mac.

@BruceForstall
Copy link
Member

@brianrob: is this because performance tracing on Linux/Mac x64 uses RBP chains exclusively (and not the Windows AMD64 unwind codes that we still emit) for stack traces?

@brianrob
Copy link
Member

@BruceForstall, that's correct. Linux does not know anything about Windows AMD64 unwind info.

@janvorli
Copy link
Member

@brianrob if it would be possible to plug in our own stack walker, then we might be able to use the copy of the windows unwinder in CoreCLR that we have there to be able to unwind stack for exception handling and GC purposes.

@brianrob
Copy link
Member

@janvorli, unfortunately, there is no extension point in perf for this. There actually aren't really any extension points to my knowledge. It's something that we could consider building, but we'd likely need to engage the maintainers early - I'm not sure if there are others that would want this as well.

@BruceForstall
Copy link
Member

This is a discussion of the problem and proposal for a fix.

The problem is that the JIT for non-Windows, x64, in functions with localloc, uses the RBP register as a frame pointer, but does not point RBP and the saved version of the caller's RBP. Thus, the "RBP frame chain" is broken, which confuses non-Windows x64 tools such as LLDB/GDB which depend on accurate RBP chains. In functions without localloc, RBP is either unused, or it points at the caller's saved RBP. If it is unused, that function simply doesn't appear in the frame chain, but the frame chain isn't "corrupted". On Windows x64, the RBP register can be freely used as an extra register (this is controlled by the ETW_EBP_FRAMED define, which is currently set to zero, thus enabling the free use of RBP).

Windows x64 itself does not depend on frame chaining. Instead, it uses the Windows x64 defined unwind codes for implementing stack walking/unwinding. These are defined in the Windows x64 software conventions, here: https://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx. Many non-Windows x64 .NET features also use these unwind codes for stack walks, such as the SOS debugger extension, and the .NET runtime itself.

One goal of a fix is to support both the Windows x64 unwind codes as well as accurate, uncorrupted RBP chains, at the same time, in functions with localloc. The fix would ideally apply to all x64 platforms, to avoid the need for per-platform special-case code, and allow us to leverage our extensive existing Windows x64 testing assets.

There are some important restrictions to consider with the Windows x64 unwind codes:
1. At the time when a frame pointer is established, the frame pointer can be set to point no more than 240 bytes away from the current stack pointer. This is the reason why the current code, in functions with localloc, sets the offset of RBP from RSP to the minimum of 240 and the outgoing argument space (see CodeGenInterface::genSPtoFPdelta() in CodeGenXArch.cpp). For most functions, this means RBP will point immediately above the outgoing argument space. In functions that call functions taking a very large number of arguments, RBP is limited to pointing no more than 240 bytes into the outgoing argument space. We could instead point RBP directly at RSP (such as is done for Edit & Continue), but it seemed better to point it closer to the local variables area, even though we don’t do any kind of layout and frame pointer optimization to specifically minimize frame pointer offsets.
2. All offsets in the unwind codes, for saving registers using move instructions, such as we use to save XMM register state (on Windows, which has non-volatile XMM registers; System V conventions define no non-volatile XMM registers), are positive and based on the RSP value at the time the frame pointer is established. Thus, all non-volatile register save space (including for the PUSH instructions to save the integer registers in the frame) must occur before the frame pointer is established.
3. The stack pointer when the frame pointer is established becomes the Windows "establisher frame" pointer used during exception dispatch. This establisher frame pointer is passed as an argument to exception handling funclets, where it is used to find the PSPSym and, ultimately, the parent function frame pointer. Thus, the PSPSym needs to move. Currently, it lives immediately above the outgoing argument space, so as to have a fixed, small, constant offset from InitialSP. If we left it where it is, then funclets would need to allocate the same fixed-size frame as their parent functions, which would be wasteful, since we wouldn't use any of the allocated local variable or temporary space. The PSPSym needs to move to a location at or above the RSP at frame pointer establishment time.

Some implications of these restrictions:
1. For functions with large outgoing argument space, we need to do at least two stack pointer adjustments, such that we can establish the frame pointer within the 240 byte range after one, and then establish the outgoing argument space after another.
2. We can't use a traditional "push rbp; mov rbp, rsp" prolog prefix, since all callee-saved registers must be allocated space before establishing the frame pointer. In fact, the establishing the frame pointer will likely be the last unwind code reported.

Note that for localloc functions with small fixed-size frames (about 240 bytes or less), we can establish the frame pointer just as we do today, after the full fixed-size stack allocation. We can easily do this for all platforms.

Otherwise, for localloc functions with large fixed-size frames, the frame will be allocated as follows:

1. PUSH RBP ; Saved RBP
2. PUSH other callee-saved integer registers
3. SUB RSP, XXX ; space for saved XMM registers, PSPSym in functions with EH, and possible 8-byte alignment to keep the frame 16-byte aligned
4. ; Note that if we implement this on Windows, we have created exactly 240 bytes of offset if all callee-saved integer and XMM registers are saved, plus 8 bytes for alignment before XMM saves, 8 bytes of padding after XMM saves, and 8 bytes for the PSPSym. (We could potentially optionally put the PSPSym where the alignment between integer and XMM register space is inserted, if any.)
5. LEA RBP, [RSP + YYY] ; point RBP at the saved RBP; establish the frame chain.
6. SUB RSP, ZZZ ; establish the remainder of the frame. This instruction is NOT reported in Windows x64 unwind codes, as it  is not necessary for unwinding -- the unwinder first unwinds RSP to where it was when RBP was set, then it continues.

This introduces an extra SUB RSP, but more than that, it complicates the prolog logic for both main function and funclet prologs by moving various things around.

An option to this design is to establish RBP chains using the tradition "push RBP; mov RBP, RSP" prolog, but don't report that in the Windows x64 unwind codes, or use that for local variable addressing. Instead, reserve another callee-saved register in this situation, e.g. R15, and use that where we previously used RBP. Report R15 as the frame pointer in the Windows x64 unwind codes. This solution has the disadvantage of "burning" a register. On advantage is that it doesn't require changing the layout of the frame. Implementation-wise, it isn't clear which will require more pervasive changes.

Note that in funclets, RBP points where the RBP of the parent pointed. That means that during RBP-based stackwalks, the funclet frames, as well as any intermediary frames such as VM exception-handling frames, will not be visible. This is determined to be acceptable.

@BruceForstall
Copy link
Member

@janvorli @jkotas @LLITCHEV Comments?

@LLITCHEV
Copy link
Contributor Author

Thanks for writing this Bruce! It is very useful.
Couple of things that come to mind:

  1. IMHO burning a register is a bad thing in terms of CQ. I'd personally try to avoid this.
  2. The last paragraph states that using the RBP in funclets to point to the parent's frame is not a problem. As I recall talking to @janvorli and @JosephTremoulet there was a desire to change this. It prevents the debugger unwinder to show native frames that are calling the funclet (native frames between the funclet and the parent). As far as reading the code, it looks like right now we use at max 2 arg registers to call the funclet. Could we use, for example ARG_2 to point to where the parent RBP pointed?

@BruceForstall
Copy link
Member

I believe you're suggesting "burning" a register in the funclets only (but not the main function) to hold the "frame pointer", instead of using RBP to access locals. It needs to be a callee-saved register (if the funclet has a call); it's easiest just to designate one in advance. It might be hard to plumb through the JIT which "frame pointer register" is being used at a particular point.

@LLITCHEV
Copy link
Contributor Author

Yes. This is exactly what I meant above. Burning a register in the funclet might not be as bad for CQ since funclets are invoked for exception handling. The finally case might be a bit questionable, but I think it should be OK too.

@RussKeldorph
Copy link
Contributor

Agree with @LLITCHEV --the analysis is useful. I'm not sure burning a register is a big deal if it's only for localloc frames with uncommonly large fixed sizes, but if the complexity is the same, might as well implement the one that has better CQ, which I assume is the first ("two SUB") solution. Unless you want to investigate more to determine which will require more pervasive changes, I would go for the first. If you don't mind pointing out the affected code in question to save newbies like me time, that would be great but isn't required.

@BruceForstall
Copy link
Member

There's another restriction to consider: GC encoding. For non-x86 platforms, a local variable can be reported to the GC as an offset relative to SP, Caller-SP, or a frame register. The frame register can be specified as any non-volatile register; it doesn't need to be RBP. In the localloc case here, if we established RBP in the 2-step fashion, we could continue referencing locals with RBP and reporting them that way to the GC. If we burned an extra register to be used as a frame pointer for the purpose of locals access, we could report that to the GC. Or, we could choose to report locals as CallerSP-relative.

For x86 GC encoding, however, EBP is the only allowed frame pointer.

@CppStars
Copy link

Code quality issue aside about burning a register (real impact of which I've rarely seen in .NET code I write which is in a high-performance situation) there is something to be said about the difference this introduces in code-generation for Windows and non-Windows.

I write tools that work off EBP walking and it's really annoying that only on Windows x64 do I not have access to my tools, even for source code I compile because all Microsoft compilers choose to thrash it.

And an even stronger point, some of us would love to use CoreCLR x64 on Windows 7 (one of your supported platforms), where XPERF (the recommended Microsoft performance tool) does not work because you break the EBP chain.

Now I have a question, can the generation of code that does not trash the EBP be a command line option for the JIT? This is particularly useful to us performance analysts because when we need to diagnose something we can at least run it using this option and turn it back off it code quality is impacted. Using a 32-bit process is not going to cut it because you can imagine they maybe using 64-bit native dlls.

@BruceForstall
Copy link
Member

Yet another constraint: any JIT-generated code P/Invoke code that generates an InlineCallFrame on the function stack must use RBP as the frame pointer (this is x64 only). This is baked in to the VM, and it uses the saved RBP and RSP to seed further stack unwinding. It does not have the capability to use a different frame pointer register.

@BruceForstall
Copy link
Member

@CppStars: I agree that it can be convenient to always have RBP chains sometimes. For Windows x64, though, that seems unlikely to ever happen (obviously, native code compilers won't do this). I think the problem you've seen on Windows 7 with XPERF is not because of breaking the RBP chain, it's because prior to Windows 8, the OS didn't have a way to get the unwind info generated by the JIT, so a stack would get lost at the native/managed boundary.

@BruceForstall
Copy link
Member

General update: I went down the path of implementing the 2-SUB solution detailed above, and that proved to be very complex and invasive. I then decided to go with the "burn a register" solution. This was much cleaner, and there was lots of existing code for doing this, such as for ARM. However, I hit a dead end when I discovered the requirement mentioned above regarding InlineCallFrame. So, I'm back to investigating the original solution once again.

@jkotas
Copy link
Member

jkotas commented Mar 30, 2016

Yet another potential way to fix this:

On Windows, the unwind info encoding is set in stone - because of it follows Windows ABI. We cannot extend it.

It is not the case on Unix. We do have our own local copy of the Windows unwind info decoder for Unix, and thus can easily add a new unwind code to get around the restriction that is getting into way.

So a potential fix may look like:

  • Emit a new unwind code for the large RBP frames
  • Interpret the new unwind code in src\unwinder\amd64\unwinder_amd64.cpp

This should be pretty simple fix. It is Unix-specific codepath, but it is very simple one - it should not have a material impact on how different is the Windows and Unix codegen.

@BruceForstall
Copy link
Member

I love that idea. It should be simple to implement. One minor detail is that the UWOP_SET_FPREG code is weird in that frame pointer offset and register gets set in the UNWIND_INFO header instead of the unwind code, to provide a fast path at the beginning of VirtualUnwind() to not require parsing the unwind codes in case of unwinding from within the main body. Maybe for Linux we either always parse the unwind codes or find someplace else to store this. Presumably we can't change the UNWIND_INFO header type

BruceForstall referenced this issue in BruceForstall/coreclr Mar 31, 2016
The JIT will now always create RBP chains on Unix platforms.

To do this, the VM is extended with a new Unix-only AMD64 unwind code: UWOP_SET_FPREG_LARGE.
The existing unwind code which is used to establish a frame pointer, UWOP_SET_FPREG, requires
that the frame pointer, when established, be no more than 240 bytes offset from the stack pointer.
This doesn't work well for frames that use localloc. (Large frames are ok, because we don't
report the frame pointer in the unwind info except for in functions with localloc or EnC.)

The new code has a 32-bit range, scaled by 16. If used, UNWIND_INFO.FrameRegister
must be set to the frame pointer register, and UNWIND_INFO.FrameOffset must be set to 15
(its maximum value). This code is followed by two UNWIND_CODEs that are combined to form
its 32-bit offset. This offset is then scaled by 16. This result is used as the FP register
offset from SP at the time the frame pointer is established.
BruceForstall referenced this issue in BruceForstall/coreclr Apr 5, 2016
The JIT will now always create RBP chains on Unix platforms. This includes in functions
the use localloc.

To do this, the VM is extended with a new Unix-only AMD64 unwind code: UWOP_SET_FPREG_LARGE.
The existing unwind code which is used to establish a frame pointer, UWOP_SET_FPREG, requires
that the frame pointer, when established, be no more than 240 bytes offset from the stack pointer.
This doesn't work well for frames that use localloc. (Large frames without localloc are ok,
because we don't report the frame pointer in the unwind info except for in functions with
localloc or EnC.)

The new unwind code has a 32-bit range. If used, UNWIND_INFO.FrameRegister
must be set to the frame pointer register, and UNWIND_INFO.FrameOffset must be set to 15
(its maximum value). This code is followed by two UNWIND_CODEs that are combined to form
its 32-bit offset. The high 4 bits must be zero. This offset is then scaled by 16. This
result is used as the FP register offset from SP at the time the frame pointer is established.
@dotnet-bot dotnet-bot assigned mmitche and unassigned BruceForstall Apr 5, 2016
@mmitche mmitche assigned BruceForstall and unassigned mmitche Apr 5, 2016
BruceForstall referenced this issue in BruceForstall/coreclr Apr 5, 2016
The JIT will now always create RBP chains on Unix platforms. This includes in functions
the use localloc.

To do this, the VM is extended with a new Unix-only AMD64 unwind code: UWOP_SET_FPREG_LARGE.
The existing unwind code which is used to establish a frame pointer, UWOP_SET_FPREG, requires
that the frame pointer, when established, be no more than 240 bytes offset from the stack pointer.
This doesn't work well for frames that use localloc. (Large frames without localloc are ok,
because we don't report the frame pointer in the unwind info except for in functions with
localloc or EnC.)

The new unwind code has a 32-bit range. If used, UNWIND_INFO.FrameRegister
must be set to the frame pointer register, and UNWIND_INFO.FrameOffset must be set to 15
(its maximum value). This code is followed by two UNWIND_CODEs that are combined to form
its 32-bit offset. The high 4 bits must be zero. This offset is then scaled by 16. This
result is used as the FP register offset from SP at the time the frame pointer is established.
BruceForstall referenced this issue in dotnet/coreclr Apr 5, 2016
Fix #1977: always create RBP chains on Unix
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the 1.0.0-rtm milestone Jan 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug
Projects
None yet
Development

No branches or pull requests