Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack probing in managed code for frames > 64kB doesn't work on Linux kernel >= 4.9.34 #16827

Closed
janvorli opened this issue Mar 8, 2018 · 17 comments

Comments

Projects
None yet
8 participants
@janvorli
Copy link
Member

commented Mar 8, 2018

I have spend some time during the last week tracing down an issue reported by @pavlexander (#16462) that he had when running his .NET app on Debian 9.3.
It manifested itself as a sigsegv when running a static constructor. And it was also intermittent, happening on some runs and not happening on others. And also not happening when running under a debugger. Until @pavlexander found that it is due to the fact that debuggers disable ASLR by default. After enabling it, it started to repro under the debugger too. But only on Debian 9.3, not e.g. on Ubuntu 14.04 or 16.04 etc.

This static constructor has enormous frame size, almost 128kB. So depending on where the ASLR set the initial RSP for the process, the used stack size either crossed the initial stack size of 128kB or not. And if it did, it has crashed during the stack probing. That was pretty strange since this was running on the primary thread and the maximum stack size was 8MB. So the kernel did not convert the fault to stack expansion on the affected system.

I’ve written a little testing app in C that basically does the same thing as the probing generated by the JIT. And on all my systems other than the Debian 9.3 (I even had Debian 9.0 and it was ok), it was working fine. On the Debian 9.3, it was crashing with sigsegv.

To make the story shorter, I’ve found that it is a kernel version dependent thing. By bisection and building and testing various kernel versions, I’ve found that it started to happen on kernel 4.9.34. Further bisection identified the linux kernel commit that changed that (torvalds/linux@cfc0eb4). It was a fix to a potential issue fixed by a rewrite of the guard page handling stuff. I spent some time reading and debugging the related portions of the kernel source and it turns out that basically, we were just lucky it has worked before this change. There is this comment close to the stack expansion invocation:

Accessing the stack below %sp is always a bug. The large cushion allows instructions like enter and pusha to work. ("enter $65535, $31" pushes 32 pointers and then decrements %sp by 65535.)

There is a check that tests if the fault address is farther than (65536 + 32 * 8) bytes from the stack pointer and if it is, it refuses to expand the stack. But before this check, there is another check testing if the fault address is inside of a range stored in a virtual memory descriptor for the stack. And that’s where the new and old kernels differ. The new kernel includes only the committed portion of the stack virtual memory range while the old one was including the guard page in that range too.
That means that on the old kernel, our probing has hit the guard page and since it was inside of the range described by the virtual memory descriptor for the stack, the test for the distance between SP and the failure address was skipped, the new page committed and the range expanded by another guard page.
On the new kernel, we hit the guard page, it is not part of the range for the stack anymore, so it checks the distance between the failure address and the RSP, finds that it is too large and so it passes sigsegv to our process.

I’ve tested even the latest Linux kernel 4.15 and this new behavior persists.

That means that in order to make stack probing work correctly on the new kernels too, we will need to modify the probing so that it moves the RSP as it probes (or at least once every 64kB). Btw, it looks like this check of the address to SP distance is there only for x86 / x64 and not for ARM / ARM64.

@janvorli janvorli added this to the 2.1.0 milestone Mar 8, 2018

@janvorli

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2018

cc: @jkotas, @sergiy-k

@BruceForstall , do you see any potential issues with such change from the JIT point of view?

@danmosemsft

This comment has been minimized.

Copy link
Member

commented Mar 8, 2018

Impressive investigation thanks for the educational writeup @janvorli.

@RussKeldorph

This comment has been minimized.

Copy link
Member

commented Mar 8, 2018

@BruceForstall

This comment has been minimized.

Copy link
Member

commented Mar 8, 2018

It seems like changing the JIT to loop with:

  1. change SP, then
  2. probe at SP

would be ok, though slightly more code. It seems like this should work on Windows as well. We can't change SP by more than one page size, or risk missing the guard page.

I wonder: what does Linux stack probing code look like (e.g., some Linux version of "chkstk")?

@janvorli

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2018

@BruceForstall:

what does Linux stack probing code look like (e.g., some Linux version of "chkstk")?

I am not aware of existence of anything like that on Unix.

would be ok, though slightly more code

It may actually be just one more instruction out of the loop. This is what we had in the failure I was investigating:

0x7fe4cc260083: testq (%rsp,%rax), %rax
0x7fe4cc260087: subq $0x1000, %rax ; imm = 0x1000
0x7fe4cc26008d: cmpq $-0x1d3c0, %rax ; imm = 0xFFFFFFFFFFFE2C40
0x7fe4cc260093: jge 0x7fe4cc260083

And I think the new version can look like this:

    lea rax, [rsp - 0x1d3c0]
loop:
    test [rsp], rax
    sub rsp, 0x1000
    cmp rsp, rax
    jge loop
@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Mar 8, 2018

I vaguely recall there was some reason why on windows we don't move RSP until the end of the probing sequence (or at least don't move it at the beginning). Something like RSP always has to point at a valid page. I could very well be wrong though.

Let me see if I can dig it up....

@jkotas

This comment has been minimized.

Copy link
Member

commented Mar 8, 2018

some Linux version of "chkstk"

You have to opt into it via -fstack-check gcc command line option.

For example, the code for method like this:

void foo()
{
    char b[10000];
    bar(b);
}

Looks like this (when compiled with gcc -fstack-check -O2):

        subq    $14152, %rsp
        orq     $0, 5928(%rsp)
        orq     $0, 1832(%rsp)
        orq     $0, (%rsp)
        addq    $4128, %rsp
...

-fstack-check option does not seem to be supported by clang.

@BruceForstall

This comment has been minimized.

Copy link
Member

commented Mar 8, 2018

I don't think we'd want to move SP to it's final value before the probing happens; presumably someone could asynchronously start using our stack (maybe the VM, or an interrupt handler?), and SP wouldn't be valid. I'm surprised that's what the code @jkotas shows does.

@janvorli

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2018

The you can replace

test [rsp], rax

by

test [rsp - 0x1000], rax

I actually makes my sample above correct, the way I've put it, we would not be probing the last page.

@BruceForstall

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

Related: #14481.

It looks like we do currently change RSP as we probe, so I'm wondering what the current "bad" prolog probe code looks like that @janvorli was looking at.

However, it appears our current probe code does not work on Windows AMD64. An exception thrown within the probe loop (e.g., on the probe instruction) fails to unwind because the static prolog unwind codes can't handle dynamically changing RSP.

cc @briansull @adityamandaleeka

@RussKeldorph

This comment has been minimized.

Copy link
Member

commented Mar 23, 2018

Now that #14481 is undone, is this issue the one true "figure out how to probe the Unix stack" bug? Do we know whether there is a single pattern that can correctly probe on both Windows and Unix or we have to bifurcate? Does OSX have a required probing pattern different from Linux?

@RussKeldorph

This comment has been minimized.

Copy link
Member

commented Mar 23, 2018

Another question: Why don't we just call __chkstk like C++? And write proper versions for non-Windows platforms?

@janvorli

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2018

@RussKeldorph the way that would work on Unix was the one that was reverted, provided we fix the unwind info issue.
I think that calling a jit helper that would behave like __chkstk would work and I could implement that in asm for Unix to be able to precisely define the necessary unwind info.
We would need to handle the __chkstk in a special way though, like e.g. the GC write barriers, so that when hardware exception occurs in it, it would be seen as happening at the __chkstk call site in the managed code (IsIPInMarkedJitHelper). So it seems we would then need to reimplement the __chkstk stuff in our runtime for Windows too to be able to check that.
The problem with the current issue was that the app was using dotnet core 2.0 that doesn't have that probing change - see the code I've seen when the issue happened in the comment above (#16827 (comment)).
I've tried to build the application provided by @pavlexander using the latest CLI with 2.1 runtime now and the issue doesn't occur. I have not realized before that we have had the probing fix in 2.1 already, so I haven't tried it at that time.

@RussKeldorph

This comment has been minimized.

Copy link
Member

commented Mar 23, 2018

@janvorli Thanks. So it sounds like the most expedient thing is to put back #14481 conditional on Unix only.

@RussKeldorph

This comment has been minimized.

Copy link
Member

commented Mar 23, 2018

...or does the unwind info need to be fixed for Unix as well?

@janvorli

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2018

I think we don't need to fix the unwind info on Unix, since all stack overflow and probes are handled as failfast and we are not trying to unwind stack in such cases. So putting it back just for Unix sounds good.

kouvel added a commit to kouvel/corefx that referenced this issue Mar 29, 2018

Fix thread constructor test on newer kernels
Works around dotnet/coreclr#17170
- Works around an issue on linux-arm64 where pthread_create appears to reserve less stack space than requested without failing, and limits the max reserved stack size to the ulimit-configued value, leading to seg fault when the requested size of stack space is attempted to be used
- Fixed to write every page in the stack region that is being checked, to guarantee hitting the guard page in case of failure. The JIT's stack probe for stackalloc should cover this, dotnet/coreclr#16827 details why that was not happening, in any case this test is not intended to test stack probing behavior, so this change makes the test a bit stronger in what it actually intends to test.

kouvel added a commit to kouvel/corefx that referenced this issue Mar 29, 2018

Fix thread constructor test on newer kernels
Works around and closes dotnet/coreclr#17170
- Works around an issue on linux-arm64 where pthread_create appears to reserve less stack space than requested without failing, and limits the max reserved stack size to the ulimit-configued value, leading to seg fault when the requested size of stack space is attempted to be used
- Fixed to write every page in the stack region that is being checked, to guarantee hitting the guard page in case of failure. The JIT's stack probe for stackalloc should cover this, dotnet/coreclr#16827 details why that was not happening, in any case this test is not intended to test stack probing behavior, so this change makes the test a bit stronger in what it actually intends to test.
@echesakovMSFT

This comment has been minimized.

Copy link
Member

commented Mar 31, 2018

Resolved with #17360

kouvel added a commit to dotnet/corefx that referenced this issue Apr 1, 2018

Fix thread constructor test on newer kernels (#28613)
Works around and closes dotnet/coreclr#17170
- Works around an issue on linux-arm64 where pthread_create appears to reserve less stack space than requested without failing, and limits the max reserved stack size to the ulimit-configued value, leading to seg fault when the requested size of stack space is attempted to be used
- Fixed to write every page in the stack region that is being checked, to guarantee hitting the guard page in case of failure. The JIT's stack probe for stackalloc should cover this, dotnet/coreclr#16827 details why that was not happening, in any case this test is not intended to test stack probing behavior, so this change makes the test a bit stronger in what it actually intends to test.

satano added a commit to satano/corefx that referenced this issue May 30, 2018

Fix thread constructor test on newer kernels (dotnet#28613)
Works around and closes dotnet/coreclr#17170
- Works around an issue on linux-arm64 where pthread_create appears to reserve less stack space than requested without failing, and limits the max reserved stack size to the ulimit-configued value, leading to seg fault when the requested size of stack space is attempted to be used
- Fixed to write every page in the stack region that is being checked, to guarantee hitting the guard page in case of failure. The JIT's stack probe for stackalloc should cover this, dotnet/coreclr#16827 details why that was not happening, in any case this test is not intended to test stack probing behavior, so this change makes the test a bit stronger in what it actually intends to test.

MattGal pushed a commit to MattGal/coreclr that referenced this issue Aug 2, 2018

Fixdotnet#16827 Stack probing for Linux (dotnet#17360)
* Add stack probing algorithm that moves sp while touching pages under _TARGET_UNIX_ only
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.