Handle page protection errors as D errors on Linux with --DRT-memoryError=1 #2249

wilzbach · 2018-07-11T04:04:59Z

See also: https://forum.dlang.org/post/pi31ab$me3$1@digitalmars.com

dlang-bot · 2018-07-11T04:05:00Z

Thanks for your pull request, @wilzbach!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + druntime#2249"

Geod24 · 2018-07-11T04:16:31Z

This change could use a bit more rationale. Is there a way to disable it ? Do we really want that ?

Personally I'm worried about having different behavior across platform.

wilzbach · 2018-07-11T04:28:19Z

This change could use a bit more rationale.

There's an entire NG thread with people complaining about this not being the default.
To quote @schveiguy

It was controversial at the time, and considered a hack. It's also only supported on Linux. So I don't know the reason why it's not always done for Linux, I really think it should be.

Is there a way to disable it

deregisterMemoryErrorHandler

Personally I'm worried about having different behavior across platform.

Other platforms will just segfault (not really a good behavior anyhow).
Also note that there's already tons of behavior that's platform-specific, e.g. #2035 (rt_trap_exceptions isn't required on Windows because it's always set there when a debugger is detected).

adamdruppe · 2018-07-11T13:36:27Z

Null pointers on win32 also throw errors btw.

adamdruppe · 2018-07-11T13:45:03Z

Though while I'm not voting no, I do object to calling segfaults bad behavior. They do exactly what they are supposed to do - and exactly what a D error is allowed to do according to our spec - and have superior debugging capabilities. (I'd say superior handling capabilities too, you can handle-and-resume or retry a segfault, not so much a D exception, but we don't really lose that since you can just register your own signal handler instead.)

The exception isn't bad, but neither is the segfault, you just need to know how to actually use it and then you see the advantages.

nemanja-boric-sociomantic · 2018-07-11T14:56:47Z

One good thing about segmentation fault is that you can generate core dump which you could inspect to see what caused the segmentation fault very easily. Is one still able to do that with this approach? If I'm not mistaken, the position of the segmentation fault is removed from the backtrace.

nemanja-boric-sociomantic · 2018-07-11T14:58:18Z

Sorry, I read the stacktrace in a wrong way 🤦‍♂️

JohanEngelen · 2018-07-11T17:14:07Z

This feels very much rushed after minimal discussion on the NG. There is even a post that mentions that in the past this was debated, so at least I expect some arguments both ways and with a conclusion of why it now should come out different than before.

Did you test how this works with debuggers and ASan and other tools?

JohanEngelen · 2018-07-11T17:20:43Z

My red flag here is that with this PR per default after a page error, more code is executed that even allocates (new), uses D's exception mechanism, calls constructors of class objects, etc. ...... That sounds problematic to me.

Shachar · 2018-07-11T17:53:01Z

I second the detractors.
The Linux segmentation fault behavior is a well known capabilities with established ways of handling it. Turning it into an exception makes zero sense to me.

At the very least, it should be possible to turn this capability off.

adamdruppe · 2018-07-11T18:21:14Z

A few notes:

you can turn this off by calling a function (just like right now, you can turn it on by calling the function). Maybe we should make it a DRT command line argument too. The implementation btw is it is just registers a signal handler with sigaction http://dpldocs.info/experimental-docs/source/etc.linux.memoryerror.d.html#L33 ). If running in a debugger, just like how we want uncaught exceptions to go through the debugger, these should too.
Executing more code after it doesn't actually bother me since the segfault interrupts the illegal operation... it is perfectly normal to handle and resume with a page fault.

This is why I could go either way on this. I like things the way they are now, but I could live with this change too. (or a compromise: print the stack trace to the console, then abort the program, but I think that has the same downsides to this basically anyway)

Shachar · 2018-07-12T01:55:07Z

After having slept on this, I have to raise the level of my objection to this change.

Before I begin, I'll point out that I've tried to follow the code for registerMemoryErrorHandler, but I'm not sure what the playing around with the variables called RIP actually do. As such, it might be that this objection is incorrect.

With that in mind, please keep in mind that the program might have been doing anything when it segfaulted. It might have been in the middle of a throw. It might have been in the middle of a context switch. It might have been mucking around with variables that outside scope(failure)s might be relying on. It might have been in the middle of performing five separate program lines, interleaved by the compiler, all partially completed.

Letting the program flow continue, even as a an exception, is just too likely to cause further problems. These will mask the original problem, or might even drive the program into an infinite loop of crashing, trying to throw, crashing, trying to throw.

As such, I must re-iterate how bad an idea I think it is to turn this on by default.

wilzbach · 2018-07-12T04:48:17Z

Alright, I was just trying to follow up on the NG thread.
Thanks a lot for all your input and I will close this soon. Really appreciated!

However, we definitely should improve the documentation and I will keep it open until then, s.t. I don't forget this.

jacob-carlborg · 2018-07-12T05:51:54Z

What about turning it on by default when the -debug flag is used?

ibuclaw · 2018-07-12T06:38:37Z

-1 on the grounds that the Linux memory error module is both dmd and x86 centric, and should be improved by some measure first so that it be more useable before making it a druntime dependency.

Though that is a purely technical reason only.

Shachar · 2018-07-12T07:10:15Z

I think we should leave it as is. It should be documented, along with the possible pitfalls it might bring. If anyone wants it, they can add it to their code.

I don't think it should be on by default. Ever.

schveiguy · 2018-07-12T13:23:52Z

I don't know the original reasons why this isn't the default, or why it's not supported on other platforms. What I do know, is that I enable it on Linux in any server program that I write, and it works swimmingly well.

Yes, a segfault can be useful in certain circumstances. But a stack trace is eminently easier to use for diagnostics, especially when you aren't in control of the environment. I can tell just by reading the stack trace where the problem has occurred. It may require more in-depth analysis, and perhaps using a debugger, or dumping a core file. But it also could be a "oops! I forgot to initialize that thing!" type of bug, in which case, I need nothing more than an editor to fix, and I'm done in 2 minutes. Instead of digging out the exact binary (hopefully you saved all those), loading into a debugger, hoping that the customer has core dumps turned on, hoping that they didn't just delete the file, hoping their disk has enough space to store it, etc., etc.

In regards to "different behavior on different platforms", my understanding is that Windows ALWAYS generated a stack trace for seg faults. There is plenty of precedent here.

It definitely makes sense to provide runtime switches to turn it off or on (given that this would enable it before e.g. any static ctors are run), but I still believe the default should be on. I'm not sure of the harm here, we should be exploiting the most we can from the platform in terms of providing diagnostic information.

schveiguy · 2018-07-12T13:25:39Z

or a compromise: print the stack trace to the console, then abort the program, but I think that has the same downsides to this basically anyway

This would work too. Printing the stack trace and aborting has the advantage that you aren't going to unwind the stack, possibly running things after something very bad has happened.

schveiguy · 2018-07-12T13:37:56Z

BTW, here is the original PR, you can read a LOT of comments in there: #187

wilzbach · 2019-01-02T13:37:22Z

So I came back to this PR and it looks like I already did the required work here, i.e. adding a flag to druntime (--DRT-memoryError=1) which allows users to enable this option without needing to recompile their binaries (see the changelog entry of this PR).

Anything else that we want to do here?

CC @thewilsonator

thewilsonator · 2019-01-02T14:21:54Z

I think its fine, and a lot of the previous criticisms were about it being a default, but I still like to get some more feedback on this (but don't let me forget about it!).

schveiguy · 2019-01-02T14:49:40Z

Since my comments on this, I have found one place where this really sucks. If you encounter a segfault inside a destructor being called by the GC, then the attempt to new an Error causes an invalid memory operation. It took me a while to track that down, as the information printed there is totally useless.

Before making this the default (I think reading the code, this is not the default, if it was before), I think we should make the memory error not depend on the GC. In fact, if there is a way to print the stack trace and abort without unwinding anything, that would be ideal, and much much safer.

jacob-carlborg · 2019-01-02T15:28:50Z

If you encounter a segfault inside a destructor being called by the GC, then the attempt to new an Error causes an invalid memory operation.

Don't we have a buffer somewhere for statically allocated exceptions?

schveiguy · 2019-01-02T16:17:55Z

Don't we have a buffer somewhere for statically allocated exceptions?

It's done elsewhere, but not for this extension. Would be a useful change, even if this PR isn't accepted.

wilzbach · 2019-01-02T16:33:48Z

Before making this the default (I think reading the code, this is not the default, if it was before),

Yep no longer the default. This PR was down-graded to just adding the CLI-equivalent of registerMemoryErrorHandler.

Don't we have a buffer somewhere for statically allocated exceptions?

See e.g. #1710

jacob-carlborg · 2019-01-02T16:36:59Z

changelog/trap_memory_error.dd

+Page protection error handling can now be registered via `--DRT-memoryError=1`
+
+In environments where attaching a debugger or retrieving a core dump isn't possible or hard,
+on Linux x86_64 one has always been able to attach druntime's  memory handler:


The code in the etc.linux.memoryerror module is defined for both 32 bit and 64 bit.

src/rt/dmain2.d

jacob-carlborg · 2019-01-02T16:41:31Z

There's no test for what happens when this flag is passed on a platform that doesn't support it.

schveiguy · 2019-01-02T16:41:40Z

Ugh, looking again at staticError, it appears that it suppresses the stack trace. Which is completely useless. Same issue as I had with InvalidMemoryOperation that I mentioned earlier.

Somehow, we need to fix this problem. However, that shouldn't hold up this PR. All it is doing is exposing the linux memory error registration as a DRT flag.

schveiguy · 2019-01-02T16:42:35Z

There's no test for what happens when this flag is passed on a platform that doesn't support it.

In general, unsupported or unknown DRT flags are going to be ignored. For example, if you passed this flag on an older version of druntime, nothing will happen.

jacob-carlborg · 2019-01-02T16:43:49Z

In general, unsupported or unknown DRT flags are going to be ignored. For example, if you passed this flag on an older version of druntime, nothing will happen.

This is slightly different in that it's only available on some platforms. But it might not be worth testing.

jacob-carlborg · 2019-01-02T16:59:42Z

Ugh, looking again at staticError, it appears that it suppresses the stack trace.

Technically it's not the staticError function but rather all Errors that use staticError that sets the info instance variable to an object that will suppress the stack trace.

Somehow, we need to fix this problem

Looking at core.runtime.defaultTraceHandler, there are two things stopping this from being @nogc: calling core.memory.gc_inFinalizer and creating an instance of DefaultTraceInfo. The latter can be fixed with again with a static allocation, not sure what to do about gc_inFinalizer.

schveiguy · 2019-01-02T18:38:05Z

gc_inFinalizer actually doesn't require GC since it's now lazily initialized:

druntime/src/gc/impl/proto/gc.d

Lines 234 to 237 in 01daddf

    
           bool inFinalizer() nothrow 
        
           { 
        
               return false; 
        
           }

schveiguy · 2019-01-02T18:40:58Z

We'd have to mark inFinalizer as @nogc in the interface, but that shouldn't be a problem, as the implementation even with a GC doesn't require doing any GC-ish things.

jacob-carlborg · 2019-01-02T19:16:33Z

We'd have to mark inFinalizer as @nogc in the interface, but that shouldn't be a problem, as the implementation even with a GC doesn't require doing any GC-ish things.

Yes, exactly.

I was looking into the allocation of DefaultTraceInfo, the question is: does it need to return a new instance every time or can it return the same pre-allocated instance?

schveiguy · 2019-01-02T19:44:10Z

does it need to return a new instance every time or can it return the same pre-allocated instance?

I think you need a unique instance per exception. But if we are preallocating the exception, we can preallocate the trace info as well. Looks like the big part of the trace info is the stack frame array. The way the default trace info is stored will have to change. Looks like it's an inner class, but it doesn't need to be.

jacob-carlborg · 2019-01-03T09:00:14Z

I think you need a unique instance per exception.

The way staticError currently you can only have one exception anyway. It will always overwrite the buffer.

RazvanN7 · 2021-11-09T12:10:46Z

@schveiguy @wilzbach @thewilsonator I have rebased this. Should we merge it?

ibuclaw · 2021-11-09T16:15:11Z

-1 on the grounds that the Linux memory error module is both dmd and x86 centric, and should be improved by some measure first so that it be more useable before making it a druntime dependency.

Though that is a purely technical reason only.

Still a -1, this time because it would add a circular dependency between druntime and phobos.

wilzbach requested a review from andralex as a code owner July 11, 2018 04:04

wilzbach added the Blocked label Jul 12, 2018

wilzbach force-pushed the memory-error branch from 288c872 to cfa33c5 Compare July 29, 2018 20:10

wilzbach force-pushed the memory-error branch from cfa33c5 to f83baf3 Compare January 2, 2019 13:34

wilzbach changed the title ~~Handle page protection errors by default as D errors on Linux~~ Handle page protection errors as D errors on Linux with --DRT-memoryError=1 Jan 2, 2019

wilzbach removed the Blocked label Jan 2, 2019

wilzbach force-pushed the memory-error branch from f83baf3 to 11cc23e Compare January 2, 2019 16:32

jacob-carlborg reviewed Jan 2, 2019

View reviewed changes

src/rt/dmain2.d Outdated Show resolved Hide resolved

jacob-carlborg reviewed Jan 2, 2019

View reviewed changes

src/rt/dmain2.d Show resolved Hide resolved

wilzbach force-pushed the memory-error branch from 11cc23e to 09e0840 Compare May 10, 2019 08:49

dlang-bot added the stalled label May 27, 2021

Handle page protection errors by default as D errors on Linux

fb7c309

RazvanN7 force-pushed the memory-error branch from 09e0840 to fb7c309 Compare November 9, 2021 12:10

dlang-bot removed the stalled label Nov 9, 2021

dlang-bot added the stalled label Nov 9, 2021

RazvanN7 closed this Nov 10, 2021

Handle page protection errors as D errors on Linux with --DRT-memoryError=1 #2249

Handle page protection errors as D errors on Linux with --DRT-memoryError=1 #2249

Conversation

wilzbach commented Jul 11, 2018

dlang-bot commented Jul 11, 2018 • edited

Bugzilla references

Testing this PR locally

Geod24 commented Jul 11, 2018 • edited

wilzbach commented Jul 11, 2018

adamdruppe commented Jul 11, 2018

adamdruppe commented Jul 11, 2018

nemanja-boric-sociomantic commented Jul 11, 2018

nemanja-boric-sociomantic commented Jul 11, 2018

JohanEngelen commented Jul 11, 2018

JohanEngelen commented Jul 11, 2018

Shachar commented Jul 11, 2018

adamdruppe commented Jul 11, 2018

Shachar commented Jul 12, 2018

wilzbach commented Jul 12, 2018

jacob-carlborg commented Jul 12, 2018

ibuclaw commented Jul 12, 2018

Shachar commented Jul 12, 2018

schveiguy commented Jul 12, 2018

schveiguy commented Jul 12, 2018

schveiguy commented Jul 12, 2018

wilzbach commented Jan 2, 2019

thewilsonator commented Jan 2, 2019

schveiguy commented Jan 2, 2019

jacob-carlborg commented Jan 2, 2019 • edited

schveiguy commented Jan 2, 2019

wilzbach commented Jan 2, 2019

jacob-carlborg Jan 2, 2019

Choose a reason for hiding this comment

jacob-carlborg commented Jan 2, 2019

schveiguy commented Jan 2, 2019

schveiguy commented Jan 2, 2019

jacob-carlborg commented Jan 2, 2019

jacob-carlborg commented Jan 2, 2019

schveiguy commented Jan 2, 2019 • edited

schveiguy commented Jan 2, 2019

jacob-carlborg commented Jan 2, 2019

schveiguy commented Jan 2, 2019

jacob-carlborg commented Jan 3, 2019

RazvanN7 commented Nov 9, 2021

ibuclaw commented Nov 9, 2021

dlang-bot commented Jul 11, 2018 •

edited

Geod24 commented Jul 11, 2018 •

edited

jacob-carlborg commented Jan 2, 2019 •

edited

schveiguy commented Jan 2, 2019 •

edited