Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in ICorProfilerInfo2.DoStackSnapshot #37586

Closed
birnbeidla opened this issue Jun 8, 2020 · 10 comments
Closed

Crash in ICorProfilerInfo2.DoStackSnapshot #37586

birnbeidla opened this issue Jun 8, 2020 · 10 comments

Comments

@birnbeidla
Copy link

Hi

I'm writing a profiler that periodically gets the managed stacks from all available threads.
Collecting stack data is done by a separate thread that does stack walking with a specific interval. We call that thread the collector thread.
The profiler works with both desktop and core CLR and can runs on Windows aswell as on Linux.

For asynchronous stack walking, I tried several different approcaches and in one specific environment, I encountered a problem.
The environment is 32bit processes on Windows with core CLR.

This is a brief explaination of how the async stack walking works: (all the work is done on the collector thread)

  1. Enumerate all existing managed threads and picks one. We call that thread the target thread.
  2. Suspend the target with a call to win32 APIs SuspendThread
  3. Get the target threads CONTEXT information via win32 APIs GetThreadContext
  4. Stack walk to the first managed frame.
    • Call GetFunctionFromIP from profiler callbacks to check for a managed frame
    • Use StackWalk64 from dbghelp.dll to walk to the next frame.
  5. With the so prepared CONTEXT struct, call DoStackSnapshot from the profiler callbacks and retrieve the function IDs
  6. Resume the target thread via win32 API

The problem is, that in some very rare cases, I run into process crashes that are occur in my call to DoStackSnapshot:
I managed to resolve the top frames of the collector thread for such a case:

coreclr!__report_gsfailure+17 f:\dd\vctools\crt\vcstartup\src\gs\gs_report.c:220
coreclr!CrawlFrame::SetCurGSCookie+1e e:\a\_work\525\s\src\vm\stackwalk.cpp:435
coreclr!StackFrameIterator::ProcessCurrentFrame+1e4 e:\a\_work\525\s\src\vm\stackwalk.cpp:2990
coreclr!StackFrameIterator::NextRaw+256 e:\a\_work\525\s\src\vm\stackwalk.cpp:2760

I'm not sure if this is a bug in core CLR or if my approach has some flaws.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Diagnostics-coreclr untriaged New issue has not been triaged by the area owner labels Jun 8, 2020
@ghost
Copy link

ghost commented Jun 8, 2020

Tagging subscribers to this area: @tommcdon
Notify danmosemsft if you want to be subscribed.

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jun 8, 2020
@tommcdon tommcdon added this to the 5.0 milestone Jun 8, 2020
@tommcdon
Copy link
Member

tommcdon commented Jun 8, 2020

@davmason

@davmason
Copy link
Member

davmason commented Jun 9, 2020

Hi @birnbeidla,

The stack frames that you provide there are for a stack walk that is already in progress, so the issue you're seeing is that you're trying to stack walk a thread that is already in the middle of stack walking another thread. Even if it didn't crash you would likely run in to all sorts of other issues.

In general collecting callstacks this was is error prone and full of easy to fall in to errors like this. You also have to watch out for deadlocks, if you happen to suspend a thread while it is executing runtime code it could be holding an arbitrary lock that DoStackSnapshot needs and you can deadlock.

Because of these dangers, and the fact that there is no SuspendThread api on linux, we introduced the ICorProfilerInfo::SuspendRuntime and ICorProfilerInfo10::ResumeRuntime apis in .net 3.0. SuspendRuntime will pause the runtime using the same mechanism that the GC uses to suspend the runtime, but not actually perform a GC. This will leave all the threads in the runtime in a state that is known to be safe to sample them. You can call DoStackSnapshot on all managed threads at that point without having to do any sort of detection about what the thread is doing. Then once you're done you can call ResumeRuntime to let code execute again.

If you are building a stack sampling profiler at this point, I highly suggest using SuspendRuntime/ResumeRuntime. Your life will be a lot easier 😄

@birnbeidla
Copy link
Author

Thanks a lot for your answers.

I am aware that you introduced the Suspend/Resume Runtime API and I really look forward to try it out.

Unfortunately, the requirement for the profiler is to run on versions prior to 3.0. I am also aware that async stack walking holds a lot of pifalls and for the deadlock scenario you describe, I already have another thread running that acts as a watchdog and can resolve the deadlock scenario.

As for the scenario you describe:
When you say that the target thread is currently stack-walking itself, do you mean, its walking its own stack (sync) or walking another threads stack (async) ?
Is there any way to detect such a scenario?
Is there any way to recover from the error?

@davmason
Copy link
Member

I may have misunderstood the data you provided. Is this stack the thread that you're collecting stacks from or the thread that you call DoStackSnapshot from?

coreclr!__report_gsfailure+17 f:\dd\vctools\crt\vcstartup\src\gs\gs_report.c:220
coreclr!CrawlFrame::SetCurGSCookie+1e e:\a\_work\525\s\src\vm\stackwalk.cpp:435
coreclr!StackFrameIterator::ProcessCurrentFrame+1e4 e:\a\_work\525\s\src\vm\stackwalk.cpp:2990
coreclr!StackFrameIterator::NextRaw+256 e:\a\_work\525\s\src\vm\stackwalk.cpp:2760

I thought the first time that you meant that it was the thread you were collecting stacks, but re-reading it just now it seems like it's actually the callstack from the crashing thread.

If that is the case, then a crash in __report_gsfailure indicates that stack corruption has happened, which would likely mean there is some sort of buffer overrun in your program overwriting stack values.

@birnbeidla
Copy link
Author

No, your first thoughts were correct. The stack is from the collector thread, NOT the target thread.

But since it happens while (async) walking the targets thread's stack (while the target is suspended), I can't rule out heap corruption completely but it's very unlikely that the corrupted stack is the one from my collector thread.

Anyway, my real question is how to detect or recover from such a situation since __report_gsfailure does a fast fail and I don't know any possibility to recover from that.
This is critical for my application because I surely can skip a few stack samples but crashing the target application is a no-go.

@davmason
Copy link
Member

Do you have the stack of the target thread? This failure means that DoStackSnapshot is trying to walk the thread and detected that one of the stack cookies is invalid. Usually this means that stack corruption happened, which would be unrecoverable. But it is also possible that you caught a runtime thread in the middle of a dangerous operation.

How to detect and avoid it depends on what the target thread is doing. If you get a crash under a debugger, then you should be able to find out the target thread by casting the ThreadID to a coreclr!Thread * and then inspecting that object, then you can find the callstack of the target thread in a debugger.

If you have a dump with the failfast and can share it with me, I can take a look.

@birnbeidla
Copy link
Author

Again, thank you very much for your comments.

The stack of the target thread is inconclusive. I've had that problem a few times now and I've never seen a pattern in the stacks of the target thread.
What would be considered as "dangerous operation"? Is that documented somewhere?

I am already aware that there are some situations I have to avoid. For instance, I block module unload until I've finished stackwalking to avoid possible deadlocks with loader lock.
A list of scenarios like this would be really handy....

I'll check if I can provide a dump. Since this is company work, I'm also bound to legal restrictions.
If so, how can I reach out to you?

@davmason
Copy link
Member

davmason commented Jul 10, 2020

Hi @birnbeidla,

Sorry that I never responded. I thought I did and moved on.

Since you opened this issue we have discovered that the jit has some error about how it treats gs cookies, which might be related to your issue. The only way to know for sure is to see the callstack of the target thread. See #13041

@tommcdon
Copy link
Member

Closing as this looks like a duplicate to #13041. Please let us know if that is not true and we will investigate further.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants