-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in ICorProfilerInfo2.DoStackSnapshot #37586
Comments
Tagging subscribers to this area: @tommcdon |
Hi @birnbeidla, The stack frames that you provide there are for a stack walk that is already in progress, so the issue you're seeing is that you're trying to stack walk a thread that is already in the middle of stack walking another thread. Even if it didn't crash you would likely run in to all sorts of other issues. In general collecting callstacks this was is error prone and full of easy to fall in to errors like this. You also have to watch out for deadlocks, if you happen to suspend a thread while it is executing runtime code it could be holding an arbitrary lock that DoStackSnapshot needs and you can deadlock. Because of these dangers, and the fact that there is no SuspendThread api on linux, we introduced the ICorProfilerInfo::SuspendRuntime and ICorProfilerInfo10::ResumeRuntime apis in .net 3.0. SuspendRuntime will pause the runtime using the same mechanism that the GC uses to suspend the runtime, but not actually perform a GC. This will leave all the threads in the runtime in a state that is known to be safe to sample them. You can call DoStackSnapshot on all managed threads at that point without having to do any sort of detection about what the thread is doing. Then once you're done you can call ResumeRuntime to let code execute again. If you are building a stack sampling profiler at this point, I highly suggest using SuspendRuntime/ResumeRuntime. Your life will be a lot easier 😄 |
Thanks a lot for your answers. I am aware that you introduced the Suspend/Resume Runtime API and I really look forward to try it out. Unfortunately, the requirement for the profiler is to run on versions prior to 3.0. I am also aware that async stack walking holds a lot of pifalls and for the deadlock scenario you describe, I already have another thread running that acts as a watchdog and can resolve the deadlock scenario. As for the scenario you describe: |
I may have misunderstood the data you provided. Is this stack the thread that you're collecting stacks from or the thread that you call DoStackSnapshot from?
I thought the first time that you meant that it was the thread you were collecting stacks, but re-reading it just now it seems like it's actually the callstack from the crashing thread. If that is the case, then a crash in |
No, your first thoughts were correct. The stack is from the collector thread, NOT the target thread. But since it happens while (async) walking the targets thread's stack (while the target is suspended), I can't rule out heap corruption completely but it's very unlikely that the corrupted stack is the one from my collector thread. Anyway, my real question is how to detect or recover from such a situation since __report_gsfailure does a fast fail and I don't know any possibility to recover from that. |
Do you have the stack of the target thread? This failure means that DoStackSnapshot is trying to walk the thread and detected that one of the stack cookies is invalid. Usually this means that stack corruption happened, which would be unrecoverable. But it is also possible that you caught a runtime thread in the middle of a dangerous operation. How to detect and avoid it depends on what the target thread is doing. If you get a crash under a debugger, then you should be able to find out the target thread by casting the ThreadID to a coreclr!Thread * and then inspecting that object, then you can find the callstack of the target thread in a debugger. If you have a dump with the failfast and can share it with me, I can take a look. |
Again, thank you very much for your comments. The stack of the target thread is inconclusive. I've had that problem a few times now and I've never seen a pattern in the stacks of the target thread. I am already aware that there are some situations I have to avoid. For instance, I block module unload until I've finished stackwalking to avoid possible deadlocks with loader lock. I'll check if I can provide a dump. Since this is company work, I'm also bound to legal restrictions. |
Hi @birnbeidla, Sorry that I never responded. I thought I did and moved on. Since you opened this issue we have discovered that the jit has some error about how it treats gs cookies, which might be related to your issue. The only way to know for sure is to see the callstack of the target thread. See #13041 |
Closing as this looks like a duplicate to #13041. Please let us know if that is not true and we will investigate further. |
Hi
I'm writing a profiler that periodically gets the managed stacks from all available threads.
Collecting stack data is done by a separate thread that does stack walking with a specific interval. We call that thread the collector thread.
The profiler works with both desktop and core CLR and can runs on Windows aswell as on Linux.
For asynchronous stack walking, I tried several different approcaches and in one specific environment, I encountered a problem.
The environment is 32bit processes on Windows with core CLR.
This is a brief explaination of how the async stack walking works: (all the work is done on the collector thread)
The problem is, that in some very rare cases, I run into process crashes that are occur in my call to DoStackSnapshot:
I managed to resolve the top frames of the collector thread for such a case:
I'm not sure if this is a bug in core CLR or if my approach has some flaws.
The text was updated successfully, but these errors were encountered: