-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong GSCookie range recorded in GCInfo #13041
Comments
Hi @k15tfu, can you tell me more about your scenario? I would like to understand what you are trying to accomplish to make sure I give the right advice. Right now the only officially supported ways to call DoStackSnapshot on Linux are the following
There are OS differences that prevent us from giving identical DoStackSnapshot behavior on Linux versus what is capable on Windows. I'm not convinced that allowing the profiler to skip the GSCookie check is the right way to go with this, and even if I was convinced there is likely not enough time to make that change safely. That being said, my goal here is to find an acceptable workaround to make your scenario work. |
@davmason, hi!
I am calling DoStackSnapshot for current thread from a signal handler, without use of ICorProfilerInfo10::SuspendRuntime. It looks correct, right?
And what are the differences? Is there anything I should know? :)
I know that it's not the best thing we can do because the root of this problem is an incorrect epilog info, but I think it's more complicated fix, taking into account that your debugger team have already faced with this issue and they decided to add SKIP_GSCOOKIE_CHECK flag. So, that patch is already used in 3.0, and the only thing we should do for the profiler is to use SKIP_GSCOOKIE_CHECK flag in DoStackSnapshot (what looks easy for me). |
@k15tfu are you aware of the fact that from a signal handler, you can call only a code that uses very limited subset of runtime functions (so called async signal safe functions) which are listed in http://man7.org/linux/man-pages/man7/signal-safety.7.html? Otherwise you can get crashes, hangs and other unexpected issues. Even accessing thread local store variables, taking locks or doing memory allocations is prohibited in general. |
@janvorli Yes, I know, and the only things I use are libunwind (which is safe), and DoStackSnapshot. Bottom line: DoStackSnapshot should not be called from signal handlers because it's not safe, and can cause ... any kind of errors. Is it correct? If yes, the one ways the sampling profiler can be implemented are:
Is there other options? And I think in case dotnet/coreclr#2 we will avoid signal handler limitations, but this bug still will occur. So it's still actual issue. |
Yes.
I think that another option might be to have a separate thread / set of threads dedicated to calling the DoStackSnapshot. The signal handler would synchronize with that thread e.g. using pipe, eventfd or some other mean that is async signal safe. It would essentially wait in the handler until the DoStackSnapshot completes and then continue running. |
@janvorli Do agree that in this case GSCookie check continue fails, and that this bug is still actual? |
Yes, it seems like a JIT bug. @k15tfu can you please share which function have you hit this issue with? |
@janvorli I don't know the exact function, but I can share the project where it happens: Mandelbrot.zip , and try it with netcoreapp2.1 / Debug:
CheckGSCookies() fails in threads running Writer.CreatePgmAscii and Writer.CreatePpmAscii (probably it's the same issue). Here is
|
Btw, @janvorli one of the way to work with unsafe functions from a signal handler is to guarantee that signal will not be delivered during such functions, therefore http://man7.org/linux/man-pages/man7/signal-safety.7.html suggests to block signals before calling them. So, if I don't do anything printf-related in my program, I can safely use printf() in signal handler. What if I do all DoStackSnapshot() stuff in the signal handler, but to avoid possible problems, I will check if the signal arrives during any of the unsafe functions? I do already skip all native frames, trying to find the topmost managed frame, so I can also check the name/module of these frames and do an early-return if find something unsafe. |
How do you know that the DoStackSnapshot doesn't end up calling unsafe functions itself? How do you know it would not allocate memory using malloc, for example? I think it is too dangerous to assume such things. Moreover, even if we don't do anything unsafe today, it is too easy to add some code somewhere in the stack walker that will break this assuption. |
@sergiy-k reported this problem as well offline. His repro was not using profiler interfaces, but used a different atypical environment. This is a problem in the JIT. The JIT makes following simplifying assumption about GS cookie reporting:
The simplifying assumption is not correct. The GS cookie can be queried in the epilog in certain rare cases. |
cc @dotnet/jit-contrib -- we should consider fixing this for 5.0. |
I am planning to fix this issue by exclusing the epilog block from the live range, like:
however, we don't have correct epilog (or epilogs) size right now and I am working on a change to calculate that. @jkotas I think it would be a sufficient fix. We have our last use of GS before we jump to an epilog so can shorter its lifetime. Am I right or do we need to know a precise live range of the GS cookie (including a part of the prolog and part of the epilog)? In the meantime, I need to find a way to repro the issue to test the fix, @sergiy-k could would it be hard to use your case for a local repro? |
I think it is ok to stop reporting the GS cookie after the check that validated it did not get corrupted. I assume that it is what you meant by the last use. |
Yes, thanks. |
The information about end of GS cookie scope recorded in GC info is not accurate and it cannot even be made accurate without redesign that is not worth it. Detect end of GS cookie scope by comparing it with current SP instead. Fixes dotnet#13041
The information about end of GS cookie scope recorded in GC info is not accurate and it cannot even be made accurate without redesign that is not worth it. Detect end of GS cookie scope by comparing it with current SP instead. Fixes #13041
Hi! We use DoStackSnapshot() in a signal handler, but sometimes CheckGSCookies() fails like:
I found that it happens only when checking GSCookie in a particular slot =
fffffdd8
(in hex),and only when
relOffset
(in EECodeManager::GetGSCookieAddr()) is close to the end of its valid region (i.e. gcInfoDecoder.GetGSCookieValidRangeEnd()).In my case, it failed many times with different
relOffset
: 0x451, 0x44f, 0x451, 0x451, 0x454, 0x451, 0x453, 0x454, whereas the valid range was 0x59 -- 0x455.Here is a disasm for the top stack frame before signal handler:
After receiving the signal being on one of these tail instructions, if we use more than 0x218 bytes on the stack (we do) the old GSCookie value will be overwritten and then when we call DoStackSnapshot() the CheckGSCookies() will abort due to GSCookie mismatch. So obviously GetGSCookieValidRangeEnd() returns incorrect offset because it contains the code after freeing the stack.
As I can see @hoyosjs has faced with the same problem dotnet/coreclr@32741b9 , so probably in my case we have to backport this patch to all supported versions and disable GSCookie check for ProfToEEInterfaceImpl::DoStackSnapshot(). P.S. I'm ready to prepare a PR and check it locally.
category:correctness
theme:gc-info
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: