-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protecting async-profiler from corrupt Method pointers #831
Comments
The JFR code you pointed to refers to collection of an async stack trace (i.e. outside a safepoint). However, JVM TI |
It may be a JVM bug to guarantee that jmethodIDs remain valid indefinitely, with "valid" implying that that I don't have reason to believe that this is related, but I was wondering what ramifications there might be to massive proliferation of unique jmethodIDs for lambdas? We've found it helpful to strip the hexadecimal suffixes to get them to collapse properly in flame graphs, but won't otherwise identical call stacks still have multiple entries in callTraceStorage? |
The design of jmethodIDs in HotSpot assumes that any JVM TI call on Blindly applying a non-portable workaround (relying on private symbols) for an unknown bug that might not even fix the issue is not an option. |
I agree. Please file a separate issue, since it's unrelated to the jmethodID segfault. |
Running for a while with our workaround in place and inserting a detectable error frame when it's triggered may give us a more precise idea of how to repro this portably. I understand that it might not be palatable for general distribution. |
This is valid only as long as there is an existing strong reference to the class containing the method to which the jmethodid points to. Which is not true for stacktraces collected by JVMTI for eg. allocation profiles - the lifecycle of the jmethodids contained in the stacktraces can be very different from the lifecycle of the classes containing the methods those jmethodids are pointing to. I have a pretty reliable reproducer (which I, unfortunately, can not share because it is basically a heavily mocking test fixture of one of our internal projects) where the crash due to messed up jmethodid is guaranteed within few minutes. @pnf I came to to this workaround as well - I don't want to cross-post to potentially a competing product (although I don't think it is ...) but one can also use the vmstructs to do more exhaustive checks on the method/class/classloader structures used by various JVMTI calls to eg. resolve defining class etc. However, all these checks are just bandaids for the fact that a I confirmed the hypothesis of method and class metadata being partially cleaned when trying to use a Ok, these are just my 2c in case this info might help someone. |
@jbachorik Thank you for sharing your thoughts. As I pointed out in the JBS issue, this is not how jmethodIDs are supposed to work. The machinery to protect them from concurrent class unloading is there in the HotSpot code; if it does not work as expected - it is obviously a JVM bug. I was not able to reproduce it. If you can - it will be helpful if you provide some debugging details.. The proposed workaround does not actually fix the issue, but makes it much harder to reproduce. |
@apangin I totally agree that the proposed change is just a workaround and it will not guarantee never touching corrupted jmethodid. The word from the JVM runtime folks (David Holmes, Dan Daugherty) is that this is kind of known but not easy to fix - https://mail.openjdk.org/pipermail/serviceability-dev/2023-June/049711.html (at least that's my understanding of what they are saying). As for reproducing - I am only able to reproduce where there are really many classes generated on-the-fly (Mockito mocks) and so far I have only one of our internal projects as a testbed. I might attempt to create a reproducer which will try to mimic what mockito does but, TBH, this feels like a lot of work for a very slim chance that this will get ever fixed. TL;DR the teardown/cleanup would need to be made atomic with the jmethodid/method pointer - the memory must not be released while someone is actively using the jmethodid and the nulling of the pointer and the cleanup must be done in exact and predictable order. This would require eg. an API to atomically create a strong reference to a class containing the method the jmethodid is pointing to such that the method/class structures remain immutable while we are operating on that particuarl jmethodid 🤷 |
IIUC, the approach to achieve this in HotSpot was to postpone releasing memory after cleanup until a global safepoint happens. At least, this is how it was designed. Since methodIDs are resolved in |
FWIW, I did some in-depth research and I think I was able to find the real culprit. |
@jbachorik Thank you for the thorough analysis. So it's indeed a JVM bug. If I have a reliable way to reproduce the crash with the profiler, I'll try to come up with a proper workaround. |
I could reproduce the issue using a test case from @jbachorik's PR. |
We recently encountered a SEGV in the jdk's
Method::checked_resolve_jmethod_id
, called via async-profilerFrameName::javaMethodName
.We're as yet unsure what caused the corrupt method data, but for what it's worth the offending
jmethodID
came originally from a jvmtiGetStackTrack
after anObjectSampler
event. Casting in the debugger shows a superficially reasonableMethod*
; the SEGV seems to occur during the evaluation ofo->method_holder()->is_loader_alive()
. What's also possibly of interest is that the frame is almost surely a lambda.The program being profiled offers many possible suspects, including custom class loaders and byte-code generation, so we have our work cut out for us. While we can reproduce the problem with multiple versions of java and of async-profiler, we don't have anything self-contained and shareable at this point.
I saw your comments on https://bugs.openjdk.org/browse/JDK-8313816 and openjdk/jdk#15171, about which I share your skepticism. However, I did notice
in https://github.com/openjdk/jdk/blob/master/src/hotspot/share/jfr/recorder/stacktrace/jfrStackTrace.cpp#L254, which suggests that corrupt method data might be more command than we thought, and that the variously desperate checks in
is_valid_method
might be sometimes be helpful in recovering from it. That method is mangled and local to libjvm.so, if we grab it withVMStructs::libjvm()->findSymbol("_ZN6Method15is_valid_methodEPKS_")
then, in our case, it does seem to be able to detect and elide the frames that were causing our crash. For obvious reasons, this is not a fully satisfying resolution, but I think it's a worthwhile defensive measure.
The text was updated successfully, but these errors were encountered: