Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protecting async-profiler from corrupt Method pointers #831

Closed
pnf opened this issue Oct 27, 2023 · 13 comments
Closed

Protecting async-profiler from corrupt Method pointers #831

pnf opened this issue Oct 27, 2023 · 13 comments
Labels

Comments

@pnf
Copy link

pnf commented Oct 27, 2023

We recently encountered a SEGV in the jdk'sMethod::checked_resolve_jmethod_id, called via async-profiler FrameName::javaMethodName.

We're as yet unsure what caused the corrupt method data, but for what it's worth the offending jmethodID came originally from a jvmti GetStackTrack after an ObjectSampler event. Casting in the debugger shows a superficially reasonable Method*; the SEGV seems to occur during the evaluation of o->method_holder()->is_loader_alive(). What's also possibly of interest is that the frame is almost surely a lambda.

The program being profiled offers many possible suspects, including custom class loaders and byte-code generation, so we have our work cut out for us. While we can reproduce the problem with multiple versions of java and of async-profiler, we don't have anything self-contained and shareable at this point.

I saw your comments on https://bugs.openjdk.org/browse/JDK-8313816 and openjdk/jdk#15171, about which I share your skepticism. However, I did notice

    if (!Method::is_valid_method(method) || is_full(enqueue_buffer)) {
      // we throw away everything we've gathered in this sample since
      // none of it is safe
      return false;
    }

in https://github.com/openjdk/jdk/blob/master/src/hotspot/share/jfr/recorder/stacktrace/jfrStackTrace.cpp#L254, which suggests that corrupt method data might be more command than we thought, and that the variously desperate checks in is_valid_method might be sometimes be helpful in recovering from it. That method is mangled and local to libjvm.so, if we grab it with
VMStructs::libjvm()->findSymbol("_ZN6Method15is_valid_methodEPKS_")
then, in our case, it does seem to be able to detect and elide the frames that were causing our crash. For obvious reasons, this is not a fully satisfying resolution, but I think it's a worthwhile defensive measure.

@apangin
Copy link
Collaborator

apangin commented Oct 27, 2023

The JFR code you pointed to refers to collection of an async stack trace (i.e. outside a safepoint). However, JVM TI GetStackTrace function called from ObjectSampler callback runs at a safepoint and must return valid jmethodIDs. If it doesn't - this is either a serious JVM bug or the initial assumptions are wrong.

@pnf
Copy link
Author

pnf commented Oct 28, 2023

It may be a JVM bug to guarantee that jmethodIDs remain valid indefinitely, with "valid" implying that that (*((Method**) jmethodID))->method_holder()->is_loader_alive() is safe to invoke via GetMethodName. It seems that there have been similar bugs in the past. In any case, adding the check to javaMethodName seems to be working for us.

I don't have reason to believe that this is related, but I was wondering what ramifications there might be to massive proliferation of unique jmethodIDs for lambdas? We've found it helpful to strip the hexadecimal suffixes to get them to collapse properly in flame graphs, but won't otherwise identical call stacks still have multiple entries in callTraceStorage?

@apangin
Copy link
Collaborator

apangin commented Oct 30, 2023

It may be a JVM bug to guarantee that jmethodIDs remain valid indefinitely

The design of jmethodIDs in HotSpot assumes that any JVM TI call on jmethodID either succeeds or returns one of JVM TI errors (e.g. JVMTI_ERROR_INVALID_METHODID). If it does not work this way - it is a JVM bug that needs to be fixed in the JVM. If would be helpful if you could share more information on how to reproduce or debug the issue.

Blindly applying a non-portable workaround (relying on private symbols) for an unknown bug that might not even fix the issue is not an option.

@apangin
Copy link
Collaborator

apangin commented Oct 30, 2023

We've found it helpful to strip the hexadecimal suffixes to get them to collapse properly in flame graphs

I agree. Please file a separate issue, since it's unrelated to the jmethodID segfault.

@pnf
Copy link
Author

pnf commented Oct 31, 2023

The design of jmethodIDs in HotSpot assumes that any JVM TI call on jmethodID either succeeds or returns one of JVM TI errors (e.g. JVMTI_ERROR_INVALID_METHODID). If it does not work this way - it is a JVM bug that needs to be fixed in the JVM. If would be helpful if you could share more information on how to reproduce or debug the issue.

Running for a while with our workaround in place and inserting a detectable error frame when it's triggered may give us a more precise idea of how to repro this portably. I understand that it might not be palatable for general distribution.

@jbachorik
Copy link

jbachorik commented Nov 7, 2023

The design of jmethodIDs in HotSpot assumes that any JVM TI call on jmethodID either succeeds or returns one of JVM TI errors (e.g. JVMTI_ERROR_INVALID_METHODID). If it does not work this way - it is a JVM bug that needs to be fixed in the JVM. If would be helpful if you could share more information on how to reproduce or debug the issue.

This is valid only as long as there is an existing strong reference to the class containing the method to which the jmethodid points to. Which is not true for stacktraces collected by JVMTI for eg. allocation profiles - the lifecycle of the jmethodids contained in the stacktraces can be very different from the lifecycle of the classes containing the methods those jmethodids are pointing to.

I have a pretty reliable reproducer (which I, unfortunately, can not share because it is basically a heavily mocking test fixture of one of our internal projects) where the crash due to messed up jmethodid is guaranteed within few minutes.

@pnf I came to to this workaround as well - I don't want to cross-post to potentially a competing product (although I don't think it is ...) but one can also use the vmstructs to do more exhaustive checks on the method/class/classloader structures used by various JVMTI calls to eg. resolve defining class etc.

However, all these checks are just bandaids for the fact that a jmethodid is safe to use only while the containing class is strongly referenced (eg. only within the same C function where you take the stacktrace or otherwise you need to create and manage global reference, which is pretty costly). There are places in the code dealing with class unloading and metaspace cleanup (and, interestingly CDS) which can blow up when the cleanup happens to be running concurrently to your native code. Eg. in Method::deallocate_contents the memory is first deallocated and only then the pointer to that memory is nulled. I know that the source code order does not guarantee the execution order because compiler might reorder the function calls but that does not make the situation any better. There are similar patterns for InstanceKlass etc.

I confirmed the hypothesis of method and class metadata being partially cleaned when trying to use a jmethodid in the JVMTI calls by building a heavily instrumented JDK with many more assertions in place checking for validity of the pointers the code assumes are fine to use - and low-and-behold, the assertions are triggering as often as I was used to see the jmethodid related crashes, so there is definitely something going on there.
Also, when you try and disable concurrent class unloading and CDS the jmethodid usage becomes stable again.

Ok, these are just my 2c in case this info might help someone.

@apangin
Copy link
Collaborator

apangin commented Nov 7, 2023

@jbachorik Thank you for sharing your thoughts. As I pointed out in the JBS issue, this is not how jmethodIDs are supposed to work. The machinery to protect them from concurrent class unloading is there in the HotSpot code; if it does not work as expected - it is obviously a JVM bug. I was not able to reproduce it. If you can - it will be helpful if you provide some debugging details.. The proposed workaround does not actually fix the issue, but makes it much harder to reproduce.

@jbachorik
Copy link

jbachorik commented Nov 7, 2023

@apangin I totally agree that the proposed change is just a workaround and it will not guarantee never touching corrupted jmethodid. The word from the JVM runtime folks (David Holmes, Dan Daugherty) is that this is kind of known but not easy to fix - https://mail.openjdk.org/pipermail/serviceability-dev/2023-June/049711.html (at least that's my understanding of what they are saying).

As for reproducing - I am only able to reproduce where there are really many classes generated on-the-fly (Mockito mocks) and so far I have only one of our internal projects as a testbed. I might attempt to create a reproducer which will try to mimic what mockito does but, TBH, this feels like a lot of work for a very slim chance that this will get ever fixed.

TL;DR the teardown/cleanup would need to be made atomic with the jmethodid/method pointer - the memory must not be released while someone is actively using the jmethodid and the nulling of the pointer and the cleanup must be done in exact and predictable order. This would require eg. an API to atomically create a strong reference to a class containing the method the jmethodid is pointing to such that the method/class structures remain immutable while we are operating on that particuarl jmethodid 🤷

@apangin
Copy link
Collaborator

apangin commented Nov 7, 2023

the memory must not be released while someone is actively using the jmethodid

IIUC, the approach to achieve this in HotSpot was to postpone releasing memory after cleanup until a global safepoint happens. At least, this is how it was designed. Since methodIDs are resolved in in_vm thread state, safepoint acts as a barrier to prevent concurrent use.

@jbachorik
Copy link

FWIW, I did some in-depth research and I think I was able to find the real culprit.
I summed this up in my blog post - https://jbachorik.github.io/posts/mysterious-jmethodid - but TL;DR is that this is caused by redefined/retransformed methods and not by the concurrent class unloading as I (incorrectly) assumed.

@jbachorik
Copy link

@apangin
Copy link
Collaborator

apangin commented Nov 20, 2023

@jbachorik Thank you for the thorough analysis. So it's indeed a JVM bug.

If I have a reliable way to reproduce the crash with the profiler, I'll try to come up with a proper workaround.

apangin added a commit that referenced this issue Dec 4, 2023
@apangin
Copy link
Collaborator

apangin commented Dec 4, 2023

I could reproduce the issue using a test case from @jbachorik's PR.
I pushed a workaround that prevents from a crash. Please check if it helps in your case.

@apangin apangin closed this as completed Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants