-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure and document overhead #14
Comments
I'm most interested in this profiler for a problem microservice that, to respond to a request, calls a sequence of methods that reaches a stack depth of about 3000 frames deep. (Since I've used perf and frame pointers to walk this, perf sees the after-inlining stack, which is about 1000 frames deep.) This is the only microservice we have that exhibits this problem, and to enable frame pointers when thousands of frames are frequently traversed, adds a single digit percentage always-on overhead, which can get as much as 10%. It's an extreme case. Most microservices are < 0.3%. To replicate this problem workload, I tried a simple microbenchmark where I forced the stack depth to be 3000 deep. Here's it running while I enabled the async profiler:
So async-profiler costs about a 3% hit while profiling, based on the drop in the "rate" of this microbenchmark. (Note that I think this is really profiling at 100 Hertz, despite it saying 1ms). That's pretty good. What's most interesting that this is on-demand, so we avoid the always-on penalty of frame pointers. Using bcc/BPF to measure the time in AsyncGetCallTrace():
So between 0.2 and 0.5ms per AsyncGetCallTrace(). That's manageable. This is a very simple microbenchmark, and I'm not sure how much worse it will be in production, but so far it looks promising enough to test. There's be extra overhead in production when dumping the profile output, as it must lookup symbols from a much larger symbol table. |
A good analysis! Brendan, would you mind sharing the benchmark, so it could serve as a performance regression test in further development? |
It's pretty dumb, but sure. BurnerRecursive.java: /*
* BurnerRecursive - simple microbenchmark that burns CPU at a deep stack depth.
*
* This prints the rate of a simple operation that is performed at a deep
* stack depth (arrived via recursion), so that the overhead of profilers that
* walk the stack can be studied.
*
* 09-Apr-2017 Brendan Gregg Created this.
*/
class BurnerThread extends Thread {
private Thread t;
private int myj;
BurnerThread() {
myj = 0;
}
public void dive(int depth) {
if (depth < 3000) {
depth++;
dive(depth);
}
for (;;) {
myj++;
// burn CPU. check the compiler doesn't toss this out:
for (int i=0; i < 1000000; i++) { }
}
}
public void run() {
this.dive(0);
}
public int get() {
return myj;
}
}
public class BurnerRecursive {
public static void main(String args[]) {
int j = 0, lastj = 0;
BurnerThread T1 = new BurnerThread();
T1.start();
for (;;) {
try {
Thread.currentThread().sleep(1000);
} catch (Exception e) { }
j = T1.get();
System.out.println("rate: " + (j - lastj));
lastj = j;
}
}
} |
Using async-profiler for measuring cold-start performance of JVM applications I noticed that async-profiler itself might introduce significant overhead (I've seen 10%) which seems related to class loading. It has stacks like this (
|
Yes, I noticed that as well. The same problem applies to honest-profiler, too, and generally to all profilers that rely on AGCT. Unfortunately, there is already too much problems with AGCT, so I start thinking how to get rid of this call at all. |
This is valuable analysis, so I'm reluctant to close the issue and lower it's visibility. I think the right thing to do is to convert it to a wiki article. |
I'm experiencing ~5x degradation in startup times as well when attempting to profile a monolithic application that does significant amount of class loading. Running perf on the VM when async profiler is enabled I see the following: 71.49% 0.00% java libasyncProfiler.so [.] VM::loadMethodIDs
71% of the VM's time is spent in libasyncProfiler.so's VM::loadMethodIDs code which eventually calls down to Method::make_jmethod_id. jmethodID Method::make_jmethod_id(ClassLoaderData* loader_data, Method* m) {
ClassLoaderData* cld = loader_data;
if (!SafepointSynchronize::is_at_safepoint()) {
// Have to add jmethod_ids() to class loader data thread-safely.
// Also have to add the method to the list safely, which the cld lock
// protects as well.
MutexLockerEx ml(cld->metaspace_lock(), Mutex::_no_safepoint_check_flag);
if (cld->jmethod_ids() == NULL) {
cld->set_jmethod_ids(new JNIMethodBlock());
}
// jmethodID is a pointer to Method*
return (jmethodID)cld->jmethod_ids()->add_method(m);
} else {
// At safepoint, we are single threaded and can set this.
if (cld->jmethod_ids() == NULL) {
cld->set_jmethod_ids(new JNIMethodBlock());
}
// jmethodID is a pointer to Method*
return (jmethodID)cld->jmethod_ids()->add_method(m);
}
} I'm not to familar with make_jmethod_id but it looks like it holds a lock to protect metadata writes to the classloader. I haven't worked out if the issue is due to the lock being under heavy contention and/or the time lock held for the critical section is too high. What are thoughts on a solution going forward? Should we be focused on #66 or look at making changes to the VM to reduce the impact of locking on the classloader data structure (improve time spent in critical section, lock free, etc)? |
@toaler Right, While fixing OpenJDK is probably the right way to go in long term, I believe async-profiler needs to address this issue for current and previous versions of JDK. I'll see if I can make a workaround. But ultimately #66 should solve this problem entirely. |
@apangin excuse my ignorance, other than the locking what is the issue with make_jmethod_id? What are your thoughts about batch jmethodID allocation and a workaround? |
@toaler For each method being added The workaround could be to preallocate jmethodID cache at the time of classloading. On the other hand, if patching OpenJDK is not a problem for you, the easiest fix would be to replace
with
in method.cpp. |
@apangin I think it's worth calling that out in the documentation, we observed 60x slow-down in class-loading times from 5 seconds to 5 minutes in our use-case having ~50k classes, most are IDL generated with a lot of methods, by accidentally preloading async-profiler as part of our trouble-shooting suite. We're on OpenJDK 8_252 |
@alexeykudinkin This is a JVM bug JDK-8062116. It has been fixed only in JDK 9. However, I made a workaround in async-profiler - check jmethodid branch. For more details, see #328. There you'll also find the prebuilt binaries with the fix. |
@apangin awesome! Thanks for putting up the workaround! Any plans to merge into master? This might be silently affecting everyone who's actively loading async-profiler into their applications with large number of classes/methods. |
Yes, the fix will likely get into the nearest release. |
Are there updates and guidance about the overhead of (recent versions of) async profiler? |
The overhead of the profiler should be measured and documented. We should test CPU-bound code as well as I/O-bound code, with varying stack depths (up to 1000s of frames).
The text was updated successfully, but these errors were encountered: