Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure and document overhead #14

Open
goldshtn opened this issue Apr 9, 2017 · 15 comments
Open

Measure and document overhead #14

goldshtn opened this issue Apr 9, 2017 · 15 comments
Labels

Comments

@goldshtn
Copy link
Contributor

goldshtn commented Apr 9, 2017

The overhead of the profiler should be measured and documented. We should test CPU-bound code as well as I/O-bound code, with varying stack depths (up to 1000s of frames).

@brendangregg
Copy link

I'm most interested in this profiler for a problem microservice that, to respond to a request, calls a sequence of methods that reaches a stack depth of about 3000 frames deep. (Since I've used perf and frame pointers to walk this, perf sees the after-inlining stack, which is about 1000 frames deep.) This is the only microservice we have that exhibits this problem, and to enable frame pointers when thousands of frames are frequently traversed, adds a single digit percentage always-on overhead, which can get as much as 10%. It's an extreme case. Most microservices are < 0.3%.

To replicate this problem workload, I tried a simple microbenchmark where I forced the stack depth to be 3000 deep. Here's it running while I enabled the async profiler:

# time /usr/lib/jvm/java-8-oracle-1.8.0.112/bin/java BurnerRecursive
rate: 1029
rate: 1031
rate: 1031
rate: 1031
rate: 1031
rate: 1019
Profiling started with interval 1 ms
rate: 1003
rate: 1000
rate: 1002
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 1002
rate: 1000
rate: 1002
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 987
rate: 1000
rate: 1000
rate: 968
rate: 1000
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 1000
Profiling stopped
Frame buffer usage 3002/1048576=0.286293%
Dumping raw traces to /tmp/traces6.txt
rate: 1029
rate: 1031
rate: 1031
^C
real    0m35.759s
user    0m35.772s
sys    0m0.080s

So async-profiler costs about a 3% hit while profiling, based on the drop in the "rate" of this microbenchmark. (Note that I think this is really profiling at 100 Hertz, despite it saying 1ms). That's pretty good. What's most interesting that this is on-demand, so we avoid the always-on penalty of frame pointers.

Using bcc/BPF to measure the time in AsyncGetCallTrace():

# ./funclatency.py '/usr/lib/jvm/java-8-oracle-1.8.0.112/jre/lib/amd64/server/libjvm.so:AsyncGetCallTrace'
Tracing 1 functions for "/usr/lib/jvm/java-8-oracle-1.8.0.112/jre/lib/amd64/server/libjvm.so:AsyncGetCallTrace"... Hit Ctrl-C to end.
^C

Function = AsyncGetCallTrace [17250]
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 307      |****************************************|
Detaching...

So between 0.2 and 0.5ms per AsyncGetCallTrace(). That's manageable.

This is a very simple microbenchmark, and I'm not sure how much worse it will be in production, but so far it looks promising enough to test. There's be extra overhead in production when dumping the profile output, as it must lookup symbols from a much larger symbol table.

@apangin
Copy link
Collaborator

apangin commented Apr 10, 2017

A good analysis! Brendan, would you mind sharing the benchmark, so it could serve as a performance regression test in further development?

@brendangregg
Copy link

It's pretty dumb, but sure. BurnerRecursive.java:

/*
 * BurnerRecursive - simple microbenchmark that burns CPU at a deep stack depth.
 *
 * This prints the rate of a simple operation that is performed at a deep
 * stack depth (arrived via recursion), so that the overhead of profilers that
 * walk the stack can be studied.
 *
 * 09-Apr-2017	Brendan Gregg	Created this.
 */
class BurnerThread extends Thread {
	private Thread t;
	private int myj;
	
	BurnerThread() {
		myj = 0;
	}
	
	public void dive(int depth) {
		if (depth < 3000) {
			depth++;
			dive(depth);
		}
		for (;;) {
			myj++;
			// burn CPU. check the compiler doesn't toss this out:
			for (int i=0; i < 1000000; i++) { }
		}
	}

	public void run() {
		this.dive(0);
	}
	
	public int get() {
		return myj;
	}
	
}

public class BurnerRecursive {
	public static void main(String args[]) {
		int j = 0, lastj = 0;
		BurnerThread T1 = new BurnerThread();
		T1.start();

		for (;;) {
			try {
				Thread.currentThread().sleep(1000);
			} catch (Exception e) { }
			j = T1.get();
			System.out.println("rate: " + (j - lastj));
			lastj = j;
		}
	}
	
}

@jrudolph
Copy link

Using async-profiler for measuring cold-start performance of JVM applications I noticed that async-profiler itself might introduce significant overhead (I've seen 10%) which seems related to class loading. It has stacks like this (VM::loadMethodIDs seems to be from async-profiler):

InterpreterRuntime::resolve_get_put(JavaThread*, Bytecodes::Code);
LinkResolver::resolve_field_access(fieldDescriptor&, constantPoolHandle, int, Bytecodes::Code, Thread*);
LinkResolver::resolve_field(fieldDescriptor&, KlassHandle, Symbol*, Symbol*, KlassHandle, Bytecodes::Code, bool, bool, Thread*);InstanceKlass::initialize(Thread*);
InstanceKlass::initialize_impl(instanceKlassHandle, Thread*);
InstanceKlass::link_class_impl(instanceKlassHandle, bool, Thread*);
VM::loadMethodIDs(_jvmtiEnv*, _jclass*);
jvmti_GetClassMethods;JvmtiEnv::GetClassMethods(oopDesc*, int*, _jmethodID***);
InstanceKlass::get_jmethod_id(instanceKlassHandle, methodHandle);
Method::make_jmethod_id(ClassLoaderData*, Method*)

@apangin
Copy link
Collaborator

apangin commented Oct 30, 2017

Yes, I noticed that as well. AsyncGetCallTrace requires that jmethodIDs for all methods in the stack trace are available, so I have to preallocate jmethodIDs at class loading time, and this seems to take long.

The same problem applies to honest-profiler, too, and generally to all profilers that rely on AGCT. Unfortunately, there is already too much problems with AGCT, so I start thinking how to get rid of this call at all.

@nitsanw
Copy link
Contributor

nitsanw commented Apr 3, 2018

This is valuable analysis, so I'm reluctant to close the issue and lower it's visibility. I think the right thing to do is to convert it to a wiki article.

@toaler
Copy link

toaler commented Apr 25, 2018

I'm experiencing ~5x degradation in startup times as well when attempting to profile a monolithic application that does significant amount of class loading.

Running perf on the VM when async profiler is enabled I see the following:

71.49% 0.00% java libasyncProfiler.so [.] VM::loadMethodIDs

  • VM::loadMethodIDs
    • 71.49% jvmti_GetClassMethods
      • 71.49% JvmtiEnv::GetClassMethods
        • 71.49% InstanceKlass::get_jmethod_id
          • 68.81% Method::make_jmethod_id
            • 3.89% apic_timer_interrupt
              0.86% pthread_mutex_lock
            • 0.00% irq_work_interrupt
            • 0.00% Monitor::lock_without_safepoint_check
              0.00% Method::make_jmethod_id
            • 0.00% retint_user
            • 0.00% return_from_stub
            • 0.00% pthread_cond_signal@@GLIBC_2.3.2
            • 0.00% int_ret_from_sys_call
              0.00% entry_SYSCALL_64_after_swapgs
            • 0.00% call_function_single_interrupt
              0.00% native_sched_clock
              0.00% native_write_msr_safe
          • 2.69% PerfEvents::signalHandler
          • 0.00% pthread_cond_signal@@GLIBC_2.3.2
          • 0.00% 0x1051d
          • 0.00% __restore_rt
            0.00% native_irq_return_iret
            0.00% pthread_mutex_unlock@plt
            0.00% pthread_getspecific

71% of the VM's time is spent in libasyncProfiler.so's VM::loadMethodIDs code which eventually calls down to Method::make_jmethod_id.

jmethodID Method::make_jmethod_id(ClassLoaderData* loader_data, Method* m) {
  ClassLoaderData* cld = loader_data;

  if (!SafepointSynchronize::is_at_safepoint()) {
    // Have to add jmethod_ids() to class loader data thread-safely.
    // Also have to add the method to the list safely, which the cld lock
    // protects as well.
    MutexLockerEx ml(cld->metaspace_lock(),  Mutex::_no_safepoint_check_flag);
    if (cld->jmethod_ids() == NULL) {
      cld->set_jmethod_ids(new JNIMethodBlock());
    }
    // jmethodID is a pointer to Method*
    return (jmethodID)cld->jmethod_ids()->add_method(m);
  } else {
    // At safepoint, we are single threaded and can set this.
    if (cld->jmethod_ids() == NULL) {
      cld->set_jmethod_ids(new JNIMethodBlock());
    }
    // jmethodID is a pointer to Method*
    return (jmethodID)cld->jmethod_ids()->add_method(m);
  }
}

I'm not to familar with make_jmethod_id but it looks like it holds a lock to protect metadata writes to the classloader. I haven't worked out if the issue is due to the lock being under heavy contention and/or the time lock held for the critical section is too high.

What are thoughts on a solution going forward? Should we be focused on #66 or look at making changes to the VM to reduce the impact of locking on the classloader data structure (improve time spent in critical section, lock free, etc)?

@apangin
Copy link
Collaborator

apangin commented Apr 26, 2018

@toaler Right, make_jmethod_id is implemented suboptimally in JDK, and this is not just the problem of locking. This could be improved, at least for the case of batch jmethodID allocation at class loading time.

While fixing OpenJDK is probably the right way to go in long term, I believe async-profiler needs to address this issue for current and previous versions of JDK. I'll see if I can make a workaround. But ultimately #66 should solve this problem entirely.

@toaler
Copy link

toaler commented Apr 27, 2018

@apangin excuse my ignorance, other than the locking what is the issue with make_jmethod_id?

What are your thoughts about batch jmethodID allocation and a workaround?

@apangin
Copy link
Collaborator

apangin commented Apr 27, 2018

@toaler For each method being added make_jmethod_id linearly walks through a linked list of JNIMethodBlocks. This can be a problem if a class has a lot of methods.

The workaround could be to preallocate jmethodID cache at the time of classloading.

On the other hand, if patching OpenJDK is not a problem for you, the easiest fix would be to replace

  enum { number_of_methods = 8 };

with

  enum { number_of_methods = 64 };

in method.cpp.

@alexeykudinkin
Copy link

@apangin I think it's worth calling that out in the documentation, we observed 60x slow-down in class-loading times from 5 seconds to 5 minutes in our use-case having ~50k classes, most are IDL generated with a lot of methods, by accidentally preloading async-profiler as part of our trouble-shooting suite.

We're on OpenJDK 8_252

@apangin
Copy link
Collaborator

apangin commented Jun 27, 2020

@alexeykudinkin This is a JVM bug JDK-8062116. It has been fixed only in JDK 9. However, I made a workaround in async-profiler - check jmethodid branch.

For more details, see #328. There you'll also find the prebuilt binaries with the fix.

@alexeykudinkin
Copy link

@apangin awesome! Thanks for putting up the workaround!

Any plans to merge into master? This might be silently affecting everyone who's actively loading async-profiler into their applications with large number of classes/methods.

@apangin
Copy link
Collaborator

apangin commented Jun 27, 2020

Yes, the fix will likely get into the nearest release.

@jumarko
Copy link

jumarko commented Dec 20, 2021

Are there updates and guidance about the overhead of (recent versions of) async profiler?
Is there any documentation that discusses this topic?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants