Measure and document overhead #14

goldshtn · 2017-04-09T19:58:24Z

The overhead of the profiler should be measured and documented. We should test CPU-bound code as well as I/O-bound code, with varying stack depths (up to 1000s of frames).

brendangregg · 2017-04-10T20:14:37Z

I'm most interested in this profiler for a problem microservice that, to respond to a request, calls a sequence of methods that reaches a stack depth of about 3000 frames deep. (Since I've used perf and frame pointers to walk this, perf sees the after-inlining stack, which is about 1000 frames deep.) This is the only microservice we have that exhibits this problem, and to enable frame pointers when thousands of frames are frequently traversed, adds a single digit percentage always-on overhead, which can get as much as 10%. It's an extreme case. Most microservices are < 0.3%.

To replicate this problem workload, I tried a simple microbenchmark where I forced the stack depth to be 3000 deep. Here's it running while I enabled the async profiler:

# time /usr/lib/jvm/java-8-oracle-1.8.0.112/bin/java BurnerRecursive
rate: 1029
rate: 1031
rate: 1031
rate: 1031
rate: 1031
rate: 1019
Profiling started with interval 1 ms
rate: 1003
rate: 1000
rate: 1002
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 1002
rate: 1000
rate: 1002
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 987
rate: 1000
rate: 1000
rate: 968
rate: 1000
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 1001
rate: 1000
Profiling stopped
Frame buffer usage 3002/1048576=0.286293%
Dumping raw traces to /tmp/traces6.txt
rate: 1029
rate: 1031
rate: 1031
^C
real    0m35.759s
user    0m35.772s
sys    0m0.080s

So async-profiler costs about a 3% hit while profiling, based on the drop in the "rate" of this microbenchmark. (Note that I think this is really profiling at 100 Hertz, despite it saying 1ms). That's pretty good. What's most interesting that this is on-demand, so we avoid the always-on penalty of frame pointers.

Using bcc/BPF to measure the time in AsyncGetCallTrace():

# ./funclatency.py '/usr/lib/jvm/java-8-oracle-1.8.0.112/jre/lib/amd64/server/libjvm.so:AsyncGetCallTrace'
Tracing 1 functions for "/usr/lib/jvm/java-8-oracle-1.8.0.112/jre/lib/amd64/server/libjvm.so:AsyncGetCallTrace"... Hit Ctrl-C to end.
^C

Function = AsyncGetCallTrace [17250]
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 307      |****************************************|
Detaching...

So between 0.2 and 0.5ms per AsyncGetCallTrace(). That's manageable.

This is a very simple microbenchmark, and I'm not sure how much worse it will be in production, but so far it looks promising enough to test. There's be extra overhead in production when dumping the profile output, as it must lookup symbols from a much larger symbol table.

apangin · 2017-04-10T21:35:00Z

A good analysis! Brendan, would you mind sharing the benchmark, so it could serve as a performance regression test in further development?

brendangregg · 2017-04-10T21:52:36Z

It's pretty dumb, but sure. BurnerRecursive.java:

/*
 * BurnerRecursive - simple microbenchmark that burns CPU at a deep stack depth.
 *
 * This prints the rate of a simple operation that is performed at a deep
 * stack depth (arrived via recursion), so that the overhead of profilers that
 * walk the stack can be studied.
 *
 * 09-Apr-2017	Brendan Gregg	Created this.
 */
class BurnerThread extends Thread {
	private Thread t;
	private int myj;
	
	BurnerThread() {
		myj = 0;
	}
	
	public void dive(int depth) {
		if (depth < 3000) {
			depth++;
			dive(depth);
		}
		for (;;) {
			myj++;
			// burn CPU. check the compiler doesn't toss this out:
			for (int i=0; i < 1000000; i++) { }
		}
	}

	public void run() {
		this.dive(0);
	}
	
	public int get() {
		return myj;
	}
	
}

public class BurnerRecursive {
	public static void main(String args[]) {
		int j = 0, lastj = 0;
		BurnerThread T1 = new BurnerThread();
		T1.start();

		for (;;) {
			try {
				Thread.currentThread().sleep(1000);
			} catch (Exception e) { }
			j = T1.get();
			System.out.println("rate: " + (j - lastj));
			lastj = j;
		}
	}
	
}

jrudolph · 2017-10-30T12:46:06Z

Using async-profiler for measuring cold-start performance of JVM applications I noticed that async-profiler itself might introduce significant overhead (I've seen 10%) which seems related to class loading. It has stacks like this (VM::loadMethodIDs seems to be from async-profiler):

InterpreterRuntime::resolve_get_put(JavaThread*, Bytecodes::Code);
LinkResolver::resolve_field_access(fieldDescriptor&, constantPoolHandle, int, Bytecodes::Code, Thread*);
LinkResolver::resolve_field(fieldDescriptor&, KlassHandle, Symbol*, Symbol*, KlassHandle, Bytecodes::Code, bool, bool, Thread*);InstanceKlass::initialize(Thread*);
InstanceKlass::initialize_impl(instanceKlassHandle, Thread*);
InstanceKlass::link_class_impl(instanceKlassHandle, bool, Thread*);
VM::loadMethodIDs(_jvmtiEnv*, _jclass*);
jvmti_GetClassMethods;JvmtiEnv::GetClassMethods(oopDesc*, int*, _jmethodID***);
InstanceKlass::get_jmethod_id(instanceKlassHandle, methodHandle);
Method::make_jmethod_id(ClassLoaderData*, Method*)

apangin · 2017-10-30T13:09:36Z

Yes, I noticed that as well. AsyncGetCallTrace requires that jmethodIDs for all methods in the stack trace are available, so I have to preallocate jmethodIDs at class loading time, and this seems to take long.

The same problem applies to honest-profiler, too, and generally to all profilers that rely on AGCT. Unfortunately, there is already too much problems with AGCT, so I start thinking how to get rid of this call at all.

nitsanw · 2018-04-03T12:17:20Z

This is valuable analysis, so I'm reluctant to close the issue and lower it's visibility. I think the right thing to do is to convert it to a wiki article.

toaler · 2018-04-25T19:35:16Z

I'm experiencing ~5x degradation in startup times as well when attempting to profile a monolithic application that does significant amount of class loading.

Running perf on the VM when async profiler is enabled I see the following:

71.49% 0.00% java libasyncProfiler.so [.] VM::loadMethodIDs

VM::loadMethodIDs
- 71.49% jvmti_GetClassMethods
  - 71.49% JvmtiEnv::GetClassMethods
    - 71.49% InstanceKlass::get_jmethod_id
      - 68.81% Method::make_jmethod_id
        
        3.89% apic_timer_interrupt
        0.86% pthread_mutex_lock
        
        0.00% irq_work_interrupt
        
        0.00% Monitor::lock_without_safepoint_check
        0.00% Method::make_jmethod_id
        
        0.00% retint_user
        
        0.00% return_from_stub
        
        0.00% pthread_cond_signal@@GLIBC_2.3.2
        
        0.00% int_ret_from_sys_call
        0.00% entry_SYSCALL_64_after_swapgs
        
        0.00% call_function_single_interrupt
        0.00% native_sched_clock
        0.00% native_write_msr_safe
      - 2.69% PerfEvents::signalHandler
      - 0.00% pthread_cond_signal@@GLIBC_2.3.2
      - 0.00% 0x1051d
      - 0.00% __restore_rt
        0.00% native_irq_return_iret
        0.00% pthread_mutex_unlock@plt
        0.00% pthread_getspecific

71% of the VM's time is spent in libasyncProfiler.so's VM::loadMethodIDs code which eventually calls down to Method::make_jmethod_id.

jmethodID Method::make_jmethod_id(ClassLoaderData* loader_data, Method* m) {
  ClassLoaderData* cld = loader_data;

  if (!SafepointSynchronize::is_at_safepoint()) {
    // Have to add jmethod_ids() to class loader data thread-safely.
    // Also have to add the method to the list safely, which the cld lock
    // protects as well.
    MutexLockerEx ml(cld->metaspace_lock(),  Mutex::_no_safepoint_check_flag);
    if (cld->jmethod_ids() == NULL) {
      cld->set_jmethod_ids(new JNIMethodBlock());
    }
    // jmethodID is a pointer to Method*
    return (jmethodID)cld->jmethod_ids()->add_method(m);
  } else {
    // At safepoint, we are single threaded and can set this.
    if (cld->jmethod_ids() == NULL) {
      cld->set_jmethod_ids(new JNIMethodBlock());
    }
    // jmethodID is a pointer to Method*
    return (jmethodID)cld->jmethod_ids()->add_method(m);
  }
}

I'm not to familar with make_jmethod_id but it looks like it holds a lock to protect metadata writes to the classloader. I haven't worked out if the issue is due to the lock being under heavy contention and/or the time lock held for the critical section is too high.

What are thoughts on a solution going forward? Should we be focused on #66 or look at making changes to the VM to reduce the impact of locking on the classloader data structure (improve time spent in critical section, lock free, etc)?

apangin · 2018-04-26T01:13:15Z

@toaler Right, make_jmethod_id is implemented suboptimally in JDK, and this is not just the problem of locking. This could be improved, at least for the case of batch jmethodID allocation at class loading time.

While fixing OpenJDK is probably the right way to go in long term, I believe async-profiler needs to address this issue for current and previous versions of JDK. I'll see if I can make a workaround. But ultimately #66 should solve this problem entirely.

toaler · 2018-04-27T05:30:16Z

@apangin excuse my ignorance, other than the locking what is the issue with make_jmethod_id?

What are your thoughts about batch jmethodID allocation and a workaround?

apangin · 2018-04-27T10:45:11Z

@toaler For each method being added make_jmethod_id linearly walks through a linked list of JNIMethodBlocks. This can be a problem if a class has a lot of methods.

The workaround could be to preallocate jmethodID cache at the time of classloading.

On the other hand, if patching OpenJDK is not a problem for you, the easiest fix would be to replace

  enum { number_of_methods = 8 };

with

  enum { number_of_methods = 64 };

in method.cpp.

alexeykudinkin · 2020-06-27T01:29:15Z

@apangin I think it's worth calling that out in the documentation, we observed 60x slow-down in class-loading times from 5 seconds to 5 minutes in our use-case having ~50k classes, most are IDL generated with a lot of methods, by accidentally preloading async-profiler as part of our trouble-shooting suite.

We're on OpenJDK 8_252

apangin · 2020-06-27T10:26:49Z

@alexeykudinkin This is a JVM bug JDK-8062116. It has been fixed only in JDK 9. However, I made a workaround in async-profiler - check jmethodid branch.

For more details, see #328. There you'll also find the prebuilt binaries with the fix.

alexeykudinkin · 2020-06-27T17:10:19Z

@apangin awesome! Thanks for putting up the workaround!

Any plans to merge into master? This might be silently affecting everyone who's actively loading async-profiler into their applications with large number of classes/methods.

apangin · 2020-06-27T19:47:21Z

Yes, the fix will likely get into the nearest release.

jumarko · 2021-12-20T08:26:30Z

Are there updates and guidance about the overhead of (recent versions of) async profiler?
Is there any documentation that discusses this topic?
Thanks!

apangin mentioned this issue Nov 2, 2017

Implement stack walking without AsyncGetCallTrace #66

Closed

10 tasks

nitsanw added the question label Apr 3, 2018

apangin mentioned this issue Jun 12, 2018

CPU profiling Memory and CPU overhead #131

Closed

apangin mentioned this issue Aug 15, 2019

jattach hangs #221

Closed

apangin mentioned this issue Jun 6, 2020

long attach time #328

Closed

apangin added a commit that referenced this issue Jul 29, 2020

#328, #14: Fixed long attach time and slow class loading

a905d50

virajjasani mentioned this issue Dec 23, 2021

HADOOP-18055. Async Profiler endpoint for Hadoop daemons apache/hadoop#3824

Merged

1 task

everythingbest mentioned this issue May 6, 2022

Program terminated with signal SIGABRT, Aborted #585

Closed

ZyablitskiyA mentioned this issue Aug 9, 2022

Allocation profiling crashes on Azul Platform Prime #632

Closed

yanglong1010 mentioned this issue Nov 15, 2022

Reading the program headers of some dynamic libraries may cause a segment fault #682

Closed

yanglong1010 mentioned this issue Nov 23, 2022

SIGSEGV in ElfParser::calcVirtualLoadAddress possibly involving libzookeeper-interface.so #669

Closed

yanglong1010 mentioned this issue Dec 9, 2022

On JDK8, alloc profiling may cause heap corrupt and crash #694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure and document overhead #14

Measure and document overhead #14

goldshtn commented Apr 9, 2017

brendangregg commented Apr 10, 2017

apangin commented Apr 10, 2017

brendangregg commented Apr 10, 2017

jrudolph commented Oct 30, 2017

apangin commented Oct 30, 2017

nitsanw commented Apr 3, 2018

toaler commented Apr 25, 2018

apangin commented Apr 26, 2018

toaler commented Apr 27, 2018

apangin commented Apr 27, 2018

alexeykudinkin commented Jun 27, 2020

apangin commented Jun 27, 2020

alexeykudinkin commented Jun 27, 2020

apangin commented Jun 27, 2020

jumarko commented Dec 20, 2021

Measure and document overhead #14

Measure and document overhead #14

Comments

goldshtn commented Apr 9, 2017

brendangregg commented Apr 10, 2017

apangin commented Apr 10, 2017

brendangregg commented Apr 10, 2017

jrudolph commented Oct 30, 2017

apangin commented Oct 30, 2017

nitsanw commented Apr 3, 2018

toaler commented Apr 25, 2018

apangin commented Apr 26, 2018

toaler commented Apr 27, 2018

apangin commented Apr 27, 2018

alexeykudinkin commented Jun 27, 2020

apangin commented Jun 27, 2020

alexeykudinkin commented Jun 27, 2020

apangin commented Jun 27, 2020

jumarko commented Dec 20, 2021