Include scheduling events (or syscall usually invoking the scheduler) #47

Closed
aktau opened this Issue Feb 26, 2015 · 3 comments

Projects

None yet

3 participants

aktau commented Feb 26, 2015

After watching your recent (fantastic) presentation at SCaLE13, I decided to try out a few of the things you showcased, especially the flame graph. Proudly, I showed my colleagues a nice way to lay out a where a binary/system is spending its time.

One of my colleagues threw me a curve-ball though. He said, trace me this program:

#include <unistd.h>

int compute(int start) {
  for (int i = 0; i < 10000000; ++i) {
     start += i;
  }
  return start;
}

int main(int argc, char **) {
  int ret = 0;
  for (int i = 0; i != 5; ++i) {
     sleep(argc == 1 ? 0 : 1);
     ret += compute(argc);
  }
  return ret;
}

So I traced it with:

sudo perf record -F 500 -g --call-graph dwarf -- ./test 1 && sudo perf script | ~/github/FlameGraph/stackcollapse-perf.pl | ~/github/FlameGraph/flamegraph.pl > test.svg

Naturally, I found that the nanosleep system calls were not being shown in the flame graph. This is where the binary spends most of its time. As you illustrate in this article about speeding up vim, sometimes nanosleep is indeed the culprit. It's not really obvious from the flamegraph (I don't even see that it calls do_nanosleep, let alone spend a lot of time in it). I thought to myself that this must be because the scheduler is just running some other binary instead and thus perf can't see it. This is good for whole system profiling but worse if one tries to look at one (or a group of) binaries in isolation, where "wait" time is important, especially if introduced by locking or sleeping.

I looked around and found some things to try on https://perf.wiki.kernel.org/index.php/Tutorial, namely:

$ ./perf record -e sched:sched_stat_sleep -e sched:sched_switch  -e sched:sched_process_exit -g -o ~/perf.data.raw ~/foo
$ ./perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
$ ./perf report --stdio --show-total-period -i ~/perf.data

This is nice, and indeed shows the sleep times. I can't convert this into a flamegraph though (I guess I was foolish to expect this to work, event tracing must be wholly different from sampling).

Is a flamegraph the wrong tool to illustrate this with? I actually believe it could be useful.

I don't expect you to solve this issue for me, but if you could point me in the right direction perhaps I could whip up the tool and show my colleague I can succinctly illustrate where the program is spending its time. (Showing both the actual on-CPU stracktraces and "other" areas like sleeping/locking/... in one visual representation).

Once again, thanks for the fantastic tools.

Owner

G'Day Nicolas,

Flamegraph is an appropriate tool, and we've been calling these "Off-CPU Time Flame Graphs". My LISA13 talk included them: http://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs/122 . As I described in the talk -- these are very important, and are the counterpart to on-CPU flame graphs.

So, regarding perf inject -s... We want the duration, and not switch counts, so that the frame width is proportional to total sleep time. (Else you are making a context switch flame graph, which can be interesting, but I also want the off-CPU time flame graph as well.) perf report has this as --show-total-period, but support in perf script was only very recent (-f period, which is in my 3.19 kernel, but not 3.13).

So it could work like this (eg, trace all processes, for 1 second):

perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -a -g -o perf.data.raw sleep 1
perf inject -v -s -i perf.data.raw -o perf.data
perf script -f comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk '
    NF > 4 { exec = $1; period_ms = int($5 / 1000000) }
    NF > 1 && NF <= 4 && period_ms > 0 { print $2 }
    NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \
    ./stackcollapse.pl | \
    ./flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg

So this simplifies the output, and runs it through just stackcollapse.pl. Example off-CPU time flame graph, generated by Linux perf_events:

offcpu

I like how this off-CPU time flame graph shows the "sleep 1" command as the tower on the far right, sleeping for 1,000 ms. :-)

The perf inject workflow strikes me as a bit weird, and this step could be skipped by more processing of the original perf.data file and perf script. I'd do this if I were on older kernels, that lacked the perf script -f period.

Note that scheduler events can be very frequent, and the overhead of dumping them to the file system (perf.data) may be prohibitive in production environments. (This is why I put a "sleep 1" in the perf record). If I had to do this in production, I'd also consider other tools, as well as revisiting perf_events once it supports more in-kernel programming (eBPF). Although, there may be some scenarios where the current perf_events overhead is acceptable, especially if you match on a single PID of interest...

Anyway, there's lots more one can do with flamegraphs, as you've been discovering. It's hard to cover in a talk (my LISA talk was 90 mins).

aktau commented Mar 5, 2015

Hi Brendan,

Thanks for the very informative response. I'll spend the following week poring over your post whenever I get some time.

The off-time flame graph definitely seems very useful, especially when put next to a regular on-time flamegraph (they are mutually exclusive, I assume?). In the specific case of dissecting one running binary, I think it would even be really nice to have one single flamegraph that shows both on-CPU and off-CPU (in different hues). But I'm getting ahead of myself :).

I'll close this issue since you've given me more than enough to chew on for the time being, thanks!

@aktau aktau closed this Mar 5, 2015
sievlev commented Jun 10, 2016

Brendan, could you help me to understand your calculations.
What does period mean?

According to a documentation period is a number of events collected. Default sampling rate is 1000Hz.
Why are you divide period to 1000000 to convert it to event duration in milliseconds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment