Include scheduling events (or syscall usually invoking the scheduler) #47
|
G'Day Nicolas, Flamegraph is an appropriate tool, and we've been calling these "Off-CPU Time Flame Graphs". My LISA13 talk included them: http://www.slideshare.net/brendangregg/blazing-performance-with-flame-graphs/122 . As I described in the talk -- these are very important, and are the counterpart to on-CPU flame graphs. So, regarding perf inject -s... We want the duration, and not switch counts, so that the frame width is proportional to total sleep time. (Else you are making a context switch flame graph, which can be interesting, but I also want the off-CPU time flame graph as well.) perf report has this as --show-total-period, but support in perf script was only very recent (-f period, which is in my 3.19 kernel, but not 3.13). So it could work like this (eg, trace all processes, for 1 second):
So this simplifies the output, and runs it through just stackcollapse.pl. Example off-CPU time flame graph, generated by Linux perf_events: I like how this off-CPU time flame graph shows the "sleep 1" command as the tower on the far right, sleeping for 1,000 ms. :-) The perf inject workflow strikes me as a bit weird, and this step could be skipped by more processing of the original perf.data file and perf script. I'd do this if I were on older kernels, that lacked the perf script -f period. Note that scheduler events can be very frequent, and the overhead of dumping them to the file system (perf.data) may be prohibitive in production environments. (This is why I put a "sleep 1" in the perf record). If I had to do this in production, I'd also consider other tools, as well as revisiting perf_events once it supports more in-kernel programming (eBPF). Although, there may be some scenarios where the current perf_events overhead is acceptable, especially if you match on a single PID of interest... Anyway, there's lots more one can do with flamegraphs, as you've been discovering. It's hard to cover in a talk (my LISA talk was 90 mins). |
|
Hi Brendan, Thanks for the very informative response. I'll spend the following week poring over your post whenever I get some time. The off-time flame graph definitely seems very useful, especially when put next to a regular on-time flamegraph (they are mutually exclusive, I assume?). In the specific case of dissecting one running binary, I think it would even be really nice to have one single flamegraph that shows both on-CPU and off-CPU (in different hues). But I'm getting ahead of myself :). I'll close this issue since you've given me more than enough to chew on for the time being, thanks! |
|
Brendan, could you help me to understand your calculations. According to a documentation period is a number of events collected. Default sampling rate is 1000Hz. |

After watching your recent (fantastic) presentation at SCaLE13, I decided to try out a few of the things you showcased, especially the flame graph. Proudly, I showed my colleagues a nice way to lay out a where a binary/system is spending its time.
One of my colleagues threw me a curve-ball though. He said, trace me this program:
So I traced it with:
Naturally, I found that the nanosleep system calls were not being shown in the flame graph. This is where the binary spends most of its time. As you illustrate in this article about speeding up vim, sometimes
nanosleepis indeed the culprit. It's not really obvious from the flamegraph (I don't even see that it callsdo_nanosleep, let alone spend a lot of time in it). I thought to myself that this must be because the scheduler is just running some other binary instead and thus perf can't see it. This is good for whole system profiling but worse if one tries to look at one (or a group of) binaries in isolation, where "wait" time is important, especially if introduced by locking or sleeping.I looked around and found some things to try on https://perf.wiki.kernel.org/index.php/Tutorial, namely:
This is nice, and indeed shows the sleep times. I can't convert this into a flamegraph though (I guess I was foolish to expect this to work, event tracing must be wholly different from sampling).
Is a flamegraph the wrong tool to illustrate this with? I actually believe it could be useful.
I don't expect you to solve this issue for me, but if you could point me in the right direction perhaps I could whip up the tool and show my colleague I can succinctly illustrate where the program is spending its time. (Showing both the actual on-CPU stracktraces and "other" areas like sleeping/locking/... in one visual representation).
Once again, thanks for the fantastic tools.