Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stackcollapse-perf.pl ignores period of samples #165

Closed
hardc0der opened this issue Mar 20, 2018 · 18 comments · Fixed by #250
Closed

stackcollapse-perf.pl ignores period of samples #165

hardc0der opened this issue Mar 20, 2018 · 18 comments · Fixed by #250

Comments

@hardc0der
Copy link

It seems like stackcollapse-perf.pl ignores the period associated with each sample record so all recorded samples are given the same weight. This ends up in the generated flamegraph not being consistent with what perf report outputs for the overhead of the different call stacks. I usually sample by specifying a frequency and the hardware CPU cycles event with something like perf record -e cpu-cycles -F 997 and I can get pretty different output. I was wondering if you run into the same thing.

@guoshimin
Copy link
Contributor

Also ran into the same issue. @hardc0der did you somehow resolve it?

@brendangregg
Copy link
Owner

FWIW, practically all the sampling we do is using cpu-clock since cpu-cycles typically isn't available in cloud environments (perf warns: "The cycles event is not supported, trying to fall back to cpu-clock-ticks"), and with cpu-clock that field shows the frequency and can be ignored. If there's a problem with cpu-cycles then I may be unaware as I rarely get a chance to use it, so please let me know.

What do you think period means, anyway? E.g., I just tried on a bare metal system; what does this mean?:

# perf record -F 1 -a 
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.160 MB perf.data (41 samples) ]
# perf script
            bash 14404 [007] 2431612.075407: 2358327118 cycles:ppp:  ffffffff88dd479c security_inode_permission ([kernel.kallsyms])

(No, bash was not hot on-CPU.)

@guoshimin
Copy link
Contributor

From https://perf.wiki.kernel.org/index.php/Tutorial#Event-based_sampling_overview:

The period is expressed as the number of occurrences of an event, not the number of timer ticks.

And later on the same page:

... the kernel is dynamically adjusting the sampling period to achieve the target average rate. The adjustment in period is reported in the raw profile data

I think it's the dynamically chosen sampling period, expressed in terms of number of events, for this sample.

In your example, the requested sampling frequency is 1 Hz, and the period turned out to be 2358327118 cycles. Does that match the clock frequency of your machine?

@brendangregg
Copy link
Owner

Sure, that roughly matches the clock frequency. But what do we do with this information? bash was not in this function for 1 second.

@guoshimin
Copy link
Contributor

I take that to mean that this sample was taken after that many cycle events fired, not that the IP stayed there for that long.

@brendangregg
Copy link
Owner

Right, so what do we do with this information? What's the proposed change to stackcollapse-perf.pl?

@guoshimin
Copy link
Contributor

I guess we should weight the samples by this amount. I'm going to check how perf report does it.

@guoshimin
Copy link
Contributor

Yeah perf report weight the samples by period. The code is https://github.com/torvalds/linux/blob/5925fa68fe8244651b3f78a88c4af99190a88f0d/tools/perf/ui/hist.c#L36-L40

@brendangregg
Copy link
Owner

So in the same profile:

         swapper     0 [007] 2431604.697259:          1 cycles:ppp:  ffffffff88a6e448 native_write_msr ([kernel.kallsyms])

Currently it weights them equally since it ignores the period. But you're saying you'd weight the bash sample as 2358327118 and this sample as 1? I don't think bash consumed 2.3 billion times more CPU in security_inode_permission() than this sample in native_write_msr(). For this example that seems even more wrong than just ignoring the period.

If all the samples were roughly 2358327118, then I could understand that this is a problem of sampling granularity (since I choose -F 1). But that's not the case. I get periods from 1 through to 6281124533.

Usually perf is a great reference of doing things right, but in this case I don't understand it.

@brendangregg
Copy link
Owner

I'm looking at other profiles where if you sum the period, it shows more than >10x the real CPU time in the process (as shown by other tools).

I choose the -F 1 to magnify sampling and period issues. If someone can 100% explain the periods in an -F 1 profile, including why they vary from 1 to 1B+, and why their sum greatly exceeds measured CPU, then we should be in a position to fix stackcollapse-perf.pl if needed (and the fix may just be to use the period as a weight). I'm nervous about fixing something I don't completely understand.

@guoshimin
Copy link
Contributor

Yeah I'm getting astronomical numbers that clearly don't make sense in the period field as well when I'm using the cycles event. I don't know if it's a kernel issue (I'm on 4.4.0), a bug in the perf tool, or a hardware issue.

I still think the period field is meant to give weights to samples, but it seems the field is currently not reliable.

I think for me the solution is to use cpu-clock for now.

@brendangregg
Copy link
Owner

It strikes me if I ignore the period=1 samples, then perhaps the other periods makes sense. Perhaps the period=1 events are due to some odd kernel code. And by weighting on period, they are effectively ignored anyway.

I'm still thinking about the consequences. Given mixed periods, it means the flame graph will have mixed resolution. You may be able to zoom into some towers and see detail, but other towers and not see detail where the frames are elided (and wrongly assume the CPU time is in the parent frame). I don't know of a good solution though, other than documenting it (or artificially ignoring extra resolution).

@guoshimin
Copy link
Contributor

Yeah I think that's fair. Maybe recommend setting -c instead of -F when using cycles to get constant periods.

@Knio
Copy link

Knio commented Sep 21, 2020

I may be totally wrong here, but just a note:

On any Intel or AMD processor, the cycle event does not count when the processor is idle, i.e., when it calls mwait(). 1

The kernel's automatic frequency adjustment seems to be designed for counters that mostly tick at a constant rate, and I think the behavior of cycles stopping entirely when the CPU is sleeping results in abnormally low periods right after a wakeup when the event tries to catch up, and then high periods once on-CPU. 1 is also the initial value for frequency-based counters.

@brendangregg
Copy link
Owner

@Knio thanks for the kernel code link: that sample_period = 1; sure sounds like the source of these period=1 events.

I think we understand this enough to justify the change, and pay attention to the period as @hardc0der suggested. Someone want to do the PR? (plus I have other PRs to catch up on, now I've finished another book.)

mdcallag added a commit to mdcallag/FlameGraph that referenced this issue Mar 31, 2022
Summary:

Disclaimer - I am not an expert on Perl, perf or FlameGraph.

This is a hack for brendangregg#165
By hack I mean that it works for me but I didn't update all of the test files.
The problem is that stackcollapse assumes that all stack traces have an equal
weight but they don't based on based on current perf output. For example,
there can be lines like this where 90689, 32298287 and 1 are the weights
but stackcollapse assumes they all have weight 1, and by weight I mean the
equivalent of the number of times this stack trace happens. Thus, the relative
frequency of xpl_accept-3 is 1 / (90689 + 32289287 + 1).

xpl_accept-1 2777710 12891973.429160: 90689 cycles:
xpl_accept-2 2777710 12891975.431112: 32289287 cycles:
xpl_accept-3 2777710 12891975.431112: 1 cycles:

This is based on brendangregg#250 but
I am not sure that PR is correct, it is also ~20 diffs behind current FlameGraph.

Test Plan:

works for me

Reviewers:

Subscribers:

Tasks:

Tags:
@mdcallag
Copy link

Are you open to another PR to fix this? Because I get extremely misleading flamegraphs today with:

perf record -g -p $pid -- sleep $perf_secs
perf script > o.scr
cat o.scr | ./stackcollapse-perf.pl > o.fold
./flamegraph.pl o.fold > o.svg

I am not sure the PR is correct and it is behind latest by ~20 diffs. My hack is here. It works for me and is current with latest.

@kgibm
Copy link

kgibm commented Nov 21, 2022

I found this issue because the SVG produced by flamegraph.pl reported astronomical "sample" counts for a short perf run:

flamegraph_huge_samples

Digging into this was very confusing, so I thought it would be useful to others to summarize. Please correct me if I'm wrong:

  1. perf record tries to use -e cpu-cycles by default if it's available (as it was in my case) and if -e isn't otherwise specified.
  2. If -e cpu-cycles is unavailable, as is common with many environments, then perf record will fall back to -e cpu-clock. This was the original use case stackcollapse-perf.pl was built for.
  3. -e cpu-clock measures the passage of time.
  4. -e cpu-cycles measures the number of CPU cycles executed.
  5. When specifying sampling frequency (-F), as is often desired for flame graphs, since -e cpu-cycles is not directly time-based, the Linux kernel tries to constantly adjust the period of CPU cycles at which it samples to try to match the requested frequency:

    the kernel is dynamically adjusting the sampling period to achieve the target average rate. The adjustment in period is reported in the raw profile data

  6. stackcollapse-perf.pl was originally designed for -e cpu-clock, thus it treats every stack sample equally. In perf script, there is a value to the left of the cpu-clock event which is the clock frequency, and this was ignored.
  7. If -e cpu-cycles is used, then, in perf script, there is a value to the left of the cpu-cycles event which is the last period (in CPU cycles) that the kernel used for that event.
  8. If -e cpu-cycles is used, then, perf report "weights" the samples using this period. I still don't really understand what this means.
  9. Before pull request take period into account when collapsing stacks #250, stackcollapse-perf.pl would simply count each cpu-cycles event as it did with -e cpu-clock, treating each event as a single sample.
  10. With -e cpu-cycles, this seems to possibly over-count certain samples because the kernel may record many events within a short period of time (less than -F) with a period of 1, although I still don't understand under what conditions it does this. It might also not take into account other side effects of the period scaling, although I don't understand what this means.
  11. Pull request take period into account when collapsing stacks #250 changed stackcollapse-perf.pl to "weight" the samples based on the value to the left of the event. Therefore, for both -e cpu-clock and -e cpu-cycles, it is expected to now get astronomical "sample" counts.
  12. flamegraph.pl has an option --countname that defaults to "samples". This seems to now be incorrect, because the intuitive understanding of "samples" is a count of event samples, rather than sum(clock frequency|cycle periods).
  13. If a user wants something more intuitive, they can either:
    1. Explicitly use perf record -e cpu-clock
    2. Use -c instead of -F with -e cpu-cycles, although it's not totally clear to me what exactly this means because although this is more clear on the period at which perf records, this counter only increments on CPU cycles, so it's still not wallclock-based.

I don't have an intuitive understanding of what it means to sample based on CPU cycles executed:

  1. What does it mean that the kernel creates many events nearly at the same time with period=1?
  2. Under what conditions is this data more accurate than sampling based on wallclock time?

If the answer to the last question is "unclear," then perhaps #250 should be reverted (given the confusing nature of the "samples" count and its interpretation), and users should just be encouraged, for this use case of FlameGraphs, to always explicitly specify perf record -e cpu-clock?

@kgibm
Copy link

kgibm commented Nov 23, 2022

If a user wants something more intuitive, they can either:

  1. Explicitly use perf record -e cpu-clock

Not so simple as cpu-clock isn't always available:

Info 11/22/2022 15:25:32   cpu-clock: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

Might be related to using --call-graph lbr.

jacobvosmaer added a commit to jacobvosmaer/perfserve that referenced this issue Feb 13, 2024
Seems to have be because of this bug:
brendangregg/FlameGraph#165. One workaround
there was to record data with `-e cpu-clock`. I don't know what that
means but that is what this commit does.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants