Skip to content

Conversation

@jeremyeder
Copy link
Contributor

We found some cases where when a longer run (1hour) was run on a system with 32GB of RAM, that the post-processing step of the perf data would be OOM-killed by the kernel. When post-processing fails, the system is left in an awkward state that needs to be cleaned up, and it also may prevent the collection of remotely-registered data recordings.

We already set the perf recording frequency to 100Hz when we enable callgraphs. This patch sets the perf frequency to 100Hz by default to reduce the sampling size. In my experience this won't cause a loss of fidelity in trace data.

@atheurer
Copy link
Contributor

Just curious, is the OOM killer used because there is no swap? And does 'perf report' memory usage directly correlate to perf.data size? I would have expected the memory usage to be similar, no matter the perf.data size (seems unlikely perf would need to keep all of this data in memory).

@jeremyeder
Copy link
Contributor Author

The swap part is quite possible. OpenShift/Kube/Atomic Enterprise run swapless by design.

@jeremyeder
Copy link
Contributor Author

I guess I need to play around a bit to answer your question. But honestly I think we should take advantage of the 10x reduction in sample size and associated overhead/cpu usage of the tracing. I think 100Hz is enough, and would get us what we need for both short and long runs.

@atheurer
Copy link
Contributor

Sure, I support the default of 100

atheurer added a commit that referenced this pull request Feb 10, 2016
reduce sampling interval for perf to 100Hz
@atheurer atheurer merged commit 126e2e8 into distributed-system-analysis:master Feb 10, 2016
@portante portante self-assigned this May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants