-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pprof collection to 100 node scale test #32056
Conversation
My goal was to add pprof collection to both the 100 node scale test and the node throughput test, however injecting extra modules is not available yet in the node throughput test. Tabeling that work for now. |
53d7154
to
5e97267
Compare
Successful run: https://github.com/cilium/cilium/actions/runs/8740642867/job/23984794752?pr=32056. I checked the artifacts exported by the run and can confirm that the pprofs were created successfully. Here's a graph showing the CPU samples that I generated using |
5e97267
to
91d9ab1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
A comment regarding the renovate configuration which still doesn't work as expected (getting renovate to behave as you expect is pretty tricky, and requires some trial and error attempts), and a couple of additional nits inline.
91d9ab1
to
cbeaad8
Compare
This commit updates the CL2 commit ref in the scale test workflows. The goal with this update is to include changes from kubernetes/perf-tests@5a36f2c. This commit also adds a comment explaining why renovate is not used to update the CL2 ref automatically, to help make it clear why this functionality shouldn't be added in the future. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
cbeaad8
to
279f2ca
Compare
The 100 node scale test just failed due to one of the pprof collection commands hitting a timeout: https://github.com/cilium/cilium/actions/runs/8757144845/job/24035169956?pr=32056. Based on the data that was collected from the command, the command was just finishing up when the timeout went into effect - the tar archive is 90% complete. When flowchart TD
Start --> pprof-cpu
Start --> pprof-trace
pprof-trace --> pprof-heap
pprof-heap --> gops-memstats
Collecting CPU pprof data and trace pprof data takes 10 seconds each, but these are executed in parallel. After the trace data is collected, heap and gops data is collected sequentially. Maybe the heap and gops steps, along with other potential sources of overhead, take about five seconds to execute in total. I'm going to bump the timeout to 25 seconds for now and see how it goes. The goal of the timeout is to prevent the pprof measurement from hanging if some kind of weird error occurs that causes a hang during the execution of cilium-bugtool, it's not to try and assert that the bugtool's execution takes less than the provided timeout value. This is why I think it is OK for us to bump the timeout in this case, as all signs point to the bugtool executing successfully and just needing longer than 15 seconds. Now, if we have to keep bumping this timeout over time, I think that would warrant an investigation because it would imply there is some kind of regression, but since we are in the discovery phase while getting this set up, tuning the timeout seems appropriate. |
4aa142f
to
7cf0ab2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
(Requesting changes only because it still contains the temporary commit to be removed).
Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit enables the pprof endpoint in cilium-agent in the 100 node scale test. This will be used in subsequent commits to collect pprofs during the test. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit adjusts the 100 node scale test to include the cilium-agent pprofs CL2 module. This will trigger the collection of cilium-agent pprofs throughout the duration of the test, assisting in debugging of regressions. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
Chatted offline with @marseel and @giorio94 about bumping the number of samples that we gather in the pprofs. I changed the settings so we gather a 40s pprof on five agents, on a one-minute interval. @marseel and @giorio94 do y'all mind taking a look? I'll do a test run to see what the data looks like. |
7cf0ab2
to
5b61354
Compare
Here's a graph on CPU usage from run https://github.com/cilium/cilium/actions/runs/8786739995/job/24110295603?pr=32056. There's a fair amount of files saved as artifacts - 152 in total compared to 60 before the change I made to increase the amount of data we're getting. However, we've increased the total duration of sampled CPU time from 30s to 280s. I think it's a fair tradeoff as that's a lot of useful data and the extra samples will increase the chances that we'll sample a regression. |
5b61354
to
fcd7bb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
/test |
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.
This PR utilizes a recent change to ClusterLoader2 that allows for injecting additional measurements to collect pprofs throughout the duration of the 100 node scale test. These pprofs are saved as artifacts and can be used for debugging.
Pprofs are pulled using the PodPeriodicCommand measurement, which periodically executes
cilium-bugtool --get-pprof
inside a randomly selected cilium-agent. A tar archive containing the profiles is saved to stdout, which is saved by the PodPeriodicCommand measurement and can later be extracted.