Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pprof collection to 100 node scale test #32056

Merged
merged 4 commits into from
Apr 29, 2024

Conversation

learnitall
Copy link
Contributor

@learnitall learnitall commented Apr 18, 2024

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

This PR utilizes a recent change to ClusterLoader2 that allows for injecting additional measurements to collect pprofs throughout the duration of the 100 node scale test. These pprofs are saved as artifacts and can be used for debugging.

Pprofs are pulled using the PodPeriodicCommand measurement, which periodically executes cilium-bugtool --get-pprof inside a randomly selected cilium-agent. A tar archive containing the profiles is saved to stdout, which is saved by the PodPeriodicCommand measurement and can later be extracted.

Scrape pprofs in 100 node scale test workflow for extra debugging information

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 18, 2024
@learnitall learnitall changed the title Add pprof collection to scale tests Add pprof collection to 100 node scale test Apr 18, 2024
@learnitall
Copy link
Contributor Author

My goal was to add pprof collection to both the 100 node scale test and the node throughput test, however injecting extra modules is not available yet in the node throughput test. Tabeling that work for now.

@learnitall learnitall force-pushed the pr/learnitall/add-pprofs-scale-tests branch 2 times, most recently from 53d7154 to 5e97267 Compare April 18, 2024 15:27
@learnitall
Copy link
Contributor Author

learnitall commented Apr 18, 2024

Successful run: https://github.com/cilium/cilium/actions/runs/8740642867/job/23984794752?pr=32056.

I checked the artifacts exported by the run and can confirm that the pprofs were created successfully. Here's a graph showing the CPU samples that I generated using go tool pprof using the data that was collected in the above test run:

profile001

@learnitall learnitall force-pushed the pr/learnitall/add-pprofs-scale-tests branch from 5e97267 to 91d9ab1 Compare April 18, 2024 16:23
@learnitall learnitall added area/CI Continuous Integration testing issue or flake kind/performance There is a performance impact of this. release-note/ci This PR makes changes to the CI. labels Apr 18, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 18, 2024
@learnitall learnitall marked this pull request as ready for review April 18, 2024 16:24
@learnitall learnitall requested review from a team as code owners April 18, 2024 16:24
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

A comment regarding the renovate configuration which still doesn't work as expected (getting renovate to behave as you expect is pretty tricky, and requires some trial and error attempts), and a couple of additional nits inline.

.github/workflows/scale-test-node-throughput-gce.yaml Outdated Show resolved Hide resolved
contrib/cl2-modules/cilium-agent-pprofs.yaml Outdated Show resolved Hide resolved
.github/workflows/scale-test-100-gce.yaml Outdated Show resolved Hide resolved
@learnitall learnitall force-pushed the pr/learnitall/add-pprofs-scale-tests branch from 91d9ab1 to cbeaad8 Compare April 19, 2024 17:03
This commit updates the CL2 commit ref in the scale test workflows. The
goal with this update is to include changes from
kubernetes/perf-tests@5a36f2c.

This commit also adds a comment explaining why renovate is not used to
update the CL2 ref automatically, to help make it clear why this
functionality shouldn't be added in the future.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
@learnitall learnitall force-pushed the pr/learnitall/add-pprofs-scale-tests branch from cbeaad8 to 279f2ca Compare April 19, 2024 17:05
@learnitall learnitall added the dont-merge/preview-only Only for preview or testing, don't merge it. label Apr 19, 2024
@learnitall
Copy link
Contributor Author

The 100 node scale test just failed due to one of the pprof collection commands hitting a timeout: https://github.com/cilium/cilium/actions/runs/8757144845/job/24035169956?pr=32056. Based on the data that was collected from the command, the command was just finishing up when the timeout went into effect - the tar archive is 90% complete.

When cilium-bugtool is executed with --get-pprof --pprof-trace-seconds=10, the following actions are performed:

flowchart TD
    Start --> pprof-cpu
    Start --> pprof-trace
    pprof-trace --> pprof-heap
    pprof-heap --> gops-memstats

Collecting CPU pprof data and trace pprof data takes 10 seconds each, but these are executed in parallel. After the trace data is collected, heap and gops data is collected sequentially. Maybe the heap and gops steps, along with other potential sources of overhead, take about five seconds to execute in total. I'm going to bump the timeout to 25 seconds for now and see how it goes.

The goal of the timeout is to prevent the pprof measurement from hanging if some kind of weird error occurs that causes a hang during the execution of cilium-bugtool, it's not to try and assert that the bugtool's execution takes less than the provided timeout value. This is why I think it is OK for us to bump the timeout in this case, as all signs point to the bugtool executing successfully and just needing longer than 15 seconds.

Now, if we have to keep bumping this timeout over time, I think that would warrant an investigation because it would imply there is some kind of regression, but since we are in the discovery phase while getting this set up, tuning the timeout seems appropriate.

@learnitall learnitall force-pushed the pr/learnitall/add-pprofs-scale-tests branch from 4aa142f to 7cf0ab2 Compare April 19, 2024 20:41
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

(Requesting changes only because it still contains the temporary commit to be removed).

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit enables the pprof endpoint in cilium-agent in the 100 node
scale test. This will be used in subsequent commits to collect pprofs
during the test.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit adjusts the 100 node scale test to include the
cilium-agent pprofs CL2 module. This will trigger the collection
of cilium-agent pprofs throughout the duration of the test,
assisting in debugging of regressions.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
@learnitall
Copy link
Contributor Author

Chatted offline with @marseel and @giorio94 about bumping the number of samples that we gather in the pprofs. I changed the settings so we gather a 40s pprof on five agents, on a one-minute interval. @marseel and @giorio94 do y'all mind taking a look? I'll do a test run to see what the data looks like.

@learnitall learnitall force-pushed the pr/learnitall/add-pprofs-scale-tests branch from 7cf0ab2 to 5b61354 Compare April 22, 2024 15:29
@learnitall
Copy link
Contributor Author

Here's a graph on CPU usage from run https://github.com/cilium/cilium/actions/runs/8786739995/job/24110295603?pr=32056. There's a fair amount of files saved as artifacts - 152 in total compared to 60 before the change I made to increase the amount of data we're getting. However, we've increased the total duration of sampled CPU time from 30s to 280s. I think it's a fair tradeoff as that's a lot of useful data and the extra samples will increase the chances that we'll sample a regression.
profile001

@learnitall learnitall force-pushed the pr/learnitall/add-pprofs-scale-tests branch from 5b61354 to fcd7bb5 Compare April 22, 2024 21:04
@learnitall learnitall removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Apr 22, 2024
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@viktor-kurchenko viktor-kurchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@marseel
Copy link
Contributor

marseel commented Apr 25, 2024

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 25, 2024
@michi-covalent michi-covalent added this pull request to the merge queue Apr 29, 2024
Merged via the queue into main with commit 08d9b14 Apr 29, 2024
263 checks passed
@michi-covalent michi-covalent deleted the pr/learnitall/add-pprofs-scale-tests branch April 29, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake kind/performance There is a performance impact of this. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants