Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric: Avoid memory leak/increase in cilium-agent #31714

Merged
merged 1 commit into from Apr 2, 2024

Conversation

sayboras
Copy link
Member

@sayboras sayboras commented Apr 2, 2024

Description

This commit is to make sure that the processed item in pod deletion queue is removed by explicitly call Done() function as per suggestion in godoc1 .

The impact of not having this change will be increasing of memory in cilium agent when the hubble metrics are enabled. This might take days (if not weeks) to observe in a normal Cilium deployment due to low number of Pod deletion events (i.e. in high churn environment, the memory will be increasing in a faster pace).

Testing is done before and after the changes as per below.

Sample workload to simulate high number of pod deletion events

apiVersion: batch/v1
kind: Job
metadata:
  name: pod-churn-job
spec:
  completions: 50000000
  parallelism: 100
  template:
    metadata:
      labels:
        app: pod-churn-job
    spec:
      containers:
      - name: churn-app
        image: sandeshkv92/highpodchurn:linux_amd64
      restartPolicy: Never

Before this change, the cilium agent memory keeps increasing from 150MB to ~500MB in less than 3 hours, while with the same workload configured and this change, the memory is quite stable for a longer period (e.g. 5 hours).

Fixes: 782f934

Testing

Testing was done with the above workload, please refer to the below dashboard for visualization.

Before

image

After

image

Footnotes

  1. https://pkg.go.dev/k8s.io/client-go@v0.29.3/util/workqueue#Type.Get

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024
@sayboras sayboras added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch release-note/bug This PR fixes an issue in a previous release of Cilium. labels Apr 2, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024
@sayboras sayboras force-pushed the tam/memory-increase branch 3 times, most recently from 24ad2d3 to b1a0ebe Compare April 2, 2024 02:26
@sayboras
Copy link
Member Author

sayboras commented Apr 2, 2024

/test

This commit is to make sure that the processed item in pod deletion
queue is removed by explicitly call Done() function as per suggestion
in godoc[^1].

The impact of not having this change will be increasing of memory in
cilium agent when the hubble metrics are enabled. This might take days
(if not weeks) to observe in a normal Cilium deployment due to low number
of Pod deletion events (i.e. in high churn environment, the memory will
be increasing in a faster pace).

Testing is done before and after the changes as per below.

Sample workload to simulate high number of pod deletion events

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: pod-churn-job
spec:
  completions: 50000000
  parallelism: 100
  template:
    metadata:
      labels:
        app: pod-churn-job
    spec:
      containers:
      - name: churn-app
        image: sandeshkv92/highpodchurn:linux_amd64
      restartPolicy: Never
```

Before this change, the cilium agent memory keeps increasing from 150MB
to ~500MB in less than 3 hours, while with the same workload configured
and this change, the memory is quite stable for a longer period (e.g. 5
hours).

[^1]: https://pkg.go.dev/k8s.io/client-go@v0.29.3/util/workqueue#Type.Get

Fixes: 782f934
Signed-off-by: Tam Mach <tam.mach@cilium.io>
@sayboras sayboras force-pushed the tam/memory-increase branch 2 times, most recently from fb9b6b6 to a5526d9 Compare April 2, 2024 03:11
@sayboras
Copy link
Member Author

sayboras commented Apr 2, 2024

/test

@michi-covalent
Copy link
Contributor

After dashboard seems to be missing 👀

@sayboras
Copy link
Member Author

sayboras commented Apr 2, 2024

After dashboard seems to be missing 👀

Thanks, I have updated the PR description, will let it run for some more time before marking this ready to review.

@sayboras sayboras marked this pull request as ready for review April 2, 2024 06:51
@sayboras sayboras requested review from a team as code owners April 2, 2024 06:52
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 2, 2024
lambdanis added a commit to lambdanis/tetragon that referenced this pull request Apr 2, 2024
This is done to prevent a memory leak in environments with a high pod churn or
very long-running Tetragon agents.

Credits to Tam Mach for cilium/cilium#31714.

Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
lambdanis added a commit to lambdanis/tetragon that referenced this pull request Apr 2, 2024
This is done to prevent a memory leak in environments with a high pod churn or
very long-running Tetragon agents.

Credits to Tam Mach for fixing a similar issue in cilium/cilium#31714.

Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
@sayboras sayboras added this pull request to the merge queue Apr 2, 2024
Merged via the queue into cilium:main with commit 4d05e9f Apr 2, 2024
62 checks passed
@sayboras sayboras deleted the tam/memory-increase branch April 2, 2024 22:08
lambdanis added a commit to cilium/tetragon that referenced this pull request Apr 3, 2024
This is done to prevent a memory leak in environments with a high pod churn or
very long-running Tetragon agents.

Credits to Tam Mach for fixing a similar issue in cilium/cilium#31714.

Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
@nbusseneau nbusseneau mentioned this pull request Apr 10, 2024
9 tasks
@nbusseneau nbusseneau added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Apr 10, 2024
@nbusseneau nbusseneau mentioned this pull request Apr 10, 2024
13 tasks
@nbusseneau nbusseneau added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 10, 2024
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants