Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] v0.29.1 seems to have a file descriptor leak #4296

Closed
mrparkers opened this issue Jul 18, 2023 · 5 comments · Fixed by kubernetes-sigs/karpenter#416
Closed

[bug] v0.29.1 seems to have a file descriptor leak #4296

mrparkers opened this issue Jul 18, 2023 · 5 comments · Fixed by kubernetes-sigs/karpenter#416
Assignees
Labels
bug Something isn't working burning Time sensitive issues

Comments

@mrparkers
Copy link

Description

Observed Behavior:

The number of open file descriptors held by karpenter (v0.29.1) pods seems to be steadily climbing to really high numbers. This was caught by a DataDog monitor that monitors open file descriptors held by containers, via the metric container.pid.open_files.

We automatically deploy new releases of karpenter to our staging environment. Here is a graph of this metric over a ~6 hour time period for v0.29.1:

Screenshot 2023-07-18 at 4 41 33 PM

Here is a graph of the same time period for all of our other clusters, which are running v0.29.0:

Screenshot 2023-07-18 at 4 43 06 PM

For another test, I used SSM to get a shell into one of the EKS nodes running a karpenter pod, and got a count of the open file descriptors:

[ssm-user@ip-10-0-142-143 bin]$ sysctl fs.file-nr
fs.file-nr = 5536       0       1615745

Then I deleted the karpenter pod running on that node, and tried again:

[ssm-user@ip-10-0-142-143 bin]$ sysctl fs.file-nr
fs.file-nr = 1728       0       1615745

You can see the results of deleting one of the pods using the same metric I referenced above. The new pod already climbed to ~250 open file descriptors within 30 minutes of the test:

Screenshot 2023-07-18 at 4 54 03 PM

Expected Behavior:

Karpenter pods should not have a file descriptor leak.

Reproduction Steps (Please include YAML):

  1. Deploy karpenter v0.29.1
  2. Observe the number of file descriptors steadily climbing to really high numbers.

I can provide more details about our exact configuration if necessary.

Versions:

  • Chart Version: v0.29.1
  • Kubernetes Version (kubectl version):
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"4c9411232e10168d7b050c49a1b59f6df9d7ea4b", GitTreeState:"clean", BuildDate:"2023-04-14T13:14:41Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.6-eks-a5565ad", GitCommit:"895ed80e0cdcca657e88e56c6ad64d4998118590", GitTreeState:"clean", BuildDate:"2023-06-16T17:34:03Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@mrparkers mrparkers added the bug Something isn't working label Jul 18, 2023
@jonathan-innis jonathan-innis added burning Time sensitive issues needs-investigation Issues that need to be investigated before triaging and removed needs-investigation Issues that need to be investigated before triaging labels Jul 19, 2023
@jonathan-innis
Copy link
Contributor

jonathan-innis commented Jul 19, 2023

Thanks for reporting this @mrparkers. I believe I tracked down the root cause and am not seeing a file descriptor leak anymore.

Screenshot 2023-07-18 at 6 58 48 PM

Do you mind trying this snapshot version and confirm that it fixes the leak on your side as well?

KARPENTER_VERSION=v0-8d82ffce1f13161df94bc9959bcefbbfcdcd0a3c

@jonathan-innis jonathan-innis self-assigned this Jul 19, 2023
@agido-heppe
Copy link

We had some OOMKills for karpenter with v0.29.1. Our requests/limits are set to 300Mi. The memory usage increased rapidly, until the container/pods were.

This might be related to this issue, since using v0-8d82ffce1f13161df94bc9959bcefbbfcdcd0a3c or rolling back to v0.29.0 seems to resolve this problem.

@mrparkers
Copy link
Author

Hi @jonathan-innis, thanks for the quick response, I appreciate the turnaround on this. I tested the new version and I no longer see an unusual amount of open file descriptors. It looks like the issue has been fixed.

@jonathan-innis
Copy link
Contributor

We're planning to release a v0.29.2 that fixes this issue. Will make the announcement once it's out, but we'll recommend that everyone migrate to that version if on v0.29.1 currently.

@jonathan-innis
Copy link
Contributor

v0.29.2 is released with the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working burning Time sensitive issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants