New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Errors with stemcell 1.351+ #318
Comments
Some of this is in the slack thread above but here is our timeline, versions, and observations. Tech details: IaaS AWS Govcloud - Bosh directors at Back on Jan 13/14th we started to see errors in Jan 23rd we moved to stemcell Jan 30th we moved to stemcell Feb 1st we kept the same versions as Jan 30th but we increased Feb 7th we deployed cf-deployment Feb 8th we deployed stemcell On Feb 13th we updated our deployment to increase staging memory limit from the default During this whole event, looking at Prometheus metrics from BOSH, the The change of stemcell kernel from What we have not narrowed down yet was why the start of the |
Dear Colleagues, CF Foundation versions used: From Cloud Foundry side we noticed the following symptom affecting CF apps.
Based on our investigation we validated that the current staging container mem limit 1024 MB is not sufficient enough to stage an application which previously have been staged successfully. In order to workaround the issue we:
|
same here with 1.360:
As @ChrisMcGowan and @PlamenDoychev mentioned increasing the memory as workaround solves the issue... |
Any new updates from any of the working groups or new band-aids folks have found ? Stemcell Just for reference for folks rolling back to |
FYI, Lakin and I have discovered that the OOMs occur with |
Synopsis: We're still troubleshooting the OOM problem.
|
Synopsis:
|
FAQ: OOM Errors on New Jammy Stemcells During StagingThe most-recent set of stemcells (Jammy 1.351+) have introduced intermittent OOM (out-of-memory) failures when staging Cloud Foundry applications. Though intermittent, these errors have disrupted user updates and triggered several open issues. We believe it’s a kernel bug. Until the issue is fixed, we recommend users pin to an earlier stemcell (1.340) or increase the staging limit of applications. What’s the error that users are seeing?Before running an application on Cloud Foundry, and application must be “staged” (choosing the appropriate buildpack (Ruby, Golang, etc.), compiling, resolving dependencies). When a user runs “cf push” or “cf restage”, it would fail with “Error staging application: StagingError - Staging error: staging failed”. When viewing the /var/log/kern.log on the Diego cell where the app was staged, one would see the error “Memory cgroup out of memory: Killed process …” along with a stack trace. How Can I Avoid OOMs on my Foundation?One way is to pin the stemcell to Jammy 1.340 and not upgrade past that. If that’s not possible, bump the staging RAM limit from 1 GiB to 2 GiB or higher, depending on your staging footprint. Specifically, modify dea_next.staging_memory_limit_mb. Current default is 1024; we recommend bumping it to 2048 or 4096. Will Increasing the Staging RAM Limit Adversely Affect the Foundation?We doubt increasing the staging RAM limit will have a negative impact unless the user is in the habit of restaging all their applications at the same time. The staging cycle is short-lived, and though staging an app will reserve a greater amount of RAM, that RAM will be released when the staging cycle completes within a few minutes. What Stemcells are Affected?The Linux stemcells based on Canonical’s Ubuntu Jammy Jellyfish release are affected, beginning with version 1.351 (released January 29, 2024) to 1.390 (present) are affected. That coincides with Canonical’s introduction of the 6.5 Linux kernel (prior stemcells had the 6.2 Linux kernel). Which IaaSes are Affected?We have seen the problem on vSphere, GCP, and AWS, and we suspect they occur on all IaaSes. What’s Causing the Error?We believe that the error is caused by a poor interaction between the Linux 6.5 kernel and v1 cgroups. The Linux 6.5 kernel was introduced with the 1.351 stemcell. Specifically, the introduction of Multi-Gen LRU:
Which Applications are Affected?Golang, and we have reports of NodeJS and Java as well. What's Being Done to Fix the Error?We're planning to rollback the kernel 6.5 → 5.15. In the meantime, we're pursuing a fix with Canonical; also, we're thinking of bumping the staging limit. |
Hi @cunnie, I ran main.go on 5.15, 6.2 and 6.5, and found that under 512MiB, we OOM every time, there is no scenario that this is enough memory to compile main.go. I also made an unbounded cgroup to see how much memory is consumed at the peak, and found it to be somewhere around 1-1.1gb needed.
Have you managed to compile main.go within a 512MiB limit on any kernel? Do you have an example workload that used to work under 6.2, but now fails on 6.5? I am also reviewing all commits in cgroup v1 between 6.2 and 6.5. Support for v1 does indeed exist under jammy, but v2 will always be better tested as it is more widely used, since it has been the new default for a few years now. Thanks, |
Hey @matthewruffell I was just trying to reproduce again and had some troubles getting it to build on 6.2, which is weird because I'm pretty sure I'm using the exact same steps I was doing the other day. I did get it to pass on 6.2 and fail on 6.5 when turning the
|
We tested on the 5.15 non-hwe kernel and were not able to reproduce the issue there. So it doesn't appear to be introduced by something that got back-ported to 5.15. |
We've seen OOM problems when running cgroups v1 with the 6.5 kernel line. Resolves [#318]
* Revert from hwe kernel to normal lts kernel We've seen OOM problems when running cgroups v1 with the 6.5 kernel line. Resolves [#318] The path for libsubcmd has changed between kernel versions, add the alternate path 8fecc29 originally changed this, now we are adding both for compatibility. Signed-off-by: Joseph Palermo <joseph.palermo@broadcom.com> Co-authored-by: Long Nguyen <nguyenlo@vmware.com> Co-authored-by: Brian Cunnie <brian.cunnie@broadcom.com>
Is anything going to be done for when we are back on a 6.5 kernel? |
If it is a cgroups v1 & 6.5 kernel problem, the current plan is to move the noble stemcell to cgroups v2, so it should get impacted there. It's also likely the problem will get fixed in the 6.5 kernel at some point, it's just a question of when. |
The candidate stemcells with the 5.15 kernel seem to work just as well as the 6.5 ones. The current plan, unless people find problems, is to publish them on Monday. The version numbers are obviously not what will be released, that's just the way this part of the pipeline works. We did some testing against public IaaSs to verify VM types work. We did not do an exhaustive test of all types, but we generally found that VM types that work with the current Jammy seem to work fine with the 5.15 version too.
|
I think the 5.15 is a good workaround for the time being. I am still looking into the 6.5 cgroups v1 issue, but it is tricky as I can sometimes successfully get the reproducer to build correctly in 6.5 with the 512MiB ram limit, making bisecting tricky due to it not being deterministic. I will write back once I have some more information to share. I will spend today, and early next week looking into this. Thanks, |
* Revert from hwe kernel to normal lts kernel We've seen OOM problems when running cgroups v1 with the 6.5 kernel line. Resolves [#318] The path for libsubcmd has changed between kernel versions, add the alternate path 8fecc29 originally changed this, now we are adding both for compatibility. Signed-off-by: Joseph Palermo <joseph.palermo@broadcom.com> Co-authored-by: Long Nguyen <nguyenlo@vmware.com> Co-authored-by: Brian Cunnie <brian.cunnie@broadcom.com>
Ubuntu-jammy 1.404 has been cut with the 5.15 kernel. Can people experiencing the problem confirm if this resolves it? |
I have been reading all changes between 6.2 and 6.5 related to cgroups and (in Linus Torvalds linux repository) $ git log --grep "memcg" v6.2..v6.5 I came up with the following shortlist of commits which touches memory controls One line shortlist (Hash, Subject): Full git log: Now, there are about 70 commits of interest, and I can't guarantee any of them The major feature that stands out is the major work done on Multi Generational Documentation: Explanations: News articles: LRU is a caching concept. Least Recently Used. The "old" implementation, used The Multi-Gen LRU is quite complicated, but the idea is to group pages together Multi-Gen LRU was merged in 6.1, but disabled by default. The Ubuntu 6.2 kernel
6.2:
6.5:
0x0000 is off. 0x0007 is fully on, as per the table in: https://docs.kernel.org/next/admin-guide/mm/multigen_lru.html Before I go too deep into this rabbit hole, could we please try turning off
Please let me know if it makes any difference. If it does, then we have our Thanks, |
We did some testing with the Multi-Gen LRU disabled and were unable to reproduce the problem. If anybody is still running the 6.5 kernel stemcells and is able to disable it and see if that resolves the problem that would be great data to have. You could either run it via a |
Hi *, Thanks for digging into it! EDIT: |
Sounds like the ubuntu-jammy 1.404 stemcell has resolved the issue by using the 5.15 kernel. Going to close now, but reopen if people see the issue again. |
Hi everyone, Just wanted to drop by with an update on the current situation. I was reading the commits to Multi-Gen LRU between 6.5 and 6.8, and came across commit 669281ee7ef731fb5204df9d948669bf32a5e68d which was particularly interesting. The symptoms are pretty much the same, Unfortunately, this commit is already applied to the 6.5 kernel, and is a part $ git log --grep "Multi-gen LRU: fix per-zone reclaim" You were testing 6.5.0-15-generic at least, so the commit is present. So we are I checked all the recent additions of patches into 6.5, and there was actually $ git log --grep "mglru" --grep "MGLRU" --grep "Multi-gen LRU" Ubuntu-6.5.0-15.15..origin/master-next commit c28ac3c7eb945fee6e20f47d576af68fdff1392a commit 4376807bf2d5371c3e00080c972be568c3f8a7d1 commit 8aa420617918d12d1f5d55030a503c9418e73c2c commit 5095a2b23987d3c3c47dd16b3d4080e2733b8bb9 commit 081488051d28d32569ebb7c7a23572778b2e7d57 commit bb5e7f234eacf34b65be67ebb3613e3b8cf11b87 All but the very last commit are introduced in 6.5.0-27-generic, which was These fixups get us most of the way to all the fixes available in 6.8, and I Would anyone please be able to install 6.5.0-27-generic to a stemcell, and run I would be very eager to hear the results. These patches should help with Thanks, |
In the past days we saw memory errors with the newer stemcells 1.351+. Let's use this ticket to document all findings and decide how to mitigate this issue.
Slack discussion: https://cloudfoundry.slack.com/archives/C02HWMDUQ/p1707492827160649
The text was updated successfully, but these errors were encountered: