Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Errors with stemcell 1.351+ #318

Closed
max-soe opened this issue Feb 14, 2024 · 23 comments · Fixed by #329
Closed

Memory Errors with stemcell 1.351+ #318

max-soe opened this issue Feb 14, 2024 · 23 comments · Fixed by #329
Assignees

Comments

@max-soe
Copy link

max-soe commented Feb 14, 2024

In the past days we saw memory errors with the newer stemcells 1.351+. Let's use this ticket to document all findings and decide how to mitigate this issue.

Slack discussion: https://cloudfoundry.slack.com/archives/C02HWMDUQ/p1707492827160649

@ChrisMcGowan
Copy link

ChrisMcGowan commented Feb 14, 2024

Some of this is in the slack thread above but here is our timeline, versions, and observations.

Tech details: IaaS AWS Govcloud - Bosh directors at 280.0.14 and stemcell 1.351. Diego-cells are m5.2xlarge and using a memory overallocation of 50 gb vs the physical ram of 32 gb. The platform has used this configuration for a few years now without issue.

Back on Jan 13/14th we started to see errors in /var/log/kernel/log for Memory cgroup out of memory by OOM on deigo-cells. At the time we where on stemcell 1.329 which was linux kernel 6.2.0-39 and cf-deployment v35.3. This deployment combo was done on Dec 28th of 2023.

Jan 23rd we moved to stemcell 1.340 which was still on linux kernel 6.2.0-39 and cf-deployment v37.0.0. The memory cgroup out of memory errors continued in the kernel.log. At this time we had no reported issues from our users.

Jan 30th we moved to stemcell 1.351 which was now linux kernel 6.5.0-15 and cf-deployment v37.2.0. The memory cgroup out of memory errors continued in the kernel.log but increased. At the same time our users started to see OOM errors while staging their apps as well as an increase in running instances crashing with OOM errors as well - some users where seeing Instance became unhealthy: Liveness check unsuccessful: failed to make TCP connection to <silk cidr addr>:8080: dial tcp <silk cidr addr>:8080: connect: connection refused (out of memory); process did not exit in cf events. These crash events seemed to be spread out, not localized to specific diego-cells and happened on apps using various buildpacks. Looking at Prometheus, the value firehose_value_metric_bbs_crashed_actual_lr_ps increased as well.

Feb 1st we kept the same versions as Jan 30th but we increased deigo-cell capacity by 10%. No change in errors or reported issues.

Feb 7th we deployed cf-deployment v37.3.0 but where still on stemcell 1.351. No change in increased kernel.log errors and user reported issues with OOM.

Feb 8th we deployed stemcell 1.360 which was linux kernel 6.5.0-17 and was suppose to contain the fix noted here. This deployment was also done with cf-deployment v37.4.0. Again no changes to the amount of kernel.log errors and staging errors but crashing instances increased, leveled out, but are still elevated in firehose_value_metric_bbs_crashed_actual_lr_ps.

On Feb 13th we updated our deployment to increase staging memory limit from the default 1024 to 2048. We also expanded our diego-cell capacity another 10%. Versions of stemcell and cf-deployment stayed the same as Feb 8th. The kernel.log errors stayed the same. For users that did a cf restage of their apps - most stopped getting OOM errors but a few still to happen. We are still looking closer but the consistent failing one is based on the node-js buildpack. The number of instances still crashing did drop looking at the firehose_value_metric_bbs_crashed_actual_lr_ps metric but still elevated from before the event. When looking at one of the new crashes, we found it was on a new/just added diego-cell which only had 10 running instances on it and this app instance ran and crashed twice on this cell with the same kernel.log errors.

During this whole event, looking at Prometheus metrics from BOSH, the diego-cell average physical memory usage stayed around 50-65% - typical for what we have seen in the past. No metrics, log entries, BOSH HM events indicate any of the diego-cells ran out of physical ram. Prometheus metrics on diego-cell capacity of allocated vs available memory showed we where between 50-70% of allocated in use. The first capacity add of cells was to lower that amount closer to 50%. The second add was to see if some additional cushion would help - it didn't.

The change of stemcell kernel from 6.2 series to 6.5 seems to have made the problem a lot worse. Increasing staging memory is a band aid for staging apps but not the increase in crashing instances with OOM errors.

What we have not narrowed down yet was why the start of the kernel.log errors on Jan 13/14 when we were still on stemcell 1.329 using kernel 6.2.0-39. The bug fix from above noted having to roll back to kernel 6.2.0-35 so maybe 6.2.0-39 had a bug as well but took longer over time to manifest ? Going back over 6 months there are zero hits on these errors in our log archive. Another CF user reported rolling back to stemcell 1.340 helped to remove most of their issues but going back to 1.340 from 1.360 re-opens at least 1 high, 2 high/med, 4 med/high CVEs according to release notes.

@PlamenDoychev
Copy link

PlamenDoychev commented Feb 15, 2024

Dear Colleagues,

CF Foundation versions used:
CF-Deployment v37
linux stemcell v1.351, tested also on v1.360

From Cloud Foundry side we noticed the following symptom affecting CF apps.
During CF staging process we observed a large number of apps (using different buildpacks) failing with OOM.

   Exit status 137 (out of memory)
   Cell f8e8a121-82a1-4d0a-a98e-32aa29ae483d stopping instance 23a3b93a-2ce2-4470-89e5-7c89fffd3508
   Cell f8e8a121-82a1-4d0a-a98e-32aa29ae483d destroying container for instance 23a3b93a-2ce2-4470-89e5-7c89fffd3508
Error staging application: StagingError - Staging error: staging failed
FAILED

Based on our investigation we validated that the current staging container mem limit 1024 MB is not sufficient enough to stage an application which previously have been staged successfully.

In order to workaround the issue we:

  1. We noticed that in some cases increasing the requested app memory solves the issue. I.e. Dummy hello-world application which usually requires 500 MB constantly was failing to stage. By increasing the requested mem to 1 GB the app was able to stage. Unfortunately this solution does not work in general and isn't acceptable for customers.
  2. Had a plan to adjust a bit the general configuration for staging container size: https://github.com/cloudfoundry/capi-release/blob/96fda367a817aaccbfc4c735db0ab81882066a5c/jobs/cloud_controller_ng/spec#L927. By doing this most probably the problem would have been solved but actually the main issue should still be in place.
  3. Decided to downgrade the stemcell to 1.340 as a last known good. This resolved in general the staging issues.

@schindlersebastian
Copy link

schindlersebastian commented Feb 15, 2024

same here with 1.360:

Exit status 137 (out of memory)
StagingError - Staging error: staging failed
FAILED

As @ChrisMcGowan and @PlamenDoychev mentioned increasing the memory as workaround solves the issue...

@ChrisMcGowan
Copy link

Any new updates from any of the working groups or new band-aids folks have found ?

Stemcell 1.379 was released the other day but the kernel bump is minor to 6.5.0-18 so i'm not expecting much if any changes to the issue. We still plan to roll that stemcell out into production.

Just for reference for folks rolling back to 1.340 or older stemcells, the switch of the kernel from 6.2.X to 6.5.X was due to the 6.2.X kernel is now EOL - see: https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle. Rolling back to something on EOL and losing patched CVEs is a deal breaker for us.

@cunnie
Copy link
Member

cunnie commented Feb 23, 2024

FYI, Lakin and I have discovered that the OOMs occur with total_cache bumping up against the 1GB memory limit of the cgroup. It appears to be a problem with cache eviction not working properly. We have 25 OOMs on a 8-core 16 GB 1.379 vSphere Diego cell, and the total cache ranges from 925 MiB to 972 MiB. On the earlier stemcells, total footprint (including cache) rarely exceeds 500 MiB.

@cunnie
Copy link
Member

cunnie commented Feb 28, 2024

Synopsis: We're still troubleshooting the OOM problem.

  • 2/27: We've managed to get an OOM on a regular Jammy VM (not a stemcell). This helps Canonical replicate the problem (they don't need to stand up an entire Cloud Foundry foundation); however, there are some caveats:
    • We only saw 2 OOMs over the course of several hours on the vanilla Jammy VM; that stands in stark contrast to ~17/hour on the Diego VMs — we're missing something.
  • 2/26 Reviewed the change in the Linux kernel's mm/memcontroller.c in the Linux kernel from v6.2 → v6.5 hoping to glean some understanding of what might be causing the OOMs on v6.5. I came up empty-handed.
  • 2/25 We managed to replicate the problem on a 4-vCPU system; we don't need an 8- or 16-vCPU system.
  • 2/25 Unsuccessfully attempted to eliminate the OOMs by using the sysctl job in the os-conf BOSH release by setting vm.swappiness and vm.dirty_background_ratio
  • 2/25 Unsuccessfully attempted to eliminate the OOMs by "pre-heating" the Diego Cell (consuming the free RAM and then releasing) in a BOSH pre-start.
  • 2/24 We managed to replicate ~100 OOMs over the course of several hours on two Diego VMs.

@cunnie
Copy link
Member

cunnie commented Mar 1, 2024

Synopsis:

  • 3/1 Have not heard back from Canonical.
  • 2/29 We were able to enhance our OOM-replication for Canonical to troubleshoot — on a regular Jammy VM, we were able to OOM almost every time. We accomplished this by lowering the memory limit from 1GiB to 512MiB. This makes it easier for Canonical to debug.
  • 2/29 We were able to confirm that OOM does not occur on v2 cgroups (CF uses v1 cgroups). This indicates the problem may be a v1 cgroups + 6.5 kernel problem.

@cunnie
Copy link
Member

cunnie commented Mar 1, 2024

FAQ: OOM Errors on New Jammy Stemcells During Staging

The most-recent set of stemcells (Jammy 1.351+) have introduced intermittent OOM (out-of-memory) failures when staging Cloud Foundry applications. Though intermittent, these errors have disrupted user updates and triggered several open issues. We believe it’s a kernel bug. Until the issue is fixed, we recommend users pin to an earlier stemcell (1.340) or increase the staging limit of applications.

What’s the error that users are seeing?

Before running an application on Cloud Foundry, and application must be “staged” (choosing the appropriate buildpack (Ruby, Golang, etc.), compiling, resolving dependencies). When a user runs “cf push” or “cf restage”, it would fail with “Error staging application: StagingError - Staging error: staging failed”. When viewing the /var/log/kern.log on the Diego cell where the app was staged, one would see the error “Memory cgroup out of memory: Killed process …” along with a stack trace.

How Can I Avoid OOMs on my Foundation?

One way is to pin the stemcell to Jammy 1.340 and not upgrade past that. If that’s not possible, bump the staging RAM limit from 1 GiB to 2 GiB or higher, depending on your staging footprint. Specifically, modify dea_next.staging_memory_limit_mb. Current default is 1024; we recommend bumping it to 2048 or 4096.

Will Increasing the Staging RAM Limit Adversely Affect the Foundation?

We doubt increasing the staging RAM limit will have a negative impact unless the user is in the habit of restaging all their applications at the same time. The staging cycle is short-lived, and though staging an app will reserve a greater amount of RAM, that RAM will be released when the staging cycle completes within a few minutes.

What Stemcells are Affected?

The Linux stemcells based on Canonical’s Ubuntu Jammy Jellyfish release are affected, beginning with version 1.351 (released January 29, 2024) to 1.390 (present) are affected. That coincides with Canonical’s introduction of the 6.5 Linux kernel (prior stemcells had the 6.2 Linux kernel).

Which IaaSes are Affected?

We have seen the problem on vSphere, GCP, and AWS, and we suspect they occur on all IaaSes.

What’s Causing the Error?

We believe that the error is caused by a poor interaction between the Linux 6.5 kernel and v1 cgroups. The Linux 6.5 kernel was introduced with the 1.351 stemcell. Specifically, the introduction of Multi-Gen LRU:

  • Multi-Gen LRU was merged in 6.1, but disabled by default. The Ubuntu 6.2 kernel has it disabled by default.
  • The Ubuntu 6.5 kernel turns it on by default.
  • To disable, echo n | sudo tee /sys/kernel/mm/lru_gen/enabled
  • Catenate that file, it should be 0x0000

Which Applications are Affected?

Golang, and we have reports of NodeJS and Java as well.

What's Being Done to Fix the Error?

We're planning to rollback the kernel 6.5 → 5.15. In the meantime, we're pursuing a fix with Canonical; also, we're thinking of bumping the staging limit.

@matthewruffell
Copy link

2/29 We were able to enhance our OOM-replication for Canonical to troubleshoot — on a regular Jammy VM, we were able to OOM almost every time. We accomplished this by lowering the memory limit from 1GiB to 512MiB. This makes it easier for Canonical to debug.

Hi @cunnie, I ran main.go on 5.15, 6.2 and 6.5, and found that under 512MiB, we OOM every time, there is no scenario that this is enough memory to compile main.go. I also made an unbounded cgroup to see how much memory is consumed at the peak, and found it to be somewhere around 1-1.1gb needed.

$ uname -rv
5.15.0-97-generic #107-Ubuntu SMP Wed Feb 7 13:26:48 UTC 2024
$ cat /sys/fs/cgroup/memory/system.slice/512mb/memory.max_usage_in_bytes 
536883200
$ cat /sys/fs/cgroup/memory/system.slice/unbounded/memory.max_usage_in_bytes 
1091538944

$ uname -rv
6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2
$ cat /sys/fs/cgroup/memory/system.slice/512mb/memory.max_usage_in_bytes
536920064
$ cat /sys/fs/cgroup/memory/system.slice/unbounded/memory.max_usage_in_bytes
1083924480

$ uname -rv
6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb  9 13:32:52 UTC 2
$ cat /sys/fs/cgroup/memory/system.slice/512mb/memory.max_usage_in_bytes
537112576
$ cat /sys/fs/cgroup/memory/system.slice/unbounded/memory.max_usage_in_bytes
1180794880

Have you managed to compile main.go within a 512MiB limit on any kernel?

Do you have an example workload that used to work under 6.2, but now fails on 6.5?

I am also reviewing all commits in cgroup v1 between 6.2 and 6.5. Support for v1 does indeed exist under jammy, but v2 will always be better tested as it is more widely used, since it has been the new default for a few years now.

Thanks,
Matthew

@jpalermo
Copy link
Member

jpalermo commented Mar 5, 2024

Hey @matthewruffell

I was just trying to reproduce again and had some troubles getting it to build on 6.2, which is weird because I'm pretty sure I'm using the exact same steps I was doing the other day.

I did get it to pass on 6.2 and fail on 6.5 when turning the GOMAXPROC down to 4 from 16 which we were using the other day.

$ uname -rv
6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2
$ cat /sys/fs/cgroup/memory/system.slice/oom-test/memory.limit_in_bytes
536870912
$ cat /sys/fs/cgroup/memory/system.slice/oom-test/memory.memsw.limit_in_bytes
536870912
$ GOMAXPROCS=4 ../go/bin/go build -mod vendor -ldflags="-s -w" -a .
$


$ uname -rv
6.5.0-1014-gcp #14~22.04.1-Ubuntu SMP Sat Feb 10 04:57:00 UTC 2024
$ cat /sys/fs/cgroup/memory/system.slice/oom-test/memory.limit_in_bytes
536870912
$ cat /sys/fs/cgroup/memory/system.slice/oom-test/memory.memsw.limit_in_bytes
536870912
$ GOMAXPROCS=4 ../go/bin/go build -mod vendor -ldflags="-s -w" -a .
github.com/jackc/pgtype: /workspace/go/pkg/tool/linux_amd64/compile: signal: killed
github.com/redis/go-redis/v9: /workspace/go/pkg/tool/linux_amd64/compile: signal: killed
$

@jpalermo
Copy link
Member

jpalermo commented Mar 6, 2024

We tested on the 5.15 non-hwe kernel and were not able to reproduce the issue there. So it doesn't appear to be introduced by something that got back-ported to 5.15.

jpalermo added a commit that referenced this issue Mar 7, 2024
We've seen OOM problems when running cgroups v1 with the 6.5 kernel line.

Resolves [#318]
jpalermo added a commit that referenced this issue Mar 7, 2024
* Revert from hwe kernel to normal lts kernel

We've seen OOM problems when running cgroups v1 with the 6.5 kernel line.

Resolves [#318]

The path for libsubcmd has changed between kernel versions, add the alternate path
8fecc29 originally changed this, now
we are adding both for compatibility.

Signed-off-by: Joseph Palermo <joseph.palermo@broadcom.com>
Co-authored-by: Long Nguyen <nguyenlo@vmware.com>
Co-authored-by: Brian Cunnie <brian.cunnie@broadcom.com>
@mymasse
Copy link

mymasse commented Mar 7, 2024

Is anything going to be done for when we are back on a 6.5 kernel?

@jpalermo
Copy link
Member

jpalermo commented Mar 7, 2024

If it is a cgroups v1 & 6.5 kernel problem, the current plan is to move the noble stemcell to cgroups v2, so it should get impacted there.

It's also likely the problem will get fixed in the 6.5 kernel at some point, it's just a question of when.

@jpalermo jpalermo reopened this Mar 7, 2024
@jpalermo
Copy link
Member

jpalermo commented Mar 7, 2024

The candidate stemcells with the 5.15 kernel seem to work just as well as the 6.5 ones. The current plan, unless people find problems, is to publish them on Monday.

The version numbers are obviously not what will be released, that's just the way this part of the pipeline works.

We did some testing against public IaaSs to verify VM types work. We did not do an exhaustive test of all types, but we generally found that VM types that work with the current Jammy seem to work fine with the 5.15 version too.

https://storage.googleapis.com/bosh-core-stemcells-candidate/google/bosh-stemcell-210.892-google-kvm-ubuntu-jammy-go_agent.tgz
https://storage.googleapis.com/bosh-core-stemcells-candidate/aws/bosh-stemcell-210.892-aws-xen-hvm-ubuntu-jammy-go_agent.tgz
https://storage.googleapis.com/bosh-core-stemcells-candidate/azure/bosh-stemcell-210.892-azure-hyperv-ubuntu-jammy-go_agent.tgz
https://storage.googleapis.com/bosh-core-stemcells-candidate/vsphere/bosh-stemcell-210.892-vsphere-esxi-ubuntu-jammy-go_agent.tgz
https://storage.googleapis.com/bosh-core-stemcells-candidate/openstack/bosh-stemcell-210.892-openstack-kvm-ubuntu-jammy-go_agent-raw.tgz

@matthewruffell
Copy link

I think the 5.15 is a good workaround for the time being. I am still looking into the 6.5 cgroups v1 issue, but it is tricky as I can sometimes successfully get the reproducer to build correctly in 6.5 with the 512MiB ram limit, making bisecting tricky due to it not being deterministic.

I will write back once I have some more information to share. I will spend today, and early next week looking into this.

Thanks,
Matthew

cf-bosh-ci-bot pushed a commit that referenced this issue Mar 12, 2024
* Revert from hwe kernel to normal lts kernel

We've seen OOM problems when running cgroups v1 with the 6.5 kernel line.

Resolves [#318]

The path for libsubcmd has changed between kernel versions, add the alternate path
8fecc29 originally changed this, now
we are adding both for compatibility.

Signed-off-by: Joseph Palermo <joseph.palermo@broadcom.com>
Co-authored-by: Long Nguyen <nguyenlo@vmware.com>
Co-authored-by: Brian Cunnie <brian.cunnie@broadcom.com>
@jpalermo
Copy link
Member

Ubuntu-jammy 1.404 has been cut with the 5.15 kernel. Can people experiencing the problem confirm if this resolves it?

@matthewruffell
Copy link

Hi @jpalermo @cunnie,

I have been reading all changes between 6.2 and 6.5 related to cgroups and
memory control groups.

(in Linus Torvalds linux repository)

$ git log --grep "memcg" v6.2..v6.5
$ git log --grep "memcontrol" v6.2..v6.5
$ git log --grep "mm: memcontrol:" v6.2..v6.5

I came up with the following shortlist of commits which touches memory controls
within cgroups, and especially, cache eviction:

One line shortlist (Hash, Subject):
https://paste.ubuntu.com/p/dqFPP6kPHr/

Full git log:
https://paste.ubuntu.com/p/Hd2jqnw6Qw/

Now, there are about 70 commits of interest, and I can't guarantee any of them
are the culprit without testing them against a consistent 100% reproducer, which
we don't currently have, so the following is the current theory only.

The major feature that stands out is the major work done on Multi Generational
Least Regularly Used (Multi-Gen LRU).

Documentation:
https://docs.kernel.org/mm/multigen_lru.html
https://docs.kernel.org/next/admin-guide/mm/multigen_lru.html

Explanations:
https://lwn.net/Articles/851184/
https://lwn.net/Articles/856931/

News articles:
https://www.phoronix.com/news/MGLRU-In-Linux-6.1
https://www.phoronix.com/news/Linux-MGLRU-memcg-LRU

LRU is a caching concept. Least Recently Used. The "old" implementation, used
in 5.15 and 6.2, added an "age bit" that gets incremented each time the cache
line is used. The lowest bits gets evicted when we experience memory pressure.
It is pretty simple, but not very sophisticated.

The Multi-Gen LRU is quite complicated, but the idea is to group pages together
into the idea of "generations", taking into account things like spatially (how
close the page is to another frequently used page, e.g. 5 pages in a row used
for the same thing get grouped together), among other metrics.

Multi-Gen LRU was merged in 6.1, but disabled by default. The Ubuntu 6.2 kernel
has it disabled by default. The Ubuntu 6.5 kernel turns it on by default.

$ grep -Rin "LRU_GEN" config-*
config-6.2.0-39-generic:1157:CONFIG_LRU_GEN=y
config-6.2.0-39-generic:1158:# CONFIG_LRU_GEN_ENABLED is not set
config-6.2.0-39-generic:1159:# CONFIG_LRU_GEN_STATS is not set
config-6.5.0-21-generic:1167:CONFIG_LRU_GEN=y
config-6.5.0-21-generic:1168:CONFIG_LRU_GEN_ENABLED=y
config-6.5.0-21-generic:1169:# CONFIG_LRU_GEN_STATS is not set

6.2:

$ cat /sys/kernel/mm/lru_gen/enabled
0x0000

6.5:

$ cat /sys/kernel/mm/lru_gen/enabled
0x0007

0x0000 is off. 0x0007 is fully on, as per the table in:

https://docs.kernel.org/next/admin-guide/mm/multigen_lru.html

Before I go too deep into this rabbit hole, could we please try turning off
Multi-Gen LRU, and reverting back to the basic LRU as in 5.15 and 6.2, on the
6.5 kernel?

  1. Boot into 6.5 kernel
  2. Confirm Multi-Gen LRU on:
$ cat /sys/kernel/mm/lru_gen/enabled
0x0007
  1. Disable Multi-Gen LRU:
$ echo n | sudo tee /sys/kernel/mm/lru_gen/enabled
n
  1. Check it is off:
$ cat /sys/kernel/mm/lru_gen/enabled
0x0000
  1. Make a cgroup and try reproduce the issue with main.go or real-world workload.

Please let me know if it makes any difference. If it does, then we have our
culprit, and we can study Multi-Gen LRU more, and if it doesn't, it's back to
the drawing board, and we need to get a 100% reproducer running for further
analysis.

Thanks,
Matthew

@jpalermo
Copy link
Member

We did some testing with the Multi-Gen LRU disabled and were unable to reproduce the problem. If anybody is still running the 6.5 kernel stemcells and is able to disable it and see if that resolves the problem that would be great data to have.

You could either run it via a bosh ssh -c against the whole instance group easily, or do it via a pre-start script and os-conf release.

@schindlersebastian
Copy link

schindlersebastian commented Mar 14, 2024

Hi *,
we did some testing with Multi-Gen LRU disabled on all of our diego cells.
I can confirm that afterwards the OOM problem was no longer reproducable!

Thanks for digging into it!
Sebastian

EDIT:
After in-depth testing in all of our stages we saw no occurences of the OOM error with
/sys/kernel/mm/lru_gen/enabled set to "n" (0x0000)
As soon as we re-enable Multi-Gen LRU the cf pushes start to fail in 50% - 70% of the attempts.

@jpalermo
Copy link
Member

Sounds like the ubuntu-jammy 1.404 stemcell has resolved the issue by using the 5.15 kernel. Going to close now, but reopen if people see the issue again.

@matthewruffell
Copy link

Hi everyone,

Just wanted to drop by with an update on the current situation.

I was reading the commits to Multi-Gen LRU between 6.5 and 6.8, and came across

commit 669281ee7ef731fb5204df9d948669bf32a5e68d
Author: Kalesh Singh kaleshsingh@google.com
Date: Tue Aug 1 19:56:02 2023 -0700
Subject: Multi-gen LRU: fix per-zone reclaim
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=669281ee7ef731fb5204df9d948669bf32a5e68d

which was particularly interesting. The symptoms are pretty much the same,
and there is a github issue [1] which describes the same sort of issues:

[1] raspberrypi/linux#5395

Unfortunately, this commit is already applied to the 6.5 kernel, and is a part
of 6.5.0-9-generic:

$ git log --grep "Multi-gen LRU: fix per-zone reclaim"
27c7d0b93445 Multi-gen LRU: fix per-zone reclaim
$ git describe --contains 27c7d0b93445eaadfe46bcdb57dab2090e023c19
Ubuntu-hwe-6.5-6.5.0-9.9_22.04.2~128

You were testing 6.5.0-15-generic at least, so the commit is present. So we are
looking for another fix.

I checked all the recent additions of patches into 6.5, and there was actually
a lot of Multi-Gen LRU commits very recently into the Ubuntu 6.5 kernel.

$ git log --grep "mglru" --grep "MGLRU" --grep "Multi-gen LRU" Ubuntu-6.5.0-15.15..origin/master-next

commit c28ac3c7eb945fee6e20f47d576af68fdff1392a
Author: Yu Zhao yuzhao@google.com
Date: Fri Dec 22 21:56:47 2023 -0700
Subject: mm/mglru: skip special VMAs in lru_gen_look_around()
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c28ac3c7eb945fee6e20f47d576af68fdff1392a

commit 4376807bf2d5371c3e00080c972be568c3f8a7d1
Author: Yu Zhao yuzhao@google.com
Date: Thu Dec 7 23:14:07 2023 -0700
Subject: mm/mglru: reclaim offlined memcgs harder
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4376807bf2d5371c3e00080c972be568c3f8a7d1

commit 8aa420617918d12d1f5d55030a503c9418e73c2c
Author: Yu Zhao yuzhao@google.com
Date: Thu Dec 7 23:14:06 2023 -0700
Subject: mm/mglru: respect min_ttl_ms with memcgs
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8aa420617918d12d1f5d55030a503c9418e73c2c

commit 5095a2b23987d3c3c47dd16b3d4080e2733b8bb9
Author: Yu Zhao yuzhao@google.com
Date: Thu Dec 7 23:14:05 2023 -0700
Subject: mm/mglru: try to stop at high watermarks
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5095a2b23987d3c3c47dd16b3d4080e2733b8bb9

commit 081488051d28d32569ebb7c7a23572778b2e7d57
Author: Yu Zhao yuzhao@google.com
Date: Thu Dec 7 23:14:04 2023 -0700
Subject: mm/mglru: fix underprotected page cache
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=081488051d28d32569ebb7c7a23572778b2e7d57

commit bb5e7f234eacf34b65be67ebb3613e3b8cf11b87
Author: Kalesh Singh kaleshsingh@google.com
Date: Tue Aug 1 19:56:03 2023 -0700
Subject: Multi-gen LRU: avoid race in inc_min_seq()
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bb5e7f234eacf34b65be67ebb3613e3b8cf11b87

All but the very last commit are introduced in 6.5.0-27-generic, which was
just released to -updates today.

These fixups get us most of the way to all the fixes available in 6.8, and I
do wonder if they help things.

Would anyone please be able to install 6.5.0-27-generic to a stemcell, and run
a real world workload through it, with Multi-Gen LRU enabled?

I would be very eager to hear the results. These patches should help with
cache eviction and not running out memory.

Thanks,
Matthew

@matthewruffell
Copy link

Hi @jpalermo @cunnie,

Have you had a chance to have a look at 6.5.0-27-generic?

Thanks,
Matthew

@matthewruffell
Copy link

Hi @jpalermo @cunnie,

Is it okay to assume that memory reclaim is good enough on 6.5.0-27-generic or later?
Let me know if you are still interested in 6.5. Noble has been released with its 6.8 kernel now, I hope your cgroups v2 transition is going well.

Thanks,
Matthew

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

8 participants