Performance degradation in 3033.2.0? #597

dee-kryvenko · 2022-01-16T20:17:52Z

Description

We are upgrading from 2905.2.4 to 3033.2.0 on AWS managed with Kops using the following AMI:

data "aws_ami" "flatcar" {
  owners      = ["075585003325"]
  most_recent = true

  filter {
    name   = "architecture"
    values = ["x86_64"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  filter {
    name   = "name"
    values = ["Flatcar-stable-${var.flatcar_version}*"]
  }
}

And we are getting what seems to be a performance hit. We have tightly limited workloads:

        resources:
          requests:
            memory: 128Mi
            cpu: 50m
          limits:
            memory: 128Mi
            cpu: 500m

And some of them (specifically - based on Java SpringBoot) just unable to start after the upgrade. They just take ages to init the Java code until the probe backs off and restarts the container. We have ruled out everything else, i.e. kops version, K8s version etc - just by swapping the node group AMI from 2905.2.4 to 3033.2.0 is what triggers this behavior, under the same resource constraints and probes configuration.

Impact

We have detected this in our test clusters, and we are not able to upgrade our prod clusters. If a bunch of workloads will just unable to start after the rolling upgrade in prod - we will have a major outage on our hands.

Environment and steps to reproduce

K8s 1.20.14, kops 1.20.3, AWS.

Expected behavior

I'd expect containers able to start with the same probes and resources constraints as they were on previous versions.

Additional information

N/A

The text was updated successfully, but these errors were encountered:

jepio · 2022-01-17T09:12:26Z

The first thing that comes to mind is the switch to cgroupv2 - https://www.flatcar.org/docs/latest/container-runtimes/switching-to-unified-cgroups/. If I were you I would check if switching back allows the workloads to start with the newer Flatcar version.

Are you seeing the pods getting OOM killed? Cgroup2 and legacy cgroups perform memory accounting differently (legacy cgroups didn't account all it), so there is no guarantee that you will be able to use the same value for memory limit. CPU accounting should be similar enough for there to not be a difference.

dee-kryvenko · 2022-01-17T13:29:39Z

It is not getting OOM killed - it just veeeery slow to start. Java is known to be slow to start, but my SpringBoot applications producing like two lines of init logs and then there is no activity at all until it gets killed by the probe timeout. Which makes me feel like it is either IO or CPU throttled but I guess it might be due to the lack of memory too.

Is there any human readable explanation of what changed exactly in chroups v2 as to the memory usage?

jepio · 2022-01-17T14:02:38Z

I don't think you'll find a human readable explanation, it's spread out over many blog posts and conference talks.

The best resource is probably this section https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory, and the memory.stat list. The biggest changes to memory controller in cgroup2 would be including kernel memory allocations, TCP socket buffers and block io writeback buffers in the limit.

If you increase the memory limit, does the application start correctly? Then you could determine new limits by looking at memory.current file.

t-lo · 2022-01-28T08:17:12Z

Hi @dee-kryvenko , we had other folks reporting Java slowness with old Java versions (in the case of the report I am referring to, Java 8) in combination with cgroups v2. The issues were resolved by

upgrading the Java runtime to a recent version, or
running Flatcar in cgroups v1 legacy mode (see https://www.flatcar.org/docs/latest/container-runtimes/switching-to-unified-cgroups/#starting-new-nodes-with-legacy-cgroups for details)

Would you mind giving this a go and get back to us with the results?

jepio · 2022-01-28T12:31:18Z

There are cases were old java runtimes don't know how to parse cgroup2 data and don't configure heap size and thread-pool sizes optimally considering cgroup limits. That could be it, and the only solution would be - update java runtime or switch to cgroupv1 like @t-lo mentioned.

dee-kryvenko · 2022-01-29T22:44:56Z

Hmmm thank you @t-lo and @jepio - I think all applications we experienced issues with were java, but I am pretty sure some of them were running Amazon Corretto 11. We have rolled back to the older flatcar for the time being, but for the next upgrade attempt this is definitely something we'll look at.

t-lo · 2022-01-31T16:01:11Z

Thanks for getting back to us @dee-kryvenko .

To ensure your issue is actually caused by cgroups v2, it would also be very helpful if you could run 3033.2.0 in cgroups v1 mode (see https://www.flatcar.org/docs/latest/container-runtimes/switching-to-unified-cgroups/#starting-new-nodes-with-legacy-cgroups) and validate if you're still hitting performance issues.

On a more general note the maintainers team is currently investigating options to make it easier to continue using cgroupsv1 by default in future releases. Stay tuned!

sayanchowdhury · 2023-09-08T11:37:01Z

@dee-kryvenko Are you still facing the issue reported. If not, can we close this issue?

dee-kryvenko · 2023-09-08T16:46:57Z

We have since moved away from kops and flatcar so no, not having this issue anymore.

dee-kryvenko added the kind/bug Something isn't working label Jan 16, 2022

jepio added resolution-suggested area/cgroup2 Issues uncovered through the migration to cgroup2. labels Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation in 3033.2.0? #597

Performance degradation in 3033.2.0? #597

dee-kryvenko commented Jan 16, 2022

jepio commented Jan 17, 2022

dee-kryvenko commented Jan 17, 2022 •

edited

jepio commented Jan 17, 2022

t-lo commented Jan 28, 2022

jepio commented Jan 28, 2022

dee-kryvenko commented Jan 29, 2022

t-lo commented Jan 31, 2022

sayanchowdhury commented Sep 8, 2023

dee-kryvenko commented Sep 8, 2023

Performance degradation in 3033.2.0? #597

Performance degradation in 3033.2.0? #597

Comments

dee-kryvenko commented Jan 16, 2022

Description

Impact

Environment and steps to reproduce

Expected behavior

Additional information

jepio commented Jan 17, 2022

dee-kryvenko commented Jan 17, 2022 • edited

jepio commented Jan 17, 2022

t-lo commented Jan 28, 2022

jepio commented Jan 28, 2022

dee-kryvenko commented Jan 29, 2022

t-lo commented Jan 31, 2022

sayanchowdhury commented Sep 8, 2023

dee-kryvenko commented Sep 8, 2023

dee-kryvenko commented Jan 17, 2022 •

edited