Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation in 3033.2.0? #597

Open
dee-kryvenko opened this issue Jan 16, 2022 · 9 comments
Open

Performance degradation in 3033.2.0? #597

dee-kryvenko opened this issue Jan 16, 2022 · 9 comments
Labels
area/cgroup2 Issues uncovered through the migration to cgroup2. kind/bug Something isn't working resolution-suggested

Comments

@dee-kryvenko
Copy link

Description

We are upgrading from 2905.2.4 to 3033.2.0 on AWS managed with Kops using the following AMI:

data "aws_ami" "flatcar" {
  owners      = ["075585003325"]
  most_recent = true

  filter {
    name   = "architecture"
    values = ["x86_64"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  filter {
    name   = "name"
    values = ["Flatcar-stable-${var.flatcar_version}*"]
  }
}

And we are getting what seems to be a performance hit. We have tightly limited workloads:

        resources:
          requests:
            memory: 128Mi
            cpu: 50m
          limits:
            memory: 128Mi
            cpu: 500m

And some of them (specifically - based on Java SpringBoot) just unable to start after the upgrade. They just take ages to init the Java code until the probe backs off and restarts the container. We have ruled out everything else, i.e. kops version, K8s version etc - just by swapping the node group AMI from 2905.2.4 to 3033.2.0 is what triggers this behavior, under the same resource constraints and probes configuration.

Impact

We have detected this in our test clusters, and we are not able to upgrade our prod clusters. If a bunch of workloads will just unable to start after the rolling upgrade in prod - we will have a major outage on our hands.

Environment and steps to reproduce

K8s 1.20.14, kops 1.20.3, AWS.

Expected behavior

I'd expect containers able to start with the same probes and resources constraints as they were on previous versions.

Additional information

N/A

@dee-kryvenko dee-kryvenko added the kind/bug Something isn't working label Jan 16, 2022
@jepio
Copy link
Member

jepio commented Jan 17, 2022

The first thing that comes to mind is the switch to cgroupv2 - https://www.flatcar.org/docs/latest/container-runtimes/switching-to-unified-cgroups/. If I were you I would check if switching back allows the workloads to start with the newer Flatcar version.

Are you seeing the pods getting OOM killed? Cgroup2 and legacy cgroups perform memory accounting differently (legacy cgroups didn't account all it), so there is no guarantee that you will be able to use the same value for memory limit. CPU accounting should be similar enough for there to not be a difference.

@dee-kryvenko
Copy link
Author

dee-kryvenko commented Jan 17, 2022

It is not getting OOM killed - it just veeeery slow to start. Java is known to be slow to start, but my SpringBoot applications producing like two lines of init logs and then there is no activity at all until it gets killed by the probe timeout. Which makes me feel like it is either IO or CPU throttled but I guess it might be due to the lack of memory too.

Is there any human readable explanation of what changed exactly in chroups v2 as to the memory usage?

@jepio
Copy link
Member

jepio commented Jan 17, 2022

I don't think you'll find a human readable explanation, it's spread out over many blog posts and conference talks.

The best resource is probably this section https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory, and the memory.stat list. The biggest changes to memory controller in cgroup2 would be including kernel memory allocations, TCP socket buffers and block io writeback buffers in the limit.

If you increase the memory limit, does the application start correctly? Then you could determine new limits by looking at memory.current file.

@t-lo
Copy link
Member

t-lo commented Jan 28, 2022

Hi @dee-kryvenko , we had other folks reporting Java slowness with old Java versions (in the case of the report I am referring to, Java 8) in combination with cgroups v2. The issues were resolved by

Would you mind giving this a go and get back to us with the results?

@jepio
Copy link
Member

jepio commented Jan 28, 2022

There are cases were old java runtimes don't know how to parse cgroup2 data and don't configure heap size and thread-pool sizes optimally considering cgroup limits. That could be it, and the only solution would be - update java runtime or switch to cgroupv1 like @t-lo mentioned.

@jepio jepio added resolution-suggested area/cgroup2 Issues uncovered through the migration to cgroup2. labels Jan 28, 2022
@dee-kryvenko
Copy link
Author

Hmmm thank you @t-lo and @jepio - I think all applications we experienced issues with were java, but I am pretty sure some of them were running Amazon Corretto 11. We have rolled back to the older flatcar for the time being, but for the next upgrade attempt this is definitely something we'll look at.

@t-lo
Copy link
Member

t-lo commented Jan 31, 2022

Thanks for getting back to us @dee-kryvenko .

To ensure your issue is actually caused by cgroups v2, it would also be very helpful if you could run 3033.2.0 in cgroups v1 mode (see https://www.flatcar.org/docs/latest/container-runtimes/switching-to-unified-cgroups/#starting-new-nodes-with-legacy-cgroups) and validate if you're still hitting performance issues.

On a more general note the maintainers team is currently investigating options to make it easier to continue using cgroupsv1 by default in future releases. Stay tuned!

@sayanchowdhury
Copy link
Member

@dee-kryvenko Are you still facing the issue reported. If not, can we close this issue?

@dee-kryvenko
Copy link
Author

We have since moved away from kops and flatcar so no, not having this issue anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cgroup2 Issues uncovered through the migration to cgroup2. kind/bug Something isn't working resolution-suggested
Projects
Development

No branches or pull requests

4 participants