Containers fail to create and probe exec errors related to seccomp on recent kernel-5.10 versions #1219

essh · 2023-03-13T07:11:07Z

What happened:

After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error:

Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7e64d5e682113437c8c07b8301771e53c710a6ca6ee": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

This issue is very similar to #1179. However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads.

We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing.

What you expected to happen:

Containers to launch successfully and become Ready
Liveness an readiness probes to execute successfully

How to reproduce it (as minimally and precisely as possible):

I don't currently have a reproduction that I can share due to my current one using some internal code (I can hopefully produce a more generic one if required when I get a chance).

As a starting point we only noticed this happening on nodes that had pods scheduled on them which had an exec liveness & readiness probe running every 10 seconds that performs a health check against a gRPC service using grpcurl. In addition to this we also have a default Pod Security Policy (yes we know they are deprecated 😄) that has the following annotation seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default.

These two conditions seem to be enough to trigger this issue and the values reported by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' will steadily increase over time until containers can no longer be created on the node.

Anything else we need to know?:

Environment:

AWS Region: Multiple
Instance Type(s): Mix of x86_64 and arm64 instances of varying sizes
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.4"
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.24"
AMI Version: v20230217
Kernel (e.g. uname -a): 5.10.165-143.735.amzn2.x86_64 #1 SMP Wed Jan 25 03:13:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-09bffa74b1e396075"
BUILD_TIME="Fri Feb 17 21:59:10 UTC 2023"
BUILD_KERNEL="5.10.165-143.735.amzn2.x86_64"
ARCH="x86_64"

Official Guidance

Kubernetes pods using SECCOMP filtering on EKS optimized AMIs based on Linux Kernel version 5.10.x may get stuck in ContainerCreating state or their liveness/readiness probes fail with the following error:

unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524

When a process with SECCOMP filters creates a child process, the same filters are inherited and applied to the new process. The Amazon Linux kernel versions 5.10.x are affected by a memory leak that occurs when parent process is terminated while creating a child process. When the total amount of memory allocated for SECCOMP filter is over the limit, a process cannot create a new SECCOMP filter. As a result, the parent process fails to create a new child process and the above error message will be logged.

This issue is more likely to be encountered with kernel versions kernel-5.10.176-157.645.amzn2 and kernel-5.10.177-158.645.amzn2 where the rate of the memory leak is higher.

Amazon Linux will be releasing the fixed kernel by May 1st, 2023. We will be releasing a new set of EKS AMIs with the updated kernel latest by May 3rd, 2023.

The text was updated successfully, but these errors were encountered:

essh · 2023-03-17T06:14:24Z

I've managed to build a reliable reproduction for this issue that I can now share. A quick summary is that the impact seems to depend on instance type. I have been able to consistently reproduce this issue on c5d.xlarge & c5a.xlarge instance types (x86_64). I have seen some bpf_jit memory growth on c6g.xlarge instances (arm64) but it seems a bit slower and I haven't seen containers fail to create on these nodes yet as a result. I can't reproduce this issue on a t3a.large instance as bpf_jit memory levels remain pretty consistent.

The easiest way to reproduce this is to spin up a fresh EKS 1.24 cluster and add a single node of the required instance type (this makes it easier to observe) running EKS AMI v20230217 (or v20230304). Then run the following commands:

kubectl delete clusterrolebinding eks:podsecuritypolicy:authenticated
kubectl delete clusterrole eks:podsecuritypolicy:privileged
kubectl delete podsecuritypolicy eks.privileged
kubectl apply -f https://gist.githubusercontent.com/essh/f7dd219a48df25e7294847484da112b7/raw/503ff9a8f32f19430040cd65c213479979bfcc3c/bpf-jit-leak.yaml

This removes the eks.privileged PSP, installs PSPs that use seccomp and starts up a simple app with some exec probes that trigger the issue. The container used for this app is built from the source available at https://github.com/essh/grpc-greeter-node.

Once this is running you can observe memory growth by executing sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' on the node. You can tweak the replica count up and down to speed up/slow down this process. If you leave it long enough the value will exceed net.core.bpf_jit_limit and you will end up with failure to create containers/exec probes. We were seeing this after about 2-3 days with our node types/workloads in our environment.

This same test against EKS AMI v20230203 nodes or lower (kernel 5.4) does not exhibit this issue.

cartermckinnon · 2023-03-17T16:02:19Z

@essh really appreciate the details; I'm following up internally with our kernel folks and will update here as I try to reproduce.

essh · 2023-03-18T05:52:03Z

If it helps I see the same behaviour with the following much simpler manifest that doesn't require any of the (deprecated/removed) PSP fiddling. You can apply this directly to a newly created cluster that meets the reproduction requirements, nothing else required.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bpf-jit-leak
  labels:
    app: bpf-jit-leak
spec:
  replicas: 8
  selector:
    matchLabels:
      app: bpf-jit-leak
  template:
    metadata:
      labels:
        app: bpf-jit-leak
    spec:
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: bpf-jit-leak
        image: essh/grpc-greeter-node:latest
        ports:
        - containerPort: 50051
          name: grpc
          protocol: TCP
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256Mi
        livenessProbe:
          exec:
            command:
            - /opt/app/scripts/health.sh
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command:
            - /opt/app/scripts/health.sh
          failureThreshold: 1
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 3
          timeoutSeconds: 5

Without the following on the spec I don't see the issue, i.e. the value reported by sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' is not periodically increasing.

      securityContext:
        seccompProfile:
          type: RuntimeDefault

borkmann · 2023-03-20T10:32:44Z

@essh really appreciate the details; I'm following up internally with our kernel folks and will update here as I try to reproduce.

@essh @cartermckinnon I happened to take a look at this recently, and tried to reproduce this on latest bpf tree kernel. I dumped the values around bpf_jit_charge_modmem and bpf_jit_uncharge_modmem, in particular the size passed in and the value of bpf_jit_current after the operation. They all look sane to me. For example, when running tcpdump with a specific filter (e.g. tcpdump -i lo tcp) but also a test application loading a seccomp BPF policy, I can see the bpf_jit_current counter going up and then discharging again with the same value. Also I tested on native eBPF programs, same here. This all looks good to me.

@cartermckinnon if you follow-up with kernel folks, I'd suggest to check the same.. meaning, is bpf_jit_current steadily increasing (and never decreasing) or does it look sane when loading/unloading programs and just the default limit is too low.

Either way, the default limit for any BPF user for the JIT is currently set to 1/4 of the module memory space, and I'll send an upstream patch (and also recommend for stable) to bump this default limit to 1/2.

From @essh's description though, it looks like the counter is never decreasing which looks like an AWS kernel bug if indeed true, perhaps some backport going wrong, etc. Would be good to double check.

We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: awslabs/amazon-eks-ami#1179 Link: awslabs/amazon-eks-ami#1219

borkmann · 2023-03-21T19:23:48Z

Looks like potentially missing kernel commit in seccomp causing this issue: a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") (via https://lore.kernel.org/bpf/20230321170925.74358-1-kuniyu@amazon.com/)

We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: awslabs/amazon-eks-ami#1179 Link: awslabs/amazon-eks-ami#1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>

stevo-f3 · 2023-03-21T22:50:01Z

Is memleak (mentioned in https://lore.kernel.org/bpf/20230321170925.74358-1-kuniyu@amazon.com/) fixed in 5.4? If so, would it make sense for kernel in amazon-eks-ami published AMI to be downgraded from 5.10 to 5.4 until memleak fix is "backported" to 5.10 and newer?

borkmann · 2023-03-21T23:11:07Z

5.4 kernel would not be affected as it does not seem to have the offending commit 3a15fb6ed92c ("seccomp: release filter after task is fully dead") which a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") fixes.

stevo-f3 · 2023-03-22T07:02:11Z

Thanks @borkmann for heads up!

It's non trivial to downgrade the kernel downstream when building AMI based on this upstream EKS node AMI which is on kernel 5.10; It would be great that this upstream AMI gets downgraded to kernel 5.4 (at least until memory leak is backported to affected 5.10+ kernels), and anyone that really needs 5.10 or newer and can live with known memory leak, can more easily upgrade the kernel on their own in custom AMI based on the upstream one. WDYT?

borkmann · 2023-03-22T09:24:31Z

I'll defer to AWS folks with regards to your question, Cc @cartermckinnon. Hopefully this can be fixed quickly by cherry-picking the two commits below for EKS 5.10 kernel.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a1140cb215fa13dcec06d12ba0c3ee105633b7c4
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=10ec8ca8ec1a2f04c4ed90897225231c58c124a7

dims · 2023-03-22T11:00:18Z

@borkmann ACK on behalf of @cartermckinnon. please give us some time to do things...

cartermckinnon · 2023-03-23T21:09:09Z

Unfortunately the series of patches we've cherrypicked internally does not seem to resolve the issue. We're still looking into it.

I was not able to reproduce this with 5.15, so we're diff-ing the changelog as well.

cartermckinnon · 2023-03-23T21:25:31Z

It's non trivial to downgrade the kernel downstream when building AMI based on this upstream EKS node AMI which is on kernel 5.10

@stevo-f3 This should do it:

yum versionlock delete kernel
amazon-linux-extras disable kernel-5.10
amazon-linux-extras enable kernel-5.4
yum install -y kernel

At present, we have more users needing 5.10 who are not experiencing this leak than those who are; downgrading the official build to 5.4 would be a last resort if we can't put a fix together.

stevo-f3 · 2023-03-23T22:13:10Z

We can't use 5.15 - recently downgraded to 5.10, with 5.15 were experiencing kernel panics on instance startup, on production only.

Thanks for downgrade to 5.4 instructions, is trivial after all, will use it at least until fix available.

[ Upstream commit 10ec8ca ] We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: awslabs/amazon-eks-ami#1179 Link: awslabs/amazon-eks-ami#1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 10ec8ca8ec1a2f04c4ed90897225231c58c124a7 ] We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: awslabs/amazon-eks-ami#1179 Link: awslabs/amazon-eks-ami#1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 10ec8ca8ec1a2f04c4ed90897225231c58c124a7 ] We've seen recent AWS EKS (Kubernetes) user reports l(CR) the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: awslabs/amazon-eks-ami#1179 Link: awslabs/amazon-eks-ami#1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 10ec8ca8ec1a2f04c4ed90897225231c58c124a7 ] We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: awslabs/amazon-eks-ami#1179 Link: awslabs/amazon-eks-ami#1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 10ec8ca8ec1a2f04c4ed90897225231c58c124a7 ] We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: awslabs/amazon-eks-ami#1179 Link: awslabs/amazon-eks-ami#1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Santhosh <santhosh.user.why.red@gmail.com>

[ Upstream commit 10ec8ca8ec1a2f04c4ed90897225231c58c124a7 ] We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <sh@synk.net> Reported-by: Lefteris Alexakis <lefteris.alexakis@kpn.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: awslabs/amazon-eks-ami#1179 Link: awslabs/amazon-eks-ami#1219 Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230320143725.8394-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

stevo-f3 mentioned this issue Mar 23, 2023

feat: optional disabled by default collector for /proc/vmallocinfo to make bpf jit usage observable prometheus/node_exporter#2640

Open

apolovov mentioned this issue Feb 7, 2024

[monitoring] Throw an alert if the bpf_jit buffer is full deckhouse/deckhouse#7402

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containers fail to create and probe exec errors related to seccomp on recent kernel-5.10 versions #1219

Containers fail to create and probe exec errors related to seccomp on recent kernel-5.10 versions #1219

essh commented Mar 13, 2023 •

edited by mmerkes

Loading

essh commented Mar 17, 2023 •

edited

Loading

cartermckinnon commented Mar 17, 2023

essh commented Mar 18, 2023 •

edited

Loading

borkmann commented Mar 20, 2023 •

edited

Loading

borkmann commented Mar 21, 2023

stevo-f3 commented Mar 21, 2023

borkmann commented Mar 21, 2023

stevo-f3 commented Mar 22, 2023

borkmann commented Mar 22, 2023 •

edited

Loading

dims commented Mar 22, 2023

cartermckinnon commented Mar 23, 2023

cartermckinnon commented Mar 23, 2023

stevo-f3 commented Mar 23, 2023

Containers fail to create and probe exec errors related to seccomp on recent kernel-5.10 versions #1219

Containers fail to create and probe exec errors related to seccomp on recent kernel-5.10 versions #1219

Comments

essh commented Mar 13, 2023 • edited by mmerkes Loading

Official Guidance

essh commented Mar 17, 2023 • edited Loading

cartermckinnon commented Mar 17, 2023

essh commented Mar 18, 2023 • edited Loading

borkmann commented Mar 20, 2023 • edited Loading

borkmann commented Mar 21, 2023

stevo-f3 commented Mar 21, 2023

borkmann commented Mar 21, 2023

stevo-f3 commented Mar 22, 2023

borkmann commented Mar 22, 2023 • edited Loading

dims commented Mar 22, 2023

cartermckinnon commented Mar 23, 2023

cartermckinnon commented Mar 23, 2023

stevo-f3 commented Mar 23, 2023

essh commented Mar 13, 2023 •

edited by mmerkes

Loading

essh commented Mar 17, 2023 •

edited

Loading

essh commented Mar 18, 2023 •

edited

Loading

borkmann commented Mar 20, 2023 •

edited

Loading

borkmann commented Mar 22, 2023 •

edited

Loading