Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodeinit pods failing in 1.15.5 #32674

Closed
2 of 3 tasks
dlahn opened this issue May 22, 2024 · 15 comments
Closed
2 of 3 tasks

nodeinit pods failing in 1.15.5 #32674

dlahn opened this issue May 22, 2024 · 15 comments
Labels
info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. sig/agent Cilium agent related.

Comments

@dlahn
Copy link

dlahn commented May 22, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We have recently tried to upgrade to 1.15.5 and the latest pre-release, but our nodeinit pods are failing with the following error:

nsenter: cannot open /proc/1/ns/ipc: Permission denied
!!! startup-script failed! exit code '1'

Reverting to 1.15.4 resolves the issue.

Cilium Version

1.15.5

Kernel Version

.

Kubernetes Version

v1.30.0-gke.145700

Regression

1.15.4

Sysdump

No response

Relevant log output

nsenter: cannot open /proc/1/ns/ipc: Permission denied
!!! startup-script failed! exit code '1'

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@dlahn dlahn added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 22, 2024
@dlahn
Copy link
Author

dlahn commented May 22, 2024

@lmb lmb added kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. sig/agent Cilium agent related. labels May 23, 2024
@lmb
Copy link
Contributor

lmb commented May 23, 2024

Do you have any custom helm config related to the nodeinit pod?

@lmb lmb added the need-more-info More information is required to further debug or fix the issue. label May 23, 2024
@dlahn
Copy link
Author

dlahn commented May 23, 2024

@lmb

  nodeinit:
    enabled: true
    reconfigureKubelet: true
    removeCbrBridge: true

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 23, 2024
@aanm
Copy link
Member

aanm commented May 23, 2024

@dlahn can you provide the steps you used for both 1.15.4 and 1.15.5? Thank you

@aanm aanm added need-more-info More information is required to further debug or fix the issue. and removed info-completed The GH issue has received a reply from the author labels May 23, 2024
@dlahn
Copy link
Author

dlahn commented May 23, 2024

@aanm I think it may have happened here, https://github.com/cilium/cilium/pull/31641/files#diff-0ea42ad21164b19bec1732225e254d3096d1e4040481c00053669287d81015fe, so I mispoke, and I think the last working verison was 1.15.3. If we simply upgrade the helm chart to the newest version, we receive these errors.

The only way to get 1.15.4 to work is to add this to the nodeinit section:

  nodeinit:
    enabled: true
    reconfigureKubelet: true
    removeCbrBridge: true
    image:
      tag: "62093c5c233ea914bfa26a10ba41f8780d9b737f"

However, this doesn't work in 1.15.5

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 23, 2024
@dlahn
Copy link
Author

dlahn commented May 29, 2024

Any ideas here?

@jbmolle
Copy link

jbmolle commented Jun 4, 2024

Hi!
I raised this issue on K8s Github kubernetes/kubernetes#125069
I don't know if your error is related to that but I couldn't start Cilium either with version 1.15.5 because the pod annotations were removed and replaced by appArmorProfile type Unconfined.
But the appArmorProfile Unconfined doesn't work for me with containerd. So if you also use containerd you can try to reput the annotations like on 1.15.4:
container.apparmor.security.beta.kubernetes.io/cilium-agent: "unconfined"
container.apparmor.security.beta.kubernetes.io/clean-cilium-state: "unconfined"
container.apparmor.security.beta.kubernetes.io/mount-cgroup: "unconfined"
container.apparmor.security.beta.kubernetes.io/apply-sysctl-overwrites: "unconfined"

@dlahn
Copy link
Author

dlahn commented Jun 4, 2024

Adding these annotations seems to have resolved the issue for us.

@ti-mo ti-mo removed the needs/triage This issue requires triaging to establish severity and next steps. label Jun 20, 2024
@danieljkemp
Copy link

Hi! I raised this issue on K8s Github kubernetes/kubernetes#125069 I don't know if your error is related to that but I couldn't start Cilium either with version 1.15.5 because the pod annotations were removed and replaced by appArmorProfile type Unconfined. But the appArmorProfile Unconfined doesn't work for me with containerd. So if you also use containerd you can try to reput the annotations like on 1.15.4: container.apparmor.security.beta.kubernetes.io/cilium-agent: "unconfined" container.apparmor.security.beta.kubernetes.io/clean-cilium-state: "unconfined" container.apparmor.security.beta.kubernetes.io/mount-cgroup: "unconfined" container.apparmor.security.beta.kubernetes.io/apply-sysctl-overwrites: "unconfined"

I ended up with the same issue/solution, found it by doing a helm chart diff.

Which leads to the followup question of why containerd doesn't support the new profile type ? ..... Are you also running rancher RKE2?

@jbmolle
Copy link

jbmolle commented Jun 25, 2024

For me containerd is working fine in the end. The problem was coming from opentelemetry operator which has a mutating admission webhook and was removing the appArmorProfile key from the pod definition.
They needed to update to Go 1.22 before getting the last schemas from K8s and accepting appArmorProfile key.
For now I've removed openteletry operator and my cilium is working fine again. The fix on opentelemetry operator is done so we just need to wait for the next release and everything should be good.
If you're not using opentelemetry operator, maybe you should check if you don't have other libs that are using a mutating admission webhook

@danieljkemp
Copy link

Just cert-manager and cnpg for those (and an istio-sidecar-injector despite having removed istio a while ago)

weirdly cilium did run fine after adding the annotations back? Haven't tried with a new cluster just yet.

@jbmolle
Copy link

jbmolle commented Jun 25, 2024

well cert-manager is not the problem. I'm using it too and the mutating webhook is not transforming the securityContext.
If it's a mutating webhook problem it's normal that the annotations are working.
K8s will look for eithter appArmorProfile in securityContext or annotations to enable the container to get the correct permissions for App Armor.
If your webhook removes the appArmorProfile of the securityContext but not the annotations then K8s receives what it needs

@jbmolle
Copy link

jbmolle commented Jun 25, 2024

I don't know cnpg but a quick look shows that you have some references to appArmorProfile in releases/cnpg-1.23.2.yaml
Those references are not there for versions older than 1.23.2 so maybe you don't use the last version?

@danieljkemp
Copy link

Yeah but that webhook only applies to cnpg postgress backup objects according to it's rules. Either way the created cilium pods end up with the apparmor context defined as expected in the running pod spec

@aanm
Copy link
Member

aanm commented Jul 12, 2024

Fixed, see the solution in here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

6 participants