Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium v1.12.0-rc2 complains on startup: "Unable to patch node resource with annotation" #19816

Closed
2 tasks done
joestringer opened this issue May 13, 2022 · 15 comments
Closed
2 tasks done
Labels
kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. release-blocker/1.12 This issue will prevent the release of the next version of Cilium.

Comments

@joestringer
Copy link
Member

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I installed Cilium v1.9.x using the Helm charts into a kind environment by following the Kind GSG.

Afterwards, I upgraded to Cilium 1.12.0-rc2 by executing the following command:

helm upgrade -i cilium cilium/cilium --version 1.12.0-rc2 --namespace kube-system -f kind-values.yaml

Here is my kind-values.yaml:

kubeProxyReplacement: partial
hostServices:
  enabled: false
externalIPs:
  enabled: true
nodePort:
  enabled: true
hostPort:
  enabled: true
bpf:
  masquerade: false
image:
  pullPolicy: IfNotPresent
ipam:
  mode: kubernetes
extraConfig:
  mtu: "1280"

The following warnings are then being regularly printed to the logs:

level=warning msg="Unable to patch node resource with annotation" error="nodes \"kind-control-plane\" is forbidden: User \"system:serviceaccount:kube-system:cilium\" cannot patch resource \"nodes/status\" in API group \"\" at the cluster scope" key=0 nodeName=kind-control-plane subsys=k8s v4CiliumHostIP.IPv4=10.244.0.245 v4Prefix=10.244.0.0/24 v4healthIP.IPv4=10.244.0.164 v6CiliumHostIP.IPv6="<nil>" v6Prefix="<nil>" v6healthIP.IPv6="<nil>"

Cilium Version

1.12.0-rc2

Kernel Version

5.14.0-1034-oem (Ubuntu Focal 20.04.4 LTS)

Kubernetes Version

1.21.1

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@joestringer joestringer added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. release-blocker/1.12 This issue will prevent the release of the next version of Cilium. labels May 13, 2022
@joestringer
Copy link
Member Author

We should investigate here to see whether this can be avoided either with improvements to the Helm charts or upgrade guide docs, since I wasn't directly and closely following any upgrade guides. This could be user error on my part, but I want to make sure that we're not ignoring a real signal.

@sayboras
Copy link
Member

Annotating node is gated by flag .Values.annotateK8sNode which is:

So just want to confirm if the above error log was showing after upgrade ? or after 1.9 installation?

I have done a quick 1.9 installation and surprisingly the above error happens despite the fact the clusterrole is having dark magic patch permission on nodes/status.

$ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         3 errors
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         disabled
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
DaemonSet         cilium             Desired: 2, Ready: 1/2, Available: 1/2, Unavailable: 1/2
Containers:       cilium             Running: 2
                  cilium-operator    Running: 2
Cluster Pods:     3/3 managed by Cilium
Image versions    cilium             quay.io/cilium/cilium:v1.9: 2
                  cilium-operator    quay.io/cilium/operator-generic:v1.9: 2
Errors:           cilium             cilium-f9jb9    controller update-k8s-node-annotations is failing since 3s (5x): nodes "19816-worker" is forbidden: User "system:serviceaccount:kube-system:cilium" cannot patch resource "nodes" in API group "" at the cluster scope
                  cilium             cilium          1 pods of DaemonSet cilium are not ready
                  cilium             cilium-zqqmk    controller update-k8s-node-annotations is failing since 3s (4x): nodes "19816-control-plane" is forbidden: User "system:serviceaccount:kube-system:cilium" cannot patch resource "nodes" in API group "" at the cluster scope

$ ksysg clusterrole cilium -o json | jq '.rules[3]'                 
{
  "apiGroups": [
    ""
  ],
  "resources": [
    "nodes/status"
  ],
  "verbs": [
    "patch"
  ]
}

Manually change clusterrole permission (from patch nodes/status to patch nodes) solves the issue, I don't know what changes as the last time I verified on this #19590 (review)

$ cilium status               
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         OK
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         disabled
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
DaemonSet         cilium             Desired: 2, Ready: 2/2, Available: 2/2
Containers:       cilium             Running: 2
                  cilium-operator    Running: 2
Cluster Pods:     3/3 managed by Cilium
Image versions    cilium             quay.io/cilium/cilium:v1.9: 2
                  cilium-operator    quay.io/cilium/operator-generic:v1.9: 2

$ ksysg clusterrole cilium -o json | jq '.rules[3]'
{
  "apiGroups": [
    ""
  ],
  "resources": [
    "nodes"
  ],
  "verbs": [
    "patch"
  ]
}

@joestringer
Copy link
Member Author

joestringer commented May 16, 2022

So just want to confirm if the above error log was showing after upgrade ? or after 1.9 installation?

I only noticed after upgrade to v1.12.0-rc2. I believe this was in the cilium-agent logs, but I guess I'll have to try again and confirm.

Image versions    cilium             quay.io/cilium/cilium:v1.9: 2
                 cilium-operator    quay.io/cilium/operator-generic:v1.9: 2

May be worth double-checking with quay.io/cilium/cilium-ci:v1.9 and quay.io/cilium/operator-generic-ci:v1.9? The images you listed above look like they are 2+ months old, so likely do not include all of the latest changes. I don't think we update the branch version images on cilium/* any more.

@sayboras
Copy link
Member

sayboras commented May 17, 2022

My bad, I thought I added repo suffix -ci before. Testing again with recent release 1.9.16 looks good.

Then I have upgraded to v1.12.0-rc2, every thing seems fine to me. Note that some warning logs related to iptables are not related to this issue.

Out of curiousity, do we support upgrade across major versions (e.g. 1.9 -> 1.11/1.12) ? Or the upgrade should be done incrementally (e.g. 1.9.x -> 1.10.x -> 1.11.x -> ...) ?

v1.9.16 -> v1.12.0-rc2
$ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         OK
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         disabled
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
DaemonSet         cilium             Desired: 2, Ready: 2/2, Available: 2/2
Containers:       cilium             Running: 2
                  cilium-operator    Running: 2
Cluster Pods:     3/3 managed by Cilium
Image versions    cilium             quay.io/cilium/cilium:v1.9.16: 2
                  cilium-operator    quay.io/cilium/operator-generic:v1.9.16: 2
                  
# after upgrade to v1.12.0-rc2
$ helm upgrade -i cilium cilium/cilium --version 1.12.0-rc2 --namespace kube-system -f kind-values.yaml
$ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         OK
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         disabled
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

DaemonSet         cilium             Desired: 2, Ready: 2/2, Available: 2/2
Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
Containers:       cilium             Running: 2
                  cilium-operator    Running: 2
Cluster Pods:     3/3 managed by Cilium
Image versions    cilium             quay.io/cilium/cilium:v1.12.0-rc2: 2
                  cilium-operator    quay.io/cilium/operator-generic:v1.12.0-rc2: 2

$ ksyslo --timestamps ds/cilium | egrep "level=warn|error|fatal"
Found 2 pods, using pod/cilium-n6zbg
2022-05-17T02:07:47.712328392Z level=info msg="  --kvstore-max-consecutive-quorum-errors='2'" subsys=daemon
2022-05-17T02:07:50.390575314Z level=warning msg="Failed to sysctl -w" error="could not open the sysctl file /proc/sys/net/core/bpf_jit_enable: open /proc/sys/net/core/bpf_jit_enable: no such file or directory" subsys=sysctl sysParamName=net.core.bpf_jit_enable sysParamValue=1
2022-05-17T02:07:54.743243464Z level=warning msg="Unable to delete Cilium &{iptables cilium_node_set_v4 [-w 5]} rule" error="exit status 1" obj="-A OLD_CILIUM_OUTPUT_raw -o lxc+ -m mark --mark 0xa00/0xfffffeff -m comment --comment \"cilium: NOTRACK for proxy return traffic\" -j NOTRACK" subsys=iptables
2022-05-17T02:07:54.744258481Z level=warning msg="Unable to delete Cilium &{iptables cilium_node_set_v4 [-w 5]} rule" error="exit status 1" obj="-A OLD_CILIUM_OUTPUT_raw -o cilium_host -m mark --mark 0xa00/0xfffffeff -m comment --comment \"cilium: NOTRACK for proxy return traffic\" -j NOTRACK" subsys=iptables
2022-05-17T02:07:54.745233864Z level=warning msg="Unable to delete Cilium &{iptables cilium_node_set_v4 [-w 5]} rule" error="exit status 1" obj="-A OLD_CILIUM_PRE_raw -m mark --mark 0x200/0xf00 -m comment --comment \"cilium: NOTRACK for proxy traffic\" -j NOTRACK" subsys=iptables

# Explicitly enabled k8s annotation to make sure no issue with RBAC.
$ ksysg cm cilium-config -o json  | jqd | grep annotate | head -1
    "annotate-k8s-node": "true",
$  ksyslo --timestamps ds/cilium | egrep "level=warn|error|fatal"
Found 2 pods, using pod/cilium-f7tgr
2022-05-17T02:16:59.371175801Z level=info msg="  --kvstore-max-consecutive-quorum-errors='2'" subsys=daemon
2022-05-17T02:17:02.044752568Z level=warning msg="Failed to sysctl -w" error="could not open the sysctl file /proc/sys/net/core/bpf_jit_enable: open /proc/sys/net/core/bpf_jit_enable: no such file or directory" subsys=sysctl sysParamName=net.core.bpf_jit_enable sysParamValue=1

@joestringer
Copy link
Member Author

Out of curiousity, do we support upgrade across major versions (e.g. 1.9 -> 1.11/1.12) ? Or the upgrade should be done incrementally (e.g. 1.9.x -> 1.10.x -> 1.11.x -> ...) ?

We only officially support the latter, there are some steps that will get missed if you upgrade directly from 1.9->1.11/1.12. As a developer I just like to live life on the edge 😁

@joestringer
Copy link
Member Author

I've just retried reproducing this, and like @sayboras I can't reproduce it any more. I'll close this out, feel free to comment or reopen if you see this again.

@AndreiHardziyenkaIR
Copy link

AndreiHardziyenkaIR commented Jun 2, 2022

Reproduced in GKE 1.22.8-gke.201

Found a quick workaround:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cilium-node-patcher
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: "system:node"
subjects:
- apiGroup: ""
  kind: ServiceAccount
  name: cilium
  namespace: kube-system

ClusterRole system:node is predefined role so no need to create it
It gives an ability to cilium service-account to patch nodes

@sayboras
Copy link
Member

sayboras commented Jun 2, 2022

@AndreiHardziyenkaIR can you share the steps on how you are installing cilium ? I just tried with the above version as well, but I am unable to replicate.

$ cilium install --version v1.12.0-rc2 --helm-set annotateK8sNode=true
level=warning msg="WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.25+; use gcloud instead." subsys=klog
level=warning msg="To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke" subsys=klog
🔮 Auto-detected Kubernetes kind: GKE
ℹ️  using Cilium version "v1.12.0-rc2"
🔮 Auto-detected cluster name: gke-cilium-dev-australia-southeast1-a-tammach
🔮 Auto-detected IPAM mode: kubernetes
🔮 Auto-detected datapath mode: gke
✅ Detected GKE native routing CIDR: 10.44.0.0/14
ℹ️  helm template --namespace kube-system cilium cilium/cilium --version 1.12.0-rc2 --set annotateK8sNode=true,cluster.id=0,cluster.name=gke-cilium-dev-australia-southeast1-a-tammach,cni.binPath=/home/kubernetes/bin,encryption.nodeEncryption=false,gke.disableDefaultSnat=true,gke.enabled=true,ipam.mode=kubernetes,ipv4NativeRoutingCIDR=10.44.0.0/14,kubeProxyReplacement=disabled,nodeinit.enabled=true,nodeinit.reconfigureKubelet=true,nodeinit.removeCbrBridge=true,operator.replicas=1,serviceAccounts.cilium.name=cilium,serviceAccounts.operator.name=cilium-operator
ℹ️  Storing helm values file in kube-system/cilium-cli-helm-values Secret
🚀 Creating Resource quotas...
🔑 Created CA in secret cilium-ca
🔑 Generating certificates for Hubble...
🚀 Creating Service accounts...
🚀 Creating Cluster roles...
🚀 Creating ConfigMap for Cilium version 1.12.0-rc2...
🚀 Creating GKE Node Init DaemonSet...
🚀 Creating Agent DaemonSet...
🚀 Creating Operator Deployment...
⌛ Waiting for Cilium to be installed and ready...
✅ Cilium was successfully installed! Run 'cilium status' to view installation health

$ cilium status    
level=warning msg="WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.25+; use gcloud instead." subsys=klog
level=warning msg="To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke" subsys=klog
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         OK
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         disabled
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

DaemonSet         cilium             Desired: 2, Ready: 2/2, Available: 2/2
Deployment        cilium-operator    Desired: 1, Ready: 1/1, Available: 1/1
Containers:       cilium             Running: 2
                  cilium-operator    Running: 1
Cluster Pods:     9/9 managed by Cilium
Image versions    cilium-operator    quay.io/cilium/operator-generic:v1.12.0-rc2: 1
                  cilium             quay.io/cilium/cilium:v1.12.0-rc2: 2


$ kg nodes gke-tammach-default-pool-a0225682-w3ql -o json | jq .metadata.annotations
W0602 20:44:27.236473  246667 gcp.go:120] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.25+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
{
  "container.googleapis.com/instance_id": "5257812964495667207",
  "csi.volume.kubernetes.io/nodeid": "{\"pd.csi.storage.gke.io\":\"projects/cilium-dev/zones/australia-southeast1-a/instances/gke-tammach-default-pool-a0225682-w3ql\"}",
  "io.cilium.network.ipv4-cilium-host": "10.44.0.32",
  "io.cilium.network.ipv4-health-ip": "10.44.0.186",
  "io.cilium.network.ipv4-pod-cidr": "10.44.0.0/24",
  "node.alpha.kubernetes.io/ttl": "0",
  "node.gke.io/last-applied-node-labels": "cloud.google.com/gke-boot-disk=pd-standard,cloud.google.com/gke-container-runtime=containerd,cloud.google.com/gke-cpu-scaling-level=2,cloud.google.com/gke-max-pods-per-node=110,cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos,cloud.google.com/gke-preemptible=true,cloud.google.com/machine-family=e2",
  "node.gke.io/last-applied-node-taints": "",
  "volumes.kubernetes.io/controller-managed-attach-detach": "true"
}

$ kgnoowide                                           
W0602 20:47:29.519756  251512 gcp.go:120] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.25+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
NAME                                     STATUS   ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP      OS-IMAGE                             KERNEL-VERSION   CONTAINER-RUNTIME
gke-tammach-default-pool-a0225682-w3ql   Ready    <none>   26m   v1.22.8-gke.201   10.152.0.15   34.151.103.121   Container-Optimized OS from Google   5.10.90+         containerd://1.5.4
gke-tammach-default-pool-a0225682-wfur   Ready    <none>   28m   v1.22.8-gke.201   10.152.0.8    34.151.69.43     Container-Optimized OS from Google   5.10.90+         containerd://1.5.4


@AndreiHardziyenkaIR
Copy link

@sayboras
The version I am using was installed automatically with GKE
I believe it comes with Dataplane V2 enabled.

@aanm
Copy link
Member

aanm commented Jun 3, 2022

Hi @AndreiHardziyenkaIR, in that case you should try and contact Google support and explain the situation. Thank you.

@altanozlu
Copy link

Same happened with v1.11.6 and dataplane v2, then i've used cilium uninstall which crashed whole cluster.

@xynova
Copy link

xynova commented Jul 12, 2022

@AndreiHardziyenkaIR I just discovered the same issue in our GKE clusters, did you get any more information about this? is it required at all?

@KrustyHack
Copy link

Ay,

For information, I've contacted GCP support and here is the response :

Hello Nicolas,

Thank you for contacting Google Cloud Platform Support.

From the case description, I understand that you have a project ‘xxx’ which has a cluster named ‘xxx’ implemented in europe-west1 zone. For a long time you are able to see this error [1] in your GKE cluster logs. Please let me know if my understanding of the issue is incorrect.

Upon inspecting your project ‘xxx’, I don’t find any logs indicating the error you mentioned in your case description.

Before we further proceed I just want to set your expectation that Cilium is a 3rd party tool and the assistance I can provide with it is limited. The error is displayed due to a permissions change on the cilium version, and as stated on the shared documentation, annotating node is gated by flag .Values.annotateK8sNode which is:
 
-enabled by default in 1.9
-disabled by default in 1.12.0-rc2
 
According to the document [2] if the upgrade is performed from 1.9 to 1.11/1.12 there are some steps missing that cause the error[1]. I won’t be able to provide the exact details of the error/missing steps since it's an issue on Cilium’s end. But as mentioned on the issue[2],  the other workaround is to ‘Manually change clusterrole permission (from patch nodes/status to patch nodes) solves the issue’ hence Manual intervention was needed due to the permission changes on Cilium Version.

[1] GKE permissions error : User \\\"system:serviceaccount:kube-system:cilium\\\" cannot patch resource \\\"nodes\\\" in API group \\\"\\\" 

[2] https://github.com/cilium/cilium/issues/19816

Anyway, the workaround work well. 👍

@marandalucas
Copy link

marandalucas commented Dec 1, 2022

@KrustyHack and @AndreiHardziyenkaIR Thanks!! You save my day!

Have you ever seen this error in GKE with Dataplane v2 enabled?

Warning FailedCreatePodSandBox 3m38s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "bb2b9e4bf0f20bc6fb92858b7ef089f83" network for pod "channel-d779f84-98tjr": networkPlugin cni failed to set up pod "channel-d779f84-98tjr_namespace" network: unable to create endpoint: [PUT /endpoint/\{id}][429] putEndpointIdTooManyRequests

@KrustyHack
Copy link

Ay @marandalucas ,

No sorry, I never had this error on our GKE clusters. If you have access to GCP support I would recommend you to contact them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. release-blocker/1.12 This issue will prevent the release of the next version of Cilium.
Projects
None yet
Development

No branches or pull requests

8 participants