Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent panic allocating address #30746

Closed
1 task done
Zariel opened this issue Feb 13, 2024 · 1 comment · Fixed by #30757
Closed
1 task done

Agent panic allocating address #30746

Zariel opened this issue Feb 13, 2024 · 1 comment · Fixed by #30757
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. sig/agent Cilium agent related.

Comments

@Zariel
Copy link
Contributor

Zariel commented Feb 13, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Panic when agent restarted, only on a single node. The node has 1 Mellanox Connectx-4 LX passed through from Proxmox via sr-iov VF and a virtio_net device which is not used.

Cilium Version

1.15.0

Kernel Version

6.1.74-talos

Kubernetes Version

1.29.1

Sysdump

cilium-sysdump-20240213-194440.zip

Relevant log output

level=info msg="Restored router address from node_config" file=/var/run/cilium/state/globals/node_config.h ipv4=10.42.1.15 ipv6="<nil>" subsys=node
level=info msg="Initializing node addressing" subsys=daemon
level=info msg="Initializing kubernetes IPAM" subsys=ipam v4Prefix=10.42.1.0/24 v6Prefix="<nil>"
level=info msg="Restoring endpoints..." subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=kube-system/reflector-6f44fcc976-fmpf5 endpointID=1399 error="Kubernetes pod kube-system/reflector-6f44fcc976-fmpf5 does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=csi-addons-system/csi-addons-controller-manager-576657d4b8-5b5wg endpointID=572 error="Kubernetes pod csi-addons-system/csi-addons-controller-manager-576657d4b8-5b5wg does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=network/cloudflared-949b6855f-6jdvx endpointID=353 error="Kubernetes pod network/cloudflared-949b6855f-6jdvx does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=monitoring/thanos-ruler-1 endpointID=135 error="Kubernetes pod monitoring/thanos-ruler-1 does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=network/echo-server-b75667455-gqsnm endpointID=2486 error="Kubernetes pod network/echo-server-b75667455-gqsnm does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=kube-system/hubble-relay-688895cdd5-2psts endpointID=2212 error="Kubernetes pod kube-system/hubble-relay-688895cdd5-2psts does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=default/unpackerr-6dcd888459-jzfcq endpointID=1484 error="Kubernetes pod default/unpackerr-6dcd888459-jzfcq does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=monitoring/kube-prometheus-stack-operator-68fc5448f8-gnskg endpointID=679 error="Kubernetes pod monitoring/kube-prometheus-stack-operator-68fc5448f8-gnskg does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=monitoring/unpoller-5957848f8c-qskmd endpointID=1373 error="Kubernetes pod monitoring/unpoller-5957848f8c-qskmd does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=cert-manager/cert-manager-cainjector-584f44558c-jxbph endpointID=719 error="Kubernetes pod cert-manager/cert-manager-cainjector-584f44558c-jxbph does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=kube-system/nfs-subdir-external-provisioner-f56fb548f-c45k6 endpointID=756 error="Kubernetes pod kube-system/nfs-subdir-external-provisioner-f56fb548f-c45k6 does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=kube-system/kubelet-csr-approver-6d5b957c-8d5h2 endpointID=3795 error="Kubernetes pod kube-system/kubelet-csr-approver-6d5b957c-8d5h2 does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=cilium-test/echo-other-node-58999bbffd-rvmq8 endpointID=1744 error="Kubernetes pod cilium-test/echo-other-node-58999bbffd-rvmq8 does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=monitoring/kube-state-metrics-67699d74df-8sk9s endpointID=3981 error="Kubernetes pod monitoring/kube-state-metrics-67699d74df-8sk9s does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=cert-manager/cert-manager-7ddb449fb6-crspw endpointID=117 error="Kubernetes pod cert-manager/cert-manager-7ddb449fb6-crspw does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=cilium-test/client3-868f7b8f6b-pkvrn endpointID=485 error="Kubernetes pod cilium-test/client3-868f7b8f6b-pkvrn does not exist" subsys=daemon
level=warning msg="Unable to restore endpoint, ignoring" ciliumEndpointName=cert-manager/cert-manager-webhook-76f9945d6f-5dd6n endpointID=656 error="Kubernetes pod cert-manager/cert-manager-webhook-76f9945d6f-5dd6n does not exist" subsys=daemon
level=info msg="Endpoints restored" failed=17 restored=11 subsys=daemon
level=info msg="Addressing information:" subsys=daemon
level=info msg="  Cluster-Name: holywoo" subsys=daemon
level=info msg="  Cluster-ID: 1" subsys=daemon
level=info msg="  Local node-name: k8s-3" subsys=daemon
level=info msg="  Node-IPv6: <nil>" subsys=daemon
level=info msg="  External-Node IPv4: 192.168.42.13" subsys=daemon
level=info msg="  Internal-Node IPv4: 10.42.1.15" subsys=daemon
level=info msg="  IPv4 allocation prefix: 10.42.1.0/24" subsys=daemon
level=info msg="  IPv4 native routing prefix: 10.42.0.0/16" subsys=daemon
level=info msg="  Loopback IPv4: 169.254.42.1" subsys=daemon
level=info msg="  Local IPv4 addresses:" subsys=daemon
level=info msg="  - 10.42.1.15" subsys=daemon
level=info msg="  - 192.168.42.13" subsys=daemon
level=info msg="Node updated" clusterName=holywoo nodeName=k8s-3 subsys=nodemanager
level=info msg="Adding local node to cluster" node=k8s-3 subsys=nodediscovery
level=info msg="Creating or updating CiliumNode resource" node=k8s-3 subsys=nodediscovery
level=info msg="Waiting until all pre-existing resources have been received" subsys=k8s-watcher
level=info msg="Node updated" clusterName=holywoo nodeName=k8s-0 subsys=nodemanager
level=info msg="Node updated" clusterName=holywoo nodeName=k8s-1 subsys=nodemanager
level=info msg="Node updated" clusterName=holywoo nodeName=k8s-2 subsys=nodemanager
level=info msg="Node updated" clusterName=holywoo nodeName=k8s-4 subsys=nodemanager
level=info msg="Initializing identity allocator" subsys=identity-cache
level=info msg="Allocating identities between range" cluster-id=1 max=131071 min=65536 subsys=identity-cache
panic: runtime error: index out of range [3] with length 0

goroutine 1 [running]:
github.com/cilium/cilium/pkg/byteorder.NetIPv4ToHost32({0x0?, 0xc003bf8120?, 0x394decb?})
        /go/src/github.com/cilium/cilium/pkg/byteorder/byteorder.go:15 +0x65
github.com/cilium/cilium/pkg/datapath/linux/config.(*HeaderfileWriter).WriteNodeConfig(0xc0013e5140, {0x3eea5c0?, 0xc00013c9d0?}, 0xc002008de0)
        /go/src/github.com/cilium/cilium/pkg/datapath/linux/config/config.go:548 +0x6c0a
github.com/cilium/cilium/pkg/datapath/loader.hashDatapath({0x7f9cfe9a4c50, 0xc0016ce140}, 0xc002008de0, {0x0, 0x0}, {0x0?, 0x0})
        /go/src/github.com/cilium/cilium/pkg/datapath/loader/hash.go:49 +0x68
github.com/cilium/cilium/pkg/datapath/loader.(*objectCache).Update(0xc002478f00, 0x0?)
        /go/src/github.com/cilium/cilium/pkg/datapath/loader/cache.go:162 +0x45
github.com/cilium/cilium/pkg/datapath/loader.newObjectCache({0x7f9cfe9a4c50?, 0xc0016ce140}, 0x2b?, {0xc0010b3aa0, 0x15})
        /go/src/github.com/cilium/cilium/pkg/datapath/loader/cache.go:142 +0x185
github.com/cilium/cilium/pkg/datapath/loader.NewObjectCache(...)
        /go/src/github.com/cilium/cilium/pkg/datapath/loader/cache.go:156
github.com/cilium/cilium/pkg/datapath/loader.(*Loader).init.func1()
        /go/src/github.com/cilium/cilium/pkg/datapath/loader/loader.go:88 +0x45
sync.(*Once).doSlow(0xc00200ac00?, 0x7f9cfe9a4c50?)
        /usr/local/go/src/sync/once.go:74 +0xbf
sync.(*Once).Do(...)
        /usr/local/go/src/sync/once.go:65
github.com/cilium/cilium/pkg/datapath/loader.(*Loader).init(0xc0013e5200, {0x7f9cfe9a4c50?, 0xc0016ce140?}, 0xc002008de0)
        /go/src/github.com/cilium/cilium/pkg/datapath/loader/loader.go:87 +0x66
github.com/cilium/cilium/pkg/datapath/loader.(*Loader).Reinitialize(0x0?, {0x3f38b28, 0xc000837860}, {0x3f3c198, 0xc000bf8000}, {{0x0, 0x0}, 0x0, {0x0, 0x0}, ...}, ...)
        /go/src/github.com/cilium/cilium/pkg/datapath/loader/base.go:301 +0x1c5
github.com/cilium/cilium/daemon/cmd.(*Daemon).init(0xc000bf8000)
        /go/src/github.com/cilium/cilium/daemon/cmd/daemon.go:258 +0x6b7
github.com/cilium/cilium/daemon/cmd.newDaemon({0x3f38b28, 0xc000837860}, 0xc0007e5000, 0xc000005500)
        /go/src/github.com/cilium/cilium/daemon/cmd/daemon.go:960 +0x5ae5
github.com/cilium/cilium/daemon/cmd.newDaemonPromise.func1({0x355ee80, 0x496c00})
        /go/src/github.com/cilium/cilium/daemon/cmd/daemon_main.go:1688 +0x66
github.com/cilium/cilium/pkg/hive/cell.Hook.Start(...)
        /go/src/github.com/cilium/cilium/pkg/hive/cell/lifecycle.go:45
github.com/cilium/cilium/pkg/hive/cell.(*DefaultLifecycle).Start(0xc00068ce10, {0x3f38b98?, 0xc0004fb8f0?})
        /go/src/github.com/cilium/cilium/pkg/hive/cell/lifecycle.go:108 +0x337
github.com/cilium/cilium/pkg/hive.(*Hive).Start(0xc0006097c0, {0x3f38b98, 0xc0004fb8f0})
        /go/src/github.com/cilium/cilium/pkg/hive/hive.go:310 +0xf9
github.com/cilium/cilium/pkg/hive.(*Hive).Run(0xc0006097c0)
        /go/src/github.com/cilium/cilium/pkg/hive/hive.go:210 +0x73
github.com/cilium/cilium/daemon/cmd.NewAgentCmd.func1(0xc001004b00?, {0x3925201?, 0x4?, 0x3925069?})
        /go/src/github.com/cilium/cilium/daemon/cmd/root.go:39 +0x17b
github.com/spf13/cobra.(*Command).execute(0xc000954600, {0xc000072050, 0x1, 0x1})
        /go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:987 +0xaa3
github.com/spf13/cobra.(*Command).ExecuteC(0xc000954600)
        /go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:1039
github.com/cilium/cilium/daemon/cmd.Execute(0xc0006097c0?)
        /go/src/github.com/cilium/cilium/daemon/cmd/root.go:79 +0x13
main.main()
        /go/src/github.com/cilium/cilium/daemon/main.go:14 +0x57


### Anything else?

_No response_

### Cilium Users Document

- [X] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct
@Zariel Zariel added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Feb 13, 2024
@aanm aanm added the kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. label Feb 13, 2024
@youngnick youngnick added the sig/agent Cilium agent related. label Feb 14, 2024
@dylandreimerink
Copy link
Member

I was able to find the cause of the issue from the sysdump. Order of events go as follows:

  1. The combination of settings used implicitly enables direct routing
  2. No direct routing device is explicitly set
  3. The agent attempt to detect the routing device, it lists all devices it thinks are external, if there is exactly 1, that device is used for direct routing.
  4. Due to a bug, found in Kube-proxy-replacement incorrectly detects aws-cni pod interfaces as host interfaces (1.15 regression) #30563, a LXC is selected as routing device, evidenced by the log line:
    2024-02-13T19:43:36.060461460Z level=info msg="Direct routing device detected" direct-routing-device=enxbc2411da2b0a subsys=linux-datapath
  5. Further in the initialization getDirectRouting is called by code that expects an IPv4 address back. Presumably the LXC device doesn't have an IPv4, at least not yet. And due to a second bug we return nil instead of erroring explicitly.
  6. The nil value is passed into a IP to bytes function that also assumes the input to be valid which causes the panic

The first bug should be fixed by a backport of #30691 which should land in v1.15.1. This doesn't address the secondary bug for which I will open a PR.

I think you should be able to mitigate the bug for now by explicitly setting the --direct-routing-device flag or nodePort.directRoutingDevice helm option. See https://docs.cilium.io/en/v1.15/network/kubernetes/kubeproxy-free/#nodeport-devices-port-and-bind-settings

@dylandreimerink dylandreimerink removed the needs/triage This issue requires triaging to establish severity and next steps. label Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. sig/agent Cilium agent related.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants