Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium 1.14.5+ can break Kubernetes pod connectivity due to the reserved:init label never being removed #30524

Closed
2 tasks done
timoreimann opened this issue Jan 30, 2024 · 4 comments · Fixed by #30909
Closed
2 tasks done
Assignees
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. sig/agent Cilium agent related.

Comments

@timoreimann
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Starting with Cilium 1.14.5, we are occasionally seeing pods in newly provisioned (DigitalOcean managed) Kubernetes clusters completely fail to ingress or egress traffic any traffic. The built-in cilium monitor shows that packets are dropped / being rejected on their way into and out of such pods.

We did not notice that the affected pods all have the reserved:init label set on the Cilium endpoint in addition to the regular pod labels. Also, the policies for both ingress and egress are marked enabled for such pods according to cilium endpoint get/list even though the policy enforcement policy is set to default and no network policies are installed at all -- output from cilium endpoint list for an affected CoreDNS instance:

ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                   IPv6   IPv4          STATUS
           ENFORCEMENT        ENFORCEMENT
2354       Enabled            Enabled           26544      k8s:doks.digitalocean.com/managed=true                                               10.244.0.77   ready
                                                           k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=coredns
                                                           k8s:io.kubernetes.pod.namespace=kube-system
                                                           k8s:k8s-app=kube-dns
                                                           reserved:init

I am assuming that the policy enforcement is a consequence of the reserved:init label being set. What is not clear to me is why that special label has not been removed despite all other pod labels having been discovered successfully.

According to the documentation for the reserved:init label, it may be assigned when one of three described conditions hold. We do not use Docker on the Kubernetes nodes and we also don't run in etcd mode, so only a (brief) unavailability of the Kubernetes API server remains as possible explanation why the label was added in the first place. FWIW, in none of the cases I have looked at more closely, I was able to spot a clear error message in the Cilium agent logs indicating that an API server request may have failed. (If the failures manifest as API watch events not being propagated to the agent in time or with a delay, however, then I imagine it could be conceivable that no error logs may be produced.)

So far we have only spotted the problem during the initial provisioning of a cluster. Once a cluster manages to finish bootstrapping without the reserved:init label being assigned, the issue does not seem to occur anymore during the remaining runtime of a cluster (or we haven't spotted it yet). I've been speculating whether the API server not being totally stable and settled in during the provisioning process may be a contributing factor.

The issue also happens on 1.14.6 but does not seem to manifest in 1.14.4.

I've been trying to reproduce the problem by attempting to surgically disallow access from the Cilium agent to the API server just as far as label discovery is concerned. This turned out to be somewhat tricky since pods won't even come up if access is prohibited too broadly. The closest I have gotten this to is by modifying the RBAC permissions of the agent to strip off the get and watch permissions from the pods resource, essentially splitting the (default) RBAC API group

- apiGroups:
  - ""
  resources:
  - namespaces
  - services
  - pods
  - endpoints
  - nodes
  verbs:
  - get
  - list
  - watch

into two separate groups, like this:

- apiGroups:
  - ""
  resources:
  - namespaces
  - services
  - endpoints
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - list

After restarting the Cilium agent, newly launched pods received the reserved:init label. Even when I re-enable all permissions for the pods resource, the label sticks around.

I have to say that I am not super certain if the above is the perfect or even right way to reproduce the bug: I would have expected the CiliumEndpoint objects of deliberately broken pods not not have any regular pod labels at all but only the reserved:init label while the RBAC permission is modified. However, when I watch for new CiliumEndpoint objects, I always see both the regular labels and the reserved:init label being associated right from the beginning. Chances are though I am missing some details on how the label is handled exactly, so happy to hear feedback on this and apply any repro suggestions folks may have.

One large(r) scale reproduction case that does work fairly reliable is to launch 50-60 new clusters in our environment. I'd then usually hit 1-2 broken clusters that show the described symptoms.

We're now downgrading to 1.14.4 again as a workaround, but obviously would want to see the bug addressed so that we can roll forward again. I've been peaking around the code a bit but haven't been able to identify any particular area that would need to be worked on for a fix. Once the root cause is identified and the approach to fixing it becomes apparent, I'm open to trying to submit a PR myself. Appreciate your support on this case.

Cilium Version

v1.14.5 (also tried v1.14.6 which is similarly affected)

Kernel Version

Linux pool-om3sd 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

Kubernetes Version

1.29.0

Sysdump

Able to provide on request

Relevant log output

2024-01-26T00:04:41.390401603Z level=info msg=Invoked duration="908.81µs" function="pprof.glob..func1 (cell.go:51)" subsys=hive
2024-01-26T00:04:41.390567519Z level=info msg=Invoked duration="64.056µs" function="gops.registerGopsHooks (cell.go:39)" subsys=hive
2024-01-26T00:04:41.391783162Z level=info msg=Invoked duration="982.257µs" function="metrics.NewRegistry (registry.go:65)" subsys=hive
2024-01-26T00:04:41.488123856Z level=info msg="Spire Delegate API Client is disabled as no socket path is configured" subsys=spire-delegate
2024-01-26T00:04:41.488171945Z level=info msg="Mutual authentication handler is disabled as no port is configured" subsys=auth
2024-01-26T00:04:41.488280537Z level=info msg=Invoked duration=96.513857ms function="cmd.glob..func4 (daemon_main.go:1601)" subsys=hive
2024-01-26T00:04:41.488382206Z level=info msg=Invoked duration="35.659µs" function="gc.registerSignalHandler (cell.go:47)" subsys=hive
2024-01-26T00:04:41.488536036Z level=info msg=Invoked duration="39.124µs" function="utime.initUtimeSync (cell.go:30)" subsys=hive
2024-01-26T00:04:41.488730601Z level=info msg=Invoked duration="88.651µs" function="agentliveness.newAgentLivenessUpdater (agent_liveness.go:44)" subsys=hive
2024-01-26T00:04:41.488958006Z level=info msg=Invoked duration="112.482µs" function="l2responder.NewL2ResponderReconciler (l2responder.go:64)" subsys=hive
2024-01-26T00:04:41.489084697Z level=info msg=Invoked duration="103.902µs" function="garp.newGARPProcessor (processor.go:27)" subsys=hive
2024-01-26T00:04:41.489200850Z level=info msg=Starting subsys=hive
2024-01-26T00:04:41.489216641Z level=info msg="Started gops server" address="127.0.0.1:9890" subsys=gops
2024-01-26T00:04:41.489727059Z level=info msg="Start hook executed" duration="484.177µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive
2024-01-26T00:04:41.489737887Z level=info msg="Start hook executed" duration="2.267µs" function="metrics.NewRegistry.func1 (registry.go:86)" subsys=hive
2024-01-26T00:04:41.489813097Z level=info msg="Establishing connection to apiserver" host="https://82355bbf-9ca0-4d3d-a592-75e4d53c7e6b.k8s.ondigitalocean.com" subsys=k8s-client
2024-01-26T00:04:41.490044725Z level=info msg="Serving prometheus metrics on :9090" subsys=metrics
2024-01-26T00:04:41.509208964Z level=info msg="Connected to apiserver" subsys=k8s-client
2024-01-26T00:04:41.511584747Z level=info msg="Start hook executed" duration=21.701086ms function="client.(*compositeClientset).onStart" subsys=hive
2024-01-26T00:04:41.511770437Z level=info msg="Start hook executed" duration="9.643µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Node].Start" subsys=hive
2024-01-26T00:04:41.512263508Z level=info msg="Using autogenerated IPv4 allocation range" subsys=node v4Prefix=10.140.0.0/16
2024-01-26T00:04:41.515728836Z level=info msg="Start hook executed" duration=3.787094ms function="node.NewLocalNodeStore.func1 (local_node_store.go:77)" subsys=hive
2024-01-26T00:04:41.516483065Z level=info msg="Start hook executed" duration="166.129µs" function="authmap.newAuthMap.func1 (cell.go:28)" subsys=hive
2024-01-26T00:04:41.516501794Z level=info msg="Start hook executed" duration="22.36µs" function="configmap.newMap.func1 (cell.go:24)" subsys=hive
2024-01-26T00:04:41.516505680Z level=info msg="Start hook executed" duration="25.359µs" function="signalmap.newMap.func1 (cell.go:45)" subsys=hive
2024-01-26T00:04:41.516509559Z level=info msg="Start hook executed" duration="12.43µs" function="nodemap.newNodeMap.func1 (cell.go:24)" subsys=hive
2024-01-26T00:04:41.516512892Z level=info msg="Start hook executed" duration="82.721µs" function="eventsmap.newEventsMap.func1 (cell.go:36)" subsys=hive
2024-01-26T00:04:41.520518001Z level=info msg="Start hook executed" duration=4.193596ms function="datapath.newDatapath.func1 (cells.go:114)" subsys=hive
2024-01-26T00:04:41.520704546Z level=info msg="Restored 3 node IDs from the BPF map" subsys=linux-datapath
2024-01-26T00:04:41.520748280Z level=info msg="Start hook executed" duration="200.324µs" function="datapath.newDatapath.func2 (cells.go:127)" subsys=hive
2024-01-26T00:04:41.520820754Z level=info msg="Start hook executed" duration="11.382µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start" subsys=hive
2024-01-26T00:04:41.622053303Z level=info msg="Start hook executed" duration=100.881896ms function="*manager.diffStore[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start" subsys=hive
2024-01-26T00:04:41.622083432Z level=info msg="Start hook executed" duration="4.597µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s.Endpoints].Start" subsys=hive
2024-01-26T00:04:41.622349078Z level=info msg="Using discoveryv1.EndpointSlice" subsys=k8s
2024-01-26T00:04:41.723057344Z level=info msg="Start hook executed" duration=100.930422ms function="*manager.diffStore[*github.com/cilium/cilium/pkg/k8s.Endpoints].Start" subsys=hive
2024-01-26T00:04:41.723085171Z level=info msg="Start hook executed" duration="5.137µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.CiliumNode].Start" subsys=hive
2024-01-26T00:04:41.723088198Z level=info msg="Start hook executed" duration="1.368µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Pod].Start" subsys=hive
2024-01-26T00:04:41.723090904Z level=info msg="Start hook executed" duration="1.549µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Namespace].Start" subsys=hive
2024-01-26T00:04:41.723094117Z level=info msg="Start hook executed" duration="1.572µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.CiliumNetworkPolicy].Start" subsys=hive
2024-01-26T00:04:41.723097113Z level=info msg="Start hook executed" duration="1.016µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.CiliumClusterwideNetworkPolicy].Start" subsys=hive
2024-01-26T00:04:41.723099674Z level=info msg="Start hook executed" duration=959ns function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.CiliumCIDRGroup].Start" subsys=hive
2024-01-26T00:04:41.723126172Z level=info msg="Start hook executed" duration="34.152µs" function="endpointmanager.newDefaultEndpointManager.func1 (cell.go:204)" subsys=hive
2024-01-26T00:04:41.723130232Z level=info msg="Start hook executed" duration="10.288µs" function="cmd.newPolicyTrifecta.func1 (policy.go:130)" subsys=hive
2024-01-26T00:04:41.723185040Z level=info msg="Start hook executed" duration="71.568µs" function="*manager.manager.Start" subsys=hive
2024-01-26T00:04:41.723572425Z level=info msg="Generating CNI configuration file with mode portmap" subsys=cni-config
2024-01-26T00:04:41.723767126Z level=info msg="Start hook executed" duration="142.716µs" function="*cni.cniConfigManager.Start" subsys=hive
2024-01-26T00:04:41.724109125Z level=info msg="Serving cilium node monitor v1.2 API at unix:///var/run/cilium/monitor1_2.sock" subsys=monitor-agent
2024-01-26T00:04:41.724120522Z level=info msg="Start hook executed" duration="326.374µs" function="agent.newMonitorAgent.func1 (cell.go:62)" subsys=hive
2024-01-26T00:04:41.724146025Z level=info msg="Start hook executed" duration="7.324µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.CiliumL2AnnouncementPolicy].Start" subsys=hive
2024-01-26T00:04:41.724150338Z level=info msg="Start hook executed" duration="6.022µs" function="*job.group.Start" subsys=hive
2024-01-26T00:04:41.724422612Z level=info msg="Start hook executed" duration="229.775µs" function="proxy.newProxy.func1 (cell.go:56)" subsys=hive
2024-01-26T00:04:41.724664549Z level=info msg="Start hook executed" duration="212.795µs" function="signal.provideSignalManager.func1 (cell.go:26)" subsys=hive
2024-01-26T00:04:41.725070625Z level=info msg="Envoy: Starting xDS gRPC server listening on /var/run/cilium/envoy/sockets/xds.sock" subsys=envoy-manager
2024-01-26T00:04:41.725266385Z level=info msg="Datapath signal listener running" subsys=signal
2024-01-26T00:04:41.726322572Z level=info msg="Start hook executed" duration=1.619123ms function="auth.registerAuthManager.func1 (cell.go:110)" subsys=hive
2024-01-26T00:04:41.726351430Z level=info msg="Start hook executed" duration="3.362µs" function="auth.registerGCJobs.func1 (cell.go:159)" subsys=hive
2024-01-26T00:04:41.726355783Z level=info msg="Start hook executed" duration="12.298µs" function="*job.group.Start" subsys=hive
2024-01-26T00:04:41.728032633Z level=info msg="Using Managed Neighbor Kernel support" subsys=daemon
2024-01-26T00:04:41.728045813Z level=warning msg="Deprecated value for --kube-proxy-replacement: partial (use either \"true\", or \"false\")" subsys=daemon
2024-01-26T00:04:41.728764093Z level=info msg="Inheriting MTU from external network interface" device=eth1 ipAddr=10.136.0.13 mtu=1500 subsys=mtu
2024-01-26T00:04:41.729110422Z level=info msg="Restoring 1 old CIDR identities" subsys=daemon
2024-01-26T00:04:41.729372790Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2024-01-26T00:04:41.729484456Z level=info msg="Removed map pin at /sys/fs/bpf/tc/globals/cilium_ipcache, recreating and re-pinning map cilium_ipcache" file-path=/sys/fs/bpf/tc/globals/cilium_ipcache name=cilium_ipcache subsys=bpf
2024-01-26T00:04:41.730835154Z level=info msg="Restored services from maps" failedServices=0 restoredServices=6 subsys=service
2024-01-26T00:04:41.731061985Z level=info msg="Restored backends from maps" failedBackends=0 restoredBackends=6 skippedBackends=0 subsys=service
2024-01-26T00:04:41.731133399Z level=info msg="Reading old endpoints..." subsys=daemon
2024-01-26T00:04:41.736524723Z level=info msg="Reusing previous DNS proxy port: 46327" subsys=daemon
2024-01-26T00:04:41.736577715Z level=info msg="Waiting until all Cilium CRDs are available" subsys=k8s
2024-01-26T00:04:42.737076426Z level=info msg="All Cilium CRDs have been found and are available" subsys=k8s
2024-01-26T00:04:42.737255001Z level=info msg="Creating or updating CiliumNode resource" node=pool-o7caf subsys=nodediscovery
2024-01-26T00:04:42.737383326Z level=info msg="Creating or updating CiliumNode resource" node=pool-o7caf subsys=nodediscovery
2024-01-26T00:04:42.753990106Z level=warning msg="Unable to update CiliumNode resource, will retry" error="Operation cannot be fulfilled on ciliumnodes.cilium.io \"pool-o7caf\": the object has been modified; please apply your changes to the latest version and try again" subsys=nodediscovery
2024-01-26T00:04:42.758977918Z level=info msg="Retrieved node information from cilium node" nodeName=pool-o7caf subsys=k8s
2024-01-26T00:04:42.759006576Z level=info msg="Received own node information from API server" ipAddr.ipv4=10.136.0.13 ipAddr.ipv6="<nil>" k8sNodeIP=10.136.0.13 labels="map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:s-2vcpu-4gb beta.kubernetes.io/os:linux doks.digitalocean.com/managed:true doks.digitalocean.com/node-id:2684529e-fb0e-4062-8414-beda244612e9 doks.digitalocean.com/node-pool:pool doks.digitalocean.com/node-pool-id:4fda5043-42f3-4379-85a1-caf2b1eacb29 doks.digitalocean.com/version:1.29.0-do.0 failure-domain.beta.kubernetes.io/region:nyc1 kubernetes.io/arch:amd64 kubernetes.io/hostname:pool-o7caf kubernetes.io/os:linux node.kubernetes.io/instance-type:s-2vcpu-4gb topology.kubernetes.io/region:nyc1]" nodeName=pool-o7caf subsys=k8s v4Prefix=10.244.0.0/25 v6Prefix="<nil>"
2024-01-26T00:04:42.759011260Z level=info msg="k8s mode: Allowing localhost to reach local endpoints" subsys=daemon
2024-01-26T00:04:42.759103839Z level=info msg="Detected devices" devices="[]" subsys=linux-datapath
2024-01-26T00:04:42.759179373Z level=info msg="Enabling k8s event listener" subsys=k8s-watcher
2024-01-26T00:04:42.760241864Z level=info msg="Removing stale endpoint interfaces" subsys=daemon
2024-01-26T00:04:42.760478879Z level=info msg="Skipping kvstore configuration" subsys=daemon
2024-01-26T00:04:42.760961750Z level=info msg="Restored router address from node_config" file=/var/run/cilium/state/globals/node_config.h ipv4=10.244.0.109 ipv6="<nil>" subsys=node
2024-01-26T00:04:42.761448424Z level=info msg="Initializing node addressing" subsys=daemon
2024-01-26T00:04:42.761460281Z level=info msg="Initializing cluster-pool IPAM" subsys=ipam v4Prefix=10.244.0.0/25 v6Prefix="<nil>"
2024-01-26T00:04:42.761640076Z level=info msg="Restoring endpoints..." subsys=daemon
2024-01-26T00:04:42.763431075Z level=info msg="Waiting until local node addressing before starting watchers depending on it" subsys=k8s-watcher
2024-01-26T00:04:42.962400165Z level=info msg="Endpoints restored" failed=0 restored=6 subsys=daemon
2024-01-26T00:04:42.962430052Z level=info msg="Addressing information:" subsys=daemon
2024-01-26T00:04:42.962433422Z level=info msg="  Cluster-Name: default" subsys=daemon
2024-01-26T00:04:42.962436919Z level=info msg="  Cluster-ID: 0" subsys=daemon
2024-01-26T00:04:42.962439527Z level=info msg="  Local node-name: pool-o7caf" subsys=daemon
2024-01-26T00:04:42.962442045Z level=info msg="  Node-IPv6: <nil>" subsys=daemon
2024-01-26T00:04:42.962448986Z level=info msg="  External-Node IPv4: 10.136.0.13" subsys=daemon
2024-01-26T00:04:42.962455059Z level=info msg="  Internal-Node IPv4: 10.244.0.109" subsys=daemon
2024-01-26T00:04:42.962458750Z level=info msg="  IPv4 allocation prefix: 10.244.0.0/25" subsys=daemon
2024-01-26T00:04:42.962462305Z level=info msg="  IPv4 native routing prefix: 10.244.0.0/16" subsys=daemon
2024-01-26T00:04:42.962466128Z level=info msg="  Loopback IPv4: 169.254.42.1" subsys=daemon
2024-01-26T00:04:42.962469546Z level=info msg="  Local IPv4 addresses:" subsys=daemon
2024-01-26T00:04:42.962473028Z level=info msg="  - 157.245.80.140" subsys=daemon
2024-01-26T00:04:42.962476366Z level=info msg="  - 10.10.0.24" subsys=daemon
2024-01-26T00:04:42.962479782Z level=info msg="  - 10.136.0.13" subsys=daemon
2024-01-26T00:04:42.962483332Z level=info msg="  - 10.244.0.109" subsys=daemon
2024-01-26T00:04:42.962486436Z level=info msg="  - 157.245.80.140" subsys=daemon
2024-01-26T00:04:42.962490491Z level=info msg="Node updated" clusterName=default nodeName=pool-o7caf subsys=nodemanager
2024-01-26T00:04:42.962494904Z level=info msg="Adding local node to cluster" node="{pool-o7caf default [{ExternalIP 157.245.80.140} {InternalIP 10.136.0.13} {CiliumInternalIP 10.244.0.109} {ExternalIP 157.245.80.140}] 10.244.0.0/25 [] <nil> [] 10.244.0.27 <nil> <nil> <nil> 0 local 0 map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:s-2vcpu-4gb beta.kubernetes.io/os:linux doks.digitalocean.com/managed:true doks.digitalocean.com/node-id:2684529e-fb0e-4062-8414-beda244612e9 doks.digitalocean.com/node-pool:pool doks.digitalocean.com/node-pool-id:4fda5043-42f3-4379-85a1-caf2b1eacb29 doks.digitalocean.com/version:1.29.0-do.0 failure-domain.beta.kubernetes.io/region:nyc1 kubernetes.io/arch:amd64 kubernetes.io/hostname:pool-o7caf kubernetes.io/os:linux node.kubernetes.io/instance-type:s-2vcpu-4gb topology.kubernetes.io/region:nyc1] map[] 1 }" subsys=nodediscovery
2024-01-26T00:04:42.965301180Z level=info msg="Creating or updating CiliumNode resource" node=pool-o7caf subsys=nodediscovery
2024-01-26T00:04:42.965716920Z level=info msg="Waiting until all pre-existing resources have been received" subsys=k8s-watcher
2024-01-26T00:04:43.348121938Z level=info msg="Annotating k8s node" subsys=daemon v4CiliumHostIP.IPv4=10.244.0.109 v4IngressIP.IPv4="<nil>" v4Prefix=10.244.0.0/25 v4healthIP.IPv4=10.244.0.27 v6CiliumHostIP.IPv6="<nil>" v6IngressIP.IPv6="<nil>" v6Prefix="<nil>" v6healthIP.IPv6="<nil>"
2024-01-26T00:04:43.348367448Z level=info msg="Initializing identity allocator" subsys=identity-cache
2024-01-26T00:04:43.348470354Z level=info msg="Allocating identities between range" cluster-id=0 max=65535 min=256 subsys=identity-cache
2024-01-26T00:04:43.350564210Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.forwarding sysParamValue=1
2024-01-26T00:04:43.350756885Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.rp_filter sysParamValue=0
2024-01-26T00:04:43.350868487Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.accept_local sysParamValue=1
2024-01-26T00:04:43.350977126Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.send_redirects sysParamValue=0
2024-01-26T00:04:43.351149592Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.forwarding sysParamValue=1
2024-01-26T00:04:43.351260008Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.rp_filter sysParamValue=0
2024-01-26T00:04:43.351345559Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.accept_local sysParamValue=1
2024-01-26T00:04:43.351459024Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.send_redirects sysParamValue=0
2024-01-26T00:04:43.351702548Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.core.bpf_jit_enable sysParamValue=1
2024-01-26T00:04:43.351815163Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.all.rp_filter sysParamValue=0
2024-01-26T00:04:43.351903376Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.fib_multipath_use_neigh sysParamValue=1
2024-01-26T00:04:43.352029385Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=kernel.unprivileged_bpf_disabled sysParamValue=1
2024-01-26T00:04:43.352122822Z level=info msg="Setting sysctl" subsys=sysctl sysParamName=kernel.timer_migration sysParamValue=0
2024-01-26T00:04:43.353289679Z level=info msg="Setting up BPF datapath" bpfClockSource=ktime bpfInsnSet="<nil>" subsys=datapath-loader
2024-01-26T00:04:43.548237273Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2024-01-26T00:04:43.890761610Z level=info msg="Iptables rules installed" subsys=iptables
2024-01-26T00:04:43.891115167Z level=info msg="Adding new proxy port rules for cilium-dns-egress:46327" id=cilium-dns-egress subsys=proxy
2024-01-26T00:04:43.893717449Z level=info msg="Iptables proxy rules installed" subsys=iptables
2024-01-26T00:04:43.895074952Z level=info msg="Start hook executed" duration=2.168474667s function="cmd.newDaemonPromise.func1 (daemon_main.go:1657)" subsys=hive
2024-01-26T00:04:43.895119583Z level=info msg="Start hook executed" duration="12.659µs" function="utime.initUtimeSync.func1 (cell.go:34)" subsys=hive
2024-01-26T00:04:43.895127935Z level=info msg="Start hook executed" duration="5.428µs" function="*job.group.Start" subsys=hive
2024-01-26T00:04:43.895133373Z level=info msg="Start hook executed" duration="49.337µs" function="l2respondermap.newMap.func1 (l2_responder_map4.go:45)" subsys=hive
2024-01-26T00:04:43.895138709Z level=info msg="Start hook executed" duration="3.031µs" function="*job.group.Start" subsys=hive
2024-01-26T00:04:43.895236946Z level=info msg="Starting IP identity watcher" subsys=ipcache
2024-01-26T00:04:43.895451436Z level=info msg="Initializing daemon" subsys=daemon
2024-01-26T00:04:43.895469615Z level=info msg="Validating configured node address ranges" subsys=daemon
2024-01-26T00:04:43.895496976Z level=info msg="Starting connection tracking garbage collector" subsys=daemon
2024-01-26T00:04:43.897047924Z level=info msg="Conntrack garbage collector interval recalculated" deleteRatio=7.62939453125e-05 newInterval=7m30s subsys=map-ct
2024-01-26T00:04:43.897065581Z level=info msg="Initial scan of connection tracking completed" subsys=ct-gc
2024-01-26T00:04:43.897158707Z level=info msg="Regenerating restored endpoints" numRestored=6 subsys=daemon
2024-01-26T00:04:43.897380334Z level=info msg="New endpoint" containerID=77ffdb95b8 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=188 identity=26484 ipv4=10.244.0.99 ipv6= k8sPodName=kube-system/hubble-relay-568f96b49b-vrssn subsys=endpoint
2024-01-26T00:04:43.897767756Z level=info msg="New endpoint" containerID=da67bbf6ea datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=1375 identity=2782 ipv4=10.244.0.85 ipv6= k8sPodName=kube-system/hubble-ui-dd784fb98-vlh74 subsys=endpoint
2024-01-26T00:04:43.898534050Z level=info msg="New endpoint" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=1013 identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2024-01-26T00:04:43.898551402Z level=info msg="New endpoint" containerID=27a24173b5 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=716 identity=21226 ipv4=10.244.0.64 ipv6= k8sPodName=kube-system/coredns-6857f5494d-vwfh6 subsys=endpoint
2024-01-26T00:04:43.898914880Z level=info msg="New endpoint" containerID=bb1c6e4107 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=3484 identity=10529 ipv4=10.244.0.112 ipv6= k8sPodName=kube-system/konnectivity-agent-2v8kj subsys=endpoint
2024-01-26T00:04:43.899126866Z level=info msg="New endpoint" containerID=e1ca34e31b datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2354 identity=26544 ipv4=10.244.0.77 ipv6= k8sPodName=kube-system/coredns-6857f5494d-9fq8l subsys=endpoint
2024-01-26T00:04:43.899229019Z level=info msg="Successfully restored endpoint. Scheduling regeneration" endpointID=2354 subsys=daemon
2024-01-26T00:04:43.899239049Z level=info msg="Successfully restored endpoint. Scheduling regeneration" endpointID=3484 subsys=daemon
2024-01-26T00:04:43.899244666Z level=info msg="Successfully restored endpoint. Scheduling regeneration" endpointID=716 subsys=daemon
2024-01-26T00:04:43.899614239Z level=info msg="Successfully restored endpoint. Scheduling regeneration" endpointID=1013 subsys=daemon
2024-01-26T00:04:43.899630563Z level=info msg="Successfully restored endpoint. Scheduling regeneration" endpointID=1375 subsys=daemon
2024-01-26T00:04:43.899662538Z level=info msg="Successfully restored endpoint. Scheduling regeneration" endpointID=188 subsys=daemon
2024-01-26T00:04:43.899831711Z level=info msg="Reusing existing global key" key="k8s:doks.digitalocean.com/managed=true;reserved:init=;k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=coredns;k8s:io.kubernetes.pod.namespace=kube-system;k8s:k8s-app=kube-dns;" subsys=allocator
2024-01-26T00:04:43.899913299Z level=info msg="Reusing existing global key" key="k8s:doks.digitalocean.com/managed=true;reserved:init=;k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=konnectivity-agent;k8s:io.kubernetes.pod.namespace=kube-system;k8s:k8s-app=konnectivity-agent;" subsys=allocator
2024-01-26T00:04:43.900073394Z level=info msg="Reusing existing global key" key="k8s:doks.digitalocean.com/managed=true;k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=coredns;k8s:io.kubernetes.pod.namespace=kube-system;k8s:k8s-app=kube-dns;" subsys=allocator
2024-01-26T00:04:43.900353241Z level=info msg="Reusing existing global key" key="k8s:app.kubernetes.io/name=hubble-ui;k8s:app.kubernetes.io/part-of=cilium;k8s:doks.digitalocean.com/managed=true;reserved:init=;k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=hubble-ui;k8s:io.kubernetes.pod.namespace=kube-system;k8s:k8s-app=hubble-ui;" subsys=allocator
2024-01-26T00:04:43.900466662Z level=info msg="Reusing existing global key" key="k8s:app.kubernetes.io/name=hubble-relay;k8s:app.kubernetes.io/part-of=cilium;k8s:doks.digitalocean.com/managed=true;k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=hubble-relay;k8s:io.kubernetes.pod.namespace=kube-system;k8s:k8s-app=hubble-relay;" subsys=allocator
2024-01-26T00:04:43.903343744Z level=info msg="Removed endpoint" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=3392 identity=4 ipv4=10.244.0.30 ipv6= k8sPodName=/ subsys=endpoint
2024-01-26T00:04:43.903635602Z level=info msg="Service has stale backends: triggering refresh" l3n4Addr="{AddrCluster:10.245.203.223 L4Addr:{Protocol:NONE Port:443} Scope:0}" orphanBackends=1 serviceID=3 serviceName=kube-system/hubble-peer subsys=service
2024-01-26T00:04:43.903849263Z level=info msg="Deleted orphan backends" orphanBackends=0 subsys=service
2024-01-26T00:04:43.909894169Z level=info msg="Launching Cilium health daemon" subsys=daemon
2024-01-26T00:04:43.915782854Z level=info msg="Launching Cilium health endpoint" subsys=daemon
2024-01-26T00:04:43.916025190Z level=info msg="Started healthz status API server" address="127.0.0.1:9879" subsys=daemon
2024-01-26T00:04:43.916074342Z level=info msg="Processing queued endpoint deletion requests from /var/run/cilium/deleteQueue" subsys=daemon
2024-01-26T00:04:43.916189367Z level=info msg="processing 0 queued deletion requests" subsys=daemon
2024-01-26T00:04:43.916200581Z level=info msg="Initializing Cilium API" subsys=daemon
2024-01-26T00:04:44.008081737Z level=info msg="Daemon initialization completed" bootstrapTime=3.075300848s subsys=daemon
2024-01-26T00:04:44.009645573Z level=info msg="Serving cilium API at unix:///var/run/cilium/cilium.sock" subsys=daemon
2024-01-26T00:04:44.009862278Z level=info msg="Configuring Hubble server" eventQueueSize=2048 maxFlows=4095 subsys=hubble
2024-01-26T00:04:44.010125005Z level=info msg="Starting local Hubble server" address="unix:///var/run/cilium/hubble.sock" subsys=hubble
2024-01-26T00:04:44.019216650Z level=info msg="Beginning to read perf buffer" startTime="2024-01-26 00:04:44.019011616 +0000 UTC m=+3.164582993" subsys=monitor-agent
2024-01-26T00:04:44.026700452Z level=info msg="Starting Hubble server" address=":4244" subsys=hubble
2024-01-26T00:04:44.423351037Z level=info msg="Compiled new BPF template" BPFCompilationTime=517.96385ms file-path=/var/run/cilium/state/templates/ee45e889ba1ebc46ccd364277ee0a6dcfd1fdf536c1a3c96824acb8e9e59a485/bpf_host.o subsys=datapath-loader
2024-01-26T00:04:44.427521331Z level=info msg="Re-pinning map with ':pending' suffix" bpfMapName=cilium_calls_hostns_01013 bpfMapPath=/sys/fs/bpf/tc/globals/cilium_calls_hostns_01013 subsys=bpf
2024-01-26T00:04:44.432339008Z level=info msg="Unpinning map after successful recreation" bpfMapName=cilium_calls_hostns_01013 bpfMapPath="/sys/fs/bpf/tc/globals/cilium_calls_hostns_01013:pending" subsys=bpf
2024-01-26T00:04:44.436300519Z level=info msg="Re-pinning map with ':pending' suffix" bpfMapName=cilium_calls_netdev_00006 bpfMapPath=/sys/fs/bpf/tc/globals/cilium_calls_netdev_00006 subsys=bpf
2024-01-26T00:04:44.440506589Z level=info msg="Unpinning map after successful recreation" bpfMapName=cilium_calls_netdev_00006 bpfMapPath="/sys/fs/bpf/tc/globals/cilium_calls_netdev_00006:pending" subsys=bpf
2024-01-26T00:04:44.440594962Z level=info msg="Rewrote endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1013 identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2024-01-26T00:04:44.441248004Z level=info msg="Restored endpoint" endpointID=1013 ipAddr="[ ]" subsys=endpoint
2024-01-26T00:04:44.548550782Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2024-01-26T00:04:45.032551571Z level=info msg="New endpoint" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2510 ipv4=10.244.0.27 ipv6= k8sPodName=/ subsys=endpoint
2024-01-26T00:04:45.032770312Z level=info msg="Resolving identity labels (blocking)" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2510 identityLabels="reserved:health" ipv4=10.244.0.27 ipv6= k8sPodName=/ subsys=endpoint
2024-01-26T00:04:45.032893383Z level=info msg="Identity of endpoint changed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2510 identity=4 identityLabels="reserved:health" ipv4=10.244.0.27 ipv6= k8sPodName=/ oldIdentity="no identity" subsys=endpoint
2024-01-26T00:04:45.403891964Z level=info msg="Compiled new BPF template" BPFCompilationTime=1.501767707s file-path=/var/run/cilium/state/templates/0a76c06fbd3a7ff6a468a86e716d7c632a9f2067f6d96caaa28c5628ba641692/bpf_lxc.o subsys=datapath-loader
2024-01-26T00:04:45.410653039Z level=info msg="Re-pinning map with ':pending' suffix" bpfMapName=cilium_calls_03484 bpfMapPath=/sys/fs/bpf/tc/globals/cilium_calls_03484 subsys=bpf
2024-01-26T00:04:45.415997033Z level=info msg="Re-pinning map with ':pending' suffix" bpfMapName=cilium_calls_02354 bpfMapPath=/sys/fs/bpf/tc/globals/cilium_calls_02354 subsys=bpf
2024-01-26T00:04:45.469463405Z level=info msg="Unpinning map after successful recreation" bpfMapName=cilium_calls_03484 bpfMapPath="/sys/fs/bpf/tc/globals/cilium_calls_03484:pending" subsys=bpf
2024-01-26T00:04:45.469506208Z level=info msg="Rewrote endpoint BPF program" containerID=bb1c6e4107 datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=3484 identity=10529 ipv4=10.244.0.112 ipv6= k8sPodName=kube-system/konnectivity-agent-2v8kj subsys=endpoint
2024-01-26T00:04:45.478110959Z level=info msg="Re-pinning map with ':pending' suffix" bpfMapName=cilium_calls_01375 bpfMapPath=/sys/fs/bpf/tc/globals/cilium_calls_01375 subsys=bpf
2024-01-26T00:04:45.479430692Z level=info msg="Restored endpoint" endpointID=3484 ipAddr="[10.244.0.112 ]" subsys=endpoint
2024-01-26T00:04:45.519716495Z level=info msg="Unpinning map after successful recreation" bpfMapName=cilium_calls_02354 bpfMapPath="/sys/fs/bpf/tc/globals/cilium_calls_02354:pending" subsys=bpf
2024-01-26T00:04:45.519773211Z level=info msg="Rewrote endpoint BPF program" containerID=e1ca34e31b datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2354 identity=26544 ipv4=10.244.0.77 ipv6= k8sPodName=kube-system/coredns-6857f5494d-9fq8l subsys=endpoint
2024-01-26T00:04:45.532402052Z level=info msg="Restored endpoint" endpointID=2354 ipAddr="[10.244.0.77 ]" subsys=endpoint
2024-01-26T00:04:45.535879006Z level=info msg="Re-pinning map with ':pending' suffix" bpfMapName=cilium_calls_00188 bpfMapPath=/sys/fs/bpf/tc/globals/cilium_calls_00188 subsys=bpf
2024-01-26T00:04:45.550138716Z level=info msg="Unpinning map after successful recreation" bpfMapName=cilium_calls_01375 bpfMapPath="/sys/fs/bpf/tc/globals/cilium_calls_01375:pending" subsys=bpf
2024-01-26T00:04:45.550242261Z level=info msg="Rewrote endpoint BPF program" containerID=da67bbf6ea datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1375 identity=2782 ipv4=10.244.0.85 ipv6= k8sPodName=kube-system/hubble-ui-dd784fb98-vlh74 subsys=endpoint
2024-01-26T00:04:45.563264304Z level=info msg="Restored endpoint" endpointID=1375 ipAddr="[10.244.0.85 ]" subsys=endpoint
2024-01-26T00:04:45.593125470Z level=info msg="Re-pinning map with ':pending' suffix" bpfMapName=cilium_calls_00716 bpfMapPath=/sys/fs/bpf/tc/globals/cilium_calls_00716 subsys=bpf
2024-01-26T00:04:45.642603138Z level=info msg="Unpinning map after successful recreation" bpfMapName=cilium_calls_00188 bpfMapPath="/sys/fs/bpf/tc/globals/cilium_calls_00188:pending" subsys=bpf
2024-01-26T00:04:45.642835490Z level=info msg="Rewrote endpoint BPF program" containerID=77ffdb95b8 datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=188 identity=26484 ipv4=10.244.0.99 ipv6= k8sPodName=kube-system/hubble-relay-568f96b49b-vrssn subsys=endpoint
2024-01-26T00:04:45.644256652Z level=info msg="Restored endpoint" endpointID=188 ipAddr="[10.244.0.99 ]" subsys=endpoint
2024-01-26T00:04:45.705821201Z level=info msg="Unpinning map after successful recreation" bpfMapName=cilium_calls_00716 bpfMapPath="/sys/fs/bpf/tc/globals/cilium_calls_00716:pending" subsys=bpf
2024-01-26T00:04:45.705908265Z level=info msg="Rewrote endpoint BPF program" containerID=27a24173b5 datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=716 identity=21226 ipv4=10.244.0.64 ipv6= k8sPodName=kube-system/coredns-6857f5494d-vwfh6 subsys=endpoint
2024-01-26T00:04:45.706987230Z level=info msg="Restored endpoint" endpointID=716 ipAddr="[10.244.0.64 ]" subsys=endpoint
2024-01-26T00:04:45.707135247Z level=info msg="Finished regenerating restored endpoints" regenerated=6 subsys=daemon total=6
2024-01-26T00:04:45.707835652Z level=info msg="Removed stale bpf map" file-path=/sys/fs/bpf/tc/globals/cilium_lb4_source_range subsys=daemon
2024-01-26T00:04:45.734345427Z level=info msg="Rewrote endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2510 identity=4 ipv4=10.244.0.27 ipv6= k8sPodName=/ subsys=endpoint
2024-01-26T00:04:45.936505162Z level=info msg="Serving cilium health API at unix:///var/run/cilium/health.sock" subsys=health-server
2024-01-26T00:05:56.985744848Z level=info msg="Delete endpoint request" containerID=bb1c6e4107 endpointID=3484 k8sNamespace=kube-system k8sPodName=konnectivity-agent-2v8kj subsys=daemon
2024-01-26T00:05:56.986100153Z level=info msg="Releasing key" key="[k8s:doks.digitalocean.com/managed=true reserved:init k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=konnectivity-agent k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=konnectivity-agent]" subsys=allocator
2024-01-26T00:05:56.988354365Z level=info msg="Removed endpoint" containerID=bb1c6e4107 datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=3484 identity=10529 ipv4=10.244.0.112 ipv6= k8sPodName=kube-system/konnectivity-agent-2v8kj subsys=endpoint
2024-01-26T00:05:57.444237785Z level=info msg="Create endpoint request" addressing="&{10.244.0.47 91fd74df-3b1f-435b-b8ef-031008c03cab default   }" containerID=93558b36a83bd32e510d1d4feb10f8aea0f44d7724ca155609ce5ecb1abfd52e datapathConfiguration="&{false false false false false <nil>}" interface=lxc78519e01f29f k8sPodName=kube-system/konnectivity-agent-xjfkb labels="[]" subsys=daemon sync-build=true
2024-01-26T00:05:57.444275019Z level=info msg="New endpoint" containerID=93558b36a8 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=3761 ipv4=10.244.0.47 ipv6= k8sPodName=kube-system/konnectivity-agent-xjfkb subsys=endpoint
2024-01-26T00:05:57.444283551Z level=info msg="Resolving identity labels (blocking)" containerID=93558b36a8 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=3761 identityLabels="k8s:doks.digitalocean.com/managed=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=konnectivity-agent,k8s:io.kubernetes.pod.namespace=kube-system,k8s:k8s-app=konnectivity-agent" ipv4=10.244.0.47 ipv6= k8sPodName=kube-system/konnectivity-agent-xjfkb subsys=endpoint
2024-01-26T00:05:57.444377904Z level=info msg="Skipped non-kubernetes labels when labelling ciliumidentity. All labels will still be used in identity determination" labels="map[k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name:kube-system]" subsys=crd-allocator
2024-01-26T00:05:57.462595124Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2024-01-26T00:05:57.462629809Z level=info msg="Invalid state transition skipped" containerID=93558b36a8 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=3761 endpointState.from=waiting-for-identity endpointState.to=waiting-to-regenerate file=/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go ipv4=10.244.0.47 ipv6= k8sPodName=kube-system/konnectivity-agent-xjfkb line=611 subsys=endpoint
2024-01-26T00:05:57.462866690Z level=info msg="Allocated new global key" key="k8s:doks.digitalocean.com/managed=true;k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=konnectivity-agent;k8s:io.kubernetes.pod.namespace=kube-system;k8s:k8s-app=konnectivity-agent;" subsys=allocator
2024-01-26T00:05:57.462877879Z level=info msg="Identity of endpoint changed" containerID=93558b36a8 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=3761 identity=46597 identityLabels="k8s:doks.digitalocean.com/managed=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=konnectivity-agent,k8s:io.kubernetes.pod.namespace=kube-system,k8s:k8s-app=konnectivity-agent" ipv4=10.244.0.47 ipv6= k8sPodName=kube-system/konnectivity-agent-xjfkb oldIdentity="no identity" subsys=endpoint
2024-01-26T00:05:57.462883264Z level=info msg="Waiting for endpoint to be generated" containerID=93558b36a8 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=3761 identity=46597 ipv4=10.244.0.47 ipv6= k8sPodName=kube-system/konnectivity-agent-xjfkb subsys=endpoint
2024-01-26T00:05:57.550155103Z level=info msg="Rewrote endpoint BPF program" containerID=93558b36a8 datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=3761 identity=46597 ipv4=10.244.0.47 ipv6= k8sPodName=kube-system/konnectivity-agent-xjfkb subsys=endpoint
2024-01-26T00:05:57.550475668Z level=info msg="Successful endpoint creation" containerID=93558b36a8 datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=3761 identity=46597 ipv4=10.244.0.47 ipv6= k8sPodName=kube-system/konnectivity-agent-xjfkb subsys=daemon
2024-01-26T00:05:58.460132845Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager

Anything else?

This was initially discussed on Slack: https://cilium.slack.com/archives/C53TG4J4R/p1706254144830909

My best bet as to which change may have introduced a bug into Cilium is #29248 by @aanm. He also mentioned that there's another, currently still open PR #30170 (authored by @oblazek) that is related to the his PR I referenced earlier. Mentioning it here as well for the sake of completeness.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@timoreimann timoreimann added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Jan 30, 2024
@julianwiedmann
Copy link
Member

ta Timo, nice catch! One suggestion would be to bisect between v1.14.4 and v1.14.5, and see if you truly end up in #29248. But the repro doesn't sound like a 5-minute exercise :).

@julianwiedmann julianwiedmann added kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. sig/agent Cilium agent related. labels Feb 1, 2024
@timoreimann
Copy link
Contributor Author

he @julianwiedmann, thanks for the suggestion. You're right, it might involve some work.

I think I'll try to patch out the culprit change first from 1.14.5 and see if I'm lucky with that. If not, I'll start bisecting.

Will let you know once I know more.

@timoreimann
Copy link
Contributor Author

So I went ahead and patched out #29251 (or, rather, its commits) from 1.14.5, which is the 1.14 back port variant from upstream #29248 that I suspected to be the culprit. I pushed the resulting branch to my fork (the top commit has the reverting patch).

I then built the Cilium agent as well as operator images to test in a custom cluster. In that setup, the RBAC-based reproduction does not work anymore, meaning I could not trigger the bug (similar to how I wasn't able to do that in v1.14.4).

I then created 500 clusters and checked for pods having the reserved:init label: I could not find a single one, i.e., all clusters came up good.

Just to be safe, I did the counter example and created several clusters on 1.14.5 (unpatched). Out of 100 cluster creates, I ran into two broken instances right away again.

Overall, I believe this confirms that the PR in question is the problematic one.

Worth highlighting that the back-porting PR #29251 has an additional commit (endpoint: fix panic in RunMetadataResolver due to send on closed channel) that the original upstream PR did not have. I had planned to run another test where I reverted all commits from #29251 except for that specific one to further distinguish which commit(s) would truly be responsible; unfortunately, the revert does not cleanly apply in that combination. I did no further attempts to address any conflicts since I felt I was really stepping out of my territory, potentially making things even worse.

Regardless, I think the above demonstrated that #29251 is related to the reported bug.

I can try to dig further into what the change in question did to hopefully understand where the essence of the problems lies, but I might need some pointers/guidance from folks more familiar with the code (possibly @aanm).

@timoreimann
Copy link
Contributor Author

FWIW, here's the output of cilium endpoint log for an endpoint that suffers from the dangling reserved:init label:

Timestamp              Status    State                   Message
2024-02-12T09:43:01Z   OK        ready                   Successfully regenerated endpoint program (Reason: one or more identities created or deleted)
2024-02-12T09:43:01Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2024-02-12T09:43:01Z   OK        regenerating            Regenerating endpoint: one or more identities created or deleted
2024-02-12T09:43:01Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to one or more identities created or deleted
2024-02-12T09:43:00Z   OK        ready                   Successfully regenerated endpoint program (Reason: one or more identities created or deleted)
2024-02-12T09:43:00Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2024-02-12T09:43:00Z   OK        regenerating            Regenerating endpoint: one or more identities created or deleted
2024-02-12T09:43:00Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to one or more identities created or deleted
2024-02-12T09:25:52Z   OK        ready                   Successfully regenerated endpoint program (Reason: )
2024-02-12T09:25:52Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2024-02-12T09:25:52Z   OK        regenerating            Regenerating endpoint:
2024-02-12T09:25:52Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to
2024-02-12T09:25:51Z   OK        ready                   Successfully regenerated endpoint program (Reason: one or more identities created or deleted)
2024-02-12T09:25:51Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2024-02-12T09:25:51Z   OK        regenerating            Regenerating endpoint: one or more identities created or deleted
2024-02-12T09:25:51Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to one or more identities created or deleted
2024-02-12T09:25:50Z   OK        ready                   Successfully regenerated endpoint program (Reason: one or more identities created or deleted)
2024-02-12T09:25:50Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2024-02-12T09:25:50Z   OK        regenerating            Regenerating endpoint: one or more identities created or deleted
2024-02-12T09:25:50Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to one or more identities created or deleted
2024-02-12T09:25:50Z   OK        ready                   Successfully regenerated endpoint program (Reason: updated security labels)
2024-02-12T09:25:50Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2024-02-12T09:25:50Z   OK        regenerating            Regenerating endpoint: updated security labels
2024-02-12T09:25:50Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to updated security labels
2024-02-12T09:25:49Z   OK        ready                   Successfully regenerated endpoint program (Reason: one or more identities created or deleted)
2024-02-12T09:25:49Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2024-02-12T09:25:49Z   OK        regenerating            Regenerating endpoint: one or more identities created or deleted
2024-02-12T09:25:49Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to one or more identities created or deleted
2024-02-12T09:25:30Z   OK        ready                   Successfully regenerated endpoint program (Reason: retrying regeneration)
2024-02-12T09:25:30Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2024-02-12T09:25:30Z   OK        regenerating            Regenerating endpoint: retrying regeneration
2024-02-12T09:25:30Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to retrying regeneration
2024-02-12T09:25:30Z   Failure   ready                   Error regenerating endpoint: context canceled
2024-02-12T09:25:30Z   OK        ready                   Completed endpoint regeneration with no pending regeneration requests
2024-02-12T09:25:30Z   OK        regenerating            Regenerating endpoint: Initial build on endpoint creation
2024-02-12T09:25:30Z   OK        waiting-to-regenerate   Successfully regenerated endpoint program (Reason: updated security labels)
2024-02-12T09:25:30Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to Initial build on endpoint creation
2024-02-12T09:25:30Z   OK        regenerating            Regenerating endpoint: updated security labels
2024-02-12T09:25:30Z   OK        waiting-to-regenerate   Triggering endpoint regeneration due to updated security labels
2024-02-12T09:25:30Z   OK        ready                   Set identity for this endpoint
2024-02-12T09:25:30Z   OK        waiting-for-identity    Endpoint creation

Note the line

2024-02-12T09:25:30Z   Failure   ready                   Error regenerating endpoint: context canceled

where presumably, the API server wasn't immediately ready yet.

I thought the final waiting-for-identity sounded suspicious, but I see the same in the output in a good cluster where I patched out the suspected commits.

@aanm aanm removed the needs/triage This issue requires triaging to establish severity and next steps. label Feb 22, 2024
@aanm aanm self-assigned this Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. sig/agent Cilium agent related.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants