[Kubernetes] Enable state_cronjob toggle breaks data collection #10845

jlind23 · 2024-08-21T17:07:15Z

While working on debugging a case we found out that enabling the state_cronjob put the Elastic Agent leader in some sort of a frozen state and no error/warn logs were generated.

Kubernetes version: 1.20.11
Kube state metrics: 2.4.2
Stack version: 8.14.3

The text was updated successfully, but these errors were encountered:

jlind23 · 2024-08-21T17:08:52Z

@graphaelli @mlunadia so to bother but I do not really know which label I should apply on this in order to get someone working on it.

gizas · 2024-08-27T15:01:28Z

I run a first round of testing locally with kind 1.20.15 and kube-state-metrics:v2.4.2

As you can see since KSM v2.40 the v1 enhancement was added for Cronjobs

Some more details from the k8s cluster:

kubectl api-resources -o wide | grep -i cronjob
cronjobs                          cj           batch/v1beta1                          true         CronJob                          create,delete,deletecollection,get,list,patch,update,watch   all

❯ kubectl version
Server Version: v1.20.15

This shows that batch/v1beta1 is the API for cronjobs and NOT batch/v1 we expect in our implementation

Also in KSM logs we can see

W0827 14:49:00.631664       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.23.3/tools/cache/reflector.go:167: failed to list *v1.CronJob: the server could not find the requested resource
E0827 14:49:00.631716       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.3/tools/cache/reflector.go:167: Failed to watch *v1.CronJob: failed to list *v1.CronJob: the server could not find the requested resource
W0827 14:49:03.377860       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.23.3/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: the server could not find the requested resource
E0827 14:49:03.377887       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.3/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: the server could not find the requested resource

To our ES setup now with k8s integration installed:

Initially when cronjob is disabled:

You can see that we receive metrics:

{"log.level":"info","@timestamp":"2024-08-27T14:50:36.569Z","message":"Non-zero metrics in the last 30s","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"monitoring","log.origin":{"file.line":187,"file.name":"log/log.go","function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot"},"service.name":"metricbeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":2813865984}}}},"cpu":{"system":{"ticks":4740,"time":{"ms":40}},"total":{"ticks":11710,"time":{"ms":120},"value":11710},"user":{"ticks":6970,"time":{"ms":80}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":19},"info":{"ephemeral_id":"b94ac108-249f-4c69-9708-f7393f4aa056","uptime":{"ms":3630055},"version":"8.14.3"},"memstats":{"gc_next":80328416,"memory_alloc":39507344,"memory_total":1538689880,"rss":208138240},"runtime":{"goroutines":252}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":14}},"output":{"events":{"acked":161,"active":0,"batches":3,"total":161},"read":{"bytes":4639,"errors":3},"write":{"bytes":40944,"latency":{"histogram":{"count":332,"max":92,"mean":21.656626506024097,"median":20,"min":12,"p75":24,"p95":30.349999999999984,"p99":38.669999999999995,"p999":92,"stddev":5.892320768697601}}}},"pipeline":{"clients":14,"events":{"active":37,"published":150,"total":150},"queue":{"acked":161}}},"metricbeat":{"kubernetes":{"state_container":{"events":42,"success":42},"state_daemonset":{"events":9,"success":9},"state_deployment":{"events":9,"success":9},"state_job":{"events":9,"success":9},"state_namespace":{"events":15,"success":15},"state_node":{"events":3,"success":3},"state_pod":{"events":42,"success":42},"state_replicaset":{"events":9,"success":9},"state_service":{"events":9,"success":9},"state_storageclass":{"events":3,"success":3}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":1.64,"15":0.7,"5":0.93,"norm":{"1":0.328,"15":0.14,"5":0.186}}}}},"ecs.version":"1.6.0"}

When I enable the cronjob then I see v1.CronJob errors in agent:

{"log.level":"error","@timestamp":"2024-08-27T14:54:02.373Z","message":"W0827 14:54:02.373061   33100 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.CronJob: the server could not find the requested resource","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-08-27T14:54:02.373Z","message":"E0827 14:54:02.373092   33100 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.CronJob: failed to list *v1.CronJob: the server could not find the requested resource","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}

Then ingestion stops !
Maybe a clearer indication for the ksm failure would be nice, I would try to test with other versions or the debug

@jlind23 did u see the same behaviour?

jlind23 · 2024-08-27T15:20:00Z

@gizas yes this is exactly the behaviour we observed!

jlind23 added the Integration:kubernetes Kubernetes label Aug 21, 2024

graphaelli added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring] label Aug 21, 2024

tetianakravchenko assigned gizas Aug 26, 2024

gizas assigned gizas and unassigned gizas Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kubernetes] Enable state_cronjob toggle breaks data collection #10845

[Kubernetes] Enable state_cronjob toggle breaks data collection #10845

jlind23 commented Aug 21, 2024

jlind23 commented Aug 21, 2024

gizas commented Aug 27, 2024

jlind23 commented Aug 27, 2024

[Kubernetes] Enable state_cronjob toggle breaks data collection #10845

[Kubernetes] Enable state_cronjob toggle breaks data collection #10845

Comments

jlind23 commented Aug 21, 2024

jlind23 commented Aug 21, 2024

gizas commented Aug 27, 2024

jlind23 commented Aug 27, 2024