Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kubernetes] Enable state_cronjob toggle breaks data collection #10845

Open
jlind23 opened this issue Aug 21, 2024 · 3 comments
Open

[Kubernetes] Enable state_cronjob toggle breaks data collection #10845

jlind23 opened this issue Aug 21, 2024 · 3 comments
Assignees
Labels
Integration:kubernetes Kubernetes Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring]

Comments

@jlind23
Copy link
Contributor

jlind23 commented Aug 21, 2024

While working on debugging a case we found out that enabling the state_cronjob put the Elastic Agent leader in some sort of a frozen state and no error/warn logs were generated.

image

Kubernetes version: 1.20.11
Kube state metrics: 2.4.2
Stack version: 8.14.3

@jlind23
Copy link
Contributor Author

jlind23 commented Aug 21, 2024

@graphaelli @mlunadia so to bother but I do not really know which label I should apply on this in order to get someone working on it.

@jlind23 jlind23 added the Integration:kubernetes Kubernetes label Aug 21, 2024
@graphaelli graphaelli added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring] label Aug 21, 2024
@gizas
Copy link
Contributor

gizas commented Aug 27, 2024

I run a first round of testing locally with kind 1.20.15 and kube-state-metrics:v2.4.2

As you can see since KSM v2.40 the v1 enhancement was added for Cronjobs

Some more details from the k8s cluster:

kubectl api-resources -o wide | grep -i cronjob
cronjobs                          cj           batch/v1beta1                          true         CronJob                          create,delete,deletecollection,get,list,patch,update,watch   all

❯ kubectl version
Server Version: v1.20.15

This shows that batch/v1beta1 is the API for cronjobs and NOT batch/v1 we expect in our implementation

Also in KSM logs we can see

W0827 14:49:00.631664       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.23.3/tools/cache/reflector.go:167: failed to list *v1.CronJob: the server could not find the requested resource
E0827 14:49:00.631716       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.3/tools/cache/reflector.go:167: Failed to watch *v1.CronJob: failed to list *v1.CronJob: the server could not find the requested resource
W0827 14:49:03.377860       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.23.3/tools/cache/reflector.go:167: failed to list *v1.PodDisruptionBudget: the server could not find the requested resource
E0827 14:49:03.377887       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.3/tools/cache/reflector.go:167: Failed to watch *v1.PodDisruptionBudget: failed to list *v1.PodDisruptionBudget: the server could not find the requested resource

To our ES setup now with k8s integration installed:

  • Initially when cronjob is disabled:

You can see that we receive metrics:

{"log.level":"info","@timestamp":"2024-08-27T14:50:36.569Z","message":"Non-zero metrics in the last 30s","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"monitoring","log.origin":{"file.line":187,"file.name":"log/log.go","function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot"},"service.name":"metricbeat","monitoring":{"ecs.version":"1.6.0","metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":2813865984}}}},"cpu":{"system":{"ticks":4740,"time":{"ms":40}},"total":{"ticks":11710,"time":{"ms":120},"value":11710},"user":{"ticks":6970,"time":{"ms":80}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":19},"info":{"ephemeral_id":"b94ac108-249f-4c69-9708-f7393f4aa056","uptime":{"ms":3630055},"version":"8.14.3"},"memstats":{"gc_next":80328416,"memory_alloc":39507344,"memory_total":1538689880,"rss":208138240},"runtime":{"goroutines":252}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":14}},"output":{"events":{"acked":161,"active":0,"batches":3,"total":161},"read":{"bytes":4639,"errors":3},"write":{"bytes":40944,"latency":{"histogram":{"count":332,"max":92,"mean":21.656626506024097,"median":20,"min":12,"p75":24,"p95":30.349999999999984,"p99":38.669999999999995,"p999":92,"stddev":5.892320768697601}}}},"pipeline":{"clients":14,"events":{"active":37,"published":150,"total":150},"queue":{"acked":161}}},"metricbeat":{"kubernetes":{"state_container":{"events":42,"success":42},"state_daemonset":{"events":9,"success":9},"state_deployment":{"events":9,"success":9},"state_job":{"events":9,"success":9},"state_namespace":{"events":15,"success":15},"state_node":{"events":3,"success":3},"state_pod":{"events":42,"success":42},"state_replicaset":{"events":9,"success":9},"state_service":{"events":9,"success":9},"state_storageclass":{"events":3,"success":3}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":1.64,"15":0.7,"5":0.93,"norm":{"1":0.328,"15":0.14,"5":0.186}}}}},"ecs.version":"1.6.0"}

Screenshot 2024-08-27 at 5 52 31 PM

  • When I enable the cronjob then I see v1.CronJob errors in agent:
{"log.level":"error","@timestamp":"2024-08-27T14:54:02.373Z","message":"W0827 14:54:02.373061   33100 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.CronJob: the server could not find the requested resource","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-08-27T14:54:02.373Z","message":"E0827 14:54:02.373092   33100 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.CronJob: failed to list *v1.CronJob: the server could not find the requested resource","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}

Then ingestion stops !
Maybe a clearer indication for the ksm failure would be nice, I would try to test with other versions or the debug

@jlind23 did u see the same behaviour?

@gizas gizas assigned gizas and unassigned gizas Aug 27, 2024
@jlind23
Copy link
Contributor Author

jlind23 commented Aug 27, 2024

@gizas yes this is exactly the behaviour we observed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Integration:kubernetes Kubernetes Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring]
Projects
None yet
Development

No branches or pull requests

3 participants