-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus charm comes up and then goes down after relations from another model are added. #103
Comments
Hi @amc94 Let me try to understand the situation.
Are you able to reproduce the same behaviour using |
In charm code we call It is one of those cases that we deemed "ok to go into error state". We often see pebble exceptions after can_connect guard when testing on a slow vm (although this is the first time I see But the Is that a transient error? In the logs (1, 2, 3) it is active/idle. |
@amc94 from the screenshot it looks like prometheus was in error for about 40sec and then active/idle eventually? It would also be handy to see the output of
|
|
@sed-i it's persistent, it hits idle/active for a small amount of time after a restart |
Thanks @amc94, we have another hint - prometheus is being OOMKilled:
Any chance prometheus has accumulated a large WAL that doesn't fit into memory (could you attach the output of
This type of failure could be more obvious if you apply resource limits to the pod:
|
16M /var/lib/prometheus/wal |
Yep, 500*10 = 5000 values every 30sec is not a high load at all, and the WAL reflects it.
|
journalctl was empty for both |
Really odd to see
|
@sed-i |
(Technically, cos-proxy doesn't send metrics; cos-proxy sends scrape job specs over relation data to prometheus, and prometheus does the scraping.) It is much more likely that loki gets overloaded. When both prom and loki consume much resources, I've seen the oomkill algo selecting prometheus over loki. From the jenkins logs you shared I couldn't spot the bundle yamls that are related to the cos charms. |
Thank you for explaining. |
Have you seen this error recently? |
it has not |
Bug Description
After Solutions QA successfully deploys the cos layer, we deploy another layer such as kubernetes or openstack. When cos-proxy is relating to prometheus. prometheus seems to go into an error state. Often juju status says 'installing agent' and the unit has message ' crash loop backoff: back-off 5m0s restarting failed container=prometheus pod=prometheus-0_cos'
some failed runs:
https://solutions.qa.canonical.com/testruns/80f369b2-cf62-4eea-9aa8-79d6ce619ab7
https://solutions.qa.canonical.com/testruns/b2d5136c-032b-444e-bc63-38676f812450
https://solutions.qa.canonical.com/testruns/123275ec-4ee3-48b3-869d-3a6021611897
logs:
https://oil-jenkins.canonical.com/artifacts/80f369b2-cf62-4eea-9aa8-79d6ce619ab7/index.html
https://oil-jenkins.canonical.com/artifacts/b2d5136c-032b-444e-bc63-38676f812450/index.html
https://oil-jenkins.canonical.com/artifacts/123275ec-4ee3-48b3-869d-3a6021611897/index.html
To Reproduce
On top of MAAS we bootsrtap a juju controller, we deploy microk8s v.1.29 and the COS on latest/stable and then either an openstack layer or kubernetes maas layer.
Environment
Both of these runs were on kvms
Relevant log output
Additional context
The main bug is prometheus falling back into a state of installing agent after it's already been set up. I'll keep adding testruns that I come across that have this error.
The text was updated successfully, but these errors were encountered: