-
Notifications
You must be signed in to change notification settings - Fork 899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics stop working within a few hours of node restart with v0.9.4 running as a bridge #2188
Comments
Could you please give more information about what is "metrics absent" alert? Is it triggered when there are no metrics at all or some specific ones it is targeted at? |
I can confirm that. My bridge just stops sending metrics to the otel-collector after a while. |
Our Prometheus server has alert rules configured that use the "absent_over_time" function for the 2 bridge metrics we are most interested in: |
this issue also happen for full node |
Does this happen only on 0.9.4? Can you try building the binary with reverted #2175? It's the only patch that barely touched metrics since last release and maybe it broke something. |
Yes, we're only seeing this with |
Did you revert to #2175 or reverted this commit only? Asking because it is important to exclude this one. |
We've built the binary after running |
@Pete-LunaNova Could you try to run latest main? It has a fix for deadlock that was blocking the async metric exporter. |
No problem. I've built this and it is up and running on our backup bridge node, commit: |
I've not seen a failure in exporting metrics since starting this new version yesterday, so it looks like the issue has been fixed in the latest main. Thank you for your work on this, it's much appreciated 👍 Is it safe to roll out this version to our primary bridge node or should we wait for a new release? |
It is safe and expect the new version tmrw |
Celestia Node version
0.9.4
OS
Ubuntu 20.04
Install tools
Running a self-compiled binary from the git tag
0.9.4
Others
Binary managed via systemd to run a bridge node with the standard flags that have worked for previous celestia node verions:
--metrics.tls=false --metrics --metrics.endpoint 127.0.0.1:4318
This points to an otel collector that exports local prometheus metrics whilst forwarding otel metrics on to the blockspacerace external endpoint.
Steps to reproduce it
Metrics show correctly via the otelcollector on a fresh restart of a node, as they have done for previous celestia-node versions. However within a few hours of running version
0.9.4
these simply stop being updated.Expected result
Metrics should continue to be updated and exported whilst the celestia node is running
Actual result
Metrics stop working within a few hours of a restart (sometimes as soon as 30 minutes) and this triggers our "metrics absent" alerts on our Prometheus servers.
Relevant log output
No response
Notes
No response
The text was updated successfully, but these errors were encountered: