Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics stop working within a few hours of node restart with v0.9.4 running as a bridge #2188

Closed
Pete-LunaNova opened this issue May 12, 2023 · 12 comments
Labels
bug Something isn't working external Issues created by non node team members

Comments

@Pete-LunaNova
Copy link

Celestia Node version

0.9.4

OS

Ubuntu 20.04

Install tools

Running a self-compiled binary from the git tag 0.9.4

Others

Binary managed via systemd to run a bridge node with the standard flags that have worked for previous celestia node verions:
--metrics.tls=false --metrics --metrics.endpoint 127.0.0.1:4318
This points to an otel collector that exports local prometheus metrics whilst forwarding otel metrics on to the blockspacerace external endpoint.

Steps to reproduce it

Metrics show correctly via the otelcollector on a fresh restart of a node, as they have done for previous celestia-node versions. However within a few hours of running version 0.9.4 these simply stop being updated.

Expected result

Metrics should continue to be updated and exported whilst the celestia node is running

Actual result

Metrics stop working within a few hours of a restart (sometimes as soon as 30 minutes) and this triggers our "metrics absent" alerts on our Prometheus servers.

Relevant log output

No response

Notes

No response

@Pete-LunaNova Pete-LunaNova added the bug Something isn't working label May 12, 2023
@github-actions github-actions bot added the external Issues created by non node team members label May 12, 2023
@walldiss
Copy link
Member

Could you please give more information about what is "metrics absent" alert? Is it triggered when there are no metrics at all or some specific ones it is targeted at?

@zekchan
Copy link

zekchan commented May 12, 2023

I can confirm that. My bridge just stops sending metrics to the otel-collector after a while.

@Pete-LunaNova
Copy link
Author

Could you please give more information about what is "metrics absent" alert? Is it triggered when there are no metrics at all or some specific ones it is targeted at?

Our Prometheus server has alert rules configured that use the "absent_over_time" function for the 2 bridge metrics we are most interested in: celestia_incentivized_bridge_total_synced_headers and celestia_incentivized_bridge_head.
However, I can confirm that when the celestia bridge encounters this issue all metrics it usually exports are no longer present.

@malise800
Copy link

this issue also happen for full node

@Wondertan
Copy link
Member

Does this happen only on 0.9.4? Can you try building the binary with reverted #2175? It's the only patch that barely touched metrics since last release and maybe it broke something.

@Pete-LunaNova
Copy link
Author

Does this happen only on 0.9.4? Can you try building the binary with reverted #2175? It's the only patch that barely touched metrics since last release and maybe it broke something.

Yes, we're only seeing this with 0.9.4. All previous versions we've used for blockspacerace have not displayed this issue.
We've built #2175 and got it running on our backup bridge node. There are fewer metrics being exported than other versions. I will keep an eye on it and provide an update if it suddenly stops exporting metrics like 0.9.4

@walldiss
Copy link
Member

Did you revert to #2175 or reverted this commit only? Asking because it is important to exclude this one.

@Pete-LunaNova
Copy link
Author

Did you revert to #2175 or reverted this commit only? Asking because it is important to exclude this one.

We've built the binary after running gh pr checkout 2175 which gives us the following commit: 96c16fea28fe10d7bf198c33a0b9a89ec828f9ee
It sounds like this is not what you require. Could you let me know which commit you want us to build please and I'll get it up and running?

@walldiss
Copy link
Member

@Pete-LunaNova Could you try to run latest main? It has a fix for deadlock that was blocking the async metric exporter.

@Pete-LunaNova
Copy link
Author

@Pete-LunaNova Could you try to run latest main? It has a fix for deadlock that was blocking the async metric exporter.

No problem. I've built this and it is up and running on our backup bridge node, commit: 9baade4bd0300dabcd160c185e145a824c128e30.
I will keep an eye on it and let you know if it looks like the issue is solved or not.

@Pete-LunaNova
Copy link
Author

I've not seen a failure in exporting metrics since starting this new version yesterday, so it looks like the issue has been fixed in the latest main. Thank you for your work on this, it's much appreciated 👍

Is it safe to roll out this version to our primary bridge node or should we wait for a new release?

@Wondertan
Copy link
Member

It is safe and expect the new version tmrw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external Issues created by non node team members
Projects
None yet
Development

No branches or pull requests

5 participants