Metrics stop working within a few hours of node restart with v0.9.4 running as a bridge #2188

Pete-LunaNova · 2023-05-12T05:13:07Z

Celestia Node version

0.9.4

OS

Ubuntu 20.04

Install tools

Running a self-compiled binary from the git tag 0.9.4

Others

Binary managed via systemd to run a bridge node with the standard flags that have worked for previous celestia node verions:
--metrics.tls=false --metrics --metrics.endpoint 127.0.0.1:4318
This points to an otel collector that exports local prometheus metrics whilst forwarding otel metrics on to the blockspacerace external endpoint.

Steps to reproduce it

Metrics show correctly via the otelcollector on a fresh restart of a node, as they have done for previous celestia-node versions. However within a few hours of running version 0.9.4 these simply stop being updated.

Expected result

Metrics should continue to be updated and exported whilst the celestia node is running

Actual result

Metrics stop working within a few hours of a restart (sometimes as soon as 30 minutes) and this triggers our "metrics absent" alerts on our Prometheus servers.

Relevant log output

No response

Notes

No response

The text was updated successfully, but these errors were encountered:

walldiss · 2023-05-12T09:38:59Z

Could you please give more information about what is "metrics absent" alert? Is it triggered when there are no metrics at all or some specific ones it is targeted at?

zekchan · 2023-05-12T15:17:54Z

I can confirm that. My bridge just stops sending metrics to the otel-collector after a while.

Pete-LunaNova · 2023-05-12T17:56:24Z

Could you please give more information about what is "metrics absent" alert? Is it triggered when there are no metrics at all or some specific ones it is targeted at?

Our Prometheus server has alert rules configured that use the "absent_over_time" function for the 2 bridge metrics we are most interested in: celestia_incentivized_bridge_total_synced_headers and celestia_incentivized_bridge_head.
However, I can confirm that when the celestia bridge encounters this issue all metrics it usually exports are no longer present.

malise800 · 2023-05-13T04:03:24Z

this issue also happen for full node

Wondertan · 2023-05-13T06:40:39Z

Does this happen only on 0.9.4? Can you try building the binary with reverted #2175? It's the only patch that barely touched metrics since last release and maybe it broke something.

Pete-LunaNova · 2023-05-13T08:04:03Z

Does this happen only on 0.9.4? Can you try building the binary with reverted #2175? It's the only patch that barely touched metrics since last release and maybe it broke something.

Yes, we're only seeing this with 0.9.4. All previous versions we've used for blockspacerace have not displayed this issue.
We've built #2175 and got it running on our backup bridge node. There are fewer metrics being exported than other versions. I will keep an eye on it and provide an update if it suddenly stops exporting metrics like 0.9.4

walldiss · 2023-05-13T08:13:19Z

Did you revert to #2175 or reverted this commit only? Asking because it is important to exclude this one.

Pete-LunaNova · 2023-05-13T08:30:09Z

Did you revert to #2175 or reverted this commit only? Asking because it is important to exclude this one.

We've built the binary after running gh pr checkout 2175 which gives us the following commit: 96c16fea28fe10d7bf198c33a0b9a89ec828f9ee
It sounds like this is not what you require. Could you let me know which commit you want us to build please and I'll get it up and running?

walldiss · 2023-05-13T13:41:17Z

@Pete-LunaNova Could you try to run latest main? It has a fix for deadlock that was blocking the async metric exporter.

Pete-LunaNova · 2023-05-13T14:25:30Z

@Pete-LunaNova Could you try to run latest main? It has a fix for deadlock that was blocking the async metric exporter.

No problem. I've built this and it is up and running on our backup bridge node, commit: 9baade4bd0300dabcd160c185e145a824c128e30.
I will keep an eye on it and let you know if it looks like the issue is solved or not.

Pete-LunaNova · 2023-05-14T15:02:54Z

I've not seen a failure in exporting metrics since starting this new version yesterday, so it looks like the issue has been fixed in the latest main. Thank you for your work on this, it's much appreciated 👍

Is it safe to roll out this version to our primary bridge node or should we wait for a new release?

Wondertan · 2023-05-14T15:37:53Z

It is safe and expect the new version tmrw

Pete-LunaNova added the bug Something isn't working label May 12, 2023

github-actions bot added the external Issues created by non node team members label May 12, 2023

Wondertan closed this as completed May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics stop working within a few hours of node restart with v0.9.4 running as a bridge #2188

Metrics stop working within a few hours of node restart with v0.9.4 running as a bridge #2188

Pete-LunaNova commented May 12, 2023

walldiss commented May 12, 2023

zekchan commented May 12, 2023

Pete-LunaNova commented May 12, 2023

malise800 commented May 13, 2023

Wondertan commented May 13, 2023

Pete-LunaNova commented May 13, 2023

walldiss commented May 13, 2023

Pete-LunaNova commented May 13, 2023

walldiss commented May 13, 2023

Pete-LunaNova commented May 13, 2023

Pete-LunaNova commented May 14, 2023

Wondertan commented May 14, 2023

Metrics stop working within a few hours of node restart with v0.9.4 running as a bridge #2188

Metrics stop working within a few hours of node restart with v0.9.4 running as a bridge #2188

Comments

Pete-LunaNova commented May 12, 2023

Celestia Node version

OS

Install tools

Others

Steps to reproduce it

Expected result

Actual result

Relevant log output

Notes

walldiss commented May 12, 2023

zekchan commented May 12, 2023

Pete-LunaNova commented May 12, 2023

malise800 commented May 13, 2023

Wondertan commented May 13, 2023

Pete-LunaNova commented May 13, 2023

walldiss commented May 13, 2023

Pete-LunaNova commented May 13, 2023

walldiss commented May 13, 2023

Pete-LunaNova commented May 13, 2023

Pete-LunaNova commented May 14, 2023

Wondertan commented May 14, 2023