Metrics are missing when coturn is used #914

FritschAuctores · 2022-05-27T10:01:09Z

The Prometheus Target is down when coturn is used

Here the turn_total_traffic_peer_rcvb metric:

What's the problem here? The CPU load of the server was between 20 and 40%

There is no network problem, other exporters (cadvisor) work without any problems during this time

The text was updated successfully, but these errors were encountered:

ggarber · 2022-07-26T11:02:23Z

Do you have any logs from the agent trying to retrieve the information from the prometheus endpoint? What is the error code you get? Maybe you can tune some timeouts or retries?

afaik other people are using the prometheus endpoint without problems.

FritschAuctores · 2022-07-26T11:49:27Z

In Prometheus: read tcp x.x.x.x:56552->x.x.x.x:9641: read: connection reset by peer
In Browser: Error: network timeout, No error code

I will add a timeout and keep watching it

ggarber · 2022-07-26T21:50:56Z

Thx for your fast answer. Do you get any error in coturn logs when that failure happens?
Is it possible that you are running out of file descriptors in the coturn server and those prometheus http requests are failing?

I haven't used prometheus endpoint at scale but maybe other people can comment if they can reproduce this issue.

eakraly · 2022-08-08T19:17:23Z

@FritschAuctores
One common reason for those gaps in prometheus metrics is timeout on scraping: let's say prom sets max timeout on scraping to 10s but it takes coturn 11s to generate and report the metrics - in this case prometheus server drops the connection before the metrics are read

Or the other way around - coturn (or rather underlying library that implements prom protocol and webserver) drop the connection after timeout.

You can probably try and review those settings on both sides

Another point is that until recent commit d62f483 (10 days ago as of today), prometheus was tagging a bunch of metrics with username which created massive dimensionality and as a result - massive amount of metrics. Which is a good reason to have timeouts on scraping (not to mention resource/memory leak)
Try running a version without -prometheus-username-labels set

eakraly · 2022-08-14T18:34:04Z

@FritschAuctores I can confirm that I am seeing similar behavior with latest master f74f50c - there are gaps in metrics for a few minutes that cause my alerts to fire off.

And this is without -prometheus-username-labels set so this is definitely a regression (compared to Releases/4.5.2)

ggarber · 2022-08-30T21:50:55Z

I've been looking at the prometheus-client-c code.

It is using microhttpd with a single thread and select() API (this last part is decided by coturn and could be switched to epoll or auto).
The connection timeout is set to 0 (no timeout)

I'm wondering if the single thread option is what ends up creating this problem. Unfortunately it is not possible to enable the thread pool option in microhttpd without forking prometheus-client-c.

@FritschAuctores what coturn version are you using? Did you get any further insight in the last weeks? Thank you.

eakraly · 2022-09-29T02:13:25Z

epoll use un microhttpd introduced in #997

And the issue is resolved for me.

eakraly · 2022-09-30T19:19:59Z

@FritschAuctores pls try following commit 57f5a25 (which is #997 I mentioned above)

eakraly · 2022-11-11T22:00:28Z

I still experience the issue as of 716417e
It seem to happen on servers that experience higher load
Most of the time not getting any responce at all

curl localhost:9641/metrics
curl: (52) Empty reply from server

eakraly · 2022-11-11T23:37:15Z

@ggarber I can confirm that #999 is the root cause
What I think is happening is that the code to enable epoll is not working in this case but I'm do not know why
If it is using select() instead of epoll, and having no timeout on connections set explains why the server stop responding after some time (or responds intermittently as eventually old connections do close) - which is exactly how it was before libmicrohttpd enabled epoll

eakraly · 2022-12-31T19:58:20Z

For future reference: in some cases MHD_USE_EPOLL_INTERNAL_THREAD is not set (whereas it should). Could be a bug in microhttpd library.
Make sure (by patching?) that you set MHD_USE_EPOLL_INTERNAL_THREAD.

eakraly added the bug label Aug 15, 2022

eakraly closed this as completed Oct 6, 2022

eakraly reopened this Nov 11, 2022

eakraly closed this as completed Dec 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics are missing when coturn is used #914

Metrics are missing when coturn is used #914

FritschAuctores commented May 27, 2022

ggarber commented Jul 26, 2022

FritschAuctores commented Jul 26, 2022

ggarber commented Jul 26, 2022

eakraly commented Aug 8, 2022

eakraly commented Aug 14, 2022 •

edited

ggarber commented Aug 30, 2022

eakraly commented Sep 29, 2022 •

edited

eakraly commented Sep 30, 2022 •

edited

eakraly commented Nov 11, 2022

eakraly commented Nov 11, 2022 •

edited

eakraly commented Dec 31, 2022

Metrics are missing when coturn is used #914

Metrics are missing when coturn is used #914

Comments

FritschAuctores commented May 27, 2022

ggarber commented Jul 26, 2022

FritschAuctores commented Jul 26, 2022

ggarber commented Jul 26, 2022

eakraly commented Aug 8, 2022

eakraly commented Aug 14, 2022 • edited

ggarber commented Aug 30, 2022

eakraly commented Sep 29, 2022 • edited

eakraly commented Sep 30, 2022 • edited

eakraly commented Nov 11, 2022

eakraly commented Nov 11, 2022 • edited

eakraly commented Dec 31, 2022

eakraly commented Aug 14, 2022 •

edited

eakraly commented Sep 29, 2022 •

edited

eakraly commented Sep 30, 2022 •

edited

eakraly commented Nov 11, 2022 •

edited