Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics are missing when coturn is used #914

Closed
FritschAuctores opened this issue May 27, 2022 · 11 comments
Closed

Metrics are missing when coturn is used #914

FritschAuctores opened this issue May 27, 2022 · 11 comments
Labels

Comments

@FritschAuctores
Copy link

The Prometheus Target is down when coturn is used

Here the turn_total_traffic_peer_rcvb metric:

grafik

What's the problem here? The CPU load of the server was between 20 and 40%

There is no network problem, other exporters (cadvisor) work without any problems during this time

@ggarber
Copy link
Contributor

ggarber commented Jul 26, 2022

Do you have any logs from the agent trying to retrieve the information from the prometheus endpoint? What is the error code you get? Maybe you can tune some timeouts or retries?

afaik other people are using the prometheus endpoint without problems.

@FritschAuctores
Copy link
Author

In Prometheus: read tcp x.x.x.x:56552->x.x.x.x:9641: read: connection reset by peer
In Browser: Error: network timeout, No error code

I will add a timeout and keep watching it

@ggarber
Copy link
Contributor

ggarber commented Jul 26, 2022

Thx for your fast answer. Do you get any error in coturn logs when that failure happens?
Is it possible that you are running out of file descriptors in the coturn server and those prometheus http requests are failing?

I haven't used prometheus endpoint at scale but maybe other people can comment if they can reproduce this issue.

@eakraly
Copy link
Collaborator

eakraly commented Aug 8, 2022

@FritschAuctores
One common reason for those gaps in prometheus metrics is timeout on scraping: let's say prom sets max timeout on scraping to 10s but it takes coturn 11s to generate and report the metrics - in this case prometheus server drops the connection before the metrics are read

Or the other way around - coturn (or rather underlying library that implements prom protocol and webserver) drop the connection after timeout.

You can probably try and review those settings on both sides

Another point is that until recent commit d62f483 (10 days ago as of today), prometheus was tagging a bunch of metrics with username which created massive dimensionality and as a result - massive amount of metrics. Which is a good reason to have timeouts on scraping (not to mention resource/memory leak)
Try running a version without -prometheus-username-labels set

@eakraly
Copy link
Collaborator

eakraly commented Aug 14, 2022

@FritschAuctores I can confirm that I am seeing similar behavior with latest master f74f50c - there are gaps in metrics for a few minutes that cause my alerts to fire off.

And this is without -prometheus-username-labels set so this is definitely a regression (compared to Releases/4.5.2)

@eakraly eakraly added the bug label Aug 15, 2022
@ggarber
Copy link
Contributor

ggarber commented Aug 30, 2022

I've been looking at the prometheus-client-c code.

  • It is using microhttpd with a single thread and select() API (this last part is decided by coturn and could be switched to epoll or auto).
  • The connection timeout is set to 0 (no timeout)

I'm wondering if the single thread option is what ends up creating this problem. Unfortunately it is not possible to enable the thread pool option in microhttpd without forking prometheus-client-c.

@FritschAuctores what coturn version are you using? Did you get any further insight in the last weeks? Thank you.

@eakraly
Copy link
Collaborator

eakraly commented Sep 29, 2022

epoll use un microhttpd introduced in #997

And the issue is resolved for me.

@eakraly
Copy link
Collaborator

eakraly commented Sep 30, 2022

@FritschAuctores pls try following commit 57f5a25 (which is #997 I mentioned above)

@eakraly eakraly closed this as completed Oct 6, 2022
@eakraly
Copy link
Collaborator

eakraly commented Nov 11, 2022

I still experience the issue as of 716417e
It seem to happen on servers that experience higher load
Most of the time not getting any responce at all

curl localhost:9641/metrics
curl: (52) Empty reply from server

@eakraly eakraly reopened this Nov 11, 2022
@eakraly
Copy link
Collaborator

eakraly commented Nov 11, 2022

@ggarber I can confirm that #999 is the root cause
What I think is happening is that the code to enable epoll is not working in this case but I'm do not know why
If it is using select() instead of epoll, and having no timeout on connections set explains why the server stop responding after some time (or responds intermittently as eventually old connections do close) - which is exactly how it was before libmicrohttpd enabled epoll

@eakraly
Copy link
Collaborator

eakraly commented Dec 31, 2022

For future reference: in some cases MHD_USE_EPOLL_INTERNAL_THREAD is not set (whereas it should). Could be a bug in microhttpd library.
Make sure (by patching?) that you set MHD_USE_EPOLL_INTERNAL_THREAD.

@eakraly eakraly closed this as completed Dec 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants